Here we will focus on a regression problem. We will fit a univariate (because we only have one predictor) linear (because the model representing the data is a line) regression (because our outcomes are continuous values) model.

**Univariate Linear Regression**

One of the basic algorithm used in statistical learning is the linear regression. This simple model looks like that:

` `**Y = W0 + W1*X**

**W0** is called the intercept; **W1** is the slope.

W0 and W1 are the **parameters**. Usually written W (weight) or ß (Beta) or θ (Theta). I think calling the parameters ‘Weight’ makes totally sense so I’ll use W.

Let’s get a concrete (very simple) example: Here is our dataset D with Y = grades for a given exam and X = number of hours spent reviewing for this particular exam. We don’t know the true relation (f) between Y and X but we will try to approximate it by training a model g.

Data set D:

In order to make sure everything is clear, in our data set D, we got 7 inputs that gives 7 outputs. In other words:

x1 = 5, x2 = 4, x3 = 2 , x4 = 3, x5 = 1respectively give y1 = 10, y2 =8, y3 = 4, y4= 6, y5= 2.

Our data on a graph give:

Before plotting the data, we assumed that the true relation between X and Y is linear. This is why we decided to fit a univariate linear regression:

** g(X): Y = W0 + W1*X**

The graph above clearly confirms our assumption. First, let’s choose randomly W0 and W1, the parameters of the model we want to fit. Let’s say that W0 = 2 and W1 = 1. We end up with:

It is obvious that this line is not a good representation of our data points. But it’s gonna be useful to learn more about the notion of prediction error.

# The cost function

In order to know if our function is a good representation of our data points, we compute a cost function or error function which calculates the prediction error. The prediction error is the distance between the true output y (from f) and the predicted outcome y'(from g).

First, the notation:Y = true outcome here: Y = [10, 8, 4, 6, 2]Y' = outcome predicted by our model (here, g(X) = W0 + W1*X)here since W0 = 2 and W1 = 1 and X = [5, 4, 2, 3, 1], we got: Y' = [7, 6, 4, 5, 3]

The prediction error is the distance between what g(x) (in our example, g(x) = 2 + 1*x) predicts for a given x, and the true y of this particular x. For example in the graph below, for x = 4, our function predicts that y’ = 2 + 1*4 = 6 whereas the true output is 8, so the prediction error at this point is 2. But for x = 1, our function predicts that y’ = 2 + 1*1 = 3 whereas y = 2, so the prediction error at this point is -1. As you can see in the graph below, some prediction errors are positive (the predicted Y’ is bigger than the true Y) while one is negative (the predicted Y’ is smaller than the true Y).

In order to know if our approximate target function is representative of our data set, we want to know the overall prediction error. If we sum the value of the prediction error at each point, we will end up with an addition of positive (e.g: 2 for x = 4) and negative values (e.g: -1 for x = 1) so one cancelling the other. The problem is that each prediction error, whether it is positive or negative, is an actual error and must increase the overall error function. This is why we square each error to get only positive values. Thus, at each point, we use:

(y-y')²

instead of:

(y-y')

The total sum of square error looks like that:

∑(y-y')²

In order to get the average of the sum of square error, we divide it by the number of data points (M) that we have:

1/M*∑(y-y')²

This is called the cost function, we write it this way:

**J(W0, W1)**

Meaning that the cost function depends on W0 and W1.

For our example, we got an averaged sum of square error equal to:

J(W0, W1) = 1/M*∑(y-y')² <=> 1/5*((10 - 7)² + (8 - 6)² +(4 - 4)² +(6 - 4)²+(2 - 3)²) where g(x): y' = 2 + 1*x, W0 = 2 and W1 = 1 <=> Thus J(2, 1) = 27.

27 is pretty high, maybe we can do better. Let’s check what’s the cost function for W0 = 1 and W1 = 1.5, we got:

J(W0, W1) = J(1, 1.5) = 24.75.

It’s better but is it the best we can do? We could randomly choose W0 and W1 and check the value of the cost function for each and eventually take the parameters which lead to the smallest cost function. But there is an infinity of possible value for W0 and W1, maybe there is a simplest method?

We want to minimize J(W0, W1). Meaning that we want 1/M*∑(y-y’)² to be as small as possible. To make it happen, we will apply an optimisation method called the gradient descent. In our example, we can even make J(W0, W1) equal to zero which means that our function g(x) will perfectly predict the Y of our dataset. Optimised, it will look like this:

So now, let’s see how to find the optimal values for W0 and W1 that minimise the cost function and make g(x) fits our dataset as well as possible.

# The gradient descent (Batch gradient descent)

First, let’s take a look on how the cost function behaves. To make it as clear as possible, let’s start with a simpler model like:

g(X): Y = W1X (so with W0 = 0)

Let’s say that the true function f(x) is equal to 4X. Thus we have:

**Y = 4X**

The chosen model g(X) has a cost function J(W1) which behaves like this:

On the absciss, the cost function (from 0 to infinity), on the ordinate, the value of W1. As you can see, if W1 = 4, we got J(W1) = 0 which is the optimal value in this example because it would give g = f. If we choose a bigger or smaller W1, we end up with an increased cost function. For example, if we choose W1 = 2:

J(2) = (Y-Y')² = (4X - 2X)² = 4*X²

our predicted Y’ will be 2*X**²** away from the true Y.

if we choose W1 = 6:

**J(2) = (Y-Y')² = (4X - 6X)² = 4*X²**

our predicted Y’ will be 4*X**²** away from the true Y

if we choose W1 = 0:

**J(2) = (Y-Y')² = (4X - 0X)² = 16*X²**

Our predicted Y’ will be 16*X**²** away from the true Y therefore on the far upper left of the graph above.

Now, let’s see how it behaves with a model like Y = W0 + W1X. So it’s going to be in 3 dimensions since this time, we will have to plot W0, W1 and J(W0, W1).

As you can see, in this example the optimal values for W0 and W1 this time are around W0 = W1 ≈ -5.

With the gradient descent we aim for this minimum. Remember your high school math classes about derivative? A derivative of a function (of a single variable at a chosen input value when it exists) basically gives “the slope of the tangent line to the graph of the function at that point” (Wikipedia). In other words, we got the direction of the minimum we want to reach:

We start by choosing 2 random values for our parameters W0 and W1.

We take the derivative of the function J(W1, W0) (Which is the function we want to minimize). Instead of 1/M*∑(y-y’)², We use J(W0, W1) = 1/**2**M*∑(y-y’)² in order to make the math easier when it comes to do the derivative. Of course it doesn’t change the solution: At the end, the minimum divided by 2 or not is still the minimum.

We then use this derivative to compute the gradient.This is basically the quantity that will be subtracted from the parameter in order to minimize J(W0, W1); Adding it to the parameter would increase our cost function.

And we repeat the process until we reach the minimum.

Concretely, we do this until convergence:

**Wj := Wj – α*(∂/∂Wj)*J(W0,W1)**

**Wj** means that we have to do it for W0 and W1

**:=** means assignement, meaning we set or reset the value of a variable, here: Wj

**α** is the learning rate, this is basically the speed of your descent. The bigger it is, the more aggressive the gradient descent is. If it’s too big, you can miss the minimum and fail to converge. If it’s too small, it will take too long to reach the minimum. You might think that as you approach the minimum you might have to change your learning rate α. In fact there is no need for that since as you approach it, your derivative term gets smaller, so your update gets also smaller.

**(∂/∂Wj)*J(W0,W1)** represents the derivative term (We will get into more details later).

Remember that we have to update W0 and W1 **at the same time!**

** |W0 := W0 – α*(∂/∂W0)*J(W0,W1)**

** |W1 := W1 – α*(∂/∂W1)*J(W0,W1)**

Repeat the process until you reach the minimum. **How do you know you reached the minimum?** Because the slope of the tangent will be zero. So the derivative term will be zero therefore the parameter will remain the same (Wj := Wj – 0).

**The derivative of the cost function:**

**(∂/∂Wj)*J(W0,W1) **= **(∂/∂Wj) * 1/ 2M*∑(y-y’)² **=

**(∂/∂Wj) * 1/**

**2**M*∑(y-(W0 + W1x))²So far we just replaced **J(W0, W1)** by** 1/2M*∑(y-y’)²** and **y’** by **(W0 + W1x).**

We are actually doing the partial derivative of the cost function. Meaning that all the variables except the one that we are looking for are treated as constant. So when we do the derivative for the updating W0, we treat W1 as a constant and vice versa.

If you want to go deeper in the math, read this.

Eventually, we end up with:

** |W0 := W0 – α*1/M* ∑ (y-(W0 + W1x))**

**|W1 := W1 – α**

***1/M*****∑ x*(y-(W0 + W1x))****Local and global minimum:**

Your function might have several minima. You can get several local minima but only one global minimum. You are looking for the global minimum because this is where the cost function is the lowest.

But reaching the global minimum is not always easy, it depends on where you start. In the example below, two different starting points (e.g. two different values for W0 and W1) lead to different minima.

Good news: the linear regression cost function is a convex function. Therefore, there is always a solution.

**The analytical solution:**

Linear algebra can solve this optimisation problem.

The optimal parameters are equal to: W = (X^{t}X)^{-1}X^{t}y

To understand this equation, you have to go a bit into linear algebra, here is a FANTASTIC CLASS from G.Strang (MIT).

X = Matrix of your inputs

Y = Matrix of your outputs

X^{t} = X transpose

(…)^{-1} = Inverse matrix

We basically end up with a scalar (if only one parameter W) or a vector (if many W’s) which is/are the optimal value/s of your parameter/s:

Wj = (X^{t}X)^{-1}X^{t}y <=> (p*n. n*p)^{-1}p*n.n*1 <=> final matrix: (p*1)

In our example with grades and hours of work, we end up with an 2*1 matrix with 0 and 2 in it, meaning W0 = 0, W1 = 2.

But why doing the gradient descent when we can resolve the problem analytically? Because of a computational problem, when p = 2, that’s okay, but when p is huge, then the computational power required is gigantic and it’s way faster to apply the gradient descent. So, basically, to save time.

# Bibliography

Hastie and al. The element of statistical learning. 2008. PDF

James and al. An introduction to statistical learning. 2013. PDF

Stanford’s statistical learning course. Hastie and Tibshirani. Stanford Online class (link)

Stanford’s machine learning course. Andrew NG. Stanford Online class (link)

Machine learning class. Holehouse.org (link)

**I AM A BEGINNER IN MACHINE LEARNING. IN CASE OF DOUBT. TRUST YOUR PROFESSOR INSTEAD.**