Consider a phenomenon characterized by 2 variables x and y.
After observing the phenomenon, we now dispose of a sample dataset.
A reasonable start would be to create a scatter plot of the dataset, to get an idea of how y changes with respect to x.
The scatter plot suggests that the relationship between x and y is almost linear, hence, we suppose:
Where a and b are the constants to be determined.
Problem: This formula can represent an infinite amount of lines based on the values of a and b.
Solution: Find the values of a and b for which y(x) best fits the dataset.
This is where linear regression comes into play.
Linear regression is a model used to achieve establishing a linear relationship between an independent variable, x in our case, and a dependent variable, which is y in our case.
The least-squares method is one of several methods that can be used to determine the values of the two parameters a and b based on the dataset.
The idea is to calculate the values of a and b that correspond to the minimum of the sum of squared errors.
An error is defined as follows:
The sum of squared errors is defined as:
Where n is the number of samples in the dataset, which is 4 in our case.
We want to find the values of a and b for which the function S(a,b) is minimized.
The extremum (minimum or maximum) of a function can be found by setting the derivative to 0.(as long as the function is differentiable in theRnumber set).
In our case, the error function S(a,b) is a polynomial, therefore differentiable in R, so we write:
We end up with a homogenous system of equations.
Using those results concluded by our analysis, we calculate the optimal values of a and b:
Using those results concluding by our analysis, we calculate the optimal values of a and b:
Finally, our regression model looks like this:
Why go through the process of squaring the errors before minimizing the error function? why not just sum all the errors?
The problem of summing the errors directly is that errors may be positive or negative. Positive numbers and negative numbers tend to cancel out each other.
Say we have an error e1=100 and another error e2=-100; the sum of the two errors (e1+e2=0) suggests that our model is 100% accurate, which is clearly not the case since the errors are just too big.
So before creating the error function, we need to assume that all the errors are positive, therefore, defining the error function as the sum of the errors won’t work.
To convert a negative number into a positive one, we usually take the absolute value of the number, or we just square it. So why not take the absolute values of the errors instead?
This approach could be used and is called the least-absolute deviations method. Although, we often tend to used least-squares in regression analysis, due to the complexity that comes with the differentiation of absolute values.
As an example, take a lot a the following function:
This function is clearly non-differentiable in x=0.
Using the least-squares method just makes the process easier.