Briefly, regression analysis is utilised when you need to predict a continuous dependent variable from a set of independent variables. One very important thing to keep in mind is that, when using regression analysis, the causal relationship among variables cannot be determined. In the model performance evaluation lesson, we looked at the fact that there is a subtle difference between a classification problem and a regression problem, based on the data set in hand. Referring to that discussion, regression analysis should be used when the value to be predicted is continuous or numeric rather than nominal. 

Linear regression is a well-established statistical method dating back to French mathematician Adrien-Marie Legendre[1] in 1805 and German mathematician and physicist Johann Carl Friedrich Gauss[2] in 1809.

Now, let me introduce you to the problem. You have a cloud of data points and your job is to fit a straight line to the available data points. Your fitting should be done in such a way that it should be the best fit. Therefore, a linear regression can be roughly said to consist of finding the best-fitting straight line through a set of data points. And the best-fitting line is called a regression line. A model of the data can be formulated using the following equation:

In this equation, the ws are the weights, which are to be calculated from the data, the ε is the error and the xs are the features, attributes or predictors. For a simple case of fitting a straight-line to one set of two-dimensional data, that is, a dataset with only one attribute, there will be only one x , that is, x1, and we can ignore the rest of the terms. This brings us to the following simpler equation:


This is the equation of a straight line (as shown in Figure 1), where we want to determine the values of wand w1, which are constants, from the training data. Again, remember that this type of analysis is applied on numeric or continuous data rather than on nominal data.


Figure 1. Example of linear regression with a straight line (source: Wikipedia[3]).

After the calculation of the weights from the training data, we can predict the values based on those weights and calculate the error using the following formulae: 


One important thing you may notice here is that the difference between the data model or the observed data (as shown in Equation 1.1) and the predicted data (as shown in Equation 1.3) is mainly the error, which is the difference between the observed data and predicted data (as shown in Equation 1.4).

The goal is to reduce or minimise the error. This is often referred to as ordinary least squares, or OLS, where the goal is to fit a model in such a way that the sum-of-squares of differences between the observed (i.e., Ŷ or the available data) and the predicted data (i.e., Ŷ ) is minimised. Mathematically, it is denoted as follows:


Now, this OLS is actually a very standard matrix problem and usually it is done by calculating the covariance matrices between x and y, and the x itself. From a practical perspective, the calculations represented in these equations work fine if there are more instances than attributes and the attributes are numeric.





Linear regression models are usually based on a few assumptions including that the errors are normally distributed with zero mean and constant variance. Linear regression estimators can perform optimally as unbiasedefficient and consistent models if these assumptions are satisfied. Here, by unbiased we mean that the estimated value provided by the estimator is very close to the true value of the estimated parameter. The term efficient means that the model estimator will generate the smallest amount of variance. And, finally, the term consistent expects the bias and variance to approach a very small number with an increasing sample size.

We have already seen the various metrics used to evaluate the performances of a regression model. In addition to this, an OLS-based regression model is usually evaluated using something called a correlation coefficient (denoted by R) or the coefficient of determination (denoted by R2). The relationship between these two is just that the coefficient of determination (i.e., R2) is the correlation coefficient multiplied by the correlation coefficient (i.e., R×R).

The R2 usually shows percentage variation in y, which is explained by all the variables together. The higher the value of R2, the better the performance of a model. The value of R2 is always between 0 and 1 and can never be negative – as it is a squared value. The simple formula for deriving the value is as follows:

As we have mentioned earlier, a model would be perfect if the R2 value was very close to 1, and this would make the SSE value very close to 0. Now, if the SSE value is very high and is similar to the value of SST, then the R2 value would approach 0, denoting that the model has performed a very bad fit.


Note that it is important to remember that none of these measures provides any causal relationship.