Unlike linear regression, which aims to separate classes based on linear combinations of calculated estimations, a logistic regression aims to classify the data based on a linear discriminant.
This means that, unlike linear regression, logistic regression does not simply predict values of numeric variables for sets of input data. Rather, the method outputs probability values for the sets of input data denoting which classes they belong to. As an example, let us assume that we have only two classes (this can also be extended to multiclass problems using multinomial logistic regression), say, classes ‘pos’ and ‘neg’. The probability Ppos denotes the probability that a certain data point belongs to the ‘pos‘ class and Pneg = 1 – Ppos denotes the probability of the data point belonging to the ‘neg’ class. Therefore, it can be said that the output of logistic regression lies within the range of [0, 1].
Again, considering a two-class example, the main idea of logistic regression is that the input space can be separated into two clearly separable ‘regions’, one for each class, by a linear or rather straight boundary. The term linear boundary denotes that, for a bi-dimensional data set, the data can be clearly separable using a straight line and not a curve. Extending it to a three-dimensional data set, it becomes a plane, and so on. Therefore, it can be seen that the boundary condition is dictated by your input data and the learning algorithm you plan to use. To put things into perspective, for a two-dimensional data set with two classes, the data points must be clearly separable into two regions by a linear boundary. If your data points do satisfy this constraint, they are said to be linear-separable as shown in the figure below.
Figure 2. Linearly separable data (source: SairahulBlog)
This is called a linear discriminant, because:
1. The dividing boundary is formed using a linear function, and
2. It allows the model to ‘discriminate’ the separable classes.
Now, let us ask ourselves a question: how can a logistic regression use only a linear boundary and quantify fairly accurately the probability of a data point belonging to a certain class? To answer this, we need to understand the mechanism of the method itself.
Let us begin with the separation of the input space into distinct regions. For simplicity, we shall consider two input variables, say, x1 and x2, and so the function corresponding to the boundary will be something like:
where x1 and x2 are input variables. Normally, an output variable is not mentioned as a part of the conceptual space, which otherwise is very common in techniques like linear regression.
As an example, let us take a point (a, b) and plug these values to equation 2.1 as x1 and x2. This makes the boundary function’s output to be the following:
Now, based on where the point (a, b) is located, we may need to consider one of the following three possibilities:
From the above discussion, it is clear that if we have a data point, we may calculate a function (equation 2.2) which will output a value within the range of (-∞, 0) as a probability of that data point to be assigned to a specific class. But we know that the probability value cannot be other than within the range of [0,1] and therefore the infinity range needs to be mapped to the probability range. This can be done using something called an odds function.
So, let P(X) represent the probability of an event X occurring, therefore the odds ratio or OR(X) can be defined by
which essentially represents the probability ratio of the event happening vs. not happening. Now, it can be clearly asserted that that the probability and odds convey the same information. But the range of P(X) is from [0,1], the range of OR(X) is from (0, ∞) .
We have covered the value range (0, ∞) for the OR(X) but we will have to cover the whole boundary condition, which is (-∞,0). This can be done by taking the logarithm of OR(X), and will be called the log-odds function. Therefore, mathematically, OR(X) goes from (0 to ∞) and log OR(X) converts it to the required range of (-∞ to 0).
This allows us to intuitively understand and interpret the result when we plug in the attributes of an input into the boundary function. In this model, the boundary function defines the log-odds of the pos class.
This brings us back to our bi-dimensional example, and we can now step through the different steps of the logistic regression for a given point (a, b) , which are as follows:
Step 1. Compute the boundary function value using equation 2.2 (alternatively, the log-odds function value), i.e.,
Step 2. Compute the odds ratio, i.e., ORpos = et (since t is the logarithm of ORpos).
Step 3. Having computed ORpos, Ppos can be calculated using the following formula:
Now combining steps 2 and 3 and using the t, we will have:
The right-hand side of equation 2.4 is popularly known as the logistic function for the logistic regression.
Figure 3. The logistic function (source: Wikipedia)
We have seen how the probabilities in logistic regression are calculated; now let us have a quick look at how it learns the boundary function outlined in equation 2.1.
Let us define a function g(x), with x being a data point from a training data set. The function g(x) will be assigned the appropriate class probability based on the following:
If x is a part of the pos class
(this value is given by the equation 2.4).
If x is a part of the neg class
Therefore, through g (x), we can quantify the probability of data coming from a training set to be appropriately classified by our model. Now, averaging g (x) over all the data points coming from the training set would provide us with the likelihood that a randomly selected data point is correctly classified by the model, regardless of which class it belongs to. In other words, the logistic regression learner attempts to maximise the ‘average’ g (x) through a method called the maximum likelihood estimation or MLE.