Unlike linear regression, which aims to separate classes based on linear combinations of calculated estimations, a logistic regression aims to classify the data based on a linear discriminant.

This means that, unlike linear regression, logistic regression does not simply predict values of numeric variables for sets of input data. Rather, the method outputs probability values for the sets of input data denoting which classes they belong to. As an example, let us assume that we have only two classes (this can also be extended to multiclass problems using multinomial logistic regression), say, classes ‘*pos’* and ‘*neg’*. The probability *P _{pos}* denotes the probability that a certain data point belongs to the ‘

Again, considering a two-class example, the main idea of logistic regression is that the input space can be separated into two clearly separable ‘regions’, one for each class, by a linear or rather straight boundary. The term linear boundary denotes that, for a bi-dimensional data set, the data can be clearly separable using a straight line and not a curve. Extending it to a three-dimensional data set, it becomes a plane, and so on. Therefore, it can be seen that the boundary condition is dictated by your input data and the learning algorithm you plan to use. To put things into perspective, for a two-dimensional data set with two classes, the data points must be clearly separable into two regions by a linear boundary. If your data points do satisfy this constraint, they are said to be linear-separable as shown in the figure below.

Figure 2. Linearly separable data (source: SairahulBlog[1])

This is called a linear discriminant, because:

1. The dividing boundary is formed using a linear function, and

2. It allows the model to ‘discriminate’ the separable classes.

[1] http://blog.sairahul.com/2014/01/linear-separability.html

Now, let us ask ourselves a question: how can a logistic regression use only a linear boundary and quantify fairly accurately the probability of a data point belonging to a certain class? To answer this, we need to understand the mechanism of the method itself.

Let us begin with the separation of the input space into distinct regions. For simplicity, we shall consider two input variables, say, *x*_{1} and *x*_{2}, and so the function corresponding to the boundary will be something like:

where *x*_{1} and *x*_{2} are input variables. Normally, an output variable is not mentioned as a part of the conceptual space, which otherwise is very common in techniques like linear regression.

As an example, let us take a point (*a, b*) and plug these values to equation 2.1 as *x*_{1} and *x*_{2}._{ }This makes the boundary function’s output to be the following:

Now, based on where the point (*a, b*) is located, we may need to consider one of the following three possibilities:

- (
*a, b*) will lie in the region defined as the*pos*class. This will result in the output of equation 2.2 as positive within the range of (0, ∞). Expressed mathematically, the higher this value, the further will be the point and the boundary. This will intuitively denote that the probability that (*a, b*) belongs to the*pos*class will be greater. Therefore,*P*will lie somewhere in (0.5, 1]._{pos} - (
*a, b*) will lie in the region defined as the*neg*class. Again, this will result in the output of equation 2.2 being negative within the range of (-∞, 0) But, again, mathematically speaking, as in case of the*pos*class, the higher the absolute value of the function output, the greater the probability that (*a, b*) will belong to the*neg*class. That is,*P*will now lie somewhere in [0, 0.5)._{pos} - (
*a, b*) will lie exactly on the linear boundary. Although unlikely, in this case the value of the equation 2.2 will be 0, which means that the model is unable to determine whether (*a, b*) belongs to the*pos*or*neg*class. This will result in a probability,*P*whose value will be exactly 0.5._{pos}

From the above discussion, it is clear that if we have a data point, we may calculate a function (equation 2.2) which will output a value within the range of (-∞, 0) as a probability of that data point to be assigned to a specific class. But we know that the probability value cannot be other than within the range of [0,1] and therefore the infinity range needs to be mapped to the probability range. This can be done using something called an odds function.

So, let *P(X)* represent the probability of an event *X* occurring, therefore the odds ratio or *OR(X)* can be defined by

which essentially represents the probability ratio of the event happening vs. not happening. Now, it can be clearly asserted that that the probability and odds convey the same information. But the range of *P(X)* is from [0,1], the range of *OR(X)* is from (0, ∞) .

We have covered the value range (0, ∞) for the *OR(X) *but we will have to cover the whole boundary condition, which is (-∞,0). This can be done by taking the *logarithm* of *OR(X)*, and will be called the *log-odds* function. Therefore, mathematically, *OR(X) *goes from (0 to ∞) and log *OR(X) *converts it to the required range of (-∞ to 0).

This allows us to intuitively understand and interpret the result when we plug in the attributes of an input into the boundary function. In this model, the boundary function defines the log-odds of the *pos* class.

This brings us back to our bi-dimensional example, and we can now step through the different steps of the logistic regression for a given point *(a, b)* , which are as follows:

__Step 1__. Compute the boundary function value using equation 2.2 (alternatively, the log-odds function value), i.e.,

__Step 2__. Compute the odds ratio, i.e., *OR _{pos} = e^{t}* (since

__Step 3__. Having computed *OR _{pos}*,

Now combining steps 2 and 3 and using the *t, *we will have:

The right-hand side of equation 2.4 is popularly known as the logistic function for the logistic regression.

Figure 3. The logistic function (source: Wikipedia[1])

We have seen how the probabilities in logistic regression are calculated; now let us have a quick look at how it learns the boundary function outlined in equation 2.1.

Let us define a function *g(x)*, with *x* being a data point from a training data set. The function *g(x)* will be assigned the appropriate class probability based on the following:

If *x* is a part of the *pos* class

(this value is given by the equation 2.4).

If *x* is a part of the *neg* class

Therefore, through *g (x),* we can quantify the probability of data coming from a training set to be appropriately classified by our model. Now, averaging *g (x)* over all the data points coming from the training set would provide us with the likelihood that a randomly selected data point is correctly classified by the model, regardless of which class it belongs to. In other words, the logistic regression learner attempts to maximise the ‘average’ *g (x)* through a method called the maximum likelihood estimation or MLE.