In my previous article about Intuitively, how can we understand different classification algorithms, I introduced the main principles of classification algorithms. However, the toy data I used was quite simple, almost linearly separable data; in real life, the data is almost always non-linear, so we should make our algorithm able to tackle non linearly separable data.
Let’s compare how logistic regression behaves with almost linearly separable data and non-linearly separable data.
With the two toy data below, we can see that Logistic Regression helps us find the decision boundary when the data is almost linearly separable, but when the data is not linearly separable data, Logistic Regression is not capable to find a clear decision boundary. It is understandable because Logistic Regression is only able to separate the data into two parts.
In this article, we will try to reuse the logistic regression in a different way and try to make it work for nonlinear data.
When looking at the blue dots and red dots, the data is nonlinear because we can see that there are 3 parts, and we should find two decision boundaries. So, what if… what if we try to combine two logistic regressions?
For example, we would have these two models based on logistic regression.
Then we can combine h1 and h2 to do the final prediction with… well, another logistic regression.
Then you can replace h1 and h2 by their own logistic function expression:
Then we can solve the problem by finding the different coefficients, and guess what, the final results are quite satisfying: we can see that the two combined Logistic Regression gives us the two decision boundaries.
Let’s go back to our final expression of our model:
This expression may seem quite scary. Let’s put some artistic sense into it.
Let’s draw something to represent the functions in a more readable way:
Horizontally, it would even be better :
Let’s specify all the coefficients to be determined, we can just put them on the lines so it is more intelligible:
And we can separate the linear function and the sigmoid function to even better improve readability.
Rectangles? No, circles are better. Wait, do you see what I see?
Wow, they are like … neurons!
We can add more colors to make it even better. The connections represent the weights and we can make them thicker to represent bigger numbers. And with a color, we can represent the sign.
Now let’s create a whole new world around this neural network.
When you compute the output value for a given input value, you have to go through all the hidden layers. So you will go from the left to the right. Let’s call it “forward propagation”.
In the end, you will calculate an estimation of the output, and you have to compare it with the actual output. And this error will help us to finetune the weights. And for this, we can use gradient descent and we have to calculate the different derivatives.
Since the error is calculated in the end, then we adjust the weights of the neurons just before the output layer. We will always adjust from the right to the left. So let’s call this “backward propagation”.
Now let’s consider two input variables, and here is the toy data.
Intuitively, we can see that two decision boundaries would be sufficient. So let’s apply two hidden neurons:
Then we can visualize the final results by testing the neural network with surrounding area data. We can see that the model is doing a quite good job.
Now since we know that there are only logistic regressions inside the neural network, we can try to visualize the different steps of the transformation.
Let’s look at H1: the input data are the original blue and red dots. H1 is a Logistic Regression that has two input variables, so the result is a surface.
Then second hidden neuron H2 is similar.
The output layer O1 is a Logistic Regression that also takes two inputs, so we also have a surface to represent the result. And the inputs of O1 are two series of data with values between 0 and 1, because they are the results of two Logistic Regressions. And then we can see that if we take the value 0.5 to the surface, the initial blue and red dots can be linearly separated into two parts.
So what happens in this neural network? In a sense, we can say that the neural network transforms the nonlinearly separable data into almost linearly separable data.
The hidden layers have 2 neurons, so two dimensions. So if we want to see this transformation we are expecting to see how the dots are moved on the plan. And we know that the final positions of the dots would be in the unit square (0,1)x(0,1).
We can see for the blue dots and red dots separately. Note that the arrows point to the final positions.
And then we can see the work of the final logistic regression.
The original data are not linearly separable, but after the transformation, they are moved into the unit square, where they become almost linearly separable.
Now let’s consider another toy data. Intuitively, how many hidden neurons would you apply?
Well, 3 neurons would be sufficient, as we can see in the graph below:
We can see the logistic regression surface for each neuron.
For the output neuron, it has 3 inputs (since they are results from Logistic Regression, the values are between 0 and 1), and we can visualize them in a unit cube.
We can create an animation to visualize better