Neural networks are a hot topic in the world of machine learning. There are numerous blog posts, online videos, courses, and even media coverage surrounding this algorithm. Most of the current top performing AI systems in the world, including Google’s latest version of the translator, have resulted from this technique.
Many aspiring data scientists today are interested in learning specific concepts such as natural language processing and image processing. A lot of these concepts rely heavily on having a basic idea of neural networks, and that is precisely what I will try to help you achieve as you keep reading below, rather than delving deep into the finer details of the theory.
To begin with, we shall look at two specific cases of neural networks: the perceptron algorithm and logistic regression (hello old friend! ), both of which are much simpler to infer. These aren’t usually associated with neural networks, but we’ll begin here rather than go through a bunch of “Here is what neural networks are” content and get done with it. The hope is that you will understand these two in the context of neural networks, making it relatively easier to grasp the general form of a neural network and what it tries to achieve.
Consider the case where your data has two input features, say, X1 and X2, and the objective is to make a binary classification decision — whether the predicted class is 0 or 1. A classic machine learning problem.
The Perceptron algorithm would tackle this problem in two steps:
Step 1: Forming a linear combination of the input features by assigning weights and adding a bias: (w1X1 + w2X2 + b), where ‘b’ is the bias or the coefficient constant term.
Step 2: Pass the (w1X1 + w2X2 + b) — let’s call it “z” — through a step function, and then assign the output to class 1 (if z≥0) or class 0 (if z=0).
A visual representation of how the perceptron makes the decision is given below:
The only difference in the way a logistic regression makes the same decision is that it uses a sigmoid function instead of a step function:
Step 1: same as above
Although they look similar in shape, the difference is that while a sigmoid function is continuous, a step function isn’t. And as the output, a logistic regression gives a predicted probability (p) of an observation belonging to class 1.
Step 2: (w1X1 + w2X2 + b) = “z” is now run through a sigmoid transformation whose mathematical formulation is as below:
p = 1/(1+e^(-z)), and returns a value in the range (0,1) since it is continuous. This value is used to determine which class the output belongs to (0 or 1).
This function (be it step or sigmoid) in the context of neural networks, is called an activation function. It accepts a linear combination of the data provided and runs it through a transformation.
Let’s look at a more general representation of how a neural network algorithm functions. Given below is the architecture of a general neural network. Aligning your understanding of the algorithm to the below representation will make it relatively easier to remember.
The input features are exactly the same as they were before — X1 and X2. The output is the same as that in the logistic regression case, with a predicted probability (p). Instead of going directly from the inputs to the outputs, we now have a middle layer, called a “hidden layer.” We receive the inputs and pass them through an activation function to get the hidden layer. From one hidden layer to another, we keep applying linear transformations and activation functions to get the next layer, until we arrive at the output, which is a predicted probability.
But, why do we have so many hidden layers? At a high level, these hidden layers might extract key features that a direct input-to-output path might not have given. This is also the reason why we apply an activation function, for otherwise we just have a linear relationship between the features as the output. The sigmoid, for example, being a curve, helps us uncover non-linear relationships in the data. And hence, from one hidden layer to another, we enable the extraction of non-linear features which a simple perceptron would not have helped achieve.
Please note that the image given here is for a very simple neural network. The hidden layers can vary in number and size in reality. The activation function on linear combinations of weights and original input features gives us H1 and H2, which act as the first and second features in the hidden layer, respectively. Both the original data and the hidden layer have two features each here, but it may not always be the case. The hidden layer can have many more features than what we started with, depending on the problem at hand.
For bookkeeping purposes, every time you see a parenthesis in the exponent position of a term, it tells you which layer you are currently at. (1) indicates the first hidden layer, (2) the second, and likewise.
While the content in this article doesn’t cover the depth of neural networks, it hopefully has served its purpose of getting the reader comfortable with its architecture both visually and mathematically. For now, there are so many questions we have left unanswered. They are not within the scope of this article, but we’ll list them down anyway:
In a real-world example, think of a basic autopilot feature in an aeroplane — a neural network algorithm could read signals from the instruments within the cockpit (the inputs) and use the same to appropriately modify the aeroplane’s controls to maintain the desired course (the output).
The primary objective of this write-up is to serve as an introduction to the world of neural networks without digging too much into the finer details. Hopefully, it has done its job well enough, and you have a fair notion of what a neural networks are and are used for.