Logo

The Data Daily

Random Forest Algorithm for Machine Learning

Random Forest Algorithm for Machine Learning

Have you ever asked yourself a series of questions in order to help make a final decision on something? Maybe it was a simple decision like what you wanted to eat for dinner. You might have asked yourself if you wanted to cook or pick food up or get delivery. If you decided to cook, then you would have needed to figure out what type of cuisine you were in the mood for. And lastly, you probably needed to figure out if you had all of the ingredients in your fridge or needed to make a run to the store. Finding the answer to these questions would have helped you come to a final decision on dinner that night.

We all have to use this decision making process multiple times, every single day. In the machine learning world this process is called adecision tree. You start with a node which then branches to another node, repeating this process until you reach a leaf. A node asks a question in order to help classify the data. A branch represents the different possibilities that this node could lead to. A leaf is the end of a decision tree, or a node that no longer has any branches.

The Random Forest Algorithm is composed of different decision trees, each with the same nodes, but using different data that leads to different leaves. It merges the decisions of multiple decision trees in order to find an answer, which represents the average of all these decision trees.

The random forest algorithm is a supervised learning model; it uses labeled data to “learn” how to classify unlabeled data. This is the opposite of the K-means Cluster algorithm, which we learned in a past article was an unsupervised learning model. The Random Forest Algorithm is used to solve both regression and classification problems, making it a diverse model that is widely used by engineers.

In the above example, we have three individual decision trees which together make up a Random Forest. Random Forest is considered ensemble learning, meaning it helps to create more accurate results by using multiple models to come to its conclusion. The algorithm uses the leaves, or final decisions, of each node to come to a conclusion of its own. This increases the accuracy of the model since it’s looking at the results of many different decision trees and finding an average.

Let’s say you want to estimate the average household income in your town. You could easily find an estimate using the Random Forest Algorithm. You would start off by distributing surveys asking people to answer a number of different questions. Depending on how they answered these questions, an estimated household income would be generated for each person.

After you’ve found the decision trees of multiple people you can apply the Random Forest Algorithm to this data. You would look at the results of each decision tree and use random forest to find an average income between all of the decision trees. Applying this algorithm would provide you with an accurate estimate of the average household income of the people you surveyed.

Our next example deals with classification data, or non-numerical data. Let’s say you are doing market research for a new company who wants to know what type of people are likely to buy their products. You’ll probably start by asking a sample of people in the same target market a series of questions about their buying behaviors and the kind of products they prefer. Based on their answers, you’ll be able to classify them as a potential customer or not a potential customer.

Before applying the Random Forest Algorithm on these results you will need to perform something called one-hotencoding. This entails assigning a number to a categorical variable in order to apply mathematics to the problem.

After the data is one-hot encoded, the mathematics can be applied and the Random Forest Algorithm can come to a conclusion. If the algorithm concludes that most people in this target market are not potential customers, it may be a good idea for the company to rethink their product with these types of people in mind.

When using the Random Forest Algorithm to solve regression problems, you are using the mean squared error (MSE) to how your data branches from each node.

This formula calculates the distance of each node from the predicted actual value, helping to decide which branch is the better decision for your forest. Here, yi is the value of the data point you are testing at a certain node and fi is the value returned by the decision tree.

When performing Random Forests based on classification data, you should know that you are often using the Gini index, or the formula used to decide how nodes on a decision tree branch.

This formula uses the class and probability to determine the Gini of each branch on a node, determining which of the branches is more likely to occur. Here, pi represents the relative frequency of the class you are observing in the dataset and crepresents the number of classes.

You can also use entropy to determine how nodes branch in a decision tree.

Entropy uses the probability of a certain outcome in order to make a decision on how the node should branch. Unlike the Gini index, it is more mathematical intensive due to the logarithmic function used in calculating it.

It is essential to understand a single decision tree before you can fully understand the random forest algorithm. You must understand the difference between a node, branch and leaf, and how the different formulas are applied in order to come to a final decision.

When used correctly, the random forest algorithm can be extremely useful with all different types of data sets, whether regression or classification data. It is easy to use, fast to train, and finds an accurate representation of the decision trees it is using.

For more resources, check out some projects using random forest:

Images Powered by Shutterstock