Logo

The Data Daily

What is Bootstrap Sampling in Machine Learning and Why is it Important?

What is Bootstrap Sampling in Machine Learning and Why is it Important?

The Bootstrap Sampling Method is a very simple concept and is a building block for some of the more advanced machine learning algorithms like AdaBoost and XGBoost. However, when I started my data science journey, I couldn’t quite understand the point of it. So my goals are to explain what the bootstrap method is and why it’s important to know!

Technically speaking, the bootstrap sampling method is a resampling method that uses random sampling with replacement.

Don’t worry if that sounded confusing, let me explain it with a diagram:

Suppose you have an initial sample with 3 observations. Using the bootstrap sampling method, you’ll create a new sample with 3 observations as well. Each observation has an equal chance of being chosen (1/3). In this case, the second observation was chosen randomly and will be the first observation in our new sample.

After choosing another observation at random, you chose the green observation.

Lastly, the yellow observation is chosen again at random. Remember that bootstrap sampling using random sampling with replacement. This means that it is very much possible for an already chosen observation to be chosen again.

And this is the essence of bootstrap sampling!

Great, now you understand what bootstrap sampling is, and you know how simple the concept is, but now you’re probably wondering what makes it so useful.

As you learn more about machine learning, you’ll almost certainly come across the term “bootstrap aggregating”, also known as “bagging”. Bagging is a technique used in many ensemble machine learning algorithms like random forests, AdaBoost, gradient boost, and XGBoost.

Check out my article on ensemble learning, bagging, and boosting.

Sometimes when estimating the parameters of a population (i.e. mean, standard error), you may have a sample that is not large enough to assume that the sampling distribution is normally distributed. Also, in some cases, it may be difficult to work out the standard error of the estimate. In either case, bootstrap sampling can be used to work around these problems.

In essence, under the assumption that the sample is representative of the population, bootstrap sampling is conducted to provide an estimate of the sampling distribution of the sample statistic in question.

This point is a little more statistical, so if you don’t understand it, don’t worry. All that you have to understand is that bootstrap sampling serves as the basis for “bagging” which is a technique that many machine learning models use.

If you want to learn more machine learning fundamentals and stay up to date with my content, you can do so here.

If you want to continue your learnings, check out my article on ensemble learning, bagging, and boosting here.

Images Powered by Shutterstock