As we all know, statistics is one of the industry knowledge one needs to be a data scientist. In this blog, I’m going to pen down some of the things about the same.
I want to start with my favorite line from the book “Rich dad poor dad “ which is a non-fictional book about personal finance, investing, business, etc. Although that is not related to statistics, I relate that point with statistics.
Here it is
For me, the above words give the crystal clear explanation of what is statistics?
As we all came across the phrase machine learning and deep learning are data-driven technologies. Because we know that data is the oil that runs those robust engines.
The flow of any data science project will be,
From the flow, we can say that ultimately the model depends upon the data. The data we download from Kaggle and other internet resources are always clean that will not need much data preparation techniques. But in reality, the data will not be like that. Forty percent of the work is taken by the data preparation part. Here statistics play the hero role.
Some of the most commonly used statistical techniques in data science are conversions from nominal or ordinal variables into numerical data, filling missing value with mean, median, mode accordingly,chi-squared test, ANOVA test, z-test, etc. And there some python libraries where you can implement these techniques without giving formulas.
Here, I will not deep dive into various statistical techniques and theories. Instead, I will tell about why the data science model has to depend on one of the hero(statistical technique) BELL CURVE (or normal distribution).
Why the bell curve is needed?
????The midpoint of the bell curve gives the mean of the data given and standard deviation gives how much the data differ from the mean of the entire data.
????If the bell curve is wider, it indicates the standard deviation is large, in turn, indicates the data are spread out wide form the mean value of the data.
????On the other hand, if the curve is taller and thinner, indicates the standard deviation is small, then we know that most of the data points lie around the mean of the data.
From the above points, we came to know about the dispersion of the data from the curve. If it is bell-shaped, your data is perfect for model development. If not, you have to transform the data in such away. While doing this we have to reduce the data dispersion without changing the actual significance of the data. Here comes another hero normalization. While normalizing the data, it will give the bell shape curve. By doing so, our model outcomes will not be biased on any specific values.
Why it is called normal distribution?
Because it is a normal thing occurring in our day to day life. For example, if we take the test scores of students in a class. Scores of more than 80% and less than 40% are low. More no of values lie between the range of 40–80 marks. This symbolizes the data distribution of the bell curve. This story lies in many real-time examples like person’s height, IQ level, etc,.
Hope this helps the people like me who study these concepts from the books or various sources but don’t know how and where these are working .