Logo

The Data Daily

Histograms. Why Histograms?

Histograms. Why Histograms?

A better question is “Why not histograms?”. Basically, the best way to look at the distribution of your numerical data is to use the histogram (most statisticians give it an unofficial seal of approval).

It’s also a sure-fire way to look at clustering in a single numeric variable. And of course, it lets you know if the numerical data is shaped like the holy grail of statistics AKA the normal distribution.

· Look directly at the distribution of the numeric data

· Check for grouping (clustering) within the single numeric variable

· Know what the probability distribution of the data approximately looks like for further analysis

So, it basically gives you almost everything you need to know to continue on with more complex analysis of a numerical variable.

The two most important components of a Histogram are just

The number of bins allows you to basically break the numeric variable up into smaller and smaller groupings (or bigger if you prefer). The higher the number of bins the more breaks you have and the more your histogram will look like a smooth curve. Obviously less bins will do the reverse.

The width of the bins also has a similar effect but backward. So, when the bin width is set higher the less the curve looks smooth and the less breaks you have. The smaller your width becomes the more breaks you likely have and the more-smooth your histogram will look.

(What about skew or bimodal histograms? Frankly I think these topics are a bit too boring and most probably you will think so too, but I will address them later in other topics just so all the boring stuff doesn’t show up in one place.)

The y-axis of a standard histogram as many should know has a count of each observed value within a defined bin.

So it isn’t quite perfectly normal but you get the idea. This is your typical histogram where the bin number is 30.

Let’s do a double take but this time we will specify the bin width to be in 10-unit intervals.

Looks like a big difference! The key take-away is that changing either the number of bins or the bin-width changes the shape and smoothness of the histogram.

In the two examples above, we can see in the shape of the distribution, that some groupings in the data exist. For instance, say between BMI 25–35 and toward the few high peaks at the right end of the first histogram (These could be indicative of clustering of some of the observations in your data more exploration and analysis would be required to tell). We can also see that maybe the true distribution of BMIs in this population could be normally distributed but for this sample we have some skew.

So, for practical purposes you should use the histogram to explore your numeric variables as they give some fairly interesting insight into your data.

For more articles and content check out ourYouTube Channeland our website, if you like what we have to offer then please show your support and like us onFacebookand Follow us onTwitter

Images Powered by Shutterstock