- Read original article here

Statistical Comparison Among Multiple Groups With ANOVA

Are You Thinking How to Compare Among the Multiple Groups? No Worries! Just Use ANOVA

Photo by Isaac Smith on Unsplash

Motivation

As a data science enthusiast, I always like to disseminate knowledge among the fellows. One fine morning, I was conducting a session on statistics for data science, and the topics were hypothesis testing, z-test, student’s t-test, p-value, etc. One of my students noticed that all of the tests had been done for comparing/ analyzing two groups. Out of curiosity, he asked me, “All of your mentioned tests can compare between two groups. But if I have three or more groups, how can I run a comparative analysis among the groups.” I was pleased to hear the question because his analytical power impressed me. After that, I stood still for a while and simultaneously clicked my mind that there is a test named ANOVA by which we can solve your problem. However, it was a long since I had studied about ANOVA test. That’s why I had not recalled the process of ANOVA at that very moment. So, I told the student that I would deliver a lecture about ANOVA in the next class. Now, I feel every data science enthusiast and practitioner should have explicit knowledge of the ANOVA test. The following article will discuss the test.

A Brief Intro and Use Case of ANOVA

ANOVA stands for ANALYSIS OF VARIANCE. According to Wikipedia —

Analysis of variance (ANOVA) is a collection of statistical models, and their associated estimation procedures (such as the “variation” among and between groups) used to analyze the differences among means [1].

If I want to express ANOVA in the simpler form, I could say it is a statistical test by which we can show whether two or more population means are equal.

Before inventing the ANOVA technique, Laplace and Gauss used different methods to compare multiple groups. Afterward, in 1918 Ronald Fisher introduced the term variance in his article ‘‘The Correlation Between Relatives on the Supposition of Mendelian Inheritance’’ [2]. In 1921, Ronald Fisher published Analysis of Variance [3], and it flourished after including his book ‘‘Statistical Methods for Research Workers’’ in 1925 [1].

Most often, we can’t take the decision; where should I use which statistical technique? So, I will mention some real-life examples where we can use the ANOVA test. Some usages are as follows —

Suppose you choose three individual groups of 20 students and assign three teachers to conduct statistics classes. At the end of the semester, you calculate the mean result of the students of each group. Now, you can compare whether there is any significant difference between the mean result of the groups or not. You can decide which teacher’s performance is best based on the test.

A large-scale farm is interested in understanding which of three different fertilizers leads to the highest crop yield. They sprinkle each fertilizer on ten different fields and measure the total yield at the end of the growing season. To understand whether there is a statistically significant difference in the mean yield that results from these three fertilizers, researchers can conduct an ANOVA test [4].

A medicine company produces four medicines for a particular disease and applies them to 20 patients to know the success rate. ANOVA test can help the company to decide which drug outperforms the others so that they can carry on producing the best one.

Biologists want to know how different levels of sunlight exposure (no sunlight, low sunlight, medium sunlight, high sunlight) and watering frequency (daily, weekly) impact the growth of a particular plant. In this case, two factors are involved (level of sunlight exposure and water frequency), so they will conduct ANOVA to see if either factor significantly impacts plant growth and whether or not the two factors are related to each other [4].

A product manufacturing company has three branches but wants to shut down the less profitable branches. An analyst can easily run an ANOVA test on the monthly or yearly profit and help the company to take the proper decision.

Beyond the above usage cases, there are a lot of fields where ANOVA can be used.

Let’s learn how we can implement it in real life.

When ANOVA Test is Needed

We can use the z-test and t-test to test whether or not two samples belong to the same population. But these two tests can’t answer the question for three or more samples. Interestingly, ANOVA comes in to resolve the issue, and the distribution associated with ANOVA is called ‘F-distribution’.

If you want to know about z-test and t-test, the following article may help you.

medium.datadriveninvestor.com

F-Distribution

F-distribution is calculated to determine whether two samples from the same population have different variances. In ANOVA, we consider two types of variances —

Variance between the groups

Variance within the groups

(You will get clear concepts of the two variances in the upcoming sections)

Basically, F-distribution is the combination of all possible F-values [5]. The values of F can be calculated as follows.

According to the equation of variance (S²),

So, the overall formula turns out,

Let’s see how the F-distribution looks like.

Sample F-Distribution (Image by Author)

F-distribution is a right-skewed distribution containing density and F- value starting from 0 [6]. In F- distribution, the minimum value of F is zero, and there is no maximum value [7]. The red shaded area of the above figure is the rejection region (value of alpha or significant level), and 3.3 is the critical value of F.

So far, we have covered all the basics for implementing ANOVA. Let’s move on.

Some Predefined assumptions for ANOVA

The following assumptions should be fulfilled for the ANOVA test [8] —

Experimental errors in the samples are normally distributed.

Experimental group samples are equal.

Observation samples are independent of each other.

All the dependent variables should be continuous.

How ANOVA Works

By default, ANOVA presumes all the sample groups’ means are equal. Then we need to justify the presumption. So, firstly we take the null hypothesis and alternative hypothesis.

Null Hypothesis: Sample groups’ means are equal.

Alternative Hypothesis: One or more groups’ means differ from the others.

Next, we calculate the F-value for our data and compare it with the standard F-value (Critical value of F) for a specific significant level. If the calculated F-value is less than the F- critical, we can say the null hypothesis is accepted because it falls under the null hypothesis region. Otherwise, the alternative hypothesis is accepted.

Types of ANOVA

When we start to perform an ANOVA test on different samples of a population, we need to find what kind of test is suitable for our problem. According to the nature of the problems, there are mainly two types of ANOVA.

i. One-way ANOVA

In the next section, we will explain both of the ANOVA tests with an example.

One-Way ANOVA

We use one-way ANOVA when we need to test the variability of one independent variable. Let’s look at the following example.

A company provides one per cent, two per cent and three per cent discount on different products. The main target of giving discounts is to receive payment faster from the customers. Now, we want to check whether the company’s initiative is fruitful or not using the ANOVA test.

Image by Author

Here, 1% Discount, 2% Discount and 3% Discount are the 3 independent variables.

Image by Author

μ represents the mean of each variable. μTOT indicates the combined mean of all the variables. Each variable is an individual sample group.

At first, we set the assumption for the hypothesis.

Null Hypothesis: μ1% = μ2% = μ3%

Alternative Hypothesis: One or more group means are different.

Some of Squares Groups (SSG):

(μ1% — μTOT)² = (44 — 49 )² = 25

(μ2% — μTOT)² = (50 — 49 )² = 1

(μ3% — μTOT)² = (53 — 49 )² = 16

Total Sum = (25+1+16) = 42. Now, multiply the number of items with total sum to get the SSG.

SSG=(42 x 10) = 420.

Degrees of freedom of Groups:

For our data, the number of groups = 3. So, the degree of freedom is

dfgroups = (3–1)=2.

Sum of Squares Errors (SSE):

The total sum of difference between each value of a group and the mean value of that group. Let’s demonstrate it with the above data.

Image by Author

Degrees of Freedom of Error:

The degrees of Freedom of Error can be calculated with the following formula —

According to the formula, for our case, the degrees of freedom of error,

dferror = (10 –1 ) x 3 =27

F-value for our example data

Finding critical value for 95% confidence level:

For the 95% confidence level, the significant level (α) is 0.05. We can get the critical value of F from the standard F- distribution table for 0.05 significant level, 2 degrees of freedom of groups and 27 degrees of freedom of errors. You can get the full F-distribution table here .

Image by Author

From the above table, the critical F-value is 3.35.

Final decision:

Image by Author

Here, the calculated value of F is 1.718, the critical value of F is 3.35 and F-value (1.718) < F-critical (3.35). So, the calculated value falls in the null hypothesis region, and we fail to reject the null hypothesis.

Now, we can say that the company won’t receive the payment faster by providing discounts on different products.

Till now, we have done some tiresome work calculating the F-value to justify our hypothesis. Fortunately, python already provides libraries by which we can calculate the F-value and the F-critical value with a few lines of code.

Implementation of One-way ANOVA in python

Let’s create the above dataset in python.

import pandas as pd

data=[[37,62,50],[60,27,63],[52,69,58],[43,64,54],[40,43,49],[52,54,52],[55,44,53],[39,31,43],[39,49,65],[23,57,43]]

table=pd.DataFrame(data,columns=['1% Discount','2% Discount','3% Discount'])

It will create the following dataframe.

Image by Author

Visualization of data for getting insight into the dataset.

It will produce the following output.

Image by Author

The boxplot shows that there are slight differences among the group means. Let’s find out the final decision with a few lines of code.

Result

The initiative of the company is not effective. Because F-value 1.718 is less than the critical value 3.354

Two-Way ANOVA

Two-way ANOVA is slightly different from one-way ANOVA. In one-way ANOVA, we have seen the comparison among one independent variable. But two-way ANOVA helps us to find the comparison between 2 independent variables. There are two types of ANOVA tests.

Two-way ANOVA without repetition

Two-way ANOVA with repetition

ANOVA without repetition

Let’s start with an example. In the previous invoice problem, we have seen how to compare and take the decision for only one independent variable, Discount. If we want to know how much discount they provide for which invoice, we need to add one more independent variable, Invoice Amount. We want to show the same result as our previous problem for one-way ANOVA with the data. Our target is to show whether the initiative of providing discount is effective or not.

Image by Author

In one-way ANOVA, we only care about the variability among the individual group sample.

Our Hypothesis is as same as one-way ANOVA.

As we want to find out the variability for the individual invoice, we need to consider each block (row) and each sample group. So, we have calculated the mean of each row and column mean. The calculation has been shown below.

Image by Author

It is the sum of difference from overall mean (μTOT) with individual sample groups.

(μ1% — μTOT)²=(12–15)²=9

(μ1% — μTOT)²=(17–15)²=4

(μ1% — μTOT)²=(16–15)²=1

We will get the Sum of Squares of Groups by adding and multiplying by the sample size.

So, SSG = (9+4+1) x 5 = 70.

Degrees of Freedom of Groups

According to the formula, Degrees of Freedom of Groups = (3–1)=2.

So, dfgroups=2.

Sum of Squares Blocks (SSB)

The Sum of Squares Blocks is the sum of differences between each block’s mean and the total mean.

(μ50 — μTOT)²=(20–15)²=25

So, SSB = (25+4+0+4+25) x 3 = 174

Sum of Squares Total (SST)

The Sum of Squares Total can be calculated as follows —

Image by Author

Sum of Squares Error (SSE)

SSE = SST — SSG — SSB = 268 — 70 — 174 = 24

Degrees of Freedom of Error

So, for our data, dferror =(5–1) x (3–1) = 8.

Calculation of F value

The calculated F-value for our data is 11.67. Now, we need to find out the critical value of F for taking our decision.

Decision for our problem

To find the critical value, we need to look up the F-table. In our case, we assume a standard confidence level of 95%. So, the significance level,

α =0.05. Degrees of Freedom of Groups or Numerator (dfgroups) is 2 and Degrees of Freedom of Error or Denominator (dferror) is 8. If we look up the value of F in the F-table, we will find out that the value is 4.46.

Image by Author

Here, the calculated F-value is 11.67 and F-critical is 4.46.

F-critical < F-value

Image by Author

Our calculated F-value falls in the rejection region. So, we can not accept the null hypothesis.

That’s why we can say that the decision taken by the company is fruitful, as there is variability among the different discounted sample groups.

Implementation with Python

First of all, we will create the dataset in Python.

import pandas as pd

data=[['$50',16,23,21],['$100',14,21,16],['$150',11,16,18],['$200',10,15,14],['$250',9,10,11]]

table=pd.DataFrame(data,columns=['Invoice Amount','1% Discount','2% Discount','3% Discount'])

The above code will generate the following output.

Image by Author

Let’s see how the discounted price varies for the different invoice amounts. We will draw some boxplots to show it.

Visual Output

The above figure shows the row-wise value distribution along with the mean value.

Now, we will show the column-wise value distribution with a boxplot.

Output

Formatting the dataset for fitting the model.

Output

Fitting the value for calculating F-value.

#fitting the model

model = ols('Value ~ C(Invoice) + C(Discount)', data=final).fit()

f_calculated=sm.stats.anova_lm(model, typ=2)

Output

sum_sq of C(Invoice), C(Discount) and Residual represent the values of SSB, SSG and SSE consecutively. df of C(Discount) represents the dfgroup, and Residual indicates the dferror.

Extracting F-value and F-critical value

#finding out the critical value of F

f_critical= stats.f.ppf(1-0.05,dfn=2,dfd=8)print("Critical f-value is {:.3f}.".format(f_critical))

print("Calculated f-value is {:.3f}".format(f_calculated['F'][1]))

Final Result

Two-way ANOVA with Replication

Image by Author

If we closely observe the above figure, we can see without replication, and there is no repetition of blocks. On the other hand, with replication, each block holds multiple samples value. Let’s see a real-world example.

Suppose you are an owner of a company. The company has two manufacturing plants (Plant A and Plant B). And the company produces three products A, B and C. Now, the company’s owner wants to know whether there is a significant difference in production between the two plants, and the data is given below.

Image by Author

If we want to calculate the F-value, we must undergo a tiresome calculation. So, we won’t show the hands-on calculation. Instead, we will show how to find the F-value for two-way ANOVA with replication.

Python Implementation

Firstly, we will create the dataset.

import pandas as pd

data=[['Plant A',13,21,18],['Plant A',14,19,15],['Plant A',12,17,15],['Plant B',16,14,15],['Plant B',18,11,13],['Plant B',17,14,8]]

table=pd.DataFrame(data,columns=['Plant','A','B','C'])

Now, we need to reform the data for fitting to the model.

reformation = pd.melt(table,id_vars=['Plant'], value_vars=['A', 'B', 'C'])

reformation.columns=['Plant','Product','Value']

model = ols('Value ~ Product + Plant:Product', data=reformation).fit()

f_calculated=sm.stats.anova_lm(model, typ=2)

Look at the line

model=ols('Value ~ Product + Plant:Product', data=reformation).fit()

For fitting the model, the first parameter should be a continuous numerical variable, and we need to use the Repeated_variable: variable parameter for calculating F-value with repetition. So, the F-value of Product in comparison with Plant is 1.44.

Finding the F-critical

Here, we assume the standard confidence level is 95%. From the calculated F-value, we have found that the degree of freedom of groups is 2, and the degree of freedom of error is 12. So, the critical value of F is —

f_critical= stats.f.ppf(1-0.05,dfn=2,dfd=12)

print('Critical value of F is {:.3f}'.format(f_critical))

Result

Critical value of F is 3.885

As the F-value < F-critical, so we can say that there is no significant difference of production between the two groups.

Photo by Trent Erwin on Unsplash

Conclusion

That’s all about the ANOVA test. Though the ANOVA test is a little bit confusing and difficult, it plays a significant role in data science and data analysis.

[N.B: At last, I like to thank Renesh Bedre for his simple explanation of the ANOVA test with python. Special thank goes to instructor Jose Portilla , whose explanation helps me a lot to realize the test from the core of my heart.]

References

[1]. https://en.wikipedia.org/wiki/Analysis_of_variance

[2]. Fisher, R.A. (1918) The correlation between relatives on the supposition of Mendelian inheritance. Transactions of the Royal Society of Edin-burgh, 52, 339–433.

[3]. On the “Probable Error” of a Coefficient of Correlation Deduced from a Small Sample. Ronald A. Fisher. Metron, 1: 3–32 (1921)

Images Powered by