Logo

The Data Daily

Common Statistical Tests in R – Part I | R-bloggers

Common Statistical Tests in R – Part I | R-bloggers

% mutate( ManagerIndicator = factor(LevelDesignation, levels = c("Manager", "Senior IC")) ) Recall also that our dataset pq_data is a cross-sectional time-series dataset, which means that for every individual identified by PersonId, there will be multiple rows representing a snapshot of a different week. In other words, a unique identifier would be something like a PersonWeekId. To simplify the dataset so that we are looking at person averages, we can group the dataset by PersonId and calculate the mean of Multitasking_hours for each person. After this manipulation, Multitasking_hours would represent the mean multitasking hours per person, as opposed to per person per week. Let us do this by building on the pipe-chain: pq_data_grouped % filter(LevelDesignation %in% c("Manager", "Senior IC")) %>% mutate( ManagerIndicator = factor(LevelDesignation, levels = c("Manager", "Senior IC")) ) %>% group_by(PersonId, ManagerIndicator) %>% summarise(Multitasking_hours = mean(Multitasking_hours), .groups = "drop") glimpse(pq_data_grouped) ## Rows: 56 ## Columns: 3 ## $ PersonId "00f6d464-ba1f-31ee-b51e-ab6e8ec4fb79", "023ddb61-1~ ## $ ManagerIndicator Senior IC, Manager, Senior IC, Senior IC, Manager, ~ ## $ Multitasking_hours 0.2813373, 0.5980080, 0.3319752, 0.2938879, 0.70762~ Now our data is in the right format. Let us presume that the data satisfies all the assumptions of the t-test, and see what happens when we run it with the base t.test() function: t.test( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped, paired = FALSE ) ## ## Welch Two Sample t-test ## ## data: Multitasking_hours by ManagerIndicator ## t = 10.097, df = 28.758, p-value = 5.806e-11 ## alternative hypothesis: true difference in means between group Manager and group Senior IC is not equal to 0 ## 95 percent confidence interval: ## 0.3444870 0.5195712 ## sample estimates: ## mean in group Manager mean in group Senior IC ## 0.8103354 0.3783063 In the function, the predictor and outcome variables are supplied using a tilde (~) format common in R, and we have specified paired = FALSE to use an unpaired t-test. As for the output, t represents the t-statistic. df represents the degree of freedom. p-value is - well - the p-value. The value here shows to be significant, as it is smaller than the significance level at 0.05. the test allows us to reject the null hypothesis that the means of multitasking hours between managers and ICs are the same. Note that the t-test used here is the Welch’s t-test, which is an adaptation of the classic Student’s t-test. The Welch’s t-test compares the variances of the two groups (i.e. handling heteroscedasticity), whereas the classic Student’s t-test assumes the variances of the two groups to be equal (fancy term = homoscedastic). 1.1 Testing for normality But hang on! There are several assumptions behind the classic t-test we haven’t examined properly, namely: independence - sample is independent normality - data for each group is normally distributed homoscedasticity - data across samples have equal variance We can at least be sure of (1), as we know that senior ICs and Managers are separate populations. However, (2) and (3) are assumptions that we have to validate and address specifically. To test whether our data is normally distributed, we can use the Shapiro-Wilk test of normality, with the function shapiro.test(): pq_data_grouped %>% group_by(ManagerIndicator) %>% summarise( p = shapiro.test(Multitasking_hours)$p.value, statistic = shapiro.test(Multitasking_hours)$statistic ) ## # A tibble: 2 x 3 ## ManagerIndicator p statistic ## ## 1 Manager 0.146 0.936 ## 2 Senior IC 0.0722 0.941 As both p-values show up as less than 0.05, the test implies that we should reject the null hypothesis that the data are normally distributed (i.e. not normally distributed). To confirm, you can also perform a visual check for normality using a histogram or a Q-Q plot. # Multitasking hours - IC mth_ic % filter(ManagerIndicator == "Senior IC") %>% pull(Multitasking_hours) qqnorm(mth_ic, pch = 1, frame = FALSE) qqline(mth_ic, col = "steelblue", lwd = 2) # Multitasking hours - Manager mth_man % filter(ManagerIndicator == "Manager") %>% pull(Multitasking_hours) qqnorm(mth_man, pch = 1, frame = FALSE) qqline(mth_man, col = "steelblue", lwd = 2) In the Q-Q plots, the points broadly adhere to the reference line. Therefore, the graphical approach suggests that the Shapiro-Wilk test may have been slightly over-sensitive. Below is a good thing to bear in mind:3 Statistical tests have the advantage of making an objective judgment of normality but have the disadvantage of sometimes not being sensitive enough at low sample sizes or overly sensitive to large sample sizes. Graphical interpretation has the advantage of allowing good judgment to assess normality in situations when numerical tests might be over or undersensitive. In other words, the sample sizes may have well played a role in the significant result in our Shapiro-Wilk test.4 As our data isn’t conclusively normal - this in turn makes the unpaired t-test less conclusive. When we cannot safely assume normality, we can consider other alternatives such as the non-parametric two-samples Wilcoxon Rank-Sum test. This is covered further down below. 1.2 Testing for equality of variance (homoscedasticity) Asides from normality, another assumption of the t-test that we hadn’t properly test for prior to running t.test() is to check for equality of variance across the two groups (homoscedasticity). Thankfully, this was not something we had to worry about as we used the Welch’s t-test. Recall that the classic Student’s t-test assumes equality between the two variances, but the Welch’s t-test already takes the difference in variance into account. If required, however, here is an example on how you can test for homoscedasticity in R, using var.test(): # F test to compare two variances var.test( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped ) ## ## F test to compare two variances ## ## data: Multitasking_hours by ManagerIndicator ## F = 4.5726, num df = 22, denom df = 32, p-value = 0.0001085 ## alternative hypothesis: true ratio of variances is not equal to 1 ## 95 percent confidence interval: ## 2.146082 10.318237 ## sample estimates: ## ratio of variances ## 4.572575 The var.test() function ran above is an F-test (i.e. uses the F-distribution) used to compare whether the variances of two samples are the same. Under the null hypothesis of the tests, there should be homoscedasticity and as the f-statistic is a ratio of variances, the f-statistic would tend towards 1. The arguments are provided in a similar format to t.test(). It appears that homoscedasticity does not hold: since the p-value is less than 0.05, we should reject the null hypothesis that variances between the manager and IC dataset are equal. The Student’s t-test would not have been appropriate here, and we were correct to have used the Welch’s t-test. Homoscedasticity can also be examined visually, using a boxplot or a dotplot (using graphics::dotchart() - suitable for small datasets). The code to do so would be as follows. For this example, visual examination is a bit more challenging as the senior IC and Manager groups have starkly different levels of multi-tasking hours. dotchart( x = pq_data_grouped$Multitasking_hours, groups = pq_data_grouped$ManagerIndicator ) boxplot( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped ) 2. Non-parametric tests 2.1 Wilcoxon Rank-Sum Test Previously, we could not safely rely on the unpaired two-sample t-test because we are not fully confident that the data satisfies the normality condition. As an alternative, we can use the Wilcoxon Rank-Sum test (aka Mann Whitney U Test). The Wilcoxon test is described as a non-parametric test, which in statistics typically means that there is no specification on a distribution, or the parameters of a distribution. In this case, the Wilcoxon test does not assume a normal distribution. Another difference between the Wilcoxon Rank-Sum test and the unpaired t-test is that the former tests whether two populations have the same shape via comparing medians, whereas the latter parametric test compares means between two independent groups. This is run using wilcox.test() wilcox.test( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped, paired = FALSE ) ## ## Wilcoxon rank sum exact test ## ## data: Multitasking_hours by ManagerIndicator ## W = 752, p-value = 2.842e-14 ## alternative hypothesis: true location shift is not equal to 0 The p-value of the test is less than the significance level (alpha = 0.05), which allows us to conclude that Managers’ median multitasking hours is significantly different from the ICs’. Note that the Wilcoxon Rank-Sum test is different from the similarly named Wilcoxon Signed-Rank test, which is the equivalent alternative for the paired t-test. To perform the Wilcoxon Signed-Rank test instead, you can simply specify the argument to be paired = TRUE. Similar to the decision of whether to use the paired or the unpaired t-test, you should ensure that the one-sample condition applies if you use the Wilcoxon Signed-Rank test. 2.2 Kruskal-Wallis test So far, we have only been looking at tests which compare exactly two populations. If we are looking for a test that works with comparisons across three or more populations, we can consider the Kruskal-Wallis test. Let us create a new data frame that is grouped at the PersonId level, but filtering out fewer values in LevelDesignation: pq_data_grouped_2 % filter(LevelDesignation %in% c( "Support", "Senior IC", "Junior IC", "Manager", "Director" )) %>% mutate(ManagerIndicator = factor(LevelDesignation)) %>% group_by(PersonId, ManagerIndicator) %>% summarise(Multitasking_hours = mean(Multitasking_hours), .groups = "drop") glimpse(pq_data_grouped_2) ## Rows: 198 ## Columns: 3 ## $ PersonId "0049ef24-ec83-356d-89f7-46b67364e677", "00f6d464-b~ ## $ ManagerIndicator Support, Senior IC, Manager, Support, Support, Supp~ ## $ Multitasking_hours 0.3812649, 0.2813373, 0.5980080, 0.2918829, 0.42288~ We can then run the Kruskal-Wallis test: kruskal.test( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped_2 ) ## ## Kruskal-Wallis rank sum test ## ## data: Multitasking_hours by ManagerIndicator ## Kruskal-Wallis chi-squared = 91.061, df = 4, p-value < 2.2e-16 Based on the Kruskal-Wallis test, we reject the null hypothesis and we conclude that at least one value in LevelDesignation is different in terms of their weekly hours spent multitasking. The most obvious downside to this method is that it does not tell us which groups are different from which, so this may need to be followed up with multiple pairwise-comparison tests (also known as post-hoc tests). 3. Comparison tests: ANOVA 3.1 ANOVA What if we want to run the t-test across more than two groups? Analysis of Variance (ANOVA) is an alternative method that generalises the t-test beyond two groups, so it is used to compare three or more groups. There are several versions of ANOVA. The simple version is the one-way ANOVA, but there is also two-way ANOVA which is used to estimate how the mean of a quantitative variable changes according to the levels of two categorical variables (e.g. rain/no-rain and weekend/weekday with respect to ice cream sales). In this example we will focus on one-way ANOVA. There are three assumptions in ANOVA, and this may look familiar: The data are independent. The responses for each factor level have a normal population distribution. These distributions have the same variance. These assumptions are the same as those required for the classic t-test above, and it is recommended that you check for variance and normality prior to ANOVA. ANOVA calculates the ratio of the between-group variance and the within-group variance (quantified using sum of squares), and then compares this with a threshold from the Fisher distribution (typically based on a significance level). The key function is aov(): res_aov F) ## ManagerIndicator 4 40.55 10.14 504.6 F): p-value of the F-statistic. This shows how likely it is that the F-value calculated from the test would have occurred if the null hypothesis of no difference among group means were true. Given that the p-value is smaller than 0.05, we reject the null hypothesis, so we reject the hypothesis that all means are equal. Therefore, we can conclude that at least one value in LevelDesignation is different in terms of their weekly hours spent multitasking. Antoine Soetewey’s blog recommends the use of the report package, which can help you make sense of the results more easily: library(report) report(res_aov) ## The ANOVA (formula: Multitasking_hours ~ ManagerIndicator) suggests that: ## ## - The main effect of ManagerIndicator is statistically significant and large ## (F(4, 193) = 504.61, p < .001; Eta2 = 0.91, 95% CI [0.90, 1.00]) ## ## Effect sizes were labelled following Field's (2013) recommendations. The same drawback that applies to the Kruskall-Wallis test also applies to ANOVA, in that doesn’t actually tell you which exact group is different from which; it only tells you whether any group differs significantly from the group mean. This ANOVA test is hence sometimes also referred to as an ‘omnibus’ test. 3.2 Next steps after ANOVA A pairwise t-test (note: pairwise, not paired!) is likely required to provide more information, and it is recommended that you review the p-value adjustment methods when doing so.6 Type I errors are more likely when running t-tests pairwise across many variables, and therefore correction is necessary. Here is an example of how you might run a pairwise t-test: pairwise.t.test( x = pq_data_grouped_2$Multitasking_hours, g = pq_data_grouped_2$ManagerIndicator, paired = FALSE, p.adjust.method = "bonferroni" ) ## ## Pairwise comparisons using t tests with pooled SD ## ## data: pq_data_grouped_2$Multitasking_hours and pq_data_grouped_2$ManagerIndicator ## ## Director Junior IC Manager Senior IC ## Junior IC" />
% mutate( ManagerIndicator = factor(LevelDesignation, levels = c("Manager", "Senior IC")) ) Recall also that our dataset pq_data is a cross-sectional time-series dataset, which means that for every individual identified by PersonId, there will be multiple rows representing a snapshot of a different week. In other words, a unique identifier would be something like a PersonWeekId. To simplify the dataset so that we are looking at person averages, we can group the dataset by PersonId and calculate the mean of Multitasking_hours for each person. After this manipulation, Multitasking_hours would represent the mean multitasking hours per person, as opposed to per person per week. Let us do this by building on the pipe-chain: pq_data_grouped % filter(LevelDesignation %in% c("Manager", "Senior IC")) %>% mutate( ManagerIndicator = factor(LevelDesignation, levels = c("Manager", "Senior IC")) ) %>% group_by(PersonId, ManagerIndicator) %>% summarise(Multitasking_hours = mean(Multitasking_hours), .groups = "drop") glimpse(pq_data_grouped) ## Rows: 56 ## Columns: 3 ## $ PersonId "00f6d464-ba1f-31ee-b51e-ab6e8ec4fb79", "023ddb61-1~ ## $ ManagerIndicator Senior IC, Manager, Senior IC, Senior IC, Manager, ~ ## $ Multitasking_hours 0.2813373, 0.5980080, 0.3319752, 0.2938879, 0.70762~ Now our data is in the right format. Let us presume that the data satisfies all the assumptions of the t-test, and see what happens when we run it with the base t.test() function: t.test( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped, paired = FALSE ) ## ## Welch Two Sample t-test ## ## data: Multitasking_hours by ManagerIndicator ## t = 10.097, df = 28.758, p-value = 5.806e-11 ## alternative hypothesis: true difference in means between group Manager and group Senior IC is not equal to 0 ## 95 percent confidence interval: ## 0.3444870 0.5195712 ## sample estimates: ## mean in group Manager mean in group Senior IC ## 0.8103354 0.3783063 In the function, the predictor and outcome variables are supplied using a tilde (~) format common in R, and we have specified paired = FALSE to use an unpaired t-test. As for the output, t represents the t-statistic. df represents the degree of freedom. p-value is - well - the p-value. The value here shows to be significant, as it is smaller than the significance level at 0.05. the test allows us to reject the null hypothesis that the means of multitasking hours between managers and ICs are the same. Note that the t-test used here is the Welch’s t-test, which is an adaptation of the classic Student’s t-test. The Welch’s t-test compares the variances of the two groups (i.e. handling heteroscedasticity), whereas the classic Student’s t-test assumes the variances of the two groups to be equal (fancy term = homoscedastic). 1.1 Testing for normality But hang on! There are several assumptions behind the classic t-test we haven’t examined properly, namely: independence - sample is independent normality - data for each group is normally distributed homoscedasticity - data across samples have equal variance We can at least be sure of (1), as we know that senior ICs and Managers are separate populations. However, (2) and (3) are assumptions that we have to validate and address specifically. To test whether our data is normally distributed, we can use the Shapiro-Wilk test of normality, with the function shapiro.test(): pq_data_grouped %>% group_by(ManagerIndicator) %>% summarise( p = shapiro.test(Multitasking_hours)$p.value, statistic = shapiro.test(Multitasking_hours)$statistic ) ## # A tibble: 2 x 3 ## ManagerIndicator p statistic ## ## 1 Manager 0.146 0.936 ## 2 Senior IC 0.0722 0.941 As both p-values show up as less than 0.05, the test implies that we should reject the null hypothesis that the data are normally distributed (i.e. not normally distributed). To confirm, you can also perform a visual check for normality using a histogram or a Q-Q plot. # Multitasking hours - IC mth_ic % filter(ManagerIndicator == "Senior IC") %>% pull(Multitasking_hours) qqnorm(mth_ic, pch = 1, frame = FALSE) qqline(mth_ic, col = "steelblue", lwd = 2) # Multitasking hours - Manager mth_man % filter(ManagerIndicator == "Manager") %>% pull(Multitasking_hours) qqnorm(mth_man, pch = 1, frame = FALSE) qqline(mth_man, col = "steelblue", lwd = 2) In the Q-Q plots, the points broadly adhere to the reference line. Therefore, the graphical approach suggests that the Shapiro-Wilk test may have been slightly over-sensitive. Below is a good thing to bear in mind:3 Statistical tests have the advantage of making an objective judgment of normality but have the disadvantage of sometimes not being sensitive enough at low sample sizes or overly sensitive to large sample sizes. Graphical interpretation has the advantage of allowing good judgment to assess normality in situations when numerical tests might be over or undersensitive. In other words, the sample sizes may have well played a role in the significant result in our Shapiro-Wilk test.4 As our data isn’t conclusively normal - this in turn makes the unpaired t-test less conclusive. When we cannot safely assume normality, we can consider other alternatives such as the non-parametric two-samples Wilcoxon Rank-Sum test. This is covered further down below. 1.2 Testing for equality of variance (homoscedasticity) Asides from normality, another assumption of the t-test that we hadn’t properly test for prior to running t.test() is to check for equality of variance across the two groups (homoscedasticity). Thankfully, this was not something we had to worry about as we used the Welch’s t-test. Recall that the classic Student’s t-test assumes equality between the two variances, but the Welch’s t-test already takes the difference in variance into account. If required, however, here is an example on how you can test for homoscedasticity in R, using var.test(): # F test to compare two variances var.test( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped ) ## ## F test to compare two variances ## ## data: Multitasking_hours by ManagerIndicator ## F = 4.5726, num df = 22, denom df = 32, p-value = 0.0001085 ## alternative hypothesis: true ratio of variances is not equal to 1 ## 95 percent confidence interval: ## 2.146082 10.318237 ## sample estimates: ## ratio of variances ## 4.572575 The var.test() function ran above is an F-test (i.e. uses the F-distribution) used to compare whether the variances of two samples are the same. Under the null hypothesis of the tests, there should be homoscedasticity and as the f-statistic is a ratio of variances, the f-statistic would tend towards 1. The arguments are provided in a similar format to t.test(). It appears that homoscedasticity does not hold: since the p-value is less than 0.05, we should reject the null hypothesis that variances between the manager and IC dataset are equal. The Student’s t-test would not have been appropriate here, and we were correct to have used the Welch’s t-test. Homoscedasticity can also be examined visually, using a boxplot or a dotplot (using graphics::dotchart() - suitable for small datasets). The code to do so would be as follows. For this example, visual examination is a bit more challenging as the senior IC and Manager groups have starkly different levels of multi-tasking hours. dotchart( x = pq_data_grouped$Multitasking_hours, groups = pq_data_grouped$ManagerIndicator ) boxplot( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped ) 2. Non-parametric tests 2.1 Wilcoxon Rank-Sum Test Previously, we could not safely rely on the unpaired two-sample t-test because we are not fully confident that the data satisfies the normality condition. As an alternative, we can use the Wilcoxon Rank-Sum test (aka Mann Whitney U Test). The Wilcoxon test is described as a non-parametric test, which in statistics typically means that there is no specification on a distribution, or the parameters of a distribution. In this case, the Wilcoxon test does not assume a normal distribution. Another difference between the Wilcoxon Rank-Sum test and the unpaired t-test is that the former tests whether two populations have the same shape via comparing medians, whereas the latter parametric test compares means between two independent groups. This is run using wilcox.test() wilcox.test( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped, paired = FALSE ) ## ## Wilcoxon rank sum exact test ## ## data: Multitasking_hours by ManagerIndicator ## W = 752, p-value = 2.842e-14 ## alternative hypothesis: true location shift is not equal to 0 The p-value of the test is less than the significance level (alpha = 0.05), which allows us to conclude that Managers’ median multitasking hours is significantly different from the ICs’. Note that the Wilcoxon Rank-Sum test is different from the similarly named Wilcoxon Signed-Rank test, which is the equivalent alternative for the paired t-test. To perform the Wilcoxon Signed-Rank test instead, you can simply specify the argument to be paired = TRUE. Similar to the decision of whether to use the paired or the unpaired t-test, you should ensure that the one-sample condition applies if you use the Wilcoxon Signed-Rank test. 2.2 Kruskal-Wallis test So far, we have only been looking at tests which compare exactly two populations. If we are looking for a test that works with comparisons across three or more populations, we can consider the Kruskal-Wallis test. Let us create a new data frame that is grouped at the PersonId level, but filtering out fewer values in LevelDesignation: pq_data_grouped_2 % filter(LevelDesignation %in% c( "Support", "Senior IC", "Junior IC", "Manager", "Director" )) %>% mutate(ManagerIndicator = factor(LevelDesignation)) %>% group_by(PersonId, ManagerIndicator) %>% summarise(Multitasking_hours = mean(Multitasking_hours), .groups = "drop") glimpse(pq_data_grouped_2) ## Rows: 198 ## Columns: 3 ## $ PersonId "0049ef24-ec83-356d-89f7-46b67364e677", "00f6d464-b~ ## $ ManagerIndicator Support, Senior IC, Manager, Support, Support, Supp~ ## $ Multitasking_hours 0.3812649, 0.2813373, 0.5980080, 0.2918829, 0.42288~ We can then run the Kruskal-Wallis test: kruskal.test( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped_2 ) ## ## Kruskal-Wallis rank sum test ## ## data: Multitasking_hours by ManagerIndicator ## Kruskal-Wallis chi-squared = 91.061, df = 4, p-value < 2.2e-16 Based on the Kruskal-Wallis test, we reject the null hypothesis and we conclude that at least one value in LevelDesignation is different in terms of their weekly hours spent multitasking. The most obvious downside to this method is that it does not tell us which groups are different from which, so this may need to be followed up with multiple pairwise-comparison tests (also known as post-hoc tests). 3. Comparison tests: ANOVA 3.1 ANOVA What if we want to run the t-test across more than two groups? Analysis of Variance (ANOVA) is an alternative method that generalises the t-test beyond two groups, so it is used to compare three or more groups. There are several versions of ANOVA. The simple version is the one-way ANOVA, but there is also two-way ANOVA which is used to estimate how the mean of a quantitative variable changes according to the levels of two categorical variables (e.g. rain/no-rain and weekend/weekday with respect to ice cream sales). In this example we will focus on one-way ANOVA. There are three assumptions in ANOVA, and this may look familiar: The data are independent. The responses for each factor level have a normal population distribution. These distributions have the same variance. These assumptions are the same as those required for the classic t-test above, and it is recommended that you check for variance and normality prior to ANOVA. ANOVA calculates the ratio of the between-group variance and the within-group variance (quantified using sum of squares), and then compares this with a threshold from the Fisher distribution (typically based on a significance level). The key function is aov(): res_aov F) ## ManagerIndicator 4 40.55 10.14 504.6 F): p-value of the F-statistic. This shows how likely it is that the F-value calculated from the test would have occurred if the null hypothesis of no difference among group means were true. Given that the p-value is smaller than 0.05, we reject the null hypothesis, so we reject the hypothesis that all means are equal. Therefore, we can conclude that at least one value in LevelDesignation is different in terms of their weekly hours spent multitasking. Antoine Soetewey’s blog recommends the use of the report package, which can help you make sense of the results more easily: library(report) report(res_aov) ## The ANOVA (formula: Multitasking_hours ~ ManagerIndicator) suggests that: ## ## - The main effect of ManagerIndicator is statistically significant and large ## (F(4, 193) = 504.61, p < .001; Eta2 = 0.91, 95% CI [0.90, 1.00]) ## ## Effect sizes were labelled following Field's (2013) recommendations. The same drawback that applies to the Kruskall-Wallis test also applies to ANOVA, in that doesn’t actually tell you which exact group is different from which; it only tells you whether any group differs significantly from the group mean. This ANOVA test is hence sometimes also referred to as an ‘omnibus’ test. 3.2 Next steps after ANOVA A pairwise t-test (note: pairwise, not paired!) is likely required to provide more information, and it is recommended that you review the p-value adjustment methods when doing so.6 Type I errors are more likely when running t-tests pairwise across many variables, and therefore correction is necessary. Here is an example of how you might run a pairwise t-test: pairwise.t.test( x = pq_data_grouped_2$Multitasking_hours, g = pq_data_grouped_2$ManagerIndicator, paired = FALSE, p.adjust.method = "bonferroni" ) ## ## Pairwise comparisons using t tests with pooled SD ## ## data: pq_data_grouped_2$Multitasking_hours and pq_data_grouped_2$ManagerIndicator ## ## Director Junior IC Manager Senior IC ## Junior IC" />
% mutate( ManagerIndicator = factor(LevelDesignation, levels = c("Manager", "Senior IC")) ) Recall also that our dataset pq_data is a cross-sectional time-series dataset, which means that for every individual identified by PersonId, there will be multiple rows representing a snapshot of a different week. In other words, a unique identifier would be something like a PersonWeekId. To simplify the dataset so that we are looking at person averages, we can group the dataset by PersonId and calculate the mean of Multitasking_hours for each person. After this manipulation, Multitasking_hours would represent the mean multitasking hours per person, as opposed to per person per week. Let us do this by building on the pipe-chain: pq_data_grouped % filter(LevelDesignation %in% c("Manager", "Senior IC")) %>% mutate( ManagerIndicator = factor(LevelDesignation, levels = c("Manager", "Senior IC")) ) %>% group_by(PersonId, ManagerIndicator) %>% summarise(Multitasking_hours = mean(Multitasking_hours), .groups = "drop") glimpse(pq_data_grouped) ## Rows: 56 ## Columns: 3 ## $ PersonId "00f6d464-ba1f-31ee-b51e-ab6e8ec4fb79", "023ddb61-1~ ## $ ManagerIndicator Senior IC, Manager, Senior IC, Senior IC, Manager, ~ ## $ Multitasking_hours 0.2813373, 0.5980080, 0.3319752, 0.2938879, 0.70762~ Now our data is in the right format. Let us presume that the data satisfies all the assumptions of the t-test, and see what happens when we run it with the base t.test() function: t.test( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped, paired = FALSE ) ## ## Welch Two Sample t-test ## ## data: Multitasking_hours by ManagerIndicator ## t = 10.097, df = 28.758, p-value = 5.806e-11 ## alternative hypothesis: true difference in means between group Manager and group Senior IC is not equal to 0 ## 95 percent confidence interval: ## 0.3444870 0.5195712 ## sample estimates: ## mean in group Manager mean in group Senior IC ## 0.8103354 0.3783063 In the function, the predictor and outcome variables are supplied using a tilde (~) format common in R, and we have specified paired = FALSE to use an unpaired t-test. As for the output, t represents the t-statistic. df represents the degree of freedom. p-value is - well - the p-value. The value here shows to be significant, as it is smaller than the significance level at 0.05. the test allows us to reject the null hypothesis that the means of multitasking hours between managers and ICs are the same. Note that the t-test used here is the Welch’s t-test, which is an adaptation of the classic Student’s t-test. The Welch’s t-test compares the variances of the two groups (i.e. handling heteroscedasticity), whereas the classic Student’s t-test assumes the variances of the two groups to be equal (fancy term = homoscedastic). 1.1 Testing for normality But hang on! There are several assumptions behind the classic t-test we haven’t examined properly, namely: independence - sample is independent normality - data for each group is normally distributed homoscedasticity - data across samples have equal variance We can at least be sure of (1), as we know that senior ICs and Managers are separate populations. However, (2) and (3) are assumptions that we have to validate and address specifically. To test whether our data is normally distributed, we can use the Shapiro-Wilk test of normality, with the function shapiro.test(): pq_data_grouped %>% group_by(ManagerIndicator) %>% summarise( p = shapiro.test(Multitasking_hours)$p.value, statistic = shapiro.test(Multitasking_hours)$statistic ) ## # A tibble: 2 x 3 ## ManagerIndicator p statistic ## ## 1 Manager 0.146 0.936 ## 2 Senior IC 0.0722 0.941 As both p-values show up as less than 0.05, the test implies that we should reject the null hypothesis that the data are normally distributed (i.e. not normally distributed). To confirm, you can also perform a visual check for normality using a histogram or a Q-Q plot. # Multitasking hours - IC mth_ic % filter(ManagerIndicator == "Senior IC") %>% pull(Multitasking_hours) qqnorm(mth_ic, pch = 1, frame = FALSE) qqline(mth_ic, col = "steelblue", lwd = 2) # Multitasking hours - Manager mth_man % filter(ManagerIndicator == "Manager") %>% pull(Multitasking_hours) qqnorm(mth_man, pch = 1, frame = FALSE) qqline(mth_man, col = "steelblue", lwd = 2) In the Q-Q plots, the points broadly adhere to the reference line. Therefore, the graphical approach suggests that the Shapiro-Wilk test may have been slightly over-sensitive. Below is a good thing to bear in mind:3 Statistical tests have the advantage of making an objective judgment of normality but have the disadvantage of sometimes not being sensitive enough at low sample sizes or overly sensitive to large sample sizes. Graphical interpretation has the advantage of allowing good judgment to assess normality in situations when numerical tests might be over or undersensitive. In other words, the sample sizes may have well played a role in the significant result in our Shapiro-Wilk test.4 As our data isn’t conclusively normal - this in turn makes the unpaired t-test less conclusive. When we cannot safely assume normality, we can consider other alternatives such as the non-parametric two-samples Wilcoxon Rank-Sum test. This is covered further down below. 1.2 Testing for equality of variance (homoscedasticity) Asides from normality, another assumption of the t-test that we hadn’t properly test for prior to running t.test() is to check for equality of variance across the two groups (homoscedasticity). Thankfully, this was not something we had to worry about as we used the Welch’s t-test. Recall that the classic Student’s t-test assumes equality between the two variances, but the Welch’s t-test already takes the difference in variance into account. If required, however, here is an example on how you can test for homoscedasticity in R, using var.test(): # F test to compare two variances var.test( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped ) ## ## F test to compare two variances ## ## data: Multitasking_hours by ManagerIndicator ## F = 4.5726, num df = 22, denom df = 32, p-value = 0.0001085 ## alternative hypothesis: true ratio of variances is not equal to 1 ## 95 percent confidence interval: ## 2.146082 10.318237 ## sample estimates: ## ratio of variances ## 4.572575 The var.test() function ran above is an F-test (i.e. uses the F-distribution) used to compare whether the variances of two samples are the same. Under the null hypothesis of the tests, there should be homoscedasticity and as the f-statistic is a ratio of variances, the f-statistic would tend towards 1. The arguments are provided in a similar format to t.test(). It appears that homoscedasticity does not hold: since the p-value is less than 0.05, we should reject the null hypothesis that variances between the manager and IC dataset are equal. The Student’s t-test would not have been appropriate here, and we were correct to have used the Welch’s t-test. Homoscedasticity can also be examined visually, using a boxplot or a dotplot (using graphics::dotchart() - suitable for small datasets). The code to do so would be as follows. For this example, visual examination is a bit more challenging as the senior IC and Manager groups have starkly different levels of multi-tasking hours. dotchart( x = pq_data_grouped$Multitasking_hours, groups = pq_data_grouped$ManagerIndicator ) boxplot( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped ) 2. Non-parametric tests 2.1 Wilcoxon Rank-Sum Test Previously, we could not safely rely on the unpaired two-sample t-test because we are not fully confident that the data satisfies the normality condition. As an alternative, we can use the Wilcoxon Rank-Sum test (aka Mann Whitney U Test). The Wilcoxon test is described as a non-parametric test, which in statistics typically means that there is no specification on a distribution, or the parameters of a distribution. In this case, the Wilcoxon test does not assume a normal distribution. Another difference between the Wilcoxon Rank-Sum test and the unpaired t-test is that the former tests whether two populations have the same shape via comparing medians, whereas the latter parametric test compares means between two independent groups. This is run using wilcox.test() wilcox.test( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped, paired = FALSE ) ## ## Wilcoxon rank sum exact test ## ## data: Multitasking_hours by ManagerIndicator ## W = 752, p-value = 2.842e-14 ## alternative hypothesis: true location shift is not equal to 0 The p-value of the test is less than the significance level (alpha = 0.05), which allows us to conclude that Managers’ median multitasking hours is significantly different from the ICs’. Note that the Wilcoxon Rank-Sum test is different from the similarly named Wilcoxon Signed-Rank test, which is the equivalent alternative for the paired t-test. To perform the Wilcoxon Signed-Rank test instead, you can simply specify the argument to be paired = TRUE. Similar to the decision of whether to use the paired or the unpaired t-test, you should ensure that the one-sample condition applies if you use the Wilcoxon Signed-Rank test. 2.2 Kruskal-Wallis test So far, we have only been looking at tests which compare exactly two populations. If we are looking for a test that works with comparisons across three or more populations, we can consider the Kruskal-Wallis test. Let us create a new data frame that is grouped at the PersonId level, but filtering out fewer values in LevelDesignation: pq_data_grouped_2 % filter(LevelDesignation %in% c( "Support", "Senior IC", "Junior IC", "Manager", "Director" )) %>% mutate(ManagerIndicator = factor(LevelDesignation)) %>% group_by(PersonId, ManagerIndicator) %>% summarise(Multitasking_hours = mean(Multitasking_hours), .groups = "drop") glimpse(pq_data_grouped_2) ## Rows: 198 ## Columns: 3 ## $ PersonId "0049ef24-ec83-356d-89f7-46b67364e677", "00f6d464-b~ ## $ ManagerIndicator Support, Senior IC, Manager, Support, Support, Supp~ ## $ Multitasking_hours 0.3812649, 0.2813373, 0.5980080, 0.2918829, 0.42288~ We can then run the Kruskal-Wallis test: kruskal.test( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped_2 ) ## ## Kruskal-Wallis rank sum test ## ## data: Multitasking_hours by ManagerIndicator ## Kruskal-Wallis chi-squared = 91.061, df = 4, p-value < 2.2e-16 Based on the Kruskal-Wallis test, we reject the null hypothesis and we conclude that at least one value in LevelDesignation is different in terms of their weekly hours spent multitasking. The most obvious downside to this method is that it does not tell us which groups are different from which, so this may need to be followed up with multiple pairwise-comparison tests (also known as post-hoc tests). 3. Comparison tests: ANOVA 3.1 ANOVA What if we want to run the t-test across more than two groups? Analysis of Variance (ANOVA) is an alternative method that generalises the t-test beyond two groups, so it is used to compare three or more groups. There are several versions of ANOVA. The simple version is the one-way ANOVA, but there is also two-way ANOVA which is used to estimate how the mean of a quantitative variable changes according to the levels of two categorical variables (e.g. rain/no-rain and weekend/weekday with respect to ice cream sales). In this example we will focus on one-way ANOVA. There are three assumptions in ANOVA, and this may look familiar: The data are independent. The responses for each factor level have a normal population distribution. These distributions have the same variance. These assumptions are the same as those required for the classic t-test above, and it is recommended that you check for variance and normality prior to ANOVA. ANOVA calculates the ratio of the between-group variance and the within-group variance (quantified using sum of squares), and then compares this with a threshold from the Fisher distribution (typically based on a significance level). The key function is aov(): res_aov F) ## ManagerIndicator 4 40.55 10.14 504.6 F): p-value of the F-statistic. This shows how likely it is that the F-value calculated from the test would have occurred if the null hypothesis of no difference among group means were true. Given that the p-value is smaller than 0.05, we reject the null hypothesis, so we reject the hypothesis that all means are equal. Therefore, we can conclude that at least one value in LevelDesignation is different in terms of their weekly hours spent multitasking. Antoine Soetewey’s blog recommends the use of the report package, which can help you make sense of the results more easily: library(report) report(res_aov) ## The ANOVA (formula: Multitasking_hours ~ ManagerIndicator) suggests that: ## ## - The main effect of ManagerIndicator is statistically significant and large ## (F(4, 193) = 504.61, p < .001; Eta2 = 0.91, 95% CI [0.90, 1.00]) ## ## Effect sizes were labelled following Field's (2013) recommendations. The same drawback that applies to the Kruskall-Wallis test also applies to ANOVA, in that doesn’t actually tell you which exact group is different from which; it only tells you whether any group differs significantly from the group mean. This ANOVA test is hence sometimes also referred to as an ‘omnibus’ test. 3.2 Next steps after ANOVA A pairwise t-test (note: pairwise, not paired!) is likely required to provide more information, and it is recommended that you review the p-value adjustment methods when doing so.6 Type I errors are more likely when running t-tests pairwise across many variables, and therefore correction is necessary. Here is an example of how you might run a pairwise t-test: pairwise.t.test( x = pq_data_grouped_2$Multitasking_hours, g = pq_data_grouped_2$ManagerIndicator, paired = FALSE, p.adjust.method = "bonferroni" ) ## ## Pairwise comparisons using t tests with pooled SD ## ## data: pq_data_grouped_2$Multitasking_hours and pq_data_grouped_2$ManagerIndicator ## ## Director Junior IC Manager Senior IC ## Junior IC" />
% mutate( ManagerIndicator = factor(LevelDesignation, levels = c("Manager", "Senior IC")) ) Recall also that our dataset pq_data is a cross-sectional time-series dataset, which means that for every individual identified by PersonId, there will be multiple rows representing a snapshot of a different week. In other words, a unique identifier would be something like a PersonWeekId. To simplify the dataset so that we are looking at person averages, we can group the dataset by PersonId and calculate the mean of Multitasking_hours for each person. After this manipulation, Multitasking_hours would represent the mean multitasking hours per person, as opposed to per person per week. Let us do this by building on the pipe-chain: pq_data_grouped % filter(LevelDesignation %in% c("Manager", "Senior IC")) %>% mutate( ManagerIndicator = factor(LevelDesignation, levels = c("Manager", "Senior IC")) ) %>% group_by(PersonId, ManagerIndicator) %>% summarise(Multitasking_hours = mean(Multitasking_hours), .groups = "drop") glimpse(pq_data_grouped) ## Rows: 56 ## Columns: 3 ## $ PersonId "00f6d464-ba1f-31ee-b51e-ab6e8ec4fb79", "023ddb61-1~ ## $ ManagerIndicator Senior IC, Manager, Senior IC, Senior IC, Manager, ~ ## $ Multitasking_hours 0.2813373, 0.5980080, 0.3319752, 0.2938879, 0.70762~ Now our data is in the right format. Let us presume that the data satisfies all the assumptions of the t-test, and see what happens when we run it with the base t.test() function: t.test( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped, paired = FALSE ) ## ## Welch Two Sample t-test ## ## data: Multitasking_hours by ManagerIndicator ## t = 10.097, df = 28.758, p-value = 5.806e-11 ## alternative hypothesis: true difference in means between group Manager and group Senior IC is not equal to 0 ## 95 percent confidence interval: ## 0.3444870 0.5195712 ## sample estimates: ## mean in group Manager mean in group Senior IC ## 0.8103354 0.3783063 In the function, the predictor and outcome variables are supplied using a tilde (~) format common in R, and we have specified paired = FALSE to use an unpaired t-test. As for the output, t represents the t-statistic. df represents the degree of freedom. p-value is - well - the p-value. The value here shows to be significant, as it is smaller than the significance level at 0.05. the test allows us to reject the null hypothesis that the means of multitasking hours between managers and ICs are the same. Note that the t-test used here is the Welch’s t-test, which is an adaptation of the classic Student’s t-test. The Welch’s t-test compares the variances of the two groups (i.e. handling heteroscedasticity), whereas the classic Student’s t-test assumes the variances of the two groups to be equal (fancy term = homoscedastic). 1.1 Testing for normality But hang on! There are several assumptions behind the classic t-test we haven’t examined properly, namely: independence - sample is independent normality - data for each group is normally distributed homoscedasticity - data across samples have equal variance We can at least be sure of (1), as we know that senior ICs and Managers are separate populations. However, (2) and (3) are assumptions that we have to validate and address specifically. To test whether our data is normally distributed, we can use the Shapiro-Wilk test of normality, with the function shapiro.test(): pq_data_grouped %>% group_by(ManagerIndicator) %>% summarise( p = shapiro.test(Multitasking_hours)$p.value, statistic = shapiro.test(Multitasking_hours)$statistic ) ## # A tibble: 2 x 3 ## ManagerIndicator p statistic ## ## 1 Manager 0.146 0.936 ## 2 Senior IC 0.0722 0.941 As both p-values show up as less than 0.05, the test implies that we should reject the null hypothesis that the data are normally distributed (i.e. not normally distributed). To confirm, you can also perform a visual check for normality using a histogram or a Q-Q plot. # Multitasking hours - IC mth_ic % filter(ManagerIndicator == "Senior IC") %>% pull(Multitasking_hours) qqnorm(mth_ic, pch = 1, frame = FALSE) qqline(mth_ic, col = "steelblue", lwd = 2) # Multitasking hours - Manager mth_man % filter(ManagerIndicator == "Manager") %>% pull(Multitasking_hours) qqnorm(mth_man, pch = 1, frame = FALSE) qqline(mth_man, col = "steelblue", lwd = 2) In the Q-Q plots, the points broadly adhere to the reference line. Therefore, the graphical approach suggests that the Shapiro-Wilk test may have been slightly over-sensitive. Below is a good thing to bear in mind:3 Statistical tests have the advantage of making an objective judgment of normality but have the disadvantage of sometimes not being sensitive enough at low sample sizes or overly sensitive to large sample sizes. Graphical interpretation has the advantage of allowing good judgment to assess normality in situations when numerical tests might be over or undersensitive. In other words, the sample sizes may have well played a role in the significant result in our Shapiro-Wilk test.4 As our data isn’t conclusively normal - this in turn makes the unpaired t-test less conclusive. When we cannot safely assume normality, we can consider other alternatives such as the non-parametric two-samples Wilcoxon Rank-Sum test. This is covered further down below. 1.2 Testing for equality of variance (homoscedasticity) Asides from normality, another assumption of the t-test that we hadn’t properly test for prior to running t.test() is to check for equality of variance across the two groups (homoscedasticity). Thankfully, this was not something we had to worry about as we used the Welch’s t-test. Recall that the classic Student’s t-test assumes equality between the two variances, but the Welch’s t-test already takes the difference in variance into account. If required, however, here is an example on how you can test for homoscedasticity in R, using var.test(): # F test to compare two variances var.test( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped ) ## ## F test to compare two variances ## ## data: Multitasking_hours by ManagerIndicator ## F = 4.5726, num df = 22, denom df = 32, p-value = 0.0001085 ## alternative hypothesis: true ratio of variances is not equal to 1 ## 95 percent confidence interval: ## 2.146082 10.318237 ## sample estimates: ## ratio of variances ## 4.572575 The var.test() function ran above is an F-test (i.e. uses the F-distribution) used to compare whether the variances of two samples are the same. Under the null hypothesis of the tests, there should be homoscedasticity and as the f-statistic is a ratio of variances, the f-statistic would tend towards 1. The arguments are provided in a similar format to t.test(). It appears that homoscedasticity does not hold: since the p-value is less than 0.05, we should reject the null hypothesis that variances between the manager and IC dataset are equal. The Student’s t-test would not have been appropriate here, and we were correct to have used the Welch’s t-test. Homoscedasticity can also be examined visually, using a boxplot or a dotplot (using graphics::dotchart() - suitable for small datasets). The code to do so would be as follows. For this example, visual examination is a bit more challenging as the senior IC and Manager groups have starkly different levels of multi-tasking hours. dotchart( x = pq_data_grouped$Multitasking_hours, groups = pq_data_grouped$ManagerIndicator ) boxplot( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped ) 2. Non-parametric tests 2.1 Wilcoxon Rank-Sum Test Previously, we could not safely rely on the unpaired two-sample t-test because we are not fully confident that the data satisfies the normality condition. As an alternative, we can use the Wilcoxon Rank-Sum test (aka Mann Whitney U Test). The Wilcoxon test is described as a non-parametric test, which in statistics typically means that there is no specification on a distribution, or the parameters of a distribution. In this case, the Wilcoxon test does not assume a normal distribution. Another difference between the Wilcoxon Rank-Sum test and the unpaired t-test is that the former tests whether two populations have the same shape via comparing medians, whereas the latter parametric test compares means between two independent groups. This is run using wilcox.test() wilcox.test( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped, paired = FALSE ) ## ## Wilcoxon rank sum exact test ## ## data: Multitasking_hours by ManagerIndicator ## W = 752, p-value = 2.842e-14 ## alternative hypothesis: true location shift is not equal to 0 The p-value of the test is less than the significance level (alpha = 0.05), which allows us to conclude that Managers’ median multitasking hours is significantly different from the ICs’. Note that the Wilcoxon Rank-Sum test is different from the similarly named Wilcoxon Signed-Rank test, which is the equivalent alternative for the paired t-test. To perform the Wilcoxon Signed-Rank test instead, you can simply specify the argument to be paired = TRUE. Similar to the decision of whether to use the paired or the unpaired t-test, you should ensure that the one-sample condition applies if you use the Wilcoxon Signed-Rank test. 2.2 Kruskal-Wallis test So far, we have only been looking at tests which compare exactly two populations. If we are looking for a test that works with comparisons across three or more populations, we can consider the Kruskal-Wallis test. Let us create a new data frame that is grouped at the PersonId level, but filtering out fewer values in LevelDesignation: pq_data_grouped_2 % filter(LevelDesignation %in% c( "Support", "Senior IC", "Junior IC", "Manager", "Director" )) %>% mutate(ManagerIndicator = factor(LevelDesignation)) %>% group_by(PersonId, ManagerIndicator) %>% summarise(Multitasking_hours = mean(Multitasking_hours), .groups = "drop") glimpse(pq_data_grouped_2) ## Rows: 198 ## Columns: 3 ## $ PersonId "0049ef24-ec83-356d-89f7-46b67364e677", "00f6d464-b~ ## $ ManagerIndicator Support, Senior IC, Manager, Support, Support, Supp~ ## $ Multitasking_hours 0.3812649, 0.2813373, 0.5980080, 0.2918829, 0.42288~ We can then run the Kruskal-Wallis test: kruskal.test( Multitasking_hours ~ ManagerIndicator, data = pq_data_grouped_2 ) ## ## Kruskal-Wallis rank sum test ## ## data: Multitasking_hours by ManagerIndicator ## Kruskal-Wallis chi-squared = 91.061, df = 4, p-value < 2.2e-16 Based on the Kruskal-Wallis test, we reject the null hypothesis and we conclude that at least one value in LevelDesignation is different in terms of their weekly hours spent multitasking. The most obvious downside to this method is that it does not tell us which groups are different from which, so this may need to be followed up with multiple pairwise-comparison tests (also known as post-hoc tests). 3. Comparison tests: ANOVA 3.1 ANOVA What if we want to run the t-test across more than two groups? Analysis of Variance (ANOVA) is an alternative method that generalises the t-test beyond two groups, so it is used to compare three or more groups. There are several versions of ANOVA. The simple version is the one-way ANOVA, but there is also two-way ANOVA which is used to estimate how the mean of a quantitative variable changes according to the levels of two categorical variables (e.g. rain/no-rain and weekend/weekday with respect to ice cream sales). In this example we will focus on one-way ANOVA. There are three assumptions in ANOVA, and this may look familiar: The data are independent. The responses for each factor level have a normal population distribution. These distributions have the same variance. These assumptions are the same as those required for the classic t-test above, and it is recommended that you check for variance and normality prior to ANOVA. ANOVA calculates the ratio of the between-group variance and the within-group variance (quantified using sum of squares), and then compares this with a threshold from the Fisher distribution (typically based on a significance level). The key function is aov(): res_aov F) ## ManagerIndicator 4 40.55 10.14 504.6 F): p-value of the F-statistic. This shows how likely it is that the F-value calculated from the test would have occurred if the null hypothesis of no difference among group means were true. Given that the p-value is smaller than 0.05, we reject the null hypothesis, so we reject the hypothesis that all means are equal. Therefore, we can conclude that at least one value in LevelDesignation is different in terms of their weekly hours spent multitasking. Antoine Soetewey’s blog recommends the use of the report package, which can help you make sense of the results more easily: library(report) report(res_aov) ## The ANOVA (formula: Multitasking_hours ~ ManagerIndicator) suggests that: ## ## - The main effect of ManagerIndicator is statistically significant and large ## (F(4, 193) = 504.61, p < .001; Eta2 = 0.91, 95% CI [0.90, 1.00]) ## ## Effect sizes were labelled following Field's (2013) recommendations. The same drawback that applies to the Kruskall-Wallis test also applies to ANOVA, in that doesn’t actually tell you which exact group is different from which; it only tells you whether any group differs significantly from the group mean. This ANOVA test is hence sometimes also referred to as an ‘omnibus’ test. 3.2 Next steps after ANOVA A pairwise t-test (note: pairwise, not paired!) is likely required to provide more information, and it is recommended that you review the p-value adjustment methods when doing so.6 Type I errors are more likely when running t-tests pairwise across many variables, and therefore correction is necessary. Here is an example of how you might run a pairwise t-test: pairwise.t.test( x = pq_data_grouped_2$Multitasking_hours, g = pq_data_grouped_2$ManagerIndicator, paired = FALSE, p.adjust.method = "bonferroni" ) ## ## Pairwise comparisons using t tests with pooled SD ## ## data: pq_data_grouped_2$Multitasking_hours and pq_data_grouped_2$ManagerIndicator ## ## Director Junior IC Manager Senior IC ## Junior IC " />

Images Powered by Shutterstock