I am not an engineer
Must read:Is there a real future in data analysis for self-learners without a math degree?

Analysis of Variance (ANOVA)

If you see an error in the article, please comment or drop me an email.

Three Conditions for using ANOVA

  1. Homogeneity of variances in each group
sd_1 <- 64.43
sd_2 <- 38.63
sd_3 <- 52.24
sd_4 <- 64.90
sd_5 <- 54.13
sd_6 <- 48.84
sds <- c(sd_1,sd_2,sd_3,sd_4,sd_5,sd_6)
sds_ratio <- round(min(sds)/max(sds),2)
print(sds_ratio)
## [1] 0.6
ifelse(sds_ratio>=.5 && sds_ratio <= 2,"variances are equal","variances are unequal")
## [1] "variances are equal"
  1. Nearly normal distribution in each group

  2. Independence of observations

See more at https://statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide-2.php

Used parameters in ANOVA

MSG – Mean square between groups –> describes variability between groups

MSE – Mean square error –> describes variability within groups

F = MSG/MSE = ratio of variability in the sample means relative to the variability within the groups

The F-Statistic comes with two degrees of freedom : df_g (group) and df_e (error)

df_t = n – 1 (number of samples minus one)

df_g = k – 1 (number of groups minus one)

df_e = n – k (combined sample size minus number of groups)

Example of ANOVA

#HYPOTHESES
# H_0 : mu_a = mu_b = mu_c
# H_A : mu_... != mu_...
# This is necessarily one-sided as the F distribution is exclusively positive

n <- 999 # number of samples
k <- 3 # number of groups

df_t <- n - 1
df_g <- k - 1
df_e <- (n - 1) - (k - 1) #<- n - k

SSG <- 8888
SSE <- 7777

MSG <- SSG/df_g
MSE <- SSE/df_e

f_statistic <- MSG/MSE

p_value <- pf(f_statistic, df_g, df_e, lower.tail = FALSE)

Multiple comparisons

Why use multiple comparisons?

  1. To check which means are different

  2. To control the Type 1 Error Rate

# HYPOTHESIS
# H_0 : mu_lower - mu_middle = 0
# H_A : mu_lower - mu_middle != 0
# This is a two-sided test

# GIVEN DATA
n_1 <- 41 # sample size of lower class
n_2 <- 331 # sample size of middle class
n <- 792 # total sample size
df <- n - 1
null_value <- 0
mean_1 <- 5.07
mean_2 <- 6.76
alpha <- .5 # Significance level
sides <- 2

MSE <- 3.628

# Standard error for multiple pairwise comparisons
SE <- sqrt((MSE/n_1) + (MSE/n_2))

t_statistic <- (abs(mean_1 - mean_2) - null_value) / SE

# Calculate the number of comparisons
k <- 3 # number of groups
nb_comp <- (k*(k-1))/2 

# Bonferroni correction of the significance level
alpha <- alpha/nb_comp

p_value <- pt(t_statistic,df=df, lower.tail=FALSE) * sides