I am not an engineer
Must read:Is there a real future in data analysis for self-learners without a math degree?

Comparing Categorical Variables

If you see an error in the article, please comment or drop me an email.


When Do We Test for Goodness of Fit (GOF)?

  • A goodness-of-fit test is a one variable Chi-square test.

  • The goal of a Chi-square goodness-of-fit test is to determine whether a set of frequencies or proportions is similar to and therefore “fits” with a hypothesized set of frequencies or proportions“.

  • A Chi-square goodness-of-fit test is like a one-sample t-test. It determines if a sample is similar to, and representative of, a population.

  • Sample of cases classified into several groups, determine whether the sample is representative of the general population

Typical questions:

  • Do the data resemble a particular distribution (normal, geometric, etc.)?

  • Is this dice fair or not?

When Do We Test for Independence?

  • A test of independence is a two variable Chi-square test.

  • Like any Chi-square test the data are frequencies, so there are no scores and no means or standard deviations.

  • The goal of a two-variable Chi-square is to determine whether or not the first variable is related to-or independent of-the second variable

Typical questions:

  • Is the outcome in one variable related to the outcome in some other variable?

  • Does brand preference depend on age or not?

Goodness of Fit: One-way table

The test statistic for a one-way table is the following X^2:

X^2 = SUM [ (nb_observed - nb_expected)^2 / nb_expected ]

*with: nb_expected <- p_pop * n*

Conditions for Goodness of fit chi-square test

  1. Independence: each case that contributes to a case (=level) of the table must be independent of all other cases in the table

  2. Expected values (=counts) must be >= 5

One-way table – Goodness of Fit Test

Define the expected values

An example of observed data, compared to expected data. In this case, the expected values come from the geometric model P=(1-prob)^D-1*prob

Results 1 2 3 4 5 6 7+ Total
Observed 1532 760 338 194 74 33 17 2948
Geometric 1569 734 343 161 75 35 31 2948

Calculate Chi-Square

obs <- c(1532,760,338,194,74,33,17)
exp <- c(1569,734,343,161,75,35,31)
sqr <- (obs - exp)^2/exp
chi_sqr <- sum(sqr)
## [1] 15.08

Check P-value

n <- length(obs)
df <- n - 1

p_value <- pchisq(q=chi_sqr,df=df,lower.tail = FALSE)    
## [1] 0.02

Independence Test

Calculation differences for the independence test, as compared to the GOF test:

  • Degrees of freedom: (nb_rows – 1) * (nb_cols – 1)

  • Expected values: col_total * ratio(row_total/table_total)

Example: Independence test

Two categorical variables:

  • Weight: obese or not obese

  • Relationship status: dating, cohabiting or married

Results Dating Cohabiting Married Total
Obese 81 103 147 331
Not obese 359 326 277 962
Total 440 429 424 1293


H_0: weight and relationship status are independent. Obesity rates do not vary by relationship status.

H_A: weight and relationship status are dependent. Obesity rates do vary by relationship status.

This is called an independence test since we are evaluating the relationship between two categorical variables.

Calculating the chi-square value for all cells

To calculate the expected value:

expected_value <- row_total/total * column_total

To calculate the chi-square value in each cell:

chisq <- (observed_value – expected_value)^2/expected_value

Feed in data

obese_obs <- c(81,103,147,331)
notob_obs <- c(359,326,277,962)
df <- data.frame(obese_obs,notob_obs)

row_total <- df$obese_obs + df$notob_obs
tbl_total <- sum(df$obese_obs)+sum(df$notob_obs)
ratio_obese_notob <- (sum(df$obese_obs)/tbl_total)
ratio_notob_obese <- (sum(df$notob_obs)/tbl_total)
df$obese_exp <- (row_total)*(ratio_obese_notob)
df$notob_exp <- (row_total)*(ratio_notob_obese)

Calculate Chi-Square

df$obese_sqr <- (df$obese_obs - df$obese_exp)^2/df$obese_exp
df$notob_sqr <- (df$notob_obs - df$notob_exp)^2/df$notob_exp
chi_sqr <- sum(df$obese_sqr,df$notob_sqr)
## [1] 30.83

Calculcate degrees of freedom

nb_rows <- length(df$obese_obs)
nb_columns <- 2
df <- (nb_rows - 1) * (nb_columns - 1)
## [1] 3

Check P-value

p_value <- pchisq(q=chi_sqr,df=df,lower.tail = FALSE)    
## [1] 0