# Comparing Categorical Variables

If you see an error in the article, please comment or drop me an email.

# Introduction

## When Do We Test for Goodness of Fit (GOF)?

• A goodness-of-fit test is a one variable Chi-square test.

• The goal of a Chi-square goodness-of-fit test is to determine whether a set of frequencies or proportions is similar to and therefore “fits” with a hypothesized set of frequencies or proportions“.

• A Chi-square goodness-of-fit test is like a one-sample t-test. It determines if a sample is similar to, and representative of, a population.

• Sample of cases classified into several groups, determine whether the sample is representative of the general population

Typical questions:

• Do the data resemble a particular distribution (normal, geometric, etc.)?

• Is this dice fair or not?

## When Do We Test for Independence?

• A test of independence is a two variable Chi-square test.

• Like any Chi-square test the data are frequencies, so there are no scores and no means or standard deviations.

• The goal of a two-variable Chi-square is to determine whether or not the first variable is related to-or independent of-the second variable

Typical questions:

• Is the outcome in one variable related to the outcome in some other variable?

• Does brand preference depend on age or not?

# Goodness of Fit: One-way table

The test statistic for a one-way table is the following X^2:

`X^2 = SUM [ (nb_observed - nb_expected)^2 / nb_expected ]`

*with: `nb_expected <- p_pop * n*`

## Conditions for Goodness of fit chi-square test

1. Independence: each case that contributes to a case (=level) of the table must be independent of all other cases in the table

2. Expected values (=counts) must be >= 5

# One-way table – Goodness of Fit Test

### Define the expected values

An example of observed data, compared to expected data. In this case, the expected values come from the geometric model P=(1-prob)^D-1*prob

Results 1 2 3 4 5 6 7+ Total
Observed 1532 760 338 194 74 33 17 2948
Geometric 1569 734 343 161 75 35 31 2948

### Calculate Chi-Square

``````obs <- c(1532,760,338,194,74,33,17)
exp <- c(1569,734,343,161,75,35,31)
sqr <- (obs - exp)^2/exp
chi_sqr <- sum(sqr)
round(chi_sqr,2)``````
``##  15.08``

### Check P-value

``````n <- length(obs)
df <- n - 1

p_value <- pchisq(q=chi_sqr,df=df,lower.tail = FALSE)
round(p_value,2)``````
``##  0.02``

# Independence Test

Calculation differences for the independence test, as compared to the GOF test:

• Degrees of freedom: (nb_rows – 1) * (nb_cols – 1)

• Expected values: col_total * ratio(row_total/table_total)

## Example: Independence test

Two categorical variables:

• Weight: obese or not obese

• Relationship status: dating, cohabiting or married

Results Dating Cohabiting Married Total
Obese 81 103 147 331
Not obese 359 326 277 962
Total 440 429 424 1293

### Hypotheses

H_0: weight and relationship status are independent. Obesity rates do not vary by relationship status.

H_A: weight and relationship status are dependent. Obesity rates do vary by relationship status.

This is called an independence test since we are evaluating the relationship between two categorical variables.

### Calculating the chi-square value for all cells

To calculate the expected value:

expected_value <- row_total/total * column_total

To calculate the chi-square value in each cell:

chisq <- (observed_value – expected_value)^2/expected_value

### Feed in data

``````obese_obs <- c(81,103,147,331)
notob_obs <- c(359,326,277,962)
df <- data.frame(obese_obs,notob_obs)

row_total <- df\$obese_obs + df\$notob_obs
tbl_total <- sum(df\$obese_obs)+sum(df\$notob_obs)
ratio_obese_notob <- (sum(df\$obese_obs)/tbl_total)
ratio_notob_obese <- (sum(df\$notob_obs)/tbl_total)

df\$obese_exp <- (row_total)*(ratio_obese_notob)
df\$notob_exp <- (row_total)*(ratio_notob_obese)``````

### Calculate Chi-Square

``````df\$obese_sqr <- (df\$obese_obs - df\$obese_exp)^2/df\$obese_exp
df\$notob_sqr <- (df\$notob_obs - df\$notob_exp)^2/df\$notob_exp
chi_sqr <- sum(df\$obese_sqr,df\$notob_sqr)
round(chi_sqr,2)``````
``##  30.83``

### Calculcate degrees of freedom

``````nb_rows <- length(df\$obese_obs)
nb_columns <- 2
df <- (nb_rows - 1) * (nb_columns - 1)
print(df)``````
``##  3``

### Check P-value

``````p_value <- pchisq(q=chi_sqr,df=df,lower.tail = FALSE)
round(p_value,2)``````
``##  0``