Comparing Categorical Variables
If you see an error in the article, please comment or drop me an email.
Introduction
When Do We Test for Goodness of Fit (GOF)?

A goodnessoffit test is a one variable Chisquare test.

The goal of a Chisquare goodnessoffit test is to determine whether a set of frequencies or proportions is similar to and therefore “fits” with a hypothesized set of frequencies or proportions“.

A Chisquare goodnessoffit test is like a onesample ttest. It determines if a sample is similar to, and representative of, a population.

Sample of cases classified into several groups, determine whether the sample is representative of the general population
Typical questions:

Do the data resemble a particular distribution (normal, geometric, etc.)?

“Is this dice fair or not?”
When Do We Test for Independence?

A test of independence is a two variable Chisquare test.

Like any Chisquare test the data are frequencies, so there are no scores and no means or standard deviations.

The goal of a twovariable Chisquare is to determine whether or not the first variable is related toor independent ofthe second variable
Typical questions:

Is the outcome in one variable related to the outcome in some other variable?

“Does brand preference depend on age or not?”
Goodness of Fit: Oneway table
The test statistic for a oneway table is the following X^2:
X^2 = SUM [ (nb_observed  nb_expected)^2 / nb_expected ]
*with: nb_expected < p_pop * n*
Conditions for Goodness of fit chisquare test

Independence: each case that contributes to a case (=level) of the table must be independent of all other cases in the table

Expected values (=counts) must be >= 5
Oneway table – Goodness of Fit Test
Define the expected values
An example of observed data, compared to expected data. In this case, the expected values come from the geometric model P=(1prob)^D1*prob
Results  1  2  3  4  5  6  7+  Total 

Observed  1532  760  338  194  74  33  17  2948 
Geometric  1569  734  343  161  75  35  31  2948 
Calculate ChiSquare
obs < c(1532,760,338,194,74,33,17)
exp < c(1569,734,343,161,75,35,31)
sqr < (obs  exp)^2/exp
chi_sqr < sum(sqr)
round(chi_sqr,2)
## [1] 15.08
Check Pvalue
n < length(obs)
df < n  1
p_value < pchisq(q=chi_sqr,df=df,lower.tail = FALSE)
round(p_value,2)
## [1] 0.02
Independence Test
Calculation differences for the independence test, as compared to the GOF test:

Degrees of freedom: (nb_rows – 1) * (nb_cols – 1)

Expected values: col_total * ratio(row_total/table_total)
Example: Independence test
Two categorical variables:

Weight: obese or not obese

Relationship status: dating, cohabiting or married
Results  Dating  Cohabiting  Married  Total 

Obese  81  103  147  331 
Not obese  359  326  277  962 
Total  440  429  424  1293 
Hypotheses
H_0: weight and relationship status are independent. Obesity rates do not vary by relationship status.
H_A: weight and relationship status are dependent. Obesity rates do vary by relationship status.
This is called an independence test since we are evaluating the relationship between two categorical variables.
Calculating the chisquare value for all cells
To calculate the expected value:
expected_value < row_total/total * column_total
To calculate the chisquare value in each cell:
chisq < (observed_value – expected_value)^2/expected_value
Feed in data
obese_obs < c(81,103,147,331)
notob_obs < c(359,326,277,962)
df < data.frame(obese_obs,notob_obs)
row_total < df$obese_obs + df$notob_obs
tbl_total < sum(df$obese_obs)+sum(df$notob_obs)
ratio_obese_notob < (sum(df$obese_obs)/tbl_total)
ratio_notob_obese < (sum(df$notob_obs)/tbl_total)
df$obese_exp < (row_total)*(ratio_obese_notob)
df$notob_exp < (row_total)*(ratio_notob_obese)
Calculate ChiSquare
df$obese_sqr < (df$obese_obs  df$obese_exp)^2/df$obese_exp
df$notob_sqr < (df$notob_obs  df$notob_exp)^2/df$notob_exp
chi_sqr < sum(df$obese_sqr,df$notob_sqr)
round(chi_sqr,2)
## [1] 30.83
Calculcate degrees of freedom
nb_rows < length(df$obese_obs)
nb_columns < 2
df < (nb_rows  1) * (nb_columns  1)
print(df)
## [1] 3
Check Pvalue
p_value < pchisq(q=chi_sqr,df=df,lower.tail = FALSE)
round(p_value,2)
## [1] 0