I am not an engineer
Must read:Is there a real future in data analysis for self-learners without a math degree?

Proportions

If you see an error in the article, please comment or drop me an email.

Conditions for near normality of the distribution of sample proportions?

1 observations are independent

2 sample size: np >= 10 and n (1 – p) >= 10

Proportion inference in a nutshell

Let’s say we are interested in the proportion of smokers in the world.

N would be the world population

p_population would be the population proportion of smokers (n_smokers/N)

Then you sample 1000 random participants of each country (p_FR, p_DE, etc.). For each sample, you obtain a proportion of smokers.

Put together, these proportions form a sampling distribution of proportions. The mean of this sampling proportion would be a good estimate of the population proportion.

First example: calculating the probability of a proportion

Exercise: 90% of all plant species are classified as angiosperms (flowering plants). If you were to randomly sample 200 plants from the list of all known plant species, what is the probability that at least 95% of plants in your sample will be flowering plants?

P(p_hat > .95) = ?

#GIVEN Data
p_population <- .9
n <- 200
p_hat <- .95

Let us check conditions for CLT:

  • Independence observed –> OKAY

  • n * p >= 10 –> 200 x .9 = 180 >= 10 –> OKAY

  • n * (1-p) >= 20 –> 200 x .1 = 20 >= 10 –> OKAY

–> This means that the sampling distribution of proportions is nearly normal distributed

Calculations

#Let us find the standard error
SE <- sqrt( (p_population * (1 - p_population)) / n)

#Then the test statistic
z_score <- (p_hat - p_population)/SE

#Now check the probabilty of p_hat in a sample of size n from a population with p_proportion
prob <- pnorm(q=z_score,lower.tail = FALSE)
print(prob)
## [1] 0.009211063
#An alternative way of calculating the probability is using the binomial distribution
sum(dbinom(x=190:200,size=200,prob=.9))
## [1] 0.00807125

Second example: calculating a confidence interval

Exercice : In a sample of 670 Americans, 571 have good intuition about experimental design. What percent of Americans have good intuition about experimental design?

#GIVEN Data
p_population <- NULL # unknown parameter
n <- 670 # sample size
p_hat <- 571/n # sample proportion of Americans with good intuition
conf_level <- .95
alpha <- 1-conf_level
sides <- 2

Let us check conditions for CLT:

  • Independence observed –> OKAY

  • n * p >= 10 –> 670 x .85 = 570 >= 10 –> OKAY

  • n * (1-p) >= 20 –> 650 x .15 = 101 >= 10 –> OKAY

–> This means that the sampling distribution of proportions is nearly normal distributed

Calculation of the confidence interval

SE <- sqrt( (p_hat * (1-p_hat)) / n)
z_score <- qnorm(1-(alpha/sides), lower.tail = TRUE)

confidence_interval <- p_hat + c(-1,1) * z_score * SE
print(confidence_interval)
## [1] 0.8253686 0.8791090

Addition to second example: calculate the right sample size

Exercise: let us say we would like to reduce the margin of error from 0.03% to 1%.

Let us solve for 1%:

ME <- .01 # 1%

ME = SE * z_score

ME = sqrt( (p_hat * (1 – p_hat)) / n) * z_score

ME^2 = ((p_hat * (1 – p_hat) ) / n) * z_score^2

n <- (p_hat * (1 – p_hat)) * z_score^2 / ME^2

ME <- .01 # 1%
n <- (p_hat * (1 - p_hat)) * z_score^2 / ME^2
n <- ceiling(n)
print(n)
## [1] 4838

Third example: calculate the required sample size for desired ME

Exercise: calculate the appropriate sample size for desired ME of x% without having a sample proportion.

Use the following formula:

ME = z_score * sqrt( ( p_hat * (1 – p_hat) ) / n )

OR

n = (p_hat * (1 – p_hat)) * z_score^2 / ME^2

with p_hat = .5 (our best possible guess) and z_score to be determined according to the confidence level

# define input
p_hat <- .5
conf_level <- .95
alpha <- 1 - conf_level
sides <- 2
ME <- .03
z_score <- qnorm(p=1-(alpha/sides),lower.tail=TRUE)

# calculate sample size
n <- (p_hat * (1 - p_hat)) * z_score^2 / ME^2
n <- ceiling(n)
print(n)
## [1] 1068

Hypothesis testing for proportions

1) Set the hypotheses

H_0 : p_pop = 0

H_A : p_pop != or > or < 0

2) Calculate the point estimate p_hat

3) Check conditions

  • Independence of the sample

  • Sample size & skew (>= 10 expected failures/successes)

4) Draw sampling distribution, shade p-value, calculate test_statistic

SE <- sqrt( (p_pop * (1 – p_pop)) / n )

test_statistic <- ((p_hat – p_pop) / SE)

Note that you cannot use p_hat (sample proportion) instead of p_pop (population proportion) if the latter is unknown. You can only use p_hat when calculating a confidence interval for a proportion!

5) Make a decision: reject or fail to reject the null hypothesis

First example

Exercise: A 2013 Pew Research poll found that 60% of 1983 randomly selected American adults believe in evolution. Does this provide convincing evidence that majority of Americans believe in evolution?

1) set the hypotheses

H_0 : p_pop = .5

H_A : p_pop > .5

p_pop <- .5

2) Calculate the point estimate p_hat

p_hat <- .6

3) Check conditions

  • Independence of the sample: 1983 < 10% of US population –> OKAY

  • Sample size & skew (>= 10 expected failures/successes) 1983 * p_pop > 10 –> OKAY

n <- 1983

4) Draw sampling distribution, shade p-value, calculate test_statistic

p_hat ~ N(mean=p_pop, SE=0.0112)

SE <- sqrt( (p_pop * (1 - p_pop)) / n )
test_statistic <- ((p_hat - p_pop) / SE)

5) Make a decision: reject or fail to reject the null hypothesis

p_value <- pnorm(q=test_statistic,lower.tail=FALSE)
print(p_value)
## [1] 2.641113e-19

We therefore reject the null hypothesis

Confidence Interval: Estimating the difference between two proportions

Basically, estimating the difference between two proportions is calculating a confidence interval of the difference between two unknown population parameters.

Exercise: In early October 2013, a Gallup poll asked “Do you think there should or should not be a law that would ban the possession of handguns, except by the police and other authorized persons?”, suggesting the following answers:

  1. No, there should not be such a law

  2. Yes, there should be such a law

  3. No opinion

Results success n p_hat
US 257 1028 .25
Coursera 59 83 .71
#DATA INPUT
n_US <- 1028
p_hat_US <- 257 / n_US
n_coursera <- 83
p_hat_coursera <- 59 / n_coursera

Define the parameter of interest and the point estimate

Parameter of interest

Difference between the proportions of all Coursera students and all Americans who believe there should be a ban on possession of handguns:

p_pop_coursera – p_pop_US

Point estimate

Difference between the proportions of sampled Coursera students and sampled Americans who believe there should be a ban on possession of handguns:

p_hat_coursera – p_hat_US

Checking conditions

  1. Independence within groups: random sampled and <10% of population
  • –> OKAY for the US sample (Gallup poll)

  • –> NOT OKAY for the Coursera sample (voluntary poll!) –> we must be extremely careful

  1. Independence between groups (non-paired)
  2. Sample size and skewness: failure/success >= in both groups

Result: sampling distribution is nearly normal

Calculating the standard error for the difference between two proportions

SE_coursera <- (p_hat_coursera * (1 - p_hat_coursera)) / n_coursera
SE_US <- (p_hat_US * (1 - p_hat_US)) / n_US
SE <- sqrt( SE_coursera + SE_US )

Calculating the confidence interval

conf_level <- .95
alpha <- 1 - conf_level
sides <- 2

z_score <- qnorm(p=1-(alpha/sides), lower.tail = TRUE)
ME <- SE * z_score
difference <- p_hat_coursera - p_hat_US
conf_interval <- difference + c(-1,1) * ME
conf_interval <- round(conf_interval,2)
print(conf_interval)
## [1] 0.36 0.56

Calculating hypothesis test for comparing two proportions

Exercise: A SurveyUSA poll asked respondents whether any of their children have ever been the victim of bullying. Also recorded on this survey was the gender of the respondent (the parent). Below is the distribution of responses by gender of respondent.

Results Male Female
Yes 34 61
No 52 61
Not sure 4 0
Total 90 122
#DATA INPUT
n_m <- 90
n_f <- 122
suc_m <- 34
suc_f <- 61
p_hat_m <- suc_m / n_m
p_hat_f <- suc_f / n_f

Setting the hypotheses

H_0 : p_m – p_f = 0

H_A : p_m – p_f != 0

null_value <- 0
sides <- 2

Calculating the pooled proportion

p_pool <- (suc_m + suc_f) / (n_m + n_f)

Checking conditions

  • Important: when checking conditions for a hypothesis test, the pooled proportion has to be used, unless the null_value is not 0.*

Which proportion should be used for checking conditions and calculating the SE?

  • Confidence interval –> p_hat

  • Hypothesis test with null_value other than 0 –> p_hat

  • Hypothesis test with null_value = 0 –> p_pooled

Checking conditions based on the expected proportion

n_m * p_pool >= 10
## [1] TRUE
n_m * (1- p_pool) >= 10 
## [1] TRUE
n_f * p_pool >= 10
## [1] TRUE
n_f * (1- p_pool) >= 10 
## [1] TRUE

Therefore, we have the following distribution: (p_m – p_f) ~ N(mean = null_value, SE)

Calculating the expected standard error…

SE <- sqrt( ( (p_pool * (1 - p_pool)) / n_m ) +  ( (p_pool * (1 - p_pool)) / n_f ) )

…the test statistic

point_estimate <- p_hat_m - p_hat_f
test_statistic <- (point_estimate - null_value) / SE

…the p-value

p_value <- pnorm(q=test_statistic,lower.tail = TRUE)*sides
print(p_value)
## [1] 0.0769369

Hypothesis Testing for Small Proportions

This inferential method applied to cases where the success-failure conditions is not met.

Example

Medical consultant claims 3 out of 62 surgeries without complications, as compared to 10% industry average. Is the consultant’s lower complication rate due to chance or statistically significiant?

Hypotheses

mu_diff = mu_consultant – null_value H_0 : diff = 0 H_A : diff != 0

Checking conditions

Independence –> OKAY if each of the surgeries is from a different surgical team.

Success-failure condition: 0.1 * 62 NOT > 10 –> NOT OKAY

Solution no1: Simulate the null distribution

sample_size <- 62
industry_avg <- .1
consultant_ratio <- 3/62
nb_sims <- 10000

simulated_samples <- NULL
for(i in 1:nb_sims) {
    simulated_samples <- c(simulated_samples,sum(rbinom(n=sample_size,size=1,prob=.1))/sample_size)
}

#Ratio of simulated samples with lower complications ratio than the consultant
sum(simulated_samples <= consultant_ratio)/nb_sims
## [1] 0.1223

Solution no2: Generate the exact null distribution and p-value

n <- 62
k <- 0:3 # 3 complications or less
p <- .1

#Formula to determine probability of observing exactly k successes in n trials:
p_value <- sum(choose(n,k) * p^k * (1 - p)^(n-k))

round(p_value,2)
## [1] 0.12

Experiment: Permutation/randomization tests for small samples

Results Survived Died Total
Control 11 39 50
Treatment 14 26 40

Is the difference in survived/died ratio due to chance?

Starting with the regular procedure

Feed in the data

survived <- c(11,14)
died <- c(39,26)
df <- data.frame(survived,died)

n_c <- sum(df[1,])
n_t <- sum(df[2,])

suc_c <- df$survived[1]
suc_t <- df$survived[2]

p_c <- suc_c/n_c
p_t <- suc_t/n_t

Set the hypotheses

diff = p_control – p_treatment

H_0 : diff = 0

H_A : diff != 0

–> two-sided hypothesis test

–> control and treatment proportions are equal under the null hypothesis. Therefore, we have to pool the proportions.

null_value <- 0
sides <- 2

Solution no1: theoretical

Pool proportions

suc_total <- sum(df$survived) #Total of successes
n <- sum(df) #Total of trials
failure_total <- n - suc_total
p_pool <- (suc_total/n)

Checking conditions

Can we use the normal model?

  • Independence: patients are independent (as long as medical teams are too)

  • success-failure (control group): 0.2777778 * 50 = 13.8888889> 10 & 0.7222222 * 50 = 36.11 > 10

  • success-failure (treatment group): 0.2777778 * 40 = 11.1111111> 10 & 0.7222222 * 40 = 28.89 > 10

Calculating Standard error…

SE <- sqrt( ((p_pool*(1-p_pool))/n_t) + ((p_pool*(1-p_pool))/n_c) )

…and the z_score…

difference <- p_c - p_t
test_statistic <- (difference - null_value)/SE

…and the p-value

p_value_th <- pnorm(q=test_statistic,lower.tail = TRUE) * sides #difference is negative! --> lower.tail=T
round(p_value_th,2)
## [1] 0.17

Solution no2: simulation

Create a sampling distribution of the differences in proportion

tmp1 <- rep(1,25)
tmp2 <- rep(0,65)
sample_source <- c(tmp1,tmp2) #create a sample example with n_successes (25) and n_failures (65)
sampling_dist <- NULL
for (i in 1:10000) {
    newtemp_sample <- sample(x=sample_source,replace=FALSE,size=length(sample_source))
    tmp_p_c <- sum(newtemp_sample[1:50])/50 #set numbers manually, but could have been set by using the variables
    tmp_p_t <- sum(newtemp_sample[51:90])/40
    tmp_diff <- tmp_p_c - tmp_p_t
    sampling_dist <- c(sampling_dist,tmp_diff)
}
hist(sampling_dist)

Check probability based on simulated sampling distribution

p_value_sim <- sum(abs(sampling_dist)>=abs(difference))/length(sampling_dist) #what part of the samping distribution is bigger or equal to the positive difference OR smaller or equal to the negative difference?
round(p_value_sim,2)
## [1] 0.24

Compare solutions

The difference between the p-value obtained following the theoretical method and the one generated by simulation is quite important:

  • Theoretical method: 0.1712462

  • Simulation method: 0.2351

Memo

Which proportion is used in confidence intervals for proportions?

For confidence intervals use p^ (observed sample proportion) when calculating the standard error and checking the success/failure condition. For hypothesis tests use p0 (null value) when calculating the standard error and checking the success/failure condition.

Use the observed number of successes and failures when calculating a confidence interval for a proportion, but not when doing a hypothesis test. In a hypothesis test for a proportion, you should use np0 and n(1???p0) successes and failures; that is, the expected number based on the null proportion.

“In statistical inference for proportions, standard error (SE) is calculated differently for hypothesis tests and confidence intervals.” Which of the following is the best justification for this statement?

Because in hypothesis testing, we assume the null hypothesis is true, hence we calculate SE using the null value of the parameter. In confidence intervals, there is no null value, hence we use the sample proportion(s). Note that the reason for the difference in calculations of standard error is the same as in the case of the single proportion: when the null hypothesis claims that the two population proportions are equal, we need to take that into consideration when calculating the standard error for the hypothesis test, and use a common proportion for both samples.