Proportions
If you see an error in the article, please comment or drop me an email.
Conditions for near normality of the distribution of sample proportions?
1 observations are independent
2 sample size: np >= 10 and n (1 – p) >= 10
Proportion inference in a nutshell
Let’s say we are interested in the proportion of smokers in the world.
N would be the world population
p_population would be the population proportion of smokers (n_smokers/N)
Then you sample 1000 random participants of each country (p_FR, p_DE, etc.). For each sample, you obtain a proportion of smokers.
Put together, these proportions form a sampling distribution of proportions. The mean of this sampling proportion would be a good estimate of the population proportion.
First example: calculating the probability of a proportion
Exercise: 90% of all plant species are classified as angiosperms (flowering plants). If you were to randomly sample 200 plants from the list of all known plant species, what is the probability that at least 95% of plants in your sample will be flowering plants?
P(p_hat > .95) = ?
#GIVEN Data
p_population < .9
n < 200
p_hat < .95
Let us check conditions for CLT:

Independence observed –> OKAY

n * p >= 10 –> 200 x .9 = 180 >= 10 –> OKAY

n * (1p) >= 20 –> 200 x .1 = 20 >= 10 –> OKAY
–> This means that the sampling distribution of proportions is nearly normal distributed
Calculations
#Let us find the standard error
SE < sqrt( (p_population * (1  p_population)) / n)
#Then the test statistic
z_score < (p_hat  p_population)/SE
#Now check the probabilty of p_hat in a sample of size n from a population with p_proportion
prob < pnorm(q=z_score,lower.tail = FALSE)
print(prob)
## [1] 0.009211063
#An alternative way of calculating the probability is using the binomial distribution
sum(dbinom(x=190:200,size=200,prob=.9))
## [1] 0.00807125
Second example: calculating a confidence interval
Exercice : In a sample of 670 Americans, 571 have good intuition about experimental design. What percent of Americans have good intuition about experimental design?
#GIVEN Data
p_population < NULL # unknown parameter
n < 670 # sample size
p_hat < 571/n # sample proportion of Americans with good intuition
conf_level < .95
alpha < 1conf_level
sides < 2
Let us check conditions for CLT:

Independence observed –> OKAY

n * p >= 10 –> 670 x .85 = 570 >= 10 –> OKAY

n * (1p) >= 20 –> 650 x .15 = 101 >= 10 –> OKAY
–> This means that the sampling distribution of proportions is nearly normal distributed
Calculation of the confidence interval
SE < sqrt( (p_hat * (1p_hat)) / n)
z_score < qnorm(1(alpha/sides), lower.tail = TRUE)
confidence_interval < p_hat + c(1,1) * z_score * SE
print(confidence_interval)
## [1] 0.8253686 0.8791090
Addition to second example: calculate the right sample size
Exercise: let us say we would like to reduce the margin of error from 0.03% to 1%.
Let us solve for 1%:
ME < .01 # 1%
ME = SE * z_score
ME = sqrt( (p_hat * (1 – p_hat)) / n) * z_score
ME^2 = ((p_hat * (1 – p_hat) ) / n) * z_score^2
n < (p_hat * (1 – p_hat)) * z_score^2 / ME^2
ME < .01 # 1%
n < (p_hat * (1  p_hat)) * z_score^2 / ME^2
n < ceiling(n)
print(n)
## [1] 4838
Third example: calculate the required sample size for desired ME
Exercise: calculate the appropriate sample size for desired ME of x% without having a sample proportion.
Use the following formula:
ME = z_score * sqrt( ( p_hat * (1 – p_hat) ) / n )
OR
n = (p_hat * (1 – p_hat)) * z_score^2 / ME^2
with p_hat = .5 (our best possible guess) and z_score to be determined according to the confidence level
# define input
p_hat < .5
conf_level < .95
alpha < 1  conf_level
sides < 2
ME < .03
z_score < qnorm(p=1(alpha/sides),lower.tail=TRUE)
# calculate sample size
n < (p_hat * (1  p_hat)) * z_score^2 / ME^2
n < ceiling(n)
print(n)
## [1] 1068
Hypothesis testing for proportions
1) Set the hypotheses
H_0 : p_pop = 0
H_A : p_pop != or > or < 0
2) Calculate the point estimate p_hat
3) Check conditions

Independence of the sample

Sample size & skew (>= 10 expected failures/successes)
4) Draw sampling distribution, shade pvalue, calculate test_statistic
SE < sqrt( (p_pop * (1 – p_pop)) / n )
test_statistic < ((p_hat – p_pop) / SE)
Note that you cannot use p_hat (sample proportion) instead of p_pop (population proportion) if the latter is unknown. You can only use p_hat when calculating a confidence interval for a proportion!
5) Make a decision: reject or fail to reject the null hypothesis
First example
Exercise: A 2013 Pew Research poll found that 60% of 1983 randomly selected American adults believe in evolution. Does this provide convincing evidence that majority of Americans believe in evolution?
1) set the hypotheses
H_0 : p_pop = .5
H_A : p_pop > .5
p_pop < .5
2) Calculate the point estimate p_hat
p_hat < .6
3) Check conditions

Independence of the sample: 1983 < 10% of US population –> OKAY

Sample size & skew (>= 10 expected failures/successes) 1983 * p_pop > 10 –> OKAY
n < 1983
4) Draw sampling distribution, shade pvalue, calculate test_statistic
p_hat ~ N(mean=p_pop, SE=0.0112)
SE < sqrt( (p_pop * (1  p_pop)) / n )
test_statistic < ((p_hat  p_pop) / SE)
5) Make a decision: reject or fail to reject the null hypothesis
p_value < pnorm(q=test_statistic,lower.tail=FALSE)
print(p_value)
## [1] 2.641113e19
We therefore reject the null hypothesis
Confidence Interval: Estimating the difference between two proportions
Basically, estimating the difference between two proportions is calculating a confidence interval of the difference between two unknown population parameters.
Exercise: In early October 2013, a Gallup poll asked “Do you think there should or should not be a law that would ban the possession of handguns, except by the police and other authorized persons?”, suggesting the following answers:

No, there should not be such a law

Yes, there should be such a law

No opinion
Results  success  n  p_hat 

US  257  1028  .25 
Coursera  59  83  .71 
#DATA INPUT
n_US < 1028
p_hat_US < 257 / n_US
n_coursera < 83
p_hat_coursera < 59 / n_coursera
Define the parameter of interest and the point estimate
Parameter of interest
Difference between the proportions of all Coursera students and all Americans who believe there should be a ban on possession of handguns:
p_pop_coursera – p_pop_US
Point estimate
Difference between the proportions of sampled Coursera students and sampled Americans who believe there should be a ban on possession of handguns:
p_hat_coursera – p_hat_US
Checking conditions
 Independence within groups: random sampled and <10% of population

–> OKAY for the US sample (Gallup poll)

–> NOT OKAY for the Coursera sample (voluntary poll!) –> we must be extremely careful
 Independence between groups (nonpaired)
 Sample size and skewness: failure/success >= in both groups
Result: sampling distribution is nearly normal
Calculating the standard error for the difference between two proportions
SE_coursera < (p_hat_coursera * (1  p_hat_coursera)) / n_coursera
SE_US < (p_hat_US * (1  p_hat_US)) / n_US
SE < sqrt( SE_coursera + SE_US )
Calculating the confidence interval
conf_level < .95
alpha < 1  conf_level
sides < 2
z_score < qnorm(p=1(alpha/sides), lower.tail = TRUE)
ME < SE * z_score
difference < p_hat_coursera  p_hat_US
conf_interval < difference + c(1,1) * ME
conf_interval < round(conf_interval,2)
print(conf_interval)
## [1] 0.36 0.56
Calculating hypothesis test for comparing two proportions
Exercise: A SurveyUSA poll asked respondents whether any of their children have ever been the victim of bullying. Also recorded on this survey was the gender of the respondent (the parent). Below is the distribution of responses by gender of respondent.
Results  Male  Female 

Yes  34  61 
No  52  61 
Not sure  4  0 
Total  90  122 
#DATA INPUT
n_m < 90
n_f < 122
suc_m < 34
suc_f < 61
p_hat_m < suc_m / n_m
p_hat_f < suc_f / n_f
Setting the hypotheses
H_0 : p_m – p_f = 0
H_A : p_m – p_f != 0
null_value < 0
sides < 2
Calculating the pooled proportion
p_pool < (suc_m + suc_f) / (n_m + n_f)
Checking conditions
 Important: when checking conditions for a hypothesis test, the pooled proportion has to be used, unless the null_value is not 0.*
Which proportion should be used for checking conditions and calculating the SE?

Confidence interval –> p_hat

Hypothesis test with null_value other than 0 –> p_hat

Hypothesis test with null_value = 0 –> p_pooled
Checking conditions based on the expected proportion
n_m * p_pool >= 10
## [1] TRUE
n_m * (1 p_pool) >= 10
## [1] TRUE
n_f * p_pool >= 10
## [1] TRUE
n_f * (1 p_pool) >= 10
## [1] TRUE
Therefore, we have the following distribution: (p_m – p_f) ~ N(mean = null_value, SE)
Calculating the expected standard error…
SE < sqrt( ( (p_pool * (1  p_pool)) / n_m ) + ( (p_pool * (1  p_pool)) / n_f ) )
…the test statistic
point_estimate < p_hat_m  p_hat_f
test_statistic < (point_estimate  null_value) / SE
…the pvalue
p_value < pnorm(q=test_statistic,lower.tail = TRUE)*sides
print(p_value)
## [1] 0.0769369
Hypothesis Testing for Small Proportions
This inferential method applied to cases where the successfailure conditions is not met.
Example
Medical consultant claims 3 out of 62 surgeries without complications, as compared to 10% industry average. Is the consultant’s lower complication rate due to chance or statistically significiant?
Hypotheses
mu_diff = mu_consultant – null_value H_0 : diff = 0 H_A : diff != 0
Checking conditions
Independence –> OKAY if each of the surgeries is from a different surgical team.
Successfailure condition: 0.1 * 62 NOT > 10 –> NOT OKAY
Solution no1: Simulate the null distribution
sample_size < 62
industry_avg < .1
consultant_ratio < 3/62
nb_sims < 10000
simulated_samples < NULL
for(i in 1:nb_sims) {
simulated_samples < c(simulated_samples,sum(rbinom(n=sample_size,size=1,prob=.1))/sample_size)
}
#Ratio of simulated samples with lower complications ratio than the consultant
sum(simulated_samples <= consultant_ratio)/nb_sims
## [1] 0.1223
Solution no2: Generate the exact null distribution and pvalue
n < 62
k < 0:3 # 3 complications or less
p < .1
#Formula to determine probability of observing exactly k successes in n trials:
p_value < sum(choose(n,k) * p^k * (1  p)^(nk))
round(p_value,2)
## [1] 0.12
Experiment: Permutation/randomization tests for small samples
Results  Survived  Died  Total 

Control  11  39  50 
Treatment  14  26  40 
Is the difference in survived/died ratio due to chance?
Starting with the regular procedure
Feed in the data
survived < c(11,14)
died < c(39,26)
df < data.frame(survived,died)
n_c < sum(df[1,])
n_t < sum(df[2,])
suc_c < df$survived[1]
suc_t < df$survived[2]
p_c < suc_c/n_c
p_t < suc_t/n_t
Set the hypotheses
diff = p_control – p_treatment
H_0 : diff = 0
H_A : diff != 0
–> twosided hypothesis test
–> control and treatment proportions are equal under the null hypothesis. Therefore, we have to pool the proportions.
null_value < 0
sides < 2
Solution no1: theoretical
Pool proportions
suc_total < sum(df$survived) #Total of successes
n < sum(df) #Total of trials
failure_total < n  suc_total
p_pool < (suc_total/n)
Checking conditions
Can we use the normal model?

Independence: patients are independent (as long as medical teams are too)

successfailure (control group): 0.2777778 * 50 = 13.8888889> 10 & 0.7222222 * 50 = 36.11 > 10

successfailure (treatment group): 0.2777778 * 40 = 11.1111111> 10 & 0.7222222 * 40 = 28.89 > 10
Calculating Standard error…
SE < sqrt( ((p_pool*(1p_pool))/n_t) + ((p_pool*(1p_pool))/n_c) )
…and the z_score…
difference < p_c  p_t
test_statistic < (difference  null_value)/SE
…and the pvalue
p_value_th < pnorm(q=test_statistic,lower.tail = TRUE) * sides #difference is negative! > lower.tail=T
round(p_value_th,2)
## [1] 0.17
Solution no2: simulation
Create a sampling distribution of the differences in proportion
tmp1 < rep(1,25)
tmp2 < rep(0,65)
sample_source < c(tmp1,tmp2) #create a sample example with n_successes (25) and n_failures (65)
sampling_dist < NULL
for (i in 1:10000) {
newtemp_sample < sample(x=sample_source,replace=FALSE,size=length(sample_source))
tmp_p_c < sum(newtemp_sample[1:50])/50 #set numbers manually, but could have been set by using the variables
tmp_p_t < sum(newtemp_sample[51:90])/40
tmp_diff < tmp_p_c  tmp_p_t
sampling_dist < c(sampling_dist,tmp_diff)
}
hist(sampling_dist)
Check probability based on simulated sampling distribution
p_value_sim < sum(abs(sampling_dist)>=abs(difference))/length(sampling_dist) #what part of the samping distribution is bigger or equal to the positive difference OR smaller or equal to the negative difference?
round(p_value_sim,2)
## [1] 0.24
Compare solutions
The difference between the pvalue obtained following the theoretical method and the one generated by simulation is quite important:

Theoretical method: 0.1712462

Simulation method: 0.2351
Memo
Which proportion is used in confidence intervals for proportions?
For confidence intervals use p^ (observed sample proportion) when calculating the standard error and checking the success/failure condition. For hypothesis tests use p0 (null value) when calculating the standard error and checking the success/failure condition.
Use the observed number of successes and failures when calculating a confidence interval for a proportion, but not when doing a hypothesis test. In a hypothesis test for a proportion, you should use np0 and n(1???p0) successes and failures; that is, the expected number based on the null proportion.
“In statistical inference for proportions, standard error (SE) is calculated differently for hypothesis tests and confidence intervals.” Which of the following is the best justification for this statement?
Because in hypothesis testing, we assume the null hypothesis is true, hence we calculate SE using the null value of the parameter. In confidence intervals, there is no null value, hence we use the sample proportion(s). Note that the reason for the difference in calculations of standard error is the same as in the case of the single proportion: when the null hypothesis claims that the two population proportions are equal, we need to take that into consideration when calculating the standard error for the hypothesis test, and use a common proportion for both samples.