Must read:Is there a real future in data analysis for self-learners without a math degree?

# Bootstrapping

If you see an error in the article, please comment or drop me an email.

The basic bootstrap principle uses observed data to construct an estimated population distribution using random sampling with replacement.

Sample –> samplings –> estimated distribution

# Steps of bootstrapping

1 Take a bootstrap sample (random sample with replacement, of the same size as the original sample)

2 Calculate a bootstrap statistic (such as mean, median, proportion, etc.)

3 Repeat steps 1 and 2 many times to create a bootstrap distribution (=a distribution of bootstrap statistics)

# Used parameters in analysis of variance

``````data(father.son)
x <- father.son\$sheight
n <- length(x)
B <- 10000
values <- sample(x=x,size=n*B, replace = TRUE) ### replace = TRUE --> drawn card goes back in the deck
resamples <- matrix(data=values,nrow=B, ncol=n) ### resampled values in a matrix of dimensions 10000 X original sample size n
resampledMedians <- apply(X=resamples,MARGIN=1,FUN=median) ### median computed for each row (=1)``````

# Two bootstrapping methods

## The percentile method

``````## ATTENTION: STARTING WITH A DIFFERENT DATA SET

n <- 20 # sample size
confidence_level <- .9 # 90%
alpha <- 1 - confidence_level
sides <- 2

# THEORETICAL EXAMPLE
vector <- rnorm(n=n)
confidence_interval <- quantile(x=vector, probs=.5+(c(-1,1)*(confidence_level/sides)))``````

## The standard error method

``````sample_median <- 887
SE_boot <- 89.5758
df <- n - 1
t_score <- qt(1-(alpha/sides),df=df,lower.tail = TRUE)
confidence_interval <- sample_median + c(-1,1) * t_score * SE_boot``````