I am not an engineer
Must read:Is there a real future in data analysis for self-learners without a math degree?

Distributions and probability

If you see an error in the article, please comment or drop me an email.

Normal distribution

Randomized sample of independent and identically distributed variables.

Normal model can be used for sampling distributions if

  • sample size > 30

  • independent probabilities

  • randomized

F distribution

Ratio of the mean squares of n1 and n2 independent standard normals.

MSG – Mean square between groups –> describes variability between groups

MSE – Mean square error –> describes variability within groups

F = MSG/MSE = ratio of variability in the sample means relative to the variability within the groups

The F-Statistic comes with two degrees of freedom : df1 and df2

df1 = k – 1 (number of groups minus one)

df2 = n – k (combined sample size minus number of groups)

f_check <- function () {
    print("ANOVA CONDITIONS:")
    print("1) Observations are independent within and across groups --> random sample of 10% or less of the population")
    print("2) Data within each group are nearly normal --> normal probability plto for each group")
    print("3) Variability across the groups is about equal --> compare sds/vars of the different groups")
}

Geometric distribution

The first success in n Bernoulli trials

Binomial distribution

The kth success in the nth trial

Check whether the distribution is binomial

binom_check <- function () {
    print("1) The trials are independent")
    print("2) The number of trials *n* is fixed")
    print("3) Each trial outcome can be classified as a *success* or *failure*")
    print("4) The probability of success *p* is the same for each trial")
}

Obtain the probability of k successes in n trials at probability p:

binom_probability <- function (n=0,k=0,p=0) {
    if (sum(c(n,k,p))==0) {
        print("You need to specify n trials, k successes and p probability")
        print("FORMULA : choose(n,k)*p^k*(1-p)^(n-k)")
    } else {
        choose(n,k)*p^k*(1-p)^(n-k)
    }
}

Obtain the mean of a binomial distribution:

binom_mean <- function (n=0,p=0) {
    if (sum(c(n,p))==0) {
        print("You need to specify n trials and p probability.")
        print("FORMULA : Mean = n * p*")
    } else {
        print(n*p)
    }
}

Obtain the standard deviation of a binomial distribution:

binom_sd <- function (n=0,p=0) {
    if (sum(c(n,p))==0) {
        print("You need to specify n trials and p probability.")
        print("FORMULA : sigma = sqrt(n*p*(1-p))")
    } else {
        print(sqrt(n*p*(1-p)))
    }
}

Negative binomial distribution

Check whether the distribution is negative binomial

nbinom_check <- function () {
    print("1) The trials are independent")
    print("2) Each trial outcome can be classified as a *success* or *failure*")
    print("3) The probability of success *p* is the same for each trial")
    print("4) The last trial is a success")
}

Obtain the probability of the kth success in n trials at probability p:

nbinom_probability <- function (n=0,k=0,p=0) {
    if (sum(c(n,k,p))==0) {
        print("You need to specify k successes at the nth trial, at p probability")
        print("FORMULA : choose(n-1,k-1)*p^k*(1-p)^(n-k)")
    } else {
        print(choose(n-1,k-1)*p^k*(1-p)^(n-k))
    }
}

Poisson distribution

The Poisson distribution is useful for estimating the number of events in a large population over a unit of time.

Check whether it is a Poisson distribution:

pois_check <- function () {
    print("1) We are looking for the number of events (=successes)")
    print("2) The population is large")
    print("3) Events occur independently from each other")
}

Obtain Poisson probability:

pois_probability <- function (lambda=0,k=0) {
    if (sum(c(lambda,k))==0) {
        print("You need to specify lambda and k successes")
        print("Note the difference between pois_probability and ppois! pois_probability provides probability for EXACTLY k successes, whereas ppois provides probability of k or less successes (=chunk of the distribution)")
        print("FORMULA : ((lambda^k)*exp(1)^(-1*lambda))/factorial(k)")
    } else {
        print(((lambda^k)*exp(1)^(-1*lambda))/factorial(k))
    }
}