I am not an engineer
Must read:Is there a real future in data analysis for self-learners without a math degree?

Logistic Regression

If you see an error in the article, please comment or drop me an email.

Logistic regression is a generalized linear model where the outcome is a categorical variable. Logistic regression can be binomial (using binary independent variables), ordinal (if categories are ordered) or multinomial (with more than two categories).

Binary Generalized Linear Models

Binary Generalized Linear Models come from trying to model outcomes that can take only two values. These binary outcomes are called Bernoulli outcomes. In binary logistic regression there is one independent variable and one categorical dependent variabe.

If we aggregate several Bernoulli outcomes into counts of 1’s, we have binomial data. For instance, collection of coin flips with a constant probability of sucess is a binomial random variable.

Why use transformation?

\(P_i = b_0 + b_1 * x1_i\)

If there was no transformation for this equation, the left hand side could only take values between 0 and 1 (probability of success or failure), whereas the right hand side could take values outside of this range

We want a transformation that makes the range of possibilities on the left hand side of Equation equal to the range of possibilities for the right hand side.

For the binomial family, we can use the logit, probit or cloglog as links. Example in R: glm( formula, family=binomial(link=probit))

Example of binomial GLM: Ravens wins

Source: John Hopkins Data Science Specialization on Coursera

\[ RW_i = b_0 + b_1 RS_i + e_i \]

  • \(RW_i\) – 1 if a Ravens win, 0 if not

  • \(RS_i\) – Number of points Ravens scored

  • \(b_0\) – probability of a Ravens win if they score 0 points

  • \(b_1\) – increase in probability of a Ravens win for each additional point

  • \(e_i\) – residual variation due

Fitting a model for the odds

Here we are trying fit a linear model for binary data. This is not recommended as the errors are supposed to be Gaussian. Rather than modelling the wins, it would be interesting to model the odds.

How we build the generalized linear model:

  • Probability: P = \(\frac{odds}{(1 + odds)}\)

  • Odds: Odds = \(\frac{probability}{(1 – probability)}\)

  • Binary Outcome 0/1: \(RW_i\)

  • Probability (0,1): \(\rm{Pr}(RW_i | RS_i, b_0, b_1 )\) The success probability would differ from game to game, depending on how many points the Ravens score.

  • Odds \((0,\infty)\): \(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\)

  • Log odds \((-\infty,\infty)\): \(\log\left(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\right)\)

How the model works:

Outcome depends on a probability distribution.

The probability distribution depends on a success probability.

The success probability depends on the regressors.

How to read our model:

\[\log\left(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\right)$ = $b_0 + b_1*RS_i\]

  • \(b_0\) – Log odds of a Ravens win if they score zero points

  • \(b_1\) – Log odds ratio of win probability for each point scored (compared to zero points)

  • \(\exp(b_1)\) – Odds ratio of win probability for each point scored (compared to zero points)

If the Ravens score zero times, we get \(b_0\), which is the log odds of a Ravens win if they score 0 points.

Therefore, \(\frac{e^b_0}{1 + e^b_0}\), is the probability that the Ravens win if they score 0 points.

Similarly, \(\frac{e^(b_0 + b_1 * X)}{1 + e^(b_0 + b_1 * X)}\), is the probability that the Ravens win if they score X points.

How to read beta_1:

\((b_0 + b_1 * (RS_i + 1)) – (b_0 + b_1 * (RS_i) = b_1\)

Therefore \(beta_1\) is the increase or decrease in the log odds of the probability that the Ravens win associated with a one unit increase in the regression variable (in this case the score).

Using the glm() function

logdata <- glm(data$Y ~ data$X, family="binomial")

Interpreting the outcome of glm()

If this is the model (using logit as a link):

\(log\frac{p_i}{1-p_1} = -2.12 – 1.81 * X1\)

This is how to read the probability:

\(Pi = \frac{e^(-2.12 – 1.81 * X1)}{1 + e^(-2.12 – 1.81 * X1)}\)

Example of interpretation:

Always exponentiate the results!

exp(logdata$coeff) = intercept 0.1648 slope 1.1125

There is a 11% increase in the probability of winning for every addition in score.

On the logit scale, check whether the values are close to 0.

On the exponential scale, check whether they are close to 1.

To get the confidence interval

exp(confint(logdata))

If the lower boundary of the confidence interval (of the slope) is below 1, the coefficient is not significant (scoring does not determine winning).

Interpreting the odds ratios

  • Odds ratio of 1 = no difference in odds (0 for the log odds ratio)

  • Odds ratio between .5 and 2 is commonly a “moderate effect”

  • Relative risk (ratio of two probabilities) is often easier to interpret, yet harder to estimate. Despite being close, relative risk and odds ratio are not the same!

Example of spam e-mails

Source:Stats OpenIntro, chapter 8.4

\(Y_i\) is the outcome variable (index i)

\(x1_i\) is the value of variable 1 for observation i

In our spam example, there are 10 predictor variables (k <- 10)

Diagnostics for logistic regression

There are two key conditions for fitting a logistic regression model:

  1. Each predictor is linearly related to logit(p_i) if all other predictors are held constant.

  2. Each outcome is independent of the other outcomes.

The first condition can be checked by plotting success/failure on the y-axis over predicted probability of success on the x axis. If the predictor is linearly related to logit(p_i). Outcomes with high predicted probability of success should be more numerous around success as shown on p.374 of the OpenIntro StatsBook (https://www.openintro.org/download.php?file=os2_08&referrer=coursera2014Spring).