I am not an engineer
Must read:Is there a real future in data analysis for self-learners without a math degree?

Poisson regression

If you see an error in the article, please comment or drop me an email.

Poisson regression

In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables.

Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters.

A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

For what kind of data do we use Poisson GLM?

  • unbounded count data: for example, the number of calls to a call center or the number of flu cases in an area or the number of hits to a web site.

  • bounded counts with unkown upper limit: In some cases the counts are clearly bounded. However, modeling the counts as unbounded is often done when the upper limit is not known or very large relative to the number of events.

  • Proportions or rates: in count data when the upper bound is known. Examples: percent of children passing a test; percent of hits to a website from a country; incident rates of industrial assets.

  • Contingency table data: table of counts. Examples: a table with variables male/female and eye colour.

In Poisson, \(/lambda\) depends on time. To model this dependence, you have to use Poisson regression.

Difference between Poisson GLM and Linear Regression

Linear regression is handling figures on an additive scale whereas Poisson GLM treats them on a relative scale.

Poisson GLM is about getting relative interpretations from a linear model.

When you take the natural log of the outcome in a linear regression, your exponentiated coefficients are interpretable with respect to geometric means. Therefore, E to the Beta of zero is the estimated geometric mean on day zero. Linear regression: \(NH_i = b_0 + b_1 JD_i + e_i\)

Poison /Log-linear regression: \(\log\left(E[NH_i | JD_i, b_0, b_1]\right) = b_0 + b_1 JD_i\)

Reminder of the Poisson characteristics

  • \(X \sim Poisson(t\lambda)\) if \(P(X = x) = \frac{(t\lambda)^x e^{-t\lambda}}{x!}\)

For \(x = 0, 1, \ldots\).

  • The mean of the Poisson is \(E[X] = t\lambda\), thus \(E[X / t] = \lambda\)

  • The variance of the Poisson is equal to the mean: \(Var(X) = t\lambda\).

  • The Poisson tends to a normal distribution as \(t\lambda\) gets large.

Example

If we have the numbers of hits to a webpage per day, we could set t to 1 if we stay with a per day rate. If we wanted a per hour rate, we would set t to 24, and to 24*60 for a per minute rate.

Rates

A Poisson process generates counts, and counts are whole numbers, 0, 1, 2, 3, etc. A proportion is a fraction. So how can a Poisson process model a proportion? The trick is to include the denominator of the fraction, or more precisely its log, as an offset. For instance, we are analysis hits from a specific source to a webpage. The total number of hits (NH), also including those from a specific source, would be included in the model as an offset.

Expected outcome for a part over a total (NH).

\(E[NHSS_i | JD_i, b_0, b_1]/NH_i = \exp\left(b_0 + b_1 JD_i\right)\)

\(\log\left(E[NHSS_i | JD_i, b_0, b_1]\right) – \log(NH_i) = b_0 + b_1 JD_i\)

\(\log\left(E[NHSS_i | JD_i, b_0, b_1]\right) = \log(NH_i) + b_0 + b_1 JD_i\)

Fitting rates in R

The coefficient of the offset has to be fixed at 1. Glm()’s parameter, offset, has precisely this effect. It fixes the coefficient of the offset to 1. To create a model for the proportion of visits from a specific source, we let offset=log(NH+1)

glm(Y ~ X,offset=log(total+1), family="poisson",data=dataset)

You put +1 after the total variable to make sure the total is above/not 0.