Must read:Is there a real future in data analysis for self-learners without a math degree?

# Model Selection

If you see an error in the article, please comment or drop me an email.

Scott Zeger: “a model is a lense through which to look at your data”.

George Box: “All models are wrong, some are useful.”

# Collinearity and parsimony

Collinearity: a high correlation between two independent variables such that the two variables contribute redundant information to the model, which is something we want to avoid in multiple linear regression.

Two predictor variables are said to be collinear when they are correlated with each other. As a matter of fact, predictors are also called independent variables. They should therefore be independent of each other.

While it is imposible to precent collinearity from arising in observational data, experiments are usually designed to prevent predictors from being collinear.

Model selection: identification of the best model for predicting a given response variable. We usually prefer simpler (parsimonious) models over more complicated ones:

• Among competing hypotheses, the one with the fewest assumptions should be selected

• The less predictors, the better

A full model is a model with all explanatory variables included as predictors.

# Bias and significance

## Bias

In model selection, bias means including a regressor variable rather than another. The less regressors are included, the higher the bias. However, bias might be justified.

• What happens if you omit to include an important regressor? – Bias increases, estimate will be incorrect. It has a negative impact of the regression model fit.

• What happens if you include unimportant regressors? – Bias is eliminated, but variance (standard error of all regressors) inflates.

• In what sense does R squared provide limited insight into which regressors to include? – R squared monotonically increases as more regressors are included, while SSE (sum of squares of unexplained variability) decreases.

## Significance of a model

The significance of a model as a whole is assessed using an F-Test.

The p-values associated with each predictor are conditional on other variables being included in the model, so they can be used to assess if a given predictor is significant, given that all others are in the model.

# The backward model selection

The general idea behind backward-selection is to start with the full model and eliminate one variable at a time until the ideal model is reached.

### The p-value method

2. Drop the variable with the highest p-value and refit the model. (Note that we cannot drop individual levels of a variable – or indicator variables. If at least one level is significant, we keep the whole variable)

3. Repeat until all remaining variables are significant.

2. Refit all possible models omitting one variable at a time, and choose the model with the highest adjusted R2.

3. Repeat until maximum possible adjusted R2 is reached.

### Why choose R2 over the P-value method?

The R2 method aims at finding reliable predictions.

The p-value method aims at finding statistically significant predictors. It depends on the significance level. It is used more commonly since it requires fitting fewer models.

# The forward model selection

The general idea behind forward-selection is to start with only one variable and adding one variable at a time until the ideal model is reached.

### The P-value method

1. Try all possible simple linear regression models predicting y using one explanatory variable at a time. Choose the model where the explanatory variable of choice has the lowest p-value.

2. Try all possible models adding one more explanatory variable at a time, and choose the model where the added explanatory variable has the lowest p-value.

3. Repeat until all added variables are significant.

1. Try all possible simple linear regression models predicting y using one explanatory variable at a time. Choose the model with the highest adjusted R2.

2. Try all possible models adding one more explanatory variable at a time, and choose the model with the highest adjusted R2.

3. Repeat until maximum possible adjusted R2 is reached.

Adjusted R2 method is more computationally intensive, but it is more reliable, since it doesn’t depend on an arbitrary significance level.

### Different results for backward and forward selection

• There is no guarantee that backward elimination and forward selection will arrive at the same final model.

# Variance inflation

Linear regression minimizes the squared difference between predicted and actual observations, i.e., minimizes the variance of the residual. If an additional predictor significantly reduces the residual’s variance, the predictor is deemed important. Note that what in generalized linear regression, we use deviance (log likelihoods) in place of variance.

• Adding a random unimportant regressor decreases deviance and reduces residual degrees of freedom. You can measure variance inflation with the vif() function. Analysis of variance (ANOVA) between the two models is a useful way to quantify the significance of additional regressors.

• Omitting an important regressor adds bias to the model.

This simulation shows how the number of unimportant variables (x2 and x3) included in a model affects the coefficient of an important variable (x1).

The variance does not necessarily go up when including a regressor. This is why simulation is appropriate in this case. The high number of simulations might even out cases where variance decreases due to chance.

#Create data
n <- 100;
nosim <- 1000
x1 <- rnorm(n); #regressor 1
x2 <- rnorm(n); #regressor 2
x3 <- rnorm(n); #regressor 3

#
betas <- sapply(1 : nosim, function(i){ #1000 simulations
y <- x1 + rnorm(n, sd = .3) #response variable = x1 plus noise
c(coef(lm(y ~ x1))[2], #estimate of x1 with 1 regressor
coef(lm(y ~ x1 + x2))[2], #estimate of x1 with 2 regressors
coef(lm(y ~ x1 + x2 + x3))[2]) #estimate of x1 with 3 regressors
})

round(apply(betas, 1, sd), 5) #variability of x1 estimates per number of regressors
##      x1      x1      x1
## 0.03537 0.03534 0.03539
#Source: https://github.com/DataScienceSpecialization/courses

Results:

• As we can see, the variance increases slightly.

In this simulation, we see what happens when dependent variables (x2 and x3) are added to a model.

#Create data
n <- 100;
nosim <- 1000
x1 <- rnorm(n); #Random regressor 1
x2 <- x1/sqrt(2) + rnorm(n) /sqrt(2) #x2 dependent on x1
x3 <- x1 * 0.95 + rnorm(n) * sqrt(1 - 0.95^2) #x3 heavily dependent on x1

#
betas <- sapply(1 : nosim, function(i){ #1000 simulations
y <- x1 + rnorm(n, sd = .3) #response variable = x1 plus noise
c(coef(lm(y ~ x1))[2], #estimate of x1 with 1 regressor
coef(lm(y ~ x1 + x2))[2], #estimate of x1 with 2 regressors
coef(lm(y ~ x1 + x2 + x3))[2]) #estimate of x1 with 3 regressors
})

round(apply(betas, 1, sd), 5) #variability of x1 estimates per number of regressors
##      x1      x1      x1
## 0.02940 0.04452 0.09763
#Source: https://github.com/DataScienceSpecialization/courses

Results:

• Adding a correlated regressor to a model inflates the variance significantly.

### Variance inflation factors – VIF

We cannot know sigma but we can estimate the increase in the actual standard error of the coefficients for including a regressor.

The variance inflation factor VIF is the increase in the variance for the ith regressor compared to the ideal setting where it is orthogonal (independent) to other regressors.

library(car)
## Warning: package 'car' was built under R version 3.3.2
fit <- lm(Fertility ~ . , data = swiss)
vif(fit) #vif provides added variance
##      Agriculture      Examination        Education         Catholic
##         2.284129         3.675420         2.774943         1.937160
## Infant.Mortality
##         1.107542
sqrt(vif(fit)) #apply squareroot for added standard error  
##      Agriculture      Examination        Education         Catholic
##         1.511334         1.917138         1.665816         1.391819
## Infant.Mortality
##         1.052398

How to read the results of the vif function:

The variance in the estimated coefficient of Education is 2.774943 times what it might have been if Education were not correlated with the other regressors

# Nested Model Selection

fit1 <- lm(Fertility ~ Agriculture, data = swiss)
fit3 <- update(fit, Fertility ~ Agriculture + Examination + Education)
fit5 <- update(fit, Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality)
anova(fit1, fit3, fit5)
## Analysis of Variance Table
##
## Model 1: Fertility ~ Agriculture
## Model 2: Fertility ~ Agriculture + Examination + Education
## Model 3: Fertility ~ Agriculture + Examination + Education + Catholic +
##     Infant.Mortality
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)
## 1     45 6283.1
## 2     43 3180.9  2    3102.2 30.211 8.638e-09 ***
## 3     41 2105.0  2    1075.9 10.477 0.0002111 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Variance in generalized linear models

The anova table shows the increments per addition (degrees of freedom) as well as the F-statistic and P-values of each model.

The F statistic represents the a ratio of the residual sums of squares divided by their respective degrees of freedom.

In generalized linear regression, we use deviance (log likelihoods) in place of variance, as shown by the anova table. RSS stands for residual sum of squares, of which deviance is a fancy synonym. You can also use the deviance() function to calculate the residual sum of squares.

deviance(fit1)
## [1] 6283.116
deviance(fit3)
## [1] 3180.925
deviance(fit5)
## [1] 2105.043

Using deviance() you can imitate the anova:

#difference of deviance over difference of degrees of freedom
numerator <-(deviance(fit1)-deviance(fit3))/(fit1$df.residual-fit3$df.residual)
#mean residual squares
denominator <- deviance(fit3)/fit3$df.residual #show the result for fit1/fit3 numerator/denominator ## [1] 20.96783 Analysis of variance is sensitive to its assumption that model residuals are approximately normal. If they are not, we could get a small p-value for that reason. It is thus worth testing residuals for normality. The Shapiro-Wilk test (shapiro.test()) is quick and easy in R. Normality is its null hypothesis. shapiro.test(fit1$residuals)
##
##  Shapiro-Wilk normality test
##
## data:  fit1\$residuals
## W = 0.98048, p-value = 0.6121

A P-value under the significance level (of .05) would mean to reject normality, which is the null hypothesis in the Shapiro-Wilk test.