Model Selection
If you see an error in the article, please comment or drop me an email.
Scott Zeger: “a model is a lense through which to look at your data”.
George Box: “All models are wrong, some are useful.”
Collinearity and parsimony
Collinearity: a high correlation between two independent variables such that the two variables contribute redundant information to the model, which is something we want to avoid in multiple linear regression.
Two predictor variables are said to be collinear when they are correlated with each other. As a matter of fact, predictors are also called independent variables. They should therefore be independent of each other.
While it is imposible to precent collinearity from arising in observational data, experiments are usually designed to prevent predictors from being collinear.
Model selection: identification of the best model for predicting a given response variable. We usually prefer simpler (parsimonious) models over more complicated ones:

Among competing hypotheses, the one with the fewest assumptions should be selected

The less predictors, the better
A full model is a model with all explanatory variables included as predictors.
Bias and significance
Bias
In model selection, bias means including a regressor variable rather than another. The less regressors are included, the higher the bias. However, bias might be justified.

What happens if you omit to include an important regressor? – Bias increases, estimate will be incorrect. It has a negative impact of the regression model fit.

What happens if you include unimportant regressors? – Bias is eliminated, but variance (standard error of all regressors) inflates.

In what sense does R squared provide limited insight into which regressors to include? – R squared monotonically increases as more regressors are included, while SSE (sum of squares of unexplained variability) decreases.
Significance of a model
The significance of a model as a whole is assessed using an FTest.
The pvalues associated with each predictor are conditional on other variables being included in the model, so they can be used to assess if a given predictor is significant, given that all others are in the model.
The backward model selection
The general idea behind backwardselection is to start with the full model and eliminate one variable at a time until the ideal model is reached.
The pvalue method

Start with the full model.

Drop the variable with the highest pvalue and refit the model. (Note that we cannot drop individual levels of a variable – or indicator variables. If at least one level is significant, we keep the whole variable)

Repeat until all remaining variables are significant.
The adjusted R2 method

Start with the full model.

Refit all possible models omitting one variable at a time, and choose the model with the highest adjusted R2.

Repeat until maximum possible adjusted R2 is reached.
Why choose R2 over the Pvalue method?
The R2 method aims at finding reliable predictions.
The pvalue method aims at finding statistically significant predictors. It depends on the significance level. It is used more commonly since it requires fitting fewer models.
The forward model selection
The general idea behind forwardselection is to start with only one variable and adding one variable at a time until the ideal model is reached.
The Pvalue method

Try all possible simple linear regression models predicting y using one explanatory variable at a time. Choose the model where the explanatory variable of choice has the lowest pvalue.

Try all possible models adding one more explanatory variable at a time, and choose the model where the added explanatory variable has the lowest pvalue.

Repeat until all added variables are significant.
Adjusted R2 method

Try all possible simple linear regression models predicting y using one explanatory variable at a time. Choose the model with the highest adjusted R2.

Try all possible models adding one more explanatory variable at a time, and choose the model with the highest adjusted R2.

Repeat until maximum possible adjusted R2 is reached.
Adjusted R2 method is more computationally intensive, but it is more reliable, since it doesn’t depend on an arbitrary significance level.
Different results for backward and forward selection
 There is no guarantee that backward elimination and forward selection will arrive at the same final model.
Variance inflation
Linear regression minimizes the squared difference between predicted and actual observations, i.e., minimizes the variance of the residual. If an additional predictor significantly reduces the residual’s variance, the predictor is deemed important. Note that what in generalized linear regression, we use deviance (log likelihoods) in place of variance.

Adding a random unimportant regressor decreases deviance and reduces residual degrees of freedom. You can measure variance inflation with the
vif()
function. Analysis of variance (ANOVA) between the two models is a useful way to quantify the significance of additional regressors. 
Omitting an important regressor adds bias to the model.
Adding random unimportant variable
This simulation shows how the number of unimportant variables (x2 and x3) included in a model affects the coefficient of an important variable (x1).
The variance does not necessarily go up when including a regressor. This is why simulation is appropriate in this case. The high number of simulations might even out cases where variance decreases due to chance.
#Create data
n < 100;
nosim < 1000
x1 < rnorm(n); #regressor 1
x2 < rnorm(n); #regressor 2
x3 < rnorm(n); #regressor 3
#
betas < sapply(1 : nosim, function(i){ #1000 simulations
y < x1 + rnorm(n, sd = .3) #response variable = x1 plus noise
c(coef(lm(y ~ x1))[2], #estimate of x1 with 1 regressor
coef(lm(y ~ x1 + x2))[2], #estimate of x1 with 2 regressors
coef(lm(y ~ x1 + x2 + x3))[2]) #estimate of x1 with 3 regressors
})
round(apply(betas, 1, sd), 5) #variability of x1 estimates per number of regressors
## x1 x1 x1
## 0.03537 0.03534 0.03539
#Source: https://github.com/DataScienceSpecialization/courses
Results:
 As we can see, the variance increases slightly.
Adding dependent unimportant variable
In this simulation, we see what happens when dependent variables (x2 and x3) are added to a model.
#Create data
n < 100;
nosim < 1000
x1 < rnorm(n); #Random regressor 1
x2 < x1/sqrt(2) + rnorm(n) /sqrt(2) #x2 dependent on x1
x3 < x1 * 0.95 + rnorm(n) * sqrt(1  0.95^2) #x3 heavily dependent on x1
#
betas < sapply(1 : nosim, function(i){ #1000 simulations
y < x1 + rnorm(n, sd = .3) #response variable = x1 plus noise
c(coef(lm(y ~ x1))[2], #estimate of x1 with 1 regressor
coef(lm(y ~ x1 + x2))[2], #estimate of x1 with 2 regressors
coef(lm(y ~ x1 + x2 + x3))[2]) #estimate of x1 with 3 regressors
})
round(apply(betas, 1, sd), 5) #variability of x1 estimates per number of regressors
## x1 x1 x1
## 0.02940 0.04452 0.09763
#Source: https://github.com/DataScienceSpecialization/courses
Results:
 Adding a correlated regressor to a model inflates the variance significantly.
Variance inflation factors – VIF
We cannot know sigma but we can estimate the increase in the actual standard error of the coefficients for including a regressor.
The variance inflation factor VIF is the increase in the variance for the ith regressor compared to the ideal setting where it is orthogonal (independent) to other regressors.
library(car)
## Warning: package 'car' was built under R version 3.3.2
fit < lm(Fertility ~ . , data = swiss)
vif(fit) #vif provides added variance
## Agriculture Examination Education Catholic
## 2.284129 3.675420 2.774943 1.937160
## Infant.Mortality
## 1.107542
sqrt(vif(fit)) #apply squareroot for added standard error
## Agriculture Examination Education Catholic
## 1.511334 1.917138 1.665816 1.391819
## Infant.Mortality
## 1.052398
How to read the results of the vif function:
The variance in the estimated coefficient of Education is 2.774943 times what it might have been if Education were not correlated with the other regressors
Nested Model Selection
fit1 < lm(Fertility ~ Agriculture, data = swiss)
fit3 < update(fit, Fertility ~ Agriculture + Examination + Education)
fit5 < update(fit, Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality)
anova(fit1, fit3, fit5)
## Analysis of Variance Table
##
## Model 1: Fertility ~ Agriculture
## Model 2: Fertility ~ Agriculture + Examination + Education
## Model 3: Fertility ~ Agriculture + Examination + Education + Catholic +
## Infant.Mortality
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 45 6283.1
## 2 43 3180.9 2 3102.2 30.211 8.638e09 ***
## 3 41 2105.0 2 1075.9 10.477 0.0002111 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Variance in generalized linear models
The anova table shows the increments per addition (degrees of freedom) as well as the Fstatistic and Pvalues of each model.
The F statistic represents the a ratio of the residual sums of squares divided by their respective degrees of freedom.
In generalized linear regression, we use deviance (log likelihoods) in place of variance, as shown by the anova table. RSS stands for residual sum of squares, of which deviance is a fancy synonym. You can also use the deviance()
function to calculate the residual sum of squares.
deviance(fit1)
## [1] 6283.116
deviance(fit3)
## [1] 3180.925
deviance(fit5)
## [1] 2105.043
Using deviance()
you can imitate the anova:
#difference of deviance over difference of degrees of freedom
numerator <(deviance(fit1)deviance(fit3))/(fit1$df.residualfit3$df.residual)
#mean residual squares
denominator < deviance(fit3)/fit3$df.residual
#show the result for fit1/fit3
numerator/denominator
## [1] 20.96783
Analysis of variance is sensitive to its assumption that model residuals are approximately normal. If they are not, we could get a small pvalue for that reason. It is thus worth testing residuals for normality. The ShapiroWilk test (shapiro.test()
) is quick and easy in R. Normality is its null hypothesis.
shapiro.test(fit1$residuals)
##
## ShapiroWilk normality test
##
## data: fit1$residuals
## W = 0.98048, pvalue = 0.6121
A Pvalue under the significance level (of .05) would mean to reject normality, which is the null hypothesis in the ShapiroWilk test.
// add bootstrap table styles to pandoc tables $(document).ready(function () { $('tr.header').parent('thead').parent('table').addClass('table tablecondensed'); });