Chapter 7 Multiple covariates
7.1 Examples
Reading materials: Slides 101 - 110 in STA108_LinearRegression_S20.pdf.
We will revisit the examples that have been studied in Chapter 1.
7.1.1 Advertising data
We now consider all the covariates in the data set. The resulting model is \[
y_i = \beta_0 + \sum_{j=1}^3 x_{i,j} \beta_j + \epsilon_i, i=1,\ldots, 200,
\] where \(y_i\) is the sales
(1 unit = 1000 dollars) for the \(i\)th entry, \(x_{i,1}\) is the TV advertising budget for the \(i\)th entry, \(x_{i,2}\) is the radio advertising budget for the \(i\)th entry, \(x_{i,3}\) is the newspaper advertising budget for the \(i\)th entry. In addition, we assume that the errors \(\{\epsilon_i\}_{i=1}^{200}\) satisfy that \(\epsilon_1,\ldots, \epsilon_{200}\) are independently and identically distributed (i.i.d.), \(\mathbb{E}[\epsilon_i]= 0\) for \(i=1,2,\ldots, 200\) and \(\mathrm{var}(\epsilon_i)=\sigma^2\) for \(i=1,2,\ldots, 200\). Recall that we consider fixed design (i.e., \(x_i\) is not random) in this course for simplicity.
dat.advertising=read.csv('./data/advertising.csv');
# Fit a multiple linear regression
fit.advertising = lm(sales~TV+radio+newspaper+1,data=dat.advertising);
# Summarize the fitted results
summary(fit.advertising)
fit.advertising.slr = lm(sales~TV+1,data=dat.advertising);
# Summarize the fitted results
summary(fit.advertising.slr)
How would you interpret the fitted results? How do they compare with the results from the simple linear regression in Chapter ???
7.1.2 Flu shot
We will include age
and female
(gender) in the regression in addition to the treatment received. What can you conclude from the fitted results?
dat.flu = read.table("./data/flu240.txt", header = TRUE)
# Fit a multiple linear regression
fit.flu= lm(outcome~treatment.received+age+female+1,data=dat.flu);
# Summarize the fitted results
summary(fit.flu)
In addition, we can fit a regression with treatment received as the outcome, where treatment assigned, age and gender are covariates. What can you conclude now?
fit.trt= lm(treatment.received~treatment.assigned+age+female+1,data=dat.flu);
# Summarize the fitted results
summary(fit.trt)
7.1.3 Project STAR
Project STAR is a stratified randomized experiment, where randomization happened within each school. Hence, we will add schoolid2
as an additional factor in the regression.
library(AER)
data("STAR")
dat.STAR=STAR; # Just to be consistent
# Fit a simple linear regression
fit.STAR= lm(math2~as.factor(star2)+as.factor(schoolid2)+1,data=dat.STAR);
# Summarize the fitted results
summary(fit.STAR)$coef[1:3,]
# We can no longer draw a fitted line here...
7.2 Classification of variables
Reading materials: Slides 111 - 118 in STA108_LinearRegression_S20.pdf.
We will use simulation to examine the roles of the variables.
source('sources.r')
simulate.one.instance.data<-function(){
n=100;
confounder = rnorm(n)*2;
precision = runif(n);
instrument = rbinom(n,size=1,prob=0.4);
trt=rnorm(n) + confounder*0.4-instrument;
y=trt*1-0.5*confounder+precision*0.7+0.5*rnorm(n);
out=data.frame(y=y,trt=trt,confounder=confounder, precision=precision, instrument=instrument)
return(out)
}
## Effect if ignoring confounder:
simulate.one.instance<-function(){
dat=simulate.one.instance.data();
beta.hat=fit.linear.model(covariate=dat[,'trt'],outcome=dat$y)
beta.hat.confounder=fit.linear.model(covariate=as.matrix(dat[,c('trt','confounder')]),outcome=dat$y)
return(c(beta.hat[2],beta.hat.confounder[2]))
}
N.sim=10000;
set.seed(1)
sim.confounder=replicate(N.sim, simulate.one.instance());
apply(sim.confounder,1,mean) # The true value is 1
apply( (sim.confounder-1),1,function(x){ sum(x^2) }) # The mean squared error
## Effect if ignoring precision variable:
simulate.one.instance<-function(){
dat=simulate.one.instance.data();
beta.hat=fit.linear.model(covariate=as.matrix(dat[,c('trt','confounder')]),outcome=dat$y)
beta.hat.precision=fit.linear.model(covariate=as.matrix(dat[,c('trt','confounder','precision')]),outcome=dat$y)
return(c(beta.hat[2],beta.hat.precision[2]))
}
N.sim=10000;
set.seed(1)
sim.precision=replicate(N.sim, simulate.one.instance());
apply(sim.precision,1,mean) # The true value is 1
apply( (sim.precision-1),1,function(x){ sum(x^2) }) # The mean squared error
## The power of instruments
### If the confounder is unobserved, we can use the two stage least squares method
simulate.one.instance<-function(){
dat=simulate.one.instance.data();
beta.hat.naive=fit.linear.model(covariate=as.matrix(dat[,'trt']),outcome=dat$y)
yiv=fit.linear.model(covariate=as.matrix(dat[,c('instrument')]),outcome=dat$y)
trtiv=fit.linear.model(covariate=as.matrix(dat[,c('instrument')]),outcome=dat$trt)
return(c(beta.hat.naive[2],yiv[2]/trtiv[2]))
}
N.sim=10000;
set.seed(1)
sim.iv=replicate(N.sim, simulate.one.instance());
apply(sim.iv,1,mean) # The true value is 1
apply( (sim.iv-1),1,function(x){ sum(x^2) }) # The mean squared error
7.3 Least squares estimation
Reading materials: Slides 119 - 141 in STA108_LinearRegression_S20.pdf.
# Consider another data set with a bit more covariates
library(AER)
data("Fatalities")
dat.fatalities = Fatalities
# This dataset contains 34 variables
# It is actually a longitudinal/panel data
# We will ignore the spatial temporal structure in this class
y=dat.fatalities$fatal; #Number of vehicle fatalities.
X=as.matrix(dat.fatalities[,c('spirits','unemp','income','beertax','baptist','mormon','drinkage','dry','youngdrivers','miles','gsp')]);
# Rewrite the functions in Chapter 2
linear.model<-function(beta,covariate){
# beta: a 2-vector, where the first entry is the intercept
beta=as.matrix(beta,nrow=length(beta))
yout=covariate%*%beta[-1]+beta[1];
return(yout);
# Note: this function only works with simple linear regression model
# How would you generalize it?
}
sum.of.squares<-function(beta,covariate,outcome){
yout=linear.model(beta=beta,covariate=covariate);
res=outcome-yout;
sos= sum(res^2);
return(sos)
}
fit.linear.model<-function(covariate,outcome){
X=cbind(1,covariate);
beta.fit=solve( t(X)%*%X )%*%t(X)%*%outcome;
return(beta.fit)
}
beta.hat=fit.linear.model(covariate=X,outcome=y)
# Compare with lm()
fit.lm=lm(y~X+1);
beta.hat
# Hat matrix
calculate.hat<-function(covariate){
X=cbind(1,covariate);
P=X%*%solve(t(X)%*%X)%*%t(X);
return(P)
}
P=calculate.hat(covariate=X);
# Verify the properties of the hat matrix:
## Residuals
resid= y-linear.model(beta=beta.hat,covariate=X);
resid.hat = (diag(length(y))-P)%*%y;
sum((resid-resid.hat)^2)
## Idempotent
max( abs(P%*%P-P))
## Orthogonality
max(abs((diag(length(y))-P)%*%X))
# Sum of squares with the hat matrix:
n=length(y);J=matrix(1,n,n);I=diag(n)
total.sum.of.squares= t(y)%*%(I-J/n)%*%y
explained.sum.of.squares=t(y)%*%(P-J/n)%*%y
residual.sum.of.squares=t(y)%*%(I-P)%*%y
# Check if it is true
sum.of.squares(beta.hat,covariate = X,outcome = y)
# R^2
# Simulate data from a new model
n=100;
set.seed(1)
X=matrix(rnorm(n*4),nrow=n,ncol=4);
y=X%*%c(1,2,3,4)+4*rnorm(n);
Z=matrix(rnorm(n*10),nrow=n,ncol=10)
P=calculate.hat(covariate=X);
P.Z=calculate.hat(covariate=cbind(X,Z));
n=length(y);J=matrix(1,n,n);I=diag(n)
R.sq =t(y)%*%(P-J/n)%*%y/t(y)%*%(I-J/n)%*%y;
R.sq.Z= t(y)%*%(P.Z-J/n)%*%y/t(y)%*%(I-J/n)%*%y;
# Hence the adjusted R^2:
R.sq.adj=1-(t(y)%*%(I-P)%*%y/(n-dim(X)[2]-1) )/(t(y)%*%(I-J/n)%*%y/(n-1));
R.sq.Z.adj=1-(t(y)%*%(I-P.Z)%*%y/(n-dim(X)[2]-dim(Z)[2]-1) )/(t(y)%*%(I-J/n)%*%y/(n-1));
7.4 Underfitting and overfitting
Reading materials: Slides 142 - 152 in STA108_LinearRegression_S20.pdf.
## Underfitting
simulate.one.instance<-function(beta.true,X,Z){
eta=c(3,4);
n= dim(X)[1];
noise = rnorm(n)/2;
y.XZ=X%*%beta.true+Z%*%eta + noise;
beta.full = fit.linear.model(covariate=cbind(X,Z),outcome=y.XZ)
beta.under= fit.linear.model(covariate=X,outcome=y.XZ)
return(cbind(beta.full[2:3]-beta.true,beta.under[2:3]-beta.true))
}
beta.true=c(1,2)
n=100;
X=matrix(rnorm(2*n),ncol=2)
Z=matrix(rnorm(2*n),ncol=2)
set.seed(1);
N.sim=10000;
under.sim=replicate(N.sim,simulate.one.instance(beta.true,X,Z))
# Fitting the true model gives
apply(under.sim[1,,],1,mean) # Non-zero bias
# Does it contradict the claim about precision variable?
# Try another simulation:
simulate.one.instance.pre<-function(beta.true){
eta=c(3,4);
n=100;
X=matrix(rnorm(2*n),ncol=2)
Z=matrix(rnorm(2*n),ncol=2)
noise = rnorm(n)/2;
y.XZ=X%*%beta.true+Z%*%eta + noise;
beta.full = fit.linear.model(covariate=cbind(X,Z),outcome=y.XZ)
beta.under= fit.linear.model(covariate=X,outcome=y.XZ)
return(cbind(beta.full[2:3]-beta.true,beta.under[2:3]-beta.true))
}
beta.true=c(1,2)
set.seed(1);
N.sim=10000;
pre.sim=replicate(N.sim,simulate.one.instance.pre(beta.true))
apply(pre.sim[1,,],1,mean) # almost zero bias
### Overfitting
# Just modify the code slightly..
simulate.one.instance<-function(beta.true,X,Z){
eta=c(3,4);
n= dim(X)[1];
noise = rnorm(n)/2;
y.X=X%*%beta.true+ noise;
beta.over= fit.linear.model(covariate=cbind(X,Z),outcome=y.X)
beta.full= fit.linear.model(covariate=X,outcome=y.X)
return(cbind(beta.full[2:3]-beta.true,beta.over[2:3]-beta.true))
}
beta.true=c(1,2)
n=100;
X=matrix(rnorm(2*n),ncol=2)
Z=matrix(rnorm(2*n),ncol=2)
set.seed(1);
N.sim=10000;
over.sim=replicate(N.sim,simulate.one.instance(beta.true,X,Z))
# Fitting the true model gives
apply(over.sim[1,,],1,mean) # No bias
7.5 Sampling distribution and inference
Reading materials: Slides 153 - 165 in STA108_LinearRegression_S20.pdf.
We will skip the part for multivariate normal distribution. You will learn more about this in STA 135: Multivariate Data Analysis.
7.5.1 ANOVA
attach(dat.advertising)
full.model=lm(sales~TV+radio+newspaper);
reduced.model=lm(sales~TV);
anova(reduced.model,full.model)
RSS.Reduced<-sum((summary(reduced.model)$residuals)^2);
RSS.Full<-sum((summary(full.model)$residuals)^2);
df.Reduced <- summary(reduced.model)$df[2];
df.Full <- summary(full.model)$df[2];
tF<-((RSS.Reduced-RSS.Full)/(df.Reduced-df.Full))/(RSS.Full/df.Full);
tF
7.5.2 Wald test
Suppose that we want to test the null hypothesis \[H_0: \beta_1-2\beta_2 = 0 \ {\rm v.s.}\ H_a: \beta_1-2\beta_2 \neq 0,\] where \(\beta_1\) is the regression coefficient for TV
and \(\beta_2\) is the regression coefficient for radio
.
full.model=lm(sales~TV+radio+newspaper);
covariance.LSE=vcov(full.model); # Calculate the covariance for the least squares estimators
coef.LSE=coef(full.model);
covariance.test = covariance.LSE[2:3,2:3]; # Extract the submatrix of covariance for beta1 and beta2
coef.test=coef.LSE[2:3];
R=matrix(0,1,2);
R[1,1]=1;R[1,2]=-2;
test.stat=t(R%*%coef.test)%*%solve(R%*%covariance.test%*%t(R))%*%(R%*%coef.test);
p.value=1-pchisq(test.stat,df=1);
We can verify that the Wald test yields similar results with the F-test using anova
. \[H_0: \beta_1=\beta_2 = 0 \ {\rm v.s.}\ H_a: \beta_1\neq 0, \beta_2 \neq 0.\]
full.model=lm(sales~TV+radio+newspaper);
covariance.LSE=vcov(full.model); # Calculate the covariance for the least squares estimators
coef.LSE=coef(full.model);
covariance.test = covariance.LSE[2:3,2:3]; # Extract the submatrix of covariance for beta1 and beta2
coef.test=coef.LSE[2:3];
R=matrix(0,2,2);
R[1,1]=1;R[2,2]=1;
test.stat=t(R%*%coef.test)%*%solve(R%*%covariance.test%*%t(R))%*%(R%*%coef.test);
p_value=1-pchisq(test.stat,df=2);
reduced_model=lm(sales~newspaper);
anova(reduced.model,full.model)