Chapter 1 Linear regression with R

Reading materials: Slides 3 - 11 in STA108_LinearRegression_S20.pdf.

Fitting a linear model is simple in R. The bare minimum requires you to know only two functions lm() and summary(). We will apply linear regression on three data set advertising, flu shot, and Project STAR.

1.1 Advertising data

This data is taken from An Introduction to Statistical Learning, by James et al. A brief description from Section 2.1 in ISL is provided below. You may read more about the data set in the free e-book.

The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper… It is not possible for our client to directly increase sales of the product. On the other hand, they can control the advertising expenditure in each of the three media. Therefore, if we determine that there is an association between advertising and sales, then we can instruct our client to adjust advertising budgets, thereby indirectly increasing sales. In other words, our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets.

It is important to note that we will avoid making any causal statements (i.e., increasing TV advertising budget will increase the sales by …) in our analysis. Causal inference on observational data is another major field in statistics. We will limit our discussion on association.

dat.advertising=read.csv('./data/advertising.csv');
# Use the code in Appendix C to visualize this data set
library(ggplot2)
library(GGally)
ggpairs(dat.advertising)
# What is X in this data frame?
dat.advertising=dat.advertising[,-1]

We will consider a simple linear regression model in this chapter. The only covariate/predictor/independent variable that we use here is the TV advertising budget. Therefore, the model is \[ y_i = x_i \beta_1 +\beta_0 + \epsilon_i, i=1,\ldots, 200, \] where \(y_i\) is the sales (1 unit = 1000 dollars) for the \(i\)th entry, \(x_i\) is the TV advertising budget for the \(i\)th entry, \(\beta_0\) is the intercept term, and \(\beta_1\) is the regression slope. In addition, we assume that the errors \(\{\epsilon_i\}_{i=1}^{200}\) satisfy that \(\epsilon_1,\ldots, \epsilon_{200}\) are independently and identically distributed (i.i.d.), \(\mathbb{E}[\epsilon_i]= 0\) for \(i=1,2,\ldots, 200\) and \(\mathrm{var}(\epsilon_i)=\sigma^2\) for \(i=1,2,\ldots, 200\). Recall that we consider fixed design (i.e., \(x_i\) is not random) in this course for simplicity.

# Fit a simple linear regression
fit.advertising = lm(sales~TV+1,data=dat.advertising); 
# Summarize the fitted results
summary(fit.advertising) 

We see that \(\hat{\beta}_1\) equals 0.05 and \(\hat{\beta}_0\) equals 7.03. We can draw the line in the scatter plot.

How would you interpret the fitted slope and intercept?

# Draw the fitted line using ggplot2 
ggplot(data = dat.advertising) + 
  geom_point(mapping = aes(x = TV, y = sales)) +
  geom_line(data = fortify(fit.advertising ), aes(x = TV, y = .fitted),color='red')
# In fact, you can use the following code without fitting the lm()

ggplot(data = dat.advertising,aes(x = TV, y = sales)) + 
  geom_point() +
  geom_smooth(method = "lm")
# The band is a confidence band 

1.2 Flu shot

The data come from the following paper: McDonald, C., Hiu, S., AND Tierney, W. (1992). Effects of computer reminders for influenza vaccination on morbidity during influenza epidemics. MD Computing 9, 304-312. (link).

Briefly, in this study, physicians were randomly selected to receive a letter encouraging them to inoculate patients at risk for flu. The treatment of interest is the actual flu shot, and the outcome is an indicator for flu-related hospital visits from the patients. In this data set, you can also find four important background covariates including gender, age, a chronic obstructive pulmonary disease (COPD) indicator, and a heart disease indicator. For further details, read the original publication or Hirano, Imbens, Rubin and Zhou (2000) (link), or Chapter 25 in Imbens and Rubin (2015) (link).

dat.flu = read.table("./data/flu240.txt", header = TRUE)

# Take a look at size of the data before proceed:
dim(dat.flu)
colnames(dat.flu)
# Basic visulization:
ggpairs(dat.flu) # This might take a while..
# Do you think the plots are informative?

# One strategy: jittering

# Without jittering:
ggplot(dat.flu, aes(x=treatment.assigned, y=outcome)) + geom_point()
# With jittering:
ggplot(dat.flu, aes(x=treatment.assigned, y=outcome)) + geom_jitter(width = 0.25)

Again, we will consider a simple linear regression model for now. The only covariate/predictor/independent variable that we use here is the treatment assignment. Therefore, the model is \[ y_i = x_i \beta_1 +\beta_0 + \epsilon_i, i=1,\ldots, 2891, \] where \(y_i\) is the outcome (0 or 1) for the \(i\)th patient, \(x_i\) is the treatment assignment (0 or 1) for the \(i\)th patient’s physician, \(\beta_0\) is the intercept term, and \(\beta_1\) is the regression slope. In addition, we assume that the errors \(\{\epsilon_i\}_{i=1}^{2891}\) satisfy that \(\epsilon_1,\ldots, \epsilon_{2891}\) are independently and identically distributed (i.i.d.), \(\mathbb{E}[\epsilon_i]= 0\) for \(i=1,2,\ldots, 2891\) and \(\mathrm{var}(\epsilon_i)=\sigma^2\) for \(i=1,2,\ldots, 2891\). The following code is almost identical to those in the previous section.

How would you interpret the fitted slope and intercept?

Are there any violations to the assumptions on \(\{\epsilon_i\}_{i=1}^{2891}\)?

# Fit a simple linear regression
fit.flu= lm(outcome~treatment.assigned+1,data=dat.flu); 
# Summarize the fitted results
summary(fit.flu) 

# Draw the fitted line using ggplot2 
ggplot(dat.flu, aes(x=treatment.assigned, y=outcome)) + 
  geom_jitter(width = 0.25)+
  geom_smooth(method = "lm")

1.3 Project STAR

Tennesses Student/Teacher Achievement Ratio study (Project STAR) was conducted in the late 1980s to evaluate the effect of class size on test scores. This dataset has been used as a classic examples in many textbooks and research papers. Brieftly, the study randomly assigned students to small classes, regular classes, and regular classes with a teacher’s aide. In order to randomize properly, schools were enrolled only if they had enough studybody to have at least one class of each type. Once the schools were enrolled, students were randomly assigned to the three types of classes, and one teacher was randomly assigned to one class. You can read more about the study on the Harvard dataverse (link) or Chapter 9 in Imbens and Rubin (2015) (link).

library(AER)
data("STAR")
dat.STAR=STAR; # Just to be consistent
# Take a look at size of the data before proceed:
dim(dat.STAR)
colnames(dat.STAR)
# With 47 variables, you might want to read the help file to see what these variables are.


# Basic visulization:
# ggpairs(dat.flu) # This might take a really long time ...

# We will consider the math scores and the class assignments in the second grade

# With jittering:
ggplot(dat.STAR, aes(x=star2, y=math2)) + 
  geom_jitter(width = 0.25)

Again, we will consider a simple linear regression model for now. We will study the association between math scores and the class assignments in the second grade. Therefore, the model is \[ y_i =\beta_0 + x_{i,1} \beta_1 +x_{i,2} \beta_2 + \epsilon_i, i=1,\ldots, 6065, \] where \(y_i\) is the math score for the \(i\)th student, \(x_{i,1}\) is an indicator (0 or 1) whether the \(i\)th student is in the small class, and \(x_{i,1}\) is an indicator (0 or 1) whether the \(i\)th student is in the regular class with aide, \(\beta_0\) is the intercept term, and \(\beta_1\) and \(\beta_2\) are the regression coefficients for \(x_{i,1}\) and \(x_{i,2}\) , respectively. In addition, we assume that the errors \(\{\epsilon_i\}_{i=1}^{6065}\) satisfy that \(\epsilon_1,\ldots, \epsilon_{6065}\) are independently and identically distributed (i.i.d.), \(\mathbb{E}[\epsilon_i]= 0\) for \(i=1,2,\ldots, 6065\) and \(\mathrm{var}(\epsilon_i)=\sigma^2\) for \(i=1,2,\ldots, 6065\). The sample size (6065) differs from the dimension of the original data set. The reason is that there are missing values in the data set. The available data will change we look at data from other grades.

The following code is almost identical to those in the previous section.

# Fit a simple linear regression
fit.STAR= lm(math2~as.factor(star2)+1,data=dat.STAR); 
# Summarize the fitted results
summary(fit.STAR) 

# We can no longer draw a fitted line here...

1.4 Note

In the homework assignments and the midterm project, you will be asked to reproduce the analysis on the real and sythenetic data using your own functions. The hope is that, by reinventing the wheels, you will have a thorough understanding of linear regression.