Chapter 9 Project description

If a statistical method is employed, you need to clearly state the model and justify your choice.

9.1 Project 1: Project STAR I

9.1.1 Background

In this project, we study the dataset from a very influential randomized experient. Tennesses Student/Teacher Achievement Ratio study (Project STAR) was conducted in the late 1980s to evaluate the effect of class size on test scores. This dataset has been used as a classic examples in many textbooks and research papers. You are encouraged to read more about the experiment design and how others analyze this dataset. This document only provides a brief explanation of the dataset that suffices for this course project.

The study randomly assigned students to small classes, regular classes, and regular classes with a teacher’s aide. In order to randomize properly, schools were enrolled only if they had enough studybody to have at least one class of each type. Once the schools were enrolled, students were randomly assigned to the three types of classes, and one teacher was randomly assigned to one class.

The dataset contains scaled scores for math and reading from kindergarten to 3rd grade. We will only examine the math scores in 1st grade in this project.

9.1.2 Tasks

Install the AER package and load the STAR dataset.
Explore this dataset and generate summary statistics (in forms of tables or plots) that you find informative, and explain them.
Write down a one-way ANOVA model to study the effects of class types on the math scaled scores. Explain your notation.
Explain why your model is appropriate for this task on this data set. You may want to include statistics and plots in your explanation.
Fit the model you choose in Task 3 and show your fits in the report.
Conduct model diagnostic and/or sensitivity analysis.
Test whether there is a difference in the math scaled score in 1st grade across students in different class types. Justify your choice of test.
Discuss whether you are able to make any causal statements based on your analysis.

9.2 Project 2: Project STAR II

9.2.1 Background

We will continue our study on Project STAR. In the previous project, we examine the effects on individual students. Here we consider each teacher as the individual unit. Moreover, we will also explore the longitudinal feature of this dataset and the fact that randomization happened within schools.

9.2.2 Tasks

Explore math scaled scores in the 1st with teachers as the unit. Generate summary statistics (in forms of tables or plots) that you find informative, and explain them.
Write down a two-way ANOVA model to study the effects of class types on math scaled scores in 1st grade, with the school indicator as the other factor. Explain your notation.
Explain why your model is appropriate for this task on this data set. You may want to include statistics and plots in your explanation.
Fit the model you choose in Task 2 and show your fits in the report.
Conduct model diagnostic and/or sensitivity analysis.
Test whether there is a difference in math scaled score in 1st grade across teachers in different class types. Justify your choice of test.
Discuss whether you are able to make any causal statements based on your analysis.
Is there any difference between the results from Project 2 compared to the results from Project 1?

In any of these tasks, if a statistical method is employed, you need to clearly state the model and justify your choice.

9.3 Project 3: US Traffic Fatalities

9.3.1 Background

The National Highway Traffic Safety Administration reported that there are 36,560 highway fatalities across US in 2018 (link). Alarmed by the high traffic fatalities, you wanted to use your knowledge and skills in statistics to explore measures that can potentially reduce the traffic fatalities.

You found out that traffic fatalities data for 48 US states from 1982 to 1988 are available, and well-cleaned, in an R package AER. You can see more description of the data set from the help file of the fatalities dataset (e.g., using ?Fatalities).

You want to analyze this dataset to see if there are any of the variables that caused the reduction or increase of traffic fatalities. Note that the ultimate goal is to make suggestions to policy makers to take certain measures.

The beer tax is the tax on a case of beer, which is an available measure of state alcohol taxes more generally. The drinking age variable is a factor indicating whether the legal drinking age is 18, 19, or 20. The two binary punishment variables describe the state’s minimum sentencing requirements for an initial drunk driving conviction.

Total vehicle miles traveled annually by state was obtained from the Department of Transportation. Personal income was obtained from the US Bureau of Economic Analysis, and the unemployment rate was obtained from the US Bureau of Labor Statistics

9.3.2 Tasks

Explore this dataset and generate summary statistics (in forms of tables or plots) that you find crucial for your own interest, or for convincing the policy makers.
Consider only the the full dataset from 1982 to 1988, propose a regression model to study whether having a mandatory jail sentence is associated with reduced traffic fatalities. In particular, you need to
1. write down your model,
2. state the assumptions required,
3. fit the model with appropriate methods,
4. conduct model diagnostics and/or sensitivity analysis.
5. and discuss causal interpretation of the proposed models.
Conclude your analysis results. You may want to test a hypothesis, construct a confidence interval, or draw a confidence band.
Explain the implications of your results to policy makers who know little about statistics. Make suggestions if you want to.

9.3.3 Hints

Take a look at the helpfile for this data set.
You may want to think about the causal interprettion (2.f) before finding an appropriate model (2.a).
Some key words here: longitudinal data, time-series data, propensity score, matching, instrument variable, and observational study. Not all key words lead to useful resources.
You may use additional data or studies to strengthen your argument in Task 4.

9.4 Project 4: Bank Marketing

9.4.1 Background

A Portuguese retail bank started a telemarketing campaign in 2008, aiming to subscribe new users to a long-term deposit. Information collected during the campaign was recorded in this data set. More information is available on the UCI machine learning repository (link) and the citations therein.

You are a consultant who is hired to study the retail banking market in Portugese. Somehow you come to know such a dataset is publicly available. Naturally, you want to gain some insights of the market from this dataset, before conducting an expensive survey.

9.4.2 Tasks

Acquire the dataset from the UCI machine learning repository (link).
Pick an appropriate data set to study, and justify your decision.
Explore this dataset and generate summary statistics (in forms of tables or plots) that you find crucial for your clients to know.
Build a predictive model for whether a client will sign on to a long-term deposit. You will use logistic regression in this task. Specifically, you will
1. write down a proper logistic regression model,
2. fit the model,
3. evaluate the performance of the fitted model,
4. and conduct model diagnostic and/or sensitivity analysis.
Build another predictive model using a method of your choice, and compare its performance to the logistic regression.
Explain the gap in the performances, if any, to your supervisor, who knows statistics quite well and only believes in data and mathematics.

In any of these tasks, if a statistical method is employed, you need to clearly state the model and justify your choice.

9.4.3 Hints

Pay close attention to the dataset (Task 3) before building your prediction model.
You may use random forest in Task 5 if you have no other methods in mind.
You need to evaluate performance of each method in a statistically appropriate manner.
You may use methods built in other computing language and display the results in your report. Your code in other language should be made available for reproducibility.