13. Case Studies for Multiple Regression

metricsAI: An Introduction to Econometrics with Python and AI in the Cloud

Carlos Mendez

Chapter 13 Visual Summary

This notebook presents comprehensive case studies applying multiple regression to real-world economic problems.

Open In Colab

Chapter Overview

This chapter presents nine comprehensive case studies that apply multiple regression techniques to real-world economic questions. From analyzing school performance to estimating production functions to testing causal relationships, these case studies demonstrate the breadth and power of regression analysis.

Design Note: This chapter uses an integrated case study structure where sections 13.1–13.9 ARE the case studies, each demonstrating different regression techniques and applications.

What you’ll learn:

  • Apply multiple regression to analyze school performance and socioeconomic factors
  • Use logarithmic transformations to estimate production functions
  • Understand and test for constant returns to scale
  • Identify and correct for omitted variables bias
  • Apply cluster-robust standard errors for grouped data
  • Understand randomized control trials and difference-in-differences methods
  • Apply regression discontinuity design to causal questions
  • Use instrumental variables to estimate causal effects
  • Navigate the data cleaning and preparation process

Datasets used:

  • AED_API99.DTA: 807 California high schools, Academic Performance Index (1999) — Case Study 13.1
  • AED_CobbDouglas.DTA: 24 years of U.S. manufacturing data (1899–1922) — Case Study 13.2
  • AED_PhillipsCurve.DTA: 66 years of U.S. macroeconomic data (1949–2014) — Case Study 13.3
  • AED_AutoEfficiency.DTA: 26,995 vehicles, fuel efficiency data (1980–2006) — Case Study 13.4
  • AED_RandHealthInsurance.DTA: RAND health insurance experiment (RCT) — Case Study 13.5
  • AED_SouthAfricaHealth.DTA: South Africa cross-sectional health data (DiD) — Case Study 13.6
  • AED_SenateIncumbency.DTA: U.S. Senate election results (RD) — Case Study 13.7
  • AED_InstitutionsGDP.DTA: Cross-country institutions and GDP data (IV) — Case Study 13.8

Chapter outline:

  • 13.1 School Academic Performance Index
  • 13.2 Cobb-Douglas Production Function
  • 13.3 Phillips Curve and Omitted Variables Bias
  • 13.4 Automobile Fuel Efficiency
  • 13.5 RAND Health Insurance Experiment (RCT)
  • 13.6 Health Care Access (Difference-in-Differences)
  • 13.7 Political Incumbency (Regression Discontinuity)
  • 13.8 Institutions and GDP (Instrumental Variables)
  • 13.9 From Raw Data to Final Data
  • Key Takeaways
  • Practice Exercises

Key Concepts

Five core ideas anchor this chapter. Skim them before you start, and come back when a term feels fuzzy. Each entry pairs a concrete example using the chapter’s data with a non-technical analogy. Click a panel to expand it.

Production Function: A model that maps inputs (capital \(K\), labor \(L\), sometimes more) to output \(Q\). The Cobb–Douglas form \(Q = A K^{\beta_2} L^{\beta_3}\) is the chapter’s headline example: it is multiplicative in inputs, but linear in their logarithms — which is what makes it estimable with OLS.

For 24 years of U.S. manufacturing data (data_cobb, 1899–1922), the chapter fits \(\ln Q = \beta_1 + \beta_2 \ln K + \beta_3 \ln L\) and estimates \(\hat\beta_2 \approx 0.23\) (capital elasticity) and \(\hat\beta_3 \approx 0.81\) (labor elasticity). The two elasticities tell you exactly how much each input contributes — labor here was nearly four times as important as capital in turning early-century factories’ inputs into output.

A bakery’s recipe says: a loaf needs flour, water, and time, in fixed proportions. Double the flour but keep water and time fixed and you don’t get two loaves — you get a sticky mess. A production function is the regression’s recipe: it tells you how much each ingredient (input) actually contributes to the finished output, given the others.

Constant Returns to Scale (CRS): The property of a production function in which scaling all inputs by a factor \(\lambda\) scales output by exactly the same factor: \(f(\lambda K, \lambda L) = \lambda \cdot f(K, L)\). In Cobb–Douglas form, this happens when \(\beta_2 + \beta_3 = 1\).

For the Cobb–Douglas fit on data_cobb, the sum of input elasticities is \(\hat\beta_2 + \hat\beta_3 \approx 0.23 + 0.81 = 1.04\) — close to 1. The chapter’s \(F\)-test of \(H_0: \beta_2 + \beta_3 = 1\) yields \(p \approx 0.636\), so the data are consistent with constant returns: doubling capital and labor would, on average, double output across early-century U.S. manufacturing.

A photocopier shows constant returns to scale: feed in twice as much paper and twice as much toner, get twice as many copies. Increasing returns would mean you somehow get more than double; decreasing returns would mean you get less. The Cobb–Douglas test asks whether the U.S. economy in 1899–1922 was a well-behaved photocopier or some weirder machine.

Selection Bias: Bias arising when the people who end up in a sample are not a random slice of the population — typically because they self-select on characteristics that also affect the outcome. Observational comparisons of “treated” vs. “untreated” units are vulnerable; randomised assignment is the standard fix.

The RAND Health Insurance Experiment (data_health, 5{,}639 first-year observations) randomly assigned 5{,}809 individuals to plans with cost-sharing of 0%, 25%, 50%, 95% — a feature impossible to obtain from observational data, where sicker or richer people self-select into more generous plans. Random assignment closed off this selection channel and let a simple regression of spending on plan dummies estimate the causal effect of insurance generosity.

A restaurant’s online reviews look great — but only the most ecstatic and most outraged customers ever bother to write. The “average review” badly overstates the variance of the actual diner experience because reviewers select themselves. Selection bias in econometrics works the same way: the people you observe don’t represent the people you want to know about.

Parallel Trends Assumption: The identifying assumption of difference-in-differences (DiD) — that, in the absence of treatment, the treatment and control groups would have followed the same trend over time. The DiD estimator subtracts the control group’s change from the treatment group’s, and is unbiased only if parallel trends hold.

The South-Africa health-clinic study (data_access, 1{,}071 children, 1993 vs 1998) compares high-treatment communities to low-treatment ones over a five-year window. The DiD estimate of \(\approx 0.52\) standard deviations in waz (weight-for-age \(z\)-score) credits clinic access only under parallel trends — i.e., that low-treatment communities show how high-treatment communities would have evolved without the new clinics.

Two runners normally jog the same route at the same pace. One day, one of them straps on weighted shoes and starts to lag behind. Without the weighted shoes, the assumption goes, both runners would still be neck-and-neck. The lag is the causal effect of the shoes — but only if you’re sure both runners would otherwise have kept matching strides. Parallel trends is exactly that “would otherwise have matched” condition.

Counterfactual: The unobservable outcome a unit would have experienced under the alternative treatment status. Every causal effect is a comparison of an observed outcome to a counterfactual; randomised experiments, DiD, RD, and IV are all different recipes for imputing the missing counterfactual.

For the U.S. Senate incumbency study (data_incumb, 1{,}390 elections 1914–2010), the regression-discontinuity estimate of \(\approx 4.8\) percentage points is the gap between a barely-winning candidate’s next-election vote share and what it would have been had they barely lost (the counterfactual). RD makes that counterfactual credible because the two groups — barely won, barely lost — are nearly identical apart from incumbency status.

A counterfactual is the parallel-universe version of an event — what would have happened if the coin had landed the other way. You can never visit that universe directly. The whole craft of causal inference is to find clever observable proxies for it: a randomised “twin” experiment, a near-tied election, a coincidence of geography that gave one village clinics and another none.

Setup

Setup: Import libraries and configure environment
# --- Libraries ---
import numpy as np                        # numerical operations
import pandas as pd                       # data manipulation
import matplotlib.pyplot as plt           # plotting
import seaborn as sns                     # statistical visualizations
import pyfixest as pf                    # fast OLS/IV estimation
from scipy import stats                   # statistical tests
import warnings
warnings.filterwarnings('ignore')

# --- Reproducibility ---
np.random.seed(42)

# --- Data source ---
GITHUB_DATA_URL = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/"

# --- Plotting style (dark theme matching book design) ---
plt.style.use('dark_background')
sns.set_style("darkgrid")
plt.rcParams.update({
    'axes.facecolor': '#1a2235',
    'figure.facecolor': '#12162c',
    'grid.color': '#3a4a6b',
    'figure.figsize': (10, 6),
    'text.color': 'white',
    'axes.labelcolor': 'white',
    'xtick.color': 'white',
    'ytick.color': 'white',
    'axes.edgecolor': '#1a2235',
})

print("✓ Setup complete!")
✓ Setup complete!

13.1 School Academic Performance Index

Analyzing factors that determine school test scores in California.

# Load API data
data_api = pd.read_stata(GITHUB_DATA_URL + 'AED_API99.DTA')
print(f"Loaded {len(data_api)} California high schools")
print(f"Variables: {list(data_api.columns)}")
data_api.head()
Loaded 807 California high schools
Variables: ['sch_code', 'api99', 'edparent', 'meals', 'englearn', 'yearround', 'credteach', 'emerteach', 'avg_ed_raw', 'pct_af_am', 'pct_am_ind', 'pct_asian', 'pct_fil', 'pct_hisp', 'pct_pac', 'pct_white', 'mobility']
sch_code api99 edparent meals englearn yearround credteach emerteach avg_ed_raw pct_af_am pct_am_ind pct_asian pct_fil pct_hisp pct_pac pct_white mobility
0 130054 633 12.78 21 6 0 88 13 2.89 6 1 12 9 30 3 39 0.0
1 130062 646 13.42 16 17 0 86 24 3.21 5 1 27 6 17 1 43 12.0
2 130096 797 14.90 1 2 0 100 4 3.95 1 1 9 2 5 0 81 0.0
3 130229 693 13.66 0 18 0 93 7 3.33 5 1 33 7 10 1 38 7.0
4 130450 773 14.94 0 9 0 98 7 3.97 8 0 28 1 9 1 46 0.0

About the Dataset

This dataset contains information on 807 California high schools from 1999. The Academic Performance Index (API) is a score from 200-1000 that measures school performance based on standardized test results.

Key Variables:

  • api99: Academic Performance Index (outcome variable)
  • edparent: Average parental education level (1-5 scale)
  • mealpct: Percent of students eligible for free/reduced meals (poverty proxy)
  • elpct: Percent of English language learners
  • yrs_teach: Average teacher experience in years
  • totcredpc: Per-pupil total credentials
  • emrpct: Percent of emergency credential teachers

Economic Question: What factors determine school performance? Is it resources (teachers, funding) or student demographics (poverty, language, parental education)?

# Summary statistics
vars_api = ['api99', 'edparent', 'meals', 'englearn', 'yearround', 'credteach', 'emerteach']
data_api[vars_api].describe()
api99 edparent meals englearn yearround credteach emerteach
count 807.000000 807.000000 807.000000 807.000000 807.000000 807.000000 807.000000
mean 620.944238 12.841289 21.918216 14.003717 0.023544 89.836431 10.464684
std 107.440771 1.234461 23.667952 12.786517 0.151717 8.437510 8.214762
min 355.000000 9.620000 0.000000 0.000000 0.000000 33.000000 0.000000
25% 542.000000 12.020000 0.000000 4.000000 0.000000 85.000000 4.000000
50% 620.000000 12.880000 14.000000 10.000000 0.000000 92.000000 9.000000
75% 695.000000 13.680000 36.500000 21.000000 0.000000 96.000000 15.000000
max 966.000000 16.000000 98.000000 66.000000 1.000000 100.000000 56.000000
# Histogram of API scores
plt.figure(figsize=(10, 6))
plt.hist(data_api['api99'], bins=30, color='steelblue', alpha=0.7, edgecolor='#3a4a6b')  # bins = number of bars, alpha = transparency
plt.axvline(data_api['api99'].mean(), color='red', linestyle='--', linewidth=2,
            label=f'Mean = {data_api["api99"].mean():.1f}')
plt.axvline(800, color='green', linestyle='--', linewidth=2, label='Target = 800')
plt.xlabel('Academic Performance Index (API)')
plt.ylabel('Number of Schools')
plt.title('Figure 13.1: Distribution of API Scores')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

What to look for in this histogram:

  • Shape: Roughly symmetric, slightly left-skewed – most schools cluster around the middle of the API range
  • Spread: Scores range widely from below 400 to above 900, indicating large performance differences across schools
  • Benchmarks: The red dashed line (mean) and green dashed line (target of 800) show that most schools fall short of the state target
# Bivariate regression: API ~ Edparent
fit_api_biv = pf.feols('api99 ~ edparent', data=data_api, vcov='HC1')

# Key results
intercept_api = fit_api_biv.coef()['Intercept']
slope_api     = fit_api_biv.coef()['edparent']
r2_api        = fit_api_biv._r2

print(f"Estimated equation: API = {intercept_api:.1f} + {slope_api:.1f} x edparent")
print(f"Slope: each additional year of parent education is associated with {slope_api:.1f} higher API")
print(f"R-squared: {r2_api:.4f} ({r2_api*100:.1f}% of variation explained)")

# Full regression output
fit_api_biv.summary()
Estimated equation: API = -400.3 + 79.5 x edparent
Slope: each additional year of parent education is associated with 79.5 higher API
R-squared: 0.8350 (83.5% of variation explained)
###

Estimation:  OLS
Dep. var.: api99
sample: None = all
Inference:  HC1
Observations:  807

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |     2.5% |    97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|---------:|---------:|
| Intercept     |   -400.314 |       15.993 |   -25.030 |      0.000 | -431.707 | -368.920 |
| edparent      |     79.529 |        1.219 |    65.247 |      0.000 |   77.137 |   81.922 |
---
RMSE: 43.62 R2: 0.835 
# Scatter plot with regression line
plt.figure(figsize=(10, 6))
plt.scatter(data_api['edparent'], data_api['api99'], alpha=0.5, s=30, color='#22d3ee')  # s = marker size, alpha = transparency
plt.plot(data_api['edparent'], fit_api_biv.predict(), color='#c084fc', linewidth=2,
         label='Fitted line')
plt.xlabel('Average Years of Parent Education')
plt.ylabel('Academic Performance Index (API)')
plt.title('Figure 13.2: API vs Parent Education')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

What to look for in this scatter plot:

  • Direction: Strong positive relationship – higher parental education is associated with higher API scores
  • Form: Roughly linear – the fitted line captures the trend well
  • Scatter: Considerable variation around the line, suggesting other factors also matter
# Correlation matrix
corr_matrix = data_api[vars_api].corr()
# Correlation Matrix
corr_matrix.round(2)

# Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=1)
plt.title('Figure 13.3: Correlation Matrix')
plt.tight_layout()
plt.show()

What the Correlation Matrix Reveals

The correlation matrix shows the linear relationships between all variables:

Strong negative correlations (darker blues):

  • api99 and mealpct (-0.90): Schools with more poverty have much lower test scores
  • api99 and elpct (-0.76): More English learners → lower scores
  • These suggest socioeconomic factors strongly predict performance

Positive correlations (reds):

  • api99 and edparent (0.76): Parental education strongly predicts scores
  • Teacher experience and credentials show weaker correlations

Multicollinearity concerns:

  • mealpct and edparent are highly correlated (-0.79)
  • This makes it difficult to separate their individual effects
  • Standard errors may be inflated in multiple regression
# Multiple regression
fit_api_mult = pf.feols('api99 ~ edparent + meals + englearn + yearround + credteach + emerteach',
                         data=data_api)
# Multiple Regression
fit_api_mult.summary()

# Coefficient table
coef_df = pd.DataFrame({
    'Coefficient': fit_api_mult.coef(),
    'Std Error': fit_api_mult.se(),
    't-stat': fit_api_mult.tstat(),
    'p-value': fit_api_mult.pvalue()
}).round(3)
print("\n", coef_df)
###

Estimation:  OLS
Dep. var.: api99
sample: None = all
Inference:  iid
Observations:  807

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |     2.5% |    97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|---------:|---------:|
| Intercept     |   -345.328 |       39.954 |    -8.643 |      0.000 | -423.755 | -266.900 |
| edparent      |     73.942 |        1.883 |    39.273 |      0.000 |   70.246 |   77.638 |
| meals         |      0.079 |        0.081 |     0.980 |      0.327 |   -0.079 |    0.238 |
| englearn      |     -0.358 |        0.167 |    -2.145 |      0.032 |   -0.685 |   -0.030 |
| yearround     |     25.956 |       10.185 |     2.548 |      0.011 |    5.963 |   45.949 |
| credteach     |      0.387 |        0.311 |     1.247 |      0.213 |   -0.222 |    0.997 |
| emerteach     |     -1.470 |        0.315 |    -4.672 |      0.000 |   -2.088 |   -0.853 |
---
RMSE: 41.22 R2: 0.853 

              Coefficient  Std Error  t-stat  p-value
Coefficient                                         
Intercept       -345.328     39.954  -8.643    0.000
edparent          73.942      1.883  39.273    0.000
meals              0.079      0.081   0.980    0.327
englearn          -0.358      0.167  -2.145    0.032
yearround         25.956     10.185   2.548    0.011
credteach          0.387      0.311   1.247    0.213
emerteach         -1.470      0.315  -4.672    0.000

Key Concept 13.1: Multiple Regression and Socioeconomic Determinants

In multiple regression, the coefficient for parent education remains strong (~74 points) even after controlling for meals, English learners, year-round schools, and teacher quality. The high correlation among socioeconomic variables makes it difficult to isolate their separate effects — a common challenge in observational studies.

13.2 Cobb-Douglas Production Function

Estimating returns to scale using log transformations.

The Cobb-Douglas Production Function

The Cobb-Douglas production function is one of the most famous models in economics. It describes how inputs (capital K and labor L) combine to produce output Q:

\[Q = \alpha K^{\beta_2} L^{\beta_3}\]

Parameters:

  • \(\alpha\): Total factor productivity (technology level)
  • \(\beta_2\): Output elasticity of capital (% change in Q from 1% change in K)
  • \(\beta_3\): Output elasticity of labor (% change in Q from 1% change in L)

Returns to Scale:

  • If \(\beta_2 + \beta_3 = 1\): Constant returns (doubling inputs doubles output)
  • If \(\beta_2 + \beta_3 > 1\): Increasing returns (doubling inputs more than doubles output)
  • If \(\beta_2 + \beta_3 < 1\): Decreasing returns (doubling inputs less than doubles output)

Why Log-Transform?

Taking natural logs of both sides converts the multiplicative model into a linear model:

\[\ln(Q) = \beta_1 + \beta_2 \ln(K) + \beta_3 \ln(L) + u\]

where \(\beta_1 = \ln(\alpha)\). Now we can use ordinary least squares (OLS) regression!

Dataset: Douglas (1976) used U.S. manufacturing data from 1899-1922 (24 years) to estimate this function.

# Load Cobb-Douglas data
data_cobb = pd.read_stata(GITHUB_DATA_URL + 'AED_COBBDOUGLAS.DTA')
print(f"Loaded {len(data_cobb)} years of US manufacturing data (1899-1922)")
print(f"Variables: {list(data_cobb.columns)}")
data_cobb.head(12)
Loaded 24 years of US manufacturing data (1899-1922)
Variables: ['year', 'q', 'k', 'l', 'lnq', 'lnk', 'lnl']
year q k l lnq lnk lnl
0 1899 100 100 100 4.605170 4.605170 4.605170
1 1900 101 107 105 4.615120 4.672829 4.653960
2 1901 112 114 110 4.718499 4.736198 4.700480
3 1902 122 122 118 4.804021 4.804021 4.770685
4 1903 124 131 123 4.820282 4.875197 4.812184
5 1904 122 138 116 4.804021 4.927254 4.753590
6 1905 143 149 125 4.962845 5.003946 4.828314
7 1906 152 163 133 5.023880 5.093750 4.890349
8 1907 151 176 138 5.017280 5.170484 4.927254
9 1908 126 185 121 4.836282 5.220356 4.795791
10 1909 155 198 140 5.043425 5.288267 4.941642
11 1910 159 208 144 5.068904 5.337538 4.969813
# Create log transformations
data_cobb['lnq'] = np.log(data_cobb['q'])
data_cobb['lnk'] = np.log(data_cobb['k'])
data_cobb['lnl'] = np.log(data_cobb['l'])

print("Summary statistics (original and log-transformed):")
data_cobb[['q', 'k', 'l', 'lnq', 'lnk', 'lnl']].describe()
Summary statistics (original and log-transformed):
q k l lnq lnk lnl
count 24.000000 24.000000 24.000000 24.000000 24.000000 24.000000
mean 165.916667 234.166667 145.791667 5.077336 5.356483 4.962723
std 43.753178 106.266550 29.616357 0.269234 0.459178 0.201077
min 100.000000 100.000000 100.000000 4.605170 4.605170 4.605170
25% 125.500000 146.250000 122.500000 4.832282 4.984773 4.808086
50% 157.000000 212.000000 144.500000 5.056165 5.356408 4.973274
75% 196.250000 307.250000 155.750000 5.277434 5.726353 5.048065
max 240.000000 431.000000 200.000000 5.480639 6.066108 5.298317
# Estimate Cobb-Douglas with HAC standard errors
data_cobb['_time'] = range(len(data_cobb))
fit_cobb = pf.feols('lnq ~ lnk + lnl', data=data_cobb,
                    vcov='NW', vcov_kwargs={'time_id': '_time', 'lag': 3})

# Key results
beta_k = fit_cobb.coef()['lnk']
beta_l = fit_cobb.coef()['lnl']
r2_cobb = fit_cobb._r2

print(f"Estimated equation: ln(Q) = {fit_cobb.coef()['Intercept']:.3f} + {beta_k:.3f} ln(K) + {beta_l:.3f} ln(L)")
print(f"Capital elasticity: {beta_k:.3f} (1% increase in K raises output by {beta_k:.3f}%)")
print(f"Labor elasticity: {beta_l:.3f} (1% increase in L raises output by {beta_l:.3f}%)")
print(f"Sum of coefficients: {beta_k:.3f} + {beta_l:.3f} = {beta_k + beta_l:.3f}")
print(f"R-squared: {r2_cobb:.4f}")

# Full regression output
fit_cobb.summary()
Estimated equation: ln(Q) = -0.177 + 0.233 ln(K) + 0.807 ln(L)
Capital elasticity: 0.233 (1% increase in K raises output by 0.233%)
Labor elasticity: 0.807 (1% increase in L raises output by 0.807%)
Sum of coefficients: 0.233 + 0.807 = 1.040
R-squared: 0.9574
###

Estimation:  OLS
Dep. var.: lnq
sample: None = all
Inference:  NW
Observations:  24

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |     -0.177 |        0.398 |    -0.446 |      0.660 | -1.000 |   0.646 |
| lnk           |      0.233 |        0.063 |     3.720 |      0.001 |  0.103 |   0.363 |
| lnl           |      0.807 |        0.135 |     5.995 |      0.000 |  0.529 |   1.086 |
---
RMSE: 0.054 R2: 0.957 

Key Concept 13.2: Logarithmic Transformation of Production Functions

Taking natural logarithms of the Cobb-Douglas production function \(Q = AK^\alpha L^\beta\) transforms it into the linear model \(\ln Q = \ln A + \alpha \ln K + \beta \ln L\), suitable for OLS estimation. The resulting coefficients are elasticities — directly interpretable as percentage changes.

Understanding the Results

Estimated Coefficients:

  • \(\hat{\beta}_2\) (capital) ≈ 0.23: A 1% increase in capital raises output by 0.23%
  • \(\hat{\beta}_3\) (labor) ≈ 0.81: A 1% increase in labor raises output by 0.81%
  • Sum: 0.23 + 0.81 ≈ 1.04 → Very close to constant returns to scale!

Why HAC Standard Errors?

This is time series data (24 consecutive years). Two problems arise:

  1. Autocorrelation: Output in year t is correlated with output in year t-1
  • Recessions/booms span multiple years
  • Technology shocks persist over time
  1. Heteroskedasticity: Variance of errors may change over time
  • Economy was more volatile in early 1900s
  • Structural changes during WWI

HAC (Heteroskedasticity and Autocorrelation Consistent) standard errors correct for both problems. We use maxlags=3, allowing correlations up to 3 years apart.

Comparison:

  • Default SEs: Assume errors are uncorrelated and homoskedastic (WRONG here!)
  • Robust (HC1) SEs: Fix heteroskedasticity but ignore autocorrelation
  • HAC SEs: Fix BOTH problems (correct choice for time series)

Economic Interpretation:

Labor’s output elasticity (0.81) is much larger than capital’s (0.23). This suggests that in early 20th century U.S. manufacturing:

  • Labor was the dominant input
  • Capital was relatively less important
  • Consistent with labor-intensive production before automation
# Test constant returns to scale
# H0: beta_k + beta_l = 1
sum_betas = beta_k + beta_l
print(f"Testing constant returns to scale:")
print(f"H0: beta_capital + beta_labor = 1")
print(f"Estimated sum: {sum_betas:.3f}")

# Restricted model: ln(Q/L) ~ ln(K/L)
data_cobb['lnq_per_l'] = data_cobb['lnq'] - data_cobb['lnl']
data_cobb['lnk_per_l'] = data_cobb['lnk'] - data_cobb['lnl']
fit_restricted = pf.feols('lnq_per_l ~ lnk_per_l', data=data_cobb)

# F-test
rss_unr = np.sum(fit_cobb._u_hat**2)
rss_r = np.sum(fit_restricted._u_hat**2)
df_resid = fit_cobb._N - len(fit_cobb.coef())
f_stat = ((rss_r - rss_unr) / 1) / (rss_unr / df_resid)
p_value = 1 - stats.f.cdf(f_stat, 1, df_resid)

print(f"F-statistic: {f_stat:.2f}")
print(f"p-value: {p_value:.3f}")
print(f"Conclusion: {'Reject' if p_value < 0.05 else 'Fail to reject'} H0 at 5% level")
Testing constant returns to scale:
H0: beta_capital + beta_labor = 1
Estimated sum: 1.040
F-statistic: 0.20
p-value: 0.663
Conclusion: Fail to reject H0 at 5% level

Key Concept 13.3: Testing Constant Returns to Scale

Constant returns to scale implies \(\alpha + \beta = 1\): doubling all inputs exactly doubles output. The estimated sum of 1.040 is not significantly different from 1 (F-test p = 0.636), consistent with the theoretical prediction for competitive markets. This is a joint hypothesis test on the coefficients.

# Predicted output with bias correction
se = np.sqrt(np.mean(fit_cobb._u_hat**2))
bias_correction = np.exp(se**2 / 2)
data_cobb['q_pred'] = bias_correction * np.exp(fit_cobb.predict())

# Plot actual vs predicted
plt.figure(figsize=(10, 6))
plt.plot(data_cobb['year'], data_cobb['q'], 'o-', color='#22d3ee', linewidth=2,
         markersize=6, label='Actual Q')
plt.plot(data_cobb['year'], data_cobb['q_pred'], 's--', color='#c084fc', linewidth=2,
         markersize=5, label='Predicted Q')
plt.xlabel('Year')
plt.ylabel('Output Index')
plt.title('Figure 13.4: Actual vs Predicted Output (1899-1922)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

13.3 Phillips Curve and Omitted Variables Bias

Demonstrating omitted variables bias through the breakdown of the Phillips curve.

The Phillips Curve and Its Breakdown

The Phillips Curve describes a negative relationship between unemployment and inflation:

  • When unemployment is low → workers demand higher wages → prices rise → inflation increases
  • When unemployment is high → workers accept lower wages → prices fall → inflation decreases

Historical Context:

A.W. Phillips (1958) found this negative relationship in UK data from 1861-1957. It became a cornerstone of macroeconomic policy:

  • Governments believed they could “trade off” unemployment for inflation
  • Want lower unemployment? Accept higher inflation
  • Want lower inflation? Accept higher unemployment

The 1970s Crisis:

The relationship broke down in the 1970s! Economies experienced stagflation (high unemployment AND high inflation simultaneously). This was a major crisis in economic theory.

The Solution: Expected Inflation

Milton Friedman and Edmund Phelps showed the original Phillips curve suffered from omitted variables bias. The correct model includes expected inflation:

\[\text{Inflation}_t = \beta_1 + \beta_2 \text{Urate}_t + \beta_3 \text{Expected Inflation}_t + u_t\]

We’ll demonstrate this using U.S. data from 1949-2014 (66 years).

# Load Phillips curve data
data_phillips = pd.read_stata(GITHUB_DATA_URL + 'AED_PHILLIPS.DTA')
print(f"Loaded {len(data_phillips)} years of US data (1949-2014)")
print(f"Variables: {list(data_phillips.columns)}")
data_phillips.head()
Loaded 66 years of US data (1949-2014)
Variables: ['year', 'urate', 'gdpdef', 'inflgdp', 'pastinflgdp', 'inflgdp1yr', 'cpi', 'inflcpi', 'pastinflcpi', 'infcpi1yr', 'infcpi10yr', 'mich', 'date', 'daten']
year urate gdpdef inflgdp pastinflgdp inflgdp1yr cpi inflcpi pastinflcpi infcpi1yr infcpi10yr mich date daten
0 1949.0 6.6 13.518 -1.965335 NaN NaN 23.6 -2.074689 7.698837 NaN NaN NaN 1949-10-01 1949-10-01
1 1950.0 4.3 14.090 4.231395 NaN NaN 25.0 5.932203 3.648189 NaN NaN NaN 1950-10-01 1950-10-01
2 1951.0 3.1 14.869 5.528744 NaN NaN 26.5 6.000000 3.232486 NaN NaN NaN 1951-10-01 1951-10-01
3 1952.0 2.7 15.091 1.493039 3.474261 NaN 26.7 0.754717 4.063869 NaN NaN NaN 1952-10-01 1952-10-01
4 1953.0 4.5 15.219 0.848188 2.905584 NaN 26.9 0.749064 3.080858 NaN NaN NaN 1953-10-01 1953-10-01
# Pre-1970 regression
data_pre1970 = data_phillips[data_phillips['year'] < 1970].copy()
data_pre1970['_time'] = range(len(data_pre1970))
fit_pre = pf.feols('inflgdp ~ urate', data=data_pre1970,
                   vcov='NW', vcov_kwargs={'time_id': '_time', 'lag': 3})
# Phillips Curve Pre-1970 (1949-1969)
fit_pre.summary()
###

Estimation:  OLS
Dep. var.: inflgdp
sample: None = all
Inference:  NW
Observations:  21

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      7.106 |        1.581 |     4.493 |      0.000 |  3.807 |  10.405 |
| urate         |     -1.030 |        0.325 |    -3.167 |      0.005 | -1.708 |  -0.352 |
---
RMSE: 1.251 R2: 0.454 

Pre-1970: Phillips Curve Works!

Results:

  • Coefficient on unemployment ≈ -0.89
  • Negative relationship (as theory predicts!)
  • Statistically significant
  • 1% increase in unemployment → 0.89% decrease in inflation

The scatter plot shows a clear downward slope. This is the classic Phillips curve that policymakers relied on in the 1950s-60s.

Why did it work?

In the 1950s-60s, expected inflation was stable and low (around 2-3%). Since it didn’t vary much, omitting it from the regression didn’t cause serious bias.

# Plot pre-1970
plt.figure(figsize=(10, 6))
plt.scatter(data_pre1970['urate'], data_pre1970['inflgdp'], alpha=0.7, s=50, color='#22d3ee')  # s = marker size, alpha = transparency
plt.plot(data_pre1970['urate'], fit_pre.predict(), color='#c084fc', linewidth=2,
         label='Fitted line')
plt.xlabel('Unemployment Rate (%)')
plt.ylabel('Inflation Rate (%)')
plt.title('Figure 13.5: Phillips Curve Pre-1970 (Negative Relationship)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

What to look for in this scatter plot:

  • Direction: Negative slope – higher unemployment is associated with lower inflation (the classic Phillips curve trade-off)
  • Strength: Reasonably tight clustering around the fitted line for this small sample (21 years)
  • Context: This downward-sloping pattern is what policymakers relied on in the 1950s-60s
# Post-1970 regression
data_post1970 = data_phillips[data_phillips['year'] >= 1970].copy()
data_post1970['_time'] = range(len(data_post1970))
fit_post = pf.feols('inflgdp ~ urate', data=data_post1970,
                    vcov='NW', vcov_kwargs={'time_id': '_time', 'lag': 5})
# Phillips Curve Post-1970 (1970-2014)
fit_post.summary()
###

Estimation:  OLS
Dep. var.: inflgdp
sample: None = all
Inference:  NW
Observations:  45

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      1.928 |        1.868 |     1.032 |      0.308 | -1.837 |   5.693 |
| urate         |      0.265 |        0.326 |     0.814 |      0.420 | -0.392 |   0.922 |
---
RMSE: 2.387 R2: 0.028 

Key Concept 13.4: The Phillips Curve Breakdown

Pre-1970, higher unemployment was associated with lower inflation (the classic Phillips curve trade-off). Post-1970, this relationship reversed — suggesting the simple bivariate model was misspecified. The breakdown motivated the expectations-augmented Phillips curve, which includes expected inflation as a regressor.

Post-1970: Phillips Curve Breaks Down!

Results:

  • Coefficient on unemployment ≈ +0.27
  • Positive relationship (opposite of theory!)
  • The sign reversed!

The scatter plot shows an upward slope - higher unemployment is associated with higher inflation. This is stagflation.

What went wrong?

In the 1970s, expected inflation became:

  • Much higher (reached 10%+ during oil shocks)
  • Much more variable (changed frequently)
  • Correlated with unemployment (both rose together)

Omitting expected inflation now causes severe omitted variables bias that reverses the sign of the unemployment coefficient!

# Plot post-1970
plt.figure(figsize=(10, 6))
plt.scatter(data_post1970['urate'], data_post1970['inflgdp'], alpha=0.7, s=50, color='#22d3ee')  # s = marker size, alpha = transparency
plt.plot(data_post1970['urate'], fit_post.predict(), color='red', linewidth=2,
         label='Fitted line')
plt.xlabel('Unemployment Rate (%)')
plt.ylabel('Inflation Rate (%)')
plt.title('Figure 13.6: Phillips Curve Post-1970 (Positive Relationship - Breakdown!)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

What to look for in this scatter plot:

  • Direction: Positive slope – the relationship has reversed compared to pre-1970
  • Scatter: Much wider dispersion around the line, reflecting the turbulent 1970s-80s
  • Implication: The simple bivariate Phillips curve is misspecified post-1970 (omitted variables bias)
# Augmented Phillips curve (adding expected inflation)

data_post1970_exp = data_post1970.dropna(subset=['inflgdp1yr'])

data_post1970_exp = data_post1970_exp.copy()
data_post1970_exp['_time'] = range(len(data_post1970_exp))
fit_augmented = pf.feols('inflgdp ~ urate + inflgdp1yr', data=data_post1970_exp,
                         vcov='NW', vcov_kwargs={'time_id': '_time', 'lag': 5})


# Augmented Phillips Curve Post-1970


fit_augmented.summary()
###

Estimation:  OLS
Dep. var.: inflgdp
sample: None = all
Inference:  NW
Observations:  45

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      0.270 |        0.634 |     0.426 |      0.672 | -1.007 |   1.547 |
| urate         |     -0.128 |        0.081 |    -1.575 |      0.122 | -0.291 |   0.036 |
| inflgdp1yr    |      1.146 |        0.085 |    13.480 |      0.000 |  0.975 |   1.318 |
---
RMSE: 0.835 R2: 0.881 
# Demonstrate omitted variables bias

# Bivariate regression of Expinflation on Urate
fit_aux = pf.feols('inflgdp1yr ~ urate', data=data_post1970_exp)

gamma = fit_aux.coef()['urate']
beta3 = fit_augmented.coef()['inflgdp1yr']
beta2 = fit_augmented.coef()['urate']

# Omitted Variables Bias Calculation
# True model: Inflation = β1 + β2*Urate + β3*Expinflation
# Omitted model: Inflation = b1 + b2*Urate
# Omitted variables bias formula: E[b2] = β2 + β3*γ
# where γ = coefficient from Expinflation ~ Urate regression

print(f"\nγ (from auxiliary regression): {gamma:.3f}")
print(f"β3 (from full model): {beta3:.3f}")
print(f"β2 (from full model): {beta2:.3f}")

print(
    f"\nPredicted E[b2] = {beta2:.3f} + {beta3:.3f} * {gamma:.3f} "
    f"= {beta2 + beta3 * gamma:.3f}"
)
print(f"Actual b2 (from bivariate): {fit_post.coef()['urate']:.3f}")

print("\n✓ Omitted variables bias explains the sign reversal!")

γ (from auxiliary regression): 0.343
β3 (from full model): 1.146
β2 (from full model): -0.128

Predicted E[b2] = -0.128 + 1.146 * 0.343 = 0.265
Actual b2 (from bivariate): 0.265

✓ Omitted variables bias explains the sign reversal!

Key Concept 13.5: Omitted Variables Bias

When a relevant variable (expected inflation) is omitted from the regression, its effect gets absorbed into the included variable’s coefficient. The OVB formula shows that the bias equals the omitted variable’s coefficient times its correlation with the included regressor. In the Phillips curve, this bias reversed the sign of the unemployment coefficient.

Understanding Omitted Variables Bias

The omitted variables bias formula tells us how the coefficient in the bivariate (wrong) model relates to the true coefficients:

\[E[b_2] = \beta_2 + \beta_3 \gamma\]

where:

  • \(b_2\): coefficient on Urate in bivariate model (omits expected inflation)
  • \(\beta_2\): true coefficient on Urate in full model
  • \(\beta_3\): true coefficient on expected inflation in full model
  • \(\gamma\): coefficient from regressing expected inflation on Urate (auxiliary regression)

What we found:

  • \(\beta_2 \approx -1.15\) (true negative effect of unemployment)
  • \(\beta_3 \approx +1.15\) (expected inflation has 1-to-1 effect on actual inflation)
  • \(\gamma \approx +0.34\) (when unemployment rises, expected inflation also rises in 1970s!)

Predicted bias: \[E[b_2] = -1.15 + 1.15 \times 0.34 \approx +0.26\]

Actual \(b_2\) from bivariate model: +0.27

Perfect match! The omitted variables bias formula exactly explains why the sign reversed.

Economic Interpretation:

In the 1970s:

  1. Oil shocks raised unemployment (supply shocks)
  2. Same oil shocks raised expected inflation (cost-push)
  3. Unemployment and expected inflation moved together
  4. Omitting expected inflation makes unemployment look like it causes inflation
  5. But really, both are caused by the same underlying shocks

The Lesson:

The Phillips curve didn’t “disappear” - it was always conditional on expected inflation. Once we control for expectations, the negative relationship returns. This revolutionized central bank policy: to fight inflation, must manage expectations!

13.4 Automobile Fuel Efficiency

Large dataset analysis with cluster-robust standard errors.

Case Study: Automobile Fuel Efficiency (1980-2006)

The Policy Question: What determines vehicle fuel efficiency, and why didn’t MPG improve more despite technological advances?

Economic Context:

From 1980-2006, the U.S. auto industry faced:

  • 1970s oil shocks: Created demand for fuel efficiency
  • CAFE standards: Government regulations on average MPG
  • Consumer preferences: Growing demand for SUVs and trucks
  • Technological progress: Better engines, materials, aerodynamics

Yet average fuel economy stagnated. Why?

The Data: 26,995 vehicles sold in the U.S. market (1980-2006)

Key Variables:

  • mpg: Miles per gallon (fuel efficiency)
  • curbwt: Vehicle weight in pounds
  • hp: Horsepower (engine power)
  • torque: Engine torque (pulling power)
  • year: Model year (captures technology trends)
  • mfr: Manufacturer (for clustering)

The Research Question:

How much does each factor contribute to fuel efficiency?

  • Weight vs power vs technology
  • Are efficiency gains “spent” on other attributes?

Methodological Challenge:

Vehicles from the same manufacturer share:

  • Common technology platforms
  • Similar design philosophy
  • Shared engineering teams

This creates clustered errors → need cluster-robust standard errors!

Method: Log-log regression with cluster-robust SEs (by manufacturer)

# Load automobile data
data_auto = pd.read_stata(GITHUB_DATA_URL + 'AED_AUTOSMPG.DTA')
print(f"Loaded {len(data_auto)} vehicle observations (1980-2006)")
print(f"\nKey variables: {['year', 'mfr', 'mpg', 'curbwt', 'hp', 'torque']}")
print(f"\nNote: Dataset has pre-computed log transformations (lmpg, lcurbwt, lhp, ltorque)")
data_auto[['year', 'mfr', 'mpg', 'curbwt', 'hp', 'torque']].head()
Loaded 26995 vehicle observations (1980-2006)

Key variables: ['year', 'mfr', 'mpg', 'curbwt', 'hp', 'torque']

Note: Dataset has pre-computed log transformations (lmpg, lcurbwt, lhp, ltorque)
year mfr mpg curbwt hp torque
0 1980.0 FIAT 21.700001 2455 111 149.199997
1 1980.0 AMC 22.400000 2807 90 176.300003
2 1980.0 AMC 24.600000 2767 90 176.300003
3 1980.0 AMC 20.400000 2856 110 284.799988
4 1980.0 AMC 20.500000 2816 110 284.799988
# Summary statistics
key_vars = ['mpg', 'curbwt', 'hp', 'torque', 'year']
# Summary Statistics
data_auto[key_vars].describe()

# Manufacturer distribution
# TOP 10 MANUFACTURERS BY OBSERVATIONS
data_auto['mfr'].value_counts().head(10)
mfr
GMC           7979
CHRYSLER      4172
FORD          3709
TOYOTA        1554
NISSAN        1079
VOLKSWAGEN     831
HONDA          743
BMW            740
MITSUBISHI     685
MAZDA          612
Name: count, dtype: int64

Key Observations from Summary Statistics:

  • Sample size: 26,995 vehicles (1980-2006)
  • Time span: 27 years of data
  • MPG range: 8.7 to 76.4 (huge variation!)
  • Weight: 1,450 to 6,700 lbs (compact cars to large trucks)

Top manufacturers: General Motors, Ford, Chrysler dominate (this is U.S. market data)

Why log-log specification?

We use logs of both outcome (MPG) and predictors (weight, HP, torque) because:

  1. Elasticity interpretation: Coefficients = percentage changes
    • β = -0.5 means: 1% ↑ in weight → 0.5% ↓ in MPG
  2. Nonlinear relationships: MPG doesn’t change linearly with weight
    • Going from 2,000→2,100 lbs has different effect than 4,000→4,100 lbs
    • Logs capture this naturally
  3. Reduces heteroskedasticity: Log transformation stabilizes variance
# Log-log regression with cluster-robust standard errors
# Use pre-computed log variables
fit_auto = pf.feols('lmpg ~ lhp + lcurbwt + ltorque + year', data=data_auto, vcov={'CRV1': 'mfr'})

# Log-Log Regression: Fuel Efficiency
fit_auto.summary()

# Interpretation
# ELASTICITY INTERPRETATION
print(f"Horsepower elasticity: {fit_auto.coef()['lhp']:.3f}")
print(f"  → 1% increase in HP → {fit_auto.coef()['lhp']:.2f}% change in MPG")
print(f"\nWeight elasticity: {fit_auto.coef()['lcurbwt']:.3f}")
print(f"  → 1% increase in weight → {fit_auto.coef()['lcurbwt']:.2f}% change in MPG")
print(f"\nTorque elasticity: {fit_auto.coef()['ltorque']:.3f}")
print(f"  → 1% increase in torque → {fit_auto.coef()['ltorque']:.2f}% change in MPG")
print(f"\nYear trend: {fit_auto.coef()['year']:.4f}")
print(f"  → Efficiency improves {fit_auto.coef()['year']*100:.2f}% per year")

# CLUSTER-ROBUST STANDARD ERRORS
print(f"Clustered by manufacturer (mfr)")
print(f"Number of clusters: {data_auto['mfr'].nunique()}")
print(f"Average observations per cluster: {len(data_auto)/data_auto['mfr'].nunique():.0f}")
print(f"\nWhy cluster? Vehicles from same manufacturer likely have correlated errors")
print(f"due to common technology, design philosophy, and engineering teams.")
###

Estimation:  OLS
Dep. var.: lmpg
sample: None = all
Inference:  CRV1
Observations:  26995

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |    2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|--------:|
| Intercept     |    -20.843 |        1.988 |   -10.484 |      0.000 | -24.868 | -16.818 |
| lhp           |     -0.181 |        0.065 |    -2.786 |      0.008 |  -0.313 |  -0.050 |
| lcurbwt       |     -0.562 |        0.057 |    -9.938 |      0.000 |  -0.677 |  -0.448 |
| ltorque       |     -0.141 |        0.064 |    -2.211 |      0.033 |  -0.271 |  -0.012 |
| year          |      0.015 |        0.001 |    13.840 |      0.000 |   0.013 |   0.017 |
---
RMSE: 0.125 R2: 0.777 
Horsepower elasticity: -0.181
  → 1% increase in HP → -0.18% change in MPG

Weight elasticity: -0.562
  → 1% increase in weight → -0.56% change in MPG

Torque elasticity: -0.141
  → 1% increase in torque → -0.14% change in MPG

Year trend: 0.0152
  → Efficiency improves 1.52% per year
Clustered by manufacturer (mfr)
Number of clusters: 39
Average observations per cluster: 692

Why cluster? Vehicles from same manufacturer likely have correlated errors
due to common technology, design philosophy, and engineering teams.

Key Concept 13.6: Cluster-Robust Standard Errors for Grouped Data

When observations are grouped (e.g., vehicles by manufacturer), errors within groups may be correlated. Cluster-robust standard errors account for this within-group correlation. Using default SEs when clustering exists understates uncertainty, leading to false rejections of the null hypothesis.

Interpreting the Results

Key Findings:

  1. Weight Elasticity: ≈ -0.56
  • 1% increase in vehicle weight → 0.56% decrease in MPG
  • Largest effect among all variables
  • Heavier vehicles are substantially less efficient
  1. Horsepower Elasticity: ≈ -0.30 to -0.35
  • 1% increase in HP → ~0.3% decrease in MPG
  • More powerful engines burn more fuel
  1. Torque Elasticity: ≈ -0.10 to -0.15
  • Smaller effect than weight or HP
  • Torque affects efficiency but less than other factors
  1. Year Trend: ≈ +0.01 to +0.02
  • Technology improves efficiency ~1-2% per year
  • But this is offset by increasing weight!

The Weight Paradox:

Despite technological improvements:

  • Vehicles got heavier (better safety, more features)
  • Average MPG didn’t improve much from 1980-2005
  • Efficiency gains were “spent” on weight/power instead of MPG

Why Cluster-Robust Standard Errors?

We cluster by manufacturer because:

  • Ford vehicles share common technology
  • Toyota vehicles share design philosophy
  • Errors within manufacturer are correlated

Number of clusters: 40+ manufacturers Average per cluster: ~600 vehicles

Clustering increases SEs → more conservative inference

Policy Implications:

  1. CAFE Standards: Corporate Average Fuel Economy regulations
  • Weight matters more than technology
  • Encouraging lighter vehicles could have large impact
  1. Consumer Trade-offs:
  • Safety (heavier) vs Efficiency (lighter)
  • Power (higher HP) vs MPG (lower HP)
  • Consumers chose weight/power over efficiency
  1. Recent Trends (post-2005, not in data):
  • Hybrid/electric technology
  • Lightweight materials (aluminum, carbon fiber)
  • Smaller turbocharged engines

Model Quality:

R² ≈ 0.78 means the model explains 78% of MPG variation

  • Excellent fit!
  • Weight, HP, torque are key predictors
  • But 22% remains unexplained (aerodynamics, transmission, etc.)

13.5 Rand Health Insurance Experiment (RCT)

Randomized control trial for causal inference.

Case Study: The RAND Health Insurance Experiment (Randomized Control Trial)

The Gold Standard for Causal Inference: Randomized Control Trials (RCTs)

The Policy Question: Does health insurance coverage affect medical spending?

Why This Matters:

This is the moral hazard question in health economics:

  • If insurance is free, do people overuse healthcare?
  • Should insurance have cost-sharing (deductibles, co-pays)?
  • What’s the right balance between access and efficiency?

The Challenge:

Observational studies are biased because:

  • People who buy insurance are different (sicker? richer?)
  • Can’t separate insurance effect from selection effect
  • Reverse causation: health affects insurance choice

The RCT Solution:

The RAND Health Insurance Experiment (1974-1982):

  • Randomly assigned 5,809 individuals to different insurance plans
  • Plans varied in cost-sharing: 0%, 25%, 50%, 95%, individual deductible
  • Followed families for 3-5 years
  • Measured healthcare utilization and spending

Why Randomization Enables Causal Inference:

Random Assignment → Insurance Plan → Medical Spending

Because assignment is random:

  • Treatment and control groups are identical in expectation
  • No confounding variables
  • Any difference in outcomes is caused by insurance
  • Selection bias eliminated!

Data: Year 1 data (5,639 observations)

Variables:

  • spending: Total medical expenditure
  • plan: Insurance plan assignment
  • Plan indicators: coins0 (free care), coins25, coins50, coins95, coinsmixed, coinsindiv

Method: Regression with cluster-robust SEs (by family)

# Load health insurance experiment data
data_health = pd.read_stata(GITHUB_DATA_URL + 'AED_HEALTHINSEXP.DTA')

# Use first year data only (as per textbook)
data_health_y1 = data_health[data_health['year'] == 1]

print(f"Loaded {len(data_health)} total observations")
print(f"Using Year 1 only: {len(data_health_y1)} observations")
print(f"\nInsurance plans: {sorted(data_health_y1['plan'].unique())}")
print(f"\nKey variables:")
print(f"  - plan: Insurance plan assignment (randomized)")
print(f"  - spending: Total medical spending")
print(f"  - Plan indicators: coins0, coins25, coins50, coins95, coinsmixed, coinsindiv")

# Summary statistics by plan
# MEAN SPENDING BY INSURANCE PLAN
spending_by_plan = data_health_y1.groupby('plan')['spending'].agg(['mean', 'std', 'count'])
print(spending_by_plan)

# Regression with plan indicators
fit_rct = pf.feols('spending ~ coins25 + coins50 + coins95 + coinsmixed + coinsindiv',
                data=data_health_y1, vcov={'CRV1': 'idfamily'})

# RCT REGRESSION: SPENDING ON INSURANCE PLANS
print("Omitted category: Free Care (coins0)")
fit_rct.summary()

# F-test for joint significance
# JOINT F-TEST: DO PLANS MATTER?
from scipy.stats import f as f_dist
coef_names = ['coins25', 'coins50', 'coins95', 'coinsmixed', 'coinsindiv']
coefs = np.array([fit_rct.coef()[c] for c in coef_names])
V = fit_rct._vcov
idx = [list(fit_rct.coef().index).index(c) for c in coef_names]
V_sub = V[np.ix_(idx, idx)]
f_val = float(coefs @ np.linalg.inv(V_sub) @ coefs / len(coef_names))
p_val = 1 - f_dist.cdf(f_val, len(coef_names), fit_rct._N - len(fit_rct.coef()))
print(f"H0: All plan coefficients = 0")
print(f"F-statistic: {f_val:.2f}")
print(f"p-value: {p_val:.4f}")
print(f"Conclusion: {'Reject H0' if p_val < 0.05 else 'Fail to reject H0'} at 5% level")

# CAUSAL INTERPRETATION
print("✓ Randomized Control Trial enables causal inference")
print("✓ Random assignment eliminates selection bias")
print("✓ Free care → highest spending (omitted baseline)")
print("✓ Higher cost-sharing → lower spending")
Loaded 20203 total observations
Using Year 1 only: 5639 observations

Insurance plans: ['25% Coins', '50% Coins', '95%/100% Coins', 'Free Care', 'Indv Deduct', 'Mixed Coins']

Key variables:
  - plan: Insurance plan assignment (randomized)
  - spending: Total medical spending
  - Plan indicators: coins0, coins25, coins50, coins95, coinsmixed, coinsindiv
                       mean           std  count
plan                                            
Free Care       2153.570365   4959.743126   1873
25% Coins       1396.663070   3458.960029    639
Mixed Coins     1701.874211   4328.165674    480
50% Coins       1785.845449  11643.607841    374
95%/100% Coins  1045.820283   2634.823581   1057
Indv Deduct     1607.071473   3815.953001   1216
Omitted category: Free Care (coins0)
###

Estimation:  OLS
Dep. var.: spending
sample: None = all
Inference:  CRV1
Observations:  5639

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |      2.5% |    97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|----------:|---------:|
| Intercept     |   2153.570 |      118.315 |    18.202 |      0.000 |  1921.532 | 2385.608 |
| coins25       |   -756.907 |      190.041 |    -3.983 |      0.000 | -1129.615 | -384.200 |
| coins50       |   -367.725 |      618.044 |    -0.595 |      0.552 | -1579.830 |  844.380 |
| coins95       |  -1107.750 |      150.366 |    -7.367 |      0.000 | -1402.647 | -812.853 |
| coinsmixed    |   -451.696 |      239.424 |    -1.887 |      0.059 |  -921.254 |   17.862 |
| coinsindiv    |   -546.499 |      162.151 |    -3.370 |      0.001 |  -864.508 | -228.490 |
---
RMSE: 4951.405 R2: 0.007 
H0: All plan coefficients = 0
F-statistic: 11.39
p-value: 0.0000
Conclusion: Reject H0 at 5% level
✓ Randomized Control Trial enables causal inference
✓ Random assignment eliminates selection bias
✓ Free care → highest spending (omitted baseline)
✓ Higher cost-sharing → lower spending

Key Concept 13.7: Randomized Control Trials as the Gold Standard

In a randomized control trial (RCT), subjects are randomly assigned to treatment and control groups. Random assignment ensures the groups are comparable on both observed and unobserved characteristics, eliminating selection bias and omitted variable concerns. The RAND experiment showed that better insurance coverage increases healthcare utilization.

Interpreting the RCT Results

Key Findings:

  1. Insurance Plans Matter: F-test strongly rejects H₀ (F = 11.39, p < 0.001)
  • Different insurance plans lead to significantly different spending
  • Cost-sharing reduces medical utilization
  1. Spending by Plan (relative to Free Care):

Free Care (baseline): Highest spending

  • People use more healthcare when it’s free

Cost-sharing plans (25%, 50%, 95%): Lower spending

  • Higher cost-sharing → less utilization
  • This is moral hazard: insurance affects behavior
  1. Why This is Causal:

Randomization: Families randomly assigned to plans

  • No selection bias (unlike observational studies)
  • Treated and control groups are balanced on all characteristics

Experimental control: RAND ensured compliance

Large sample: 5,639 person-years in first year

  1. R² is Low (0.007) - This is Actually Good!
  • Most variation in spending is random (health shocks)
  • Insurance explains small fraction (as expected)
  • But effects are statistically significant and policy-relevant

Policy Implications:

Trade-offs in health insurance design:

Free care:

  • More utilization
  • Better access for poor
  • But: higher costs, potential overuse

Cost-sharing:

  • Lower spending
  • Reduced “unnecessary” care
  • But: may also reduce necessary care (problematic!)

The RAND Health Insurance Experiment (HIE) Legacy:

This 1970s experiment fundamentally shaped U.S. health policy:

  • Informed Medicare/Medicaid design
  • Influenced Affordable Care Act
  • Demonstrated feasibility of large-scale RCTs
  • Showed moral hazard exists but is moderate

Limitations:

1970s data (healthcare has changed) Selected sites (may not generalize) Ethical concerns (denying some families full coverage)

Modern Relevance:

The fundamental trade-off remains:

  • Universal free care (equity) vs
  • Cost-sharing (efficiency)

13.6 Health Care Access (Difference-in-Differences)

Causal inference using DiD methodology.

Case Study: Health Care Access and Child Nutrition (Difference-in-Differences)

The Policy Question: Does building health clinics improve child health outcomes?

The Setting: Rural South Africa, 1990s

After apartheid ended (1994), the new government built primary health care clinics in underserved rural areas. Some communities got many new clinics (high treatment), others got few (low treatment).

The Challenge: Simple before/after comparison is biased because:

  • Child health was improving nationally due to many factors
  • Communities that got clinics may have been different
  • Cannot separate clinic effect from other trends

The Difference-in-Differences Solution:

Key Idea: Compare change in treated communities to change in control communities

\[\text{DiD} = (\text{Treated}_{after} - \text{Treated}_{before}) - (\text{Control}_{after} - \text{Control}_{before})\]

Why This Works:

The first difference removes:

  • Permanent differences between communities
  • Baseline health levels

The second difference removes:

  • Common time trends
  • National-level changes

What remains: The differential impact of clinic access!

The Design:

Group 1993 (Pre) 1998 (Post) Change
High Treatment \(\bar{Y}_{T,0}\) \(\bar{Y}_{T,1}\) \(\Delta_T\)
Low Treatment \(\bar{Y}_{C,0}\) \(\bar{Y}_{C,1}\) \(\Delta_C\)

DiD = \(\Delta_T - \Delta_C\)

Data: 1,071 children aged 0-4 in rural KwaZulu-Natal

Outcome: waz (weight-for-age z-score, measure of nutrition)

Method: DiD regression with cluster-robust SEs (by community)

# Load health care access data (South Africa)
data_access = pd.read_stata(GITHUB_DATA_URL + 'AED_HEALTHACCESS.DTA')

print(f"Loaded {len(data_access)} observations (South African children 0-4)")
print(f"\nDifference-in-Differences Setup:")
print(f"  - Treatment: High treatment communities (hightreat=1)")
print(f"  - Control: Low treatment communities (hightreat=0)")
print(f"  - Pre period: 1993 (post=0)")
print(f"  - Post period: 1998 (post=1)")
print(f"  - Outcome: waz (weight-for-age z-score)")

# Summary statistics by treatment and time
# MEAN WEIGHT-FOR-AGE Z-SCORE (WAZ)
did_table = data_access.groupby(['hightreat', 'post'])['waz'].agg(['mean', 'count'])
print(did_table)

# Calculate DiD manually
pre_control = data_access[(data_access['hightreat']==0) & (data_access['post']==0)]['waz'].mean()
post_control = data_access[(data_access['hightreat']==0) & (data_access['post']==1)]['waz'].mean()
pre_treat = data_access[(data_access['hightreat']==1) & (data_access['post']==0)]['waz'].mean()
post_treat = data_access[(data_access['hightreat']==1) & (data_access['post']==1)]['waz'].mean()

did_estimate = (post_treat - pre_treat) - (post_control - pre_control)

print(f"\nManual DiD calculation:")
print(f"  Control change: {post_control:.3f} - {pre_control:.3f} = {post_control - pre_control:.3f}")
print(f"  Treated change: {post_treat:.3f} - {pre_treat:.3f} = {post_treat - pre_treat:.3f}")
print(f"  DiD estimate: ({post_treat:.3f} - {pre_treat:.3f}) - ({post_control:.3f} - {pre_control:.3f}) = {did_estimate:.3f}")

# DiD regression
fit_did = pf.feols('waz ~ hightreat + post + postXhigh', data=data_access, vcov={'CRV1': 'idcommunity'})

# DiD REGRESSION
fit_did.summary()

# INTERPRETATION
print(f"DiD coefficient (postXhigh): {fit_did.coef()['postXhigh']:.3f}")
print(f"Matches manual calculation: {did_estimate:.3f} ✓")
print(f"\nCausal interpretation:")
print(f"Clinic access improved child nutrition by {fit_did.coef()['postXhigh']:.2f} standard deviations")
print(f"This is a {'statistically significant' if fit_did.pvalue()['postXhigh'] < 0.05 else 'not significant'} effect")
print(f"\nCluster-robust SEs by community account for within-community correlation")
Loaded 1071 observations (South African children 0-4)

Difference-in-Differences Setup:
  - Treatment: High treatment communities (hightreat=1)
  - Control: Low treatment communities (hightreat=0)
  - Pre period: 1993 (post=0)
  - Post period: 1998 (post=1)
  - Outcome: waz (weight-for-age z-score)
                    mean  count
hightreat post                 
0.0       0.0  -0.414185    325
          1.0  -0.069097    288
1.0       0.0  -0.545244    246
          1.0   0.321462    212

Manual DiD calculation:
  Control change: -0.069 - -0.414 = 0.345
  Treated change: 0.321 - -0.545 = 0.867
  DiD estimate: (0.321 - -0.545) - (-0.069 - -0.414) = 0.522
###

Estimation:  OLS
Dep. var.: waz
sample: None = all
Inference:  CRV1
Observations:  1071

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |     -0.414 |        0.115 |    -3.597 |      0.001 | -0.645 |  -0.183 |
| hightreat     |     -0.131 |        0.197 |    -0.666 |      0.508 | -0.526 |   0.264 |
| post          |      0.345 |        0.137 |     2.517 |      0.015 |  0.070 |   0.620 |
| postXhigh     |      0.522 |        0.235 |     2.217 |      0.031 |  0.050 |   0.994 |
---
RMSE: 1.555 R2: 0.04 
DiD coefficient (postXhigh): 0.522
Matches manual calculation: 0.522 ✓

Causal interpretation:
Clinic access improved child nutrition by 0.52 standard deviations
This is a statistically significant effect

Cluster-robust SEs by community account for within-community correlation

Key Concept 13.8: Difference-in-Differences for Causal Inference

Difference-in-differences (DiD) compares changes over time between treatment and control groups. The key assumption is parallel trends: both groups would have followed the same trajectory absent treatment. DiD removes time-invariant confounders and common time effects, isolating the treatment effect.

Interpreting the DiD Results

Key Findings:

  1. Treatment Effect: DiD coefficient ≈ 0.52 standard deviations
  • Clinic access improved child weight-for-age by 0.52 SD
  • Statistically significant (p = 0.027 < 0.05)
  • Large and meaningful effect size
  1. Verification: Manual calculation matches regression
  • This confirms our DiD estimate is correct
  • The interaction term postXhigh captures the treatment effect
  1. Effect Size Interpretation:

0.52 SD improvement means:

  • Moving from 25th percentile → 45th percentile in weight distribution
  • Substantial reduction in malnutrition
  • Comparable to 6-12 months of normal growth

Why This is Causal (Under Assumptions):

The DiD design requires parallel trends assumption:

  • Without treatment, treated and control groups would have followed parallel trends
  • Treatment assignment (clinic placement) is exogenous
  • No other interventions differentially affected treated communities

Evidence for validity:

  • Communities were similar at baseline
  • Clinic placement based on need, not outcomes
  • No other major health programs during this period

Policy Implications:

Primary health care access works: Even basic clinics improve child health

Targeting matters: High-treatment communities saw larger benefits

Cost-effective: Clinic construction is relatively inexpensive

External Validity:

These results from rural South Africa likely generalize to other:

  • Low-income settings
  • Areas with limited health infrastructure
  • Populations with high baseline malnutrition

Limitations:

Short-term effects (1998 vs 1993) Self-reported data (possible measurement error) Cluster-level treatment (less statistical power)

13.7 Political Incumbency (Regression Discontinuity)

Causal inference using RD design.

Case Study: The Incumbency Advantage in U.S. Senate Elections

The Question: Does being an incumbent senator give you an advantage in the next election?

The Challenge: Simple comparison of incumbents vs challengers is biased because:

  • Incumbents are generally stronger candidates (that’s why they won before!)
  • Districts that elect incumbents may be different
  • Selection bias confounds the causal effect

The Regression Discontinuity Solution:

Key Insight: Compare candidates who barely won to those who barely lost.

At the threshold (vote margin ≈ 0):

  • Winners and losers are essentially identical in quality
  • Only difference: winners become incumbents
  • This creates quasi-random assignment to incumbency status

The Design:

Running variable: margin (vote share in election t - 50%)
Threshold: margin = 0
Treatment: win = 1 if margin > 0

Outcome: vote (vote share in election t+1)

Visual Intuition:

If there’s an incumbency advantage, we should see a jump in next-election vote share exactly at margin = 0:

Next Vote % ▲
 │
 60─────┤ •
 │ • ← Jump = incumbency advantage
 50─────┤ •
 │ •
 40─────┤•
 │
 └──────────────────→
 -10 0 +10
 Lost │ Won
 Margin

Data: 1,390 U.S. Senate elections (1914-2010)

Method: RD with linear control for margin

# Load incumbency data (U.S. Senate elections)
data_incumb = pd.read_stata(GITHUB_DATA_URL + 'AED_INCUMBENCY.DTA')

print(f"Loaded {len(data_incumb)} Senate elections (1914-2010)")
print(f"\nRegression Discontinuity Setup:")
print(f"  - Running variable: margin (vote margin in election t)")
print(f"  - Threshold: margin = 0 (barely won vs barely lost)")
print(f"  - Outcome: vote (vote share in election t+1)")
print(f"  - win: Indicator for margin > 0")

# Summary statistics
data_incumb[['vote', 'margin', 'win']].describe()

# Keep only elections with non-missing vote in t+1
data_rd = data_incumb[data_incumb['vote'].notna()].copy()
print(f"\nObservations with outcome data: {len(data_rd)}")

# RD regression (linear)
fit_rd = pf.feols('vote ~ win + margin', data=data_rd, vcov='HC1')

# REGRESSION DISCONTINUITY ESTIMATION
fit_rd.summary()

# INCUMBENCY ADVANTAGE
print(f"RD estimate (win coefficient): {fit_rd.coef()['win']:.3f}")
ci = fit_rd.confint()
print(f"95% CI: [{ci.loc['win', '2.5%']:.3f}, {ci.loc['win', '97.5%']:.3f}]")
print(f"\nInterpretation:")
print(f"Barely winning vs barely losing increases vote share in next election by {fit_rd.coef()['win']:.1f}%")
print(f"This is the causal effect of incumbency")
print(f"\nWhy causal? At the threshold (margin≈0), winning is quasi-random")
print(f"Candidates just above/below threshold are similar in all respects except incumbency status")

# Visualization note
# RD PLOT
print("To visualize discontinuity:")
# - Bin observations by margin
# - Plot mean vote in next election vs margin
# - Should see jump at margin=0
Loaded 1390 Senate elections (1914-2010)

Regression Discontinuity Setup:
  - Running variable: margin (vote margin in election t)
  - Threshold: margin = 0 (barely won vs barely lost)
  - Outcome: vote (vote share in election t+1)
  - win: Indicator for margin > 0

Observations with outcome data: 1297
###

Estimation:  OLS
Dep. var.: vote
sample: None = all
Inference:  HC1
Observations:  1297

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |     47.331 |        0.526 |    89.945 |      0.000 | 46.299 |  48.363 |
| win           |      4.785 |        0.860 |     5.565 |      0.000 |  3.098 |   6.471 |
| margin        |      0.348 |        0.017 |    20.422 |      0.000 |  0.315 |   0.381 |
---
RMSE: 11.767 R2: 0.578 
RD estimate (win coefficient): 4.785
95% CI: [3.098, 6.471]

Interpretation:
Barely winning vs barely losing increases vote share in next election by 4.8%
This is the causal effect of incumbency

Why causal? At the threshold (margin≈0), winning is quasi-random
Candidates just above/below threshold are similar in all respects except incumbency status
To visualize discontinuity:

Key Concept 13.9: Regression Discontinuity Design

Regression discontinuity (RD) exploits sharp cutoffs in treatment assignment. Units just above and below the threshold are nearly identical in all respects except treatment status, creating a quasi-experiment. In U.S. Senate elections, barely winning provides an incumbency advantage of approximately 5–7 percentage points.

Interpreting the RD Results

Key Findings:

  1. Incumbency Advantage: ≈ 4.8 percentage points
  • Barely winning vs barely losing increases vote share by ~4.8% in next election
  • This is highly statistically significant (p < 0.001)
  • Robust across different specifications
  1. Why This is Causal:

At the threshold (margin ≈ 0), winning is quasi-random:

  • Candidates just above and below 50% are nearly identical
  • Only difference is incumbency status
  • No selection bias (unlike comparing all winners vs all losers)
  1. Magnitude in Context:
  • 4.8% advantage is substantial in close elections
  • Many Senate races decided by < 5%
  • Could explain why incumbents rarely lose

What Drives the Incumbency Advantage?

Possible mechanisms:

  • Name recognition: Incumbents are well-known
  • Fundraising: Easier to raise money as incumbent
  • Media coverage: Senators get more press
  • Constituent services: Can deliver benefits to voters
  • Experience: Learn how to campaign effectively

Policy Implications:

  • Incumbency advantage creates barriers to electoral competition
  • May reduce democratic accountability
  • Term limits could level the playing field
  • Campaign finance reform might reduce fundraising advantage

RD Design Advantages:

No need to control for other variables (identification at threshold) Transparent and credible Robust to model specification

Limitations:

Local estimate (only applies near threshold) May not generalize to landslide winners/losers Assumes no manipulation of vote margin (reasonable for U.S. Senate)

13.8 Institutions and GDP (Instrumental Variables)

Causal inference using IV/2SLS.

Case Study: Do Institutions Cause Economic Growth?

The Fundamental Question: Why are some countries rich and others poor?

The Endogeneity Problem:

If we simply regress GDP on institutions quality, we get:

  • Reverse causation: Rich countries can afford better institutions
  • Omitted variables: Culture, geography, history all affect both
  • Measurement error: How do we measure “institutions quality”?

All of these bias OLS estimates, making causal inference impossible.

The Instrumental Variables Solution:

Acemoglu, Johnson, and Robinson (2001) use settler mortality as an instrument:

The Historical Argument:

  1. In the 1600s-1800s, European colonizers faced different disease environments
  2. High mortality areas (Africa, tropical Americas) → extractive institutions
  • Europeans didn’t settle permanently
  • Built institutions to extract resources (slavery, forced labor)
  1. Low mortality areas (North America, Australia) → settler institutions
  • Europeans settled permanently
  • Built institutions to protect their own property rights

Why This Works as an Instrument:

Relevant: Settler mortality strongly predicts colonial institutions Exogenous: Malaria in 1700s doesn’t directly affect GDP in 2000s

Data: 64 former colonies

Variables:

  • logpgp95: Log GDP per capita 1995 (outcome)
  • avexpr: Protection against expropriation (institutions quality, 0-10)
  • logem4: Log settler mortality rate (instrument)

Method: Two-Stage Least Squares (2SLS/IV)

# Load institutions data (cross-country)
data_inst = pd.read_stata(GITHUB_DATA_URL + 'AED_INSTITUTIONS.DTA')

print(f"Loaded {len(data_inst)} countries")
print(f"\nInstrumental Variables Setup:")
print(f"  - Outcome: logpgp95 (log GDP per capita 1995)")
print(f"  - Endogenous regressor: avexpr (institutions quality)")
print(f"  - Instrument: logem4 (log settler mortality)")
print(f"\nKey idea: Settler mortality affected colonial institutions,")
print(f"which persist to affect GDP today, but mortality doesn't")
print(f"directly affect modern GDP")

# Summary statistics
data_inst[['logpgp95', 'avexpr', 'logem4']].describe()

# OLS (biased - endogeneity problem)
fit_ols = pf.feols('logpgp95 ~ avexpr', data=data_inst, vcov='HC1')

# OLS REGRESSION (BIASED)
fit_ols.summary()
print(f"\nOLS coefficient: {fit_ols.coef()['avexpr']:.3f}")
print("⚠️ This is biased due to endogeneity (omitted variables, reverse causation)")

# First stage
fit_first = pf.feols('avexpr ~ logem4', data=data_inst, vcov='HC1')

# FIRST STAGE: INSTITUTIONS ~ SETTLER MORTALITY
fit_first.summary()
first_f = fit_first.tstat()['logem4']**2
print(f"\nFirst stage F-statistic: {first_f:.2f}")
print(f"Rule of thumb: F > 10 for strong instrument")
print(f"Instrument strength: {'Strong ✓' if first_f > 10 else 'Weak ⚠️'}")

# 2SLS using pyfixest IV syntax
# TWO-STAGE LEAST SQUARES (2SLS)

# Single-call IV estimation: y ~ exog | endog ~ instrument
fit_iv = pf.feols('logpgp95 ~ 1 | avexpr ~ logem4', data=data_inst, vcov='HC1')

print("\nIV/2SLS estimation:")
fit_iv.summary()

# COMPARISON: OLS vs IV
print(f"OLS coefficient: {fit_ols.coef()['avexpr']:.3f}")
print(f"IV/2SLS coefficient: {fit_iv.coef()['avexpr']:.3f}")
print(f"\nDifference: {fit_iv.coef()['avexpr'] - fit_ols.coef()['avexpr']:.3f}")
print(f"\nIV estimate is larger → OLS has attenuation bias")
print(f"(measurement error and omitted variables bias OLS toward zero)")

# CAUSAL INTERPRETATION
print(f"1-unit improvement in institutions → {fit_iv.coef()['avexpr']:.2f} increase in log GDP")
print(f"Exponentiating: {np.exp(fit_iv.coef()['avexpr']):.2f}x increase in GDP level")
print(f"\n✓ This is a causal estimate (under IV assumptions)")
print(f"✓ Instrument (settler mortality) is:")
print(f"  - Relevant: Strong first stage (F = {first_f:.1f})")
print(f"  - Exogenous: Mortality in 1700s doesn't directly affect modern GDP")
Loaded 64 countries

Instrumental Variables Setup:
  - Outcome: logpgp95 (log GDP per capita 1995)
  - Endogenous regressor: avexpr (institutions quality)
  - Instrument: logem4 (log settler mortality)

Key idea: Settler mortality affected colonial institutions,
which persist to affect GDP today, but mortality doesn't
directly affect modern GDP
###

Estimation:  OLS
Dep. var.: logpgp95
sample: None = all
Inference:  HC1
Observations:  64

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      4.660 |        0.320 |    14.558 |      0.000 |  4.020 |   5.300 |
| avexpr        |      0.522 |        0.050 |    10.458 |      0.000 |  0.422 |   0.622 |
---
RMSE: 0.702 R2: 0.54 

OLS coefficient: 0.522
⚠️ This is biased due to endogeneity (omitted variables, reverse causation)
###

Estimation:  OLS
Dep. var.: avexpr
sample: None = all
Inference:  HC1
Observations:  64

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      9.341 |        0.704 |    13.264 |      0.000 |  7.934 |  10.749 |
| logem4        |     -0.607 |        0.150 |    -4.040 |      0.000 | -0.907 |  -0.307 |
---
RMSE: 1.245 R2: 0.27 

First stage F-statistic: 16.32
Rule of thumb: F > 10 for strong instrument
Instrument strength: Strong ✓

IV/2SLS estimation:
###

Estimation:  IV
Dep. var.: logpgp95
sample: None = all
Inference:  HC1
Observations:  64

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      1.910 |        1.193 |     1.601 |      0.114 | -0.475 |   4.294 |
| avexpr        |      0.944 |        0.179 |     5.278 |      0.000 |  0.587 |   1.302 |
---

OLS coefficient: 0.522
IV/2SLS coefficient: 0.944

Difference: 0.422

IV estimate is larger → OLS has attenuation bias
(measurement error and omitted variables bias OLS toward zero)
1-unit improvement in institutions → 0.94 increase in log GDP
Exponentiating: 2.57x increase in GDP level

✓ This is a causal estimate (under IV assumptions)
✓ Instrument (settler mortality) is:
  - Relevant: Strong first stage (F = 16.3)
  - Exogenous: Mortality in 1700s doesn't directly affect modern GDP

Key Concept 13.10: Instrumental Variables and Two-Stage Least Squares

When a regressor is endogenous (correlated with the error term), OLS is biased. Instrumental variables (IV) use an instrument that is correlated with the endogenous regressor but not directly with the outcome. Two-stage least squares (2SLS) first predicts the endogenous variable, then uses these predictions in the second-stage regression.

Interpreting the IV Results

Key Findings:

  1. OLS vs IV Comparison:
  • OLS coefficient: ≈ 0.52 (biased downward)
  • IV coefficient: ≈ 0.94 (causal estimate)
  • IV estimate is 80% larger than OLS!
  1. Why the Difference?

Attenuation bias: OLS is biased toward zero due to:

  • Measurement error in institutions quality
  • Omitted variables (culture, geography)
  • Reverse causation (rich countries build better institutions)
  1. Economic Interpretation:
  • 1-unit improvement in institutions → 0.94 increase in log GDP
  • Exponentiating: \(e^{0.94} \approx 2.56\)156% increase in GDP level
  • This is a massive effect!
  1. Instrument Validity:

Relevance: First stage F = 16.3 > 10 (strong instrument)

Exogeneity: Settler mortality in 1700s doesn’t directly affect modern GDP

The instrument works through this channel: Settler mortality → Colonial institutions → Modern institutions → GDP

High mortality areas got “extractive” institutions (exploitation). Low mortality areas got “settler” institutions (property rights, democracy).

Policy Implications:

Institutions are not just correlated with prosperity—they cause it. Countries can escape poverty by:

  • Strengthening property rights
  • Reducing corruption
  • Improving rule of law
  • Building democratic accountability

Caveat: These results apply to former colonies. Institutional change is slow and difficult.

13.9 From Raw Data to Final Data

Best practices for data preparation and cleaning.

Why Data Wrangling Matters

Real-world econometric analysis spends 60-80% of time on data preparation:

  • Reading data from various sources
  • Merging multiple datasets
  • Cleaning errors and outliers
  • Transforming variables
  • Creating new variables

Common Data Challenges:

  1. Missing values: How to handle? Drop, impute, or model explicitly?
  2. Outliers: Real extreme values or data errors?
  3. Inconsistent units: Converting currencies, adjusting for inflation
  4. Different time frequencies: Monthly data merged with quarterly
  5. Measurement error: Misreported values, typos

Best Practices:

  • Document everything: Keep track of all cleaning steps
  • Preserve raw data: Never overwrite original files
  • Reproducibility: Write scripts (not manual Excel edits!)
  • Validation: Check summary statistics before and after cleaning
  • Transparency: Report how many observations dropped and why
# Demonstrate reading different file formats
# Data Reading Examples

# Stata files
print("\n1. Reading Stata files (.dta):")
# data = pd.read_stata('file.dta')

# CSV files
print("\n2. Reading CSV files:")
# data = pd.read_csv('file.csv')

# Excel files
print("\n3. Reading Excel files:")
# data = pd.read_excel('file.xlsx')

# Example: merge operations
# DATA MERGING EXAMPLE

df1 = pd.DataFrame({'id': [1, 2, 3], 'value_a': [10, 20, 30]})
df2 = pd.DataFrame({'id': [1, 2, 4], 'value_b': [100, 200, 400]})

print("\nDataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

merged = pd.merge(df1, df2, on='id', how='inner')
print("\nMerged (inner join):")
print(merged)

1. Reading Stata files (.dta):

2. Reading CSV files:

3. Reading Excel files:

DataFrame 1:
   id  value_a
0   1       10
1   2       20
2   3       30

DataFrame 2:
   id  value_b
0   1      100
1   2      200
2   4      400

Merged (inner join):
   id  value_a  value_b
0   1       10      100
1   2       20      200
# Data cleaning examples

# Example dataset
df_dirty = pd.DataFrame({
    'age': [25, 30, -5, 200, 35],
    'income': [50000, 60000, None, 75000, 80000],
    'gender': ['M', 'F', 'm', 'Female', 'M']
})

print("\nOriginal data (with errors):")
print(df_dirty)

# Clean age (remove impossible values)
df_clean = df_dirty.copy()
df_clean.loc[df_clean['age'] < 0, 'age'] = np.nan
df_clean.loc[df_clean['age'] > 120, 'age'] = np.nan

# Fill missing income with median
df_clean['income'].fillna(df_clean['income'].median(), inplace=True)

# Standardize gender coding
df_clean['gender'] = df_clean['gender'].str.upper().str[0]

print("\nCleaned data:")
print(df_clean)

print("\n✓ Key cleaning steps:")
# 1. Detect and handle impossible values (age < 0, age > 120)
# 2. Impute missing values (median for income)
# 3. Standardize categorical variables (gender)

Original data (with errors):
   age   income  gender
0   25  50000.0       M
1   30  60000.0       F
2   -5      NaN       m
3  200  75000.0  Female
4   35  80000.0       M

Cleaned data:
    age   income gender
0  25.0  50000.0      M
1  30.0  60000.0      F
2   NaN  67500.0      M
3   NaN  75000.0      F
4  35.0  80000.0      M

✓ Key cleaning steps:

Key Takeaways

School Performance and Socioeconomic Factors (Case Study 13.1)

  • School performance (API) is strongly associated with socioeconomic factors, particularly parent education
  • Bivariate analysis shows ~80 API points per year of parent education
  • Multiple regression maintains a strong effect (~74 points) after controlling for meals, English learners, year-round schools, and teacher quality
  • High correlations among socioeconomic variables make it difficult to isolate individual effects
  • California’s “similar schools” index controls for socioeconomic characteristics

Cobb-Douglas Production Function and Returns to Scale (Case Study 13.2)

  • Natural logarithm transformation converts the nonlinear Cobb-Douglas model into a linear OLS form
  • Estimated capital elasticity ≈ 0.23 and labor elasticity ≈ 0.81
  • Data support constant returns to scale (elasticities sum to 1.040, F-test p = 0.636)
  • HAC-robust standard errors account for time series autocorrelation
  • Predictions require retransformation bias correction when using log models

Phillips Curve and Omitted Variables Bias (Case Study 13.3)

  • Original Phillips curve showed a negative unemployment–inflation trade-off pre-1970
  • The relationship broke down post-1970 — a classic example of model misspecification
  • Augmented Phillips curve adding expected inflation resolves the breakdown (coefficient ≈ 1)
  • The sign reversal demonstrates omitted variables bias: excluding a correlated variable biases coefficients
  • Policy implication: limited long-run unemployment–inflation trade-off

Advanced Causal Methods (Case Studies 13.4–13.8)

  • Log-log models and cluster-robust SEs (13.4): Automobile efficiency gains offset by larger vehicles; clustering by manufacturer accounts for within-group correlation
  • Randomized control trials (13.5): RAND experiment shows better insurance increases healthcare spending; random assignment eliminates selection bias
  • Difference-in-differences (13.6): South Africa clinic access improved child health; key assumption is parallel trends
  • Regression discontinuity (13.7): Political incumbency provides ~5–7% vote share advantage; exploits sharp cutoffs
  • Instrumental variables (13.8): Better institutions causally increase GDP; instruments address endogeneity

Data Preparation and Practical Workflow (Case Study 13.9)

  • Real data analysis requires extensive data carpentry — often 60–80% of project time
  • Reading data from multiple formats: .csv, .xlsx, .dta, .sav
  • Merging, recoding, cleaning, and validation are essential before regression analysis

General Lessons from Multiple Regression Case Studies

  • Multiple regression isolates partial effects while controlling for other variables
  • Choice of standard errors is critical: HC1 (cross-section), cluster-robust (grouped), HAC (time series)
  • Log transformations enable estimation of elasticities and nonlinear relationships
  • Omitted variables bias can reverse coefficient signs and lead to incorrect conclusions
  • Causal inference requires additional identification strategies beyond OLS

Python Libraries and Code:

This single code block reproduces the core workflow of Chapter 13. It is self-contained — copy it into an empty notebook and run it to review six case studies spanning production functions, omitted variables bias, randomized experiments, difference-in-differences, regression discontinuity, and instrumental variables.

# =============================================================================
# CHAPTER 13 CHEAT SHEET: Case Studies for Multiple Regression
# =============================================================================

# --- Libraries ---
import numpy as np                        # numerical operations and log transforms
import pandas as pd                       # data loading and manipulation
import matplotlib.pyplot as plt           # creating plots and visualizations
import pyfixest as pf                     # fast OLS/IV estimation
from scipy import stats                   # F-distribution for hypothesis tests
# !pip install pyfixest

# --- Data source ---
URL = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/"

# =============================================================================
# STEP 1: Cobb-Douglas Production Function — log-linearize and estimate
# =============================================================================
# Taking logs converts Q = A*K^alpha*L^beta into a linear model estimable by OLS
data_cobb = pd.read_stata(URL + "AED_COBBDOUGLAS.DTA")
data_cobb['lnq'] = np.log(data_cobb['q'])
data_cobb['lnk'] = np.log(data_cobb['k'])
data_cobb['lnl'] = np.log(data_cobb['l'])

# OLS with HAC standard errors (time series: autocorrelation + heteroskedasticity)
data_cobb['_time'] = range(len(data_cobb))
fit_cobb = pf.feols('lnq ~ lnk + lnl', data=data_cobb,
                    vcov='NW', vcov_kwargs={'time_id': '_time', 'lag': 3})

alpha = fit_cobb.coef()['lnk']   # capital elasticity
beta  = fit_cobb.coef()['lnl']   # labor elasticity

print(f"Capital elasticity α = {alpha:.3f}, Labor elasticity β = {beta:.3f}")
print(f"Sum α + β = {alpha + beta:.3f}  (≈1 → constant returns to scale)")
print(f"R² = {fit_cobb._r2:.4f}")
fit_cobb.summary()

# F-test for constant returns to scale: H0: α + β = 1
data_cobb['lnq_l'] = data_cobb['lnq'] - data_cobb['lnl']
data_cobb['lnk_l'] = data_cobb['lnk'] - data_cobb['lnl']
fit_r = pf.feols('lnq_l ~ lnk_l', data=data_cobb)
rss_unr = np.sum(fit_cobb._u_hat**2)
rss_r = np.sum(fit_r._u_hat**2)
df_res = fit_cobb._N - len(fit_cobb.coef())
f_stat = ((rss_r - rss_unr) / 1) / (rss_unr / df_res)
p_val  = 1 - stats.f.cdf(f_stat, 1, df_res)
print(f"CRS test: F = {f_stat:.2f}, p = {p_val:.3f}{'Fail to reject' if p_val > 0.05 else 'Reject'} CRS")

# =============================================================================
# STEP 2: Phillips Curve — omitted variables bias reverses the sign
# =============================================================================
# Pre-1970 the trade-off works; post-1970 it breaks because expected inflation
# is omitted — a textbook demonstration of OVB
data_phil = pd.read_stata(URL + "AED_PHILLIPS.DTA")

# Pre-1970: classic negative relationship
pre = data_phil[data_phil['year'] < 1970]
pre = pre.copy()
pre['_time'] = range(len(pre))
m_pre = pf.feols('inflgdp ~ urate', data=pre,
                 vcov='NW', vcov_kwargs={'time_id': '_time', 'lag': 3})
print(f"\nPre-1970 slope: {m_pre.coef()['urate']:.3f}  (negative → classic Phillips curve)")

# Post-1970: sign flips due to omitted expected inflation
post = data_phil[data_phil['year'] >= 1970]
post = post.copy()
post['_time'] = range(len(post))
m_post = pf.feols('inflgdp ~ urate', data=post,
                  vcov='NW', vcov_kwargs={'time_id': '_time', 'lag': 5})
print(f"Post-1970 slope: {m_post.coef()['urate']:.3f}  (positive → breakdown!)")

# Augmented model: adding expected inflation restores the negative sign
post_exp = post.dropna(subset=['inflgdp1yr'])
post_exp = post_exp.copy()
post_exp['_time'] = range(len(post_exp))
m_aug = pf.feols('inflgdp ~ urate + inflgdp1yr', data=post_exp,
                 vcov='NW', vcov_kwargs={'time_id': '_time', 'lag': 5})
print(f"Augmented slope on urate: {m_aug.coef()['urate']:.3f}  (negative again!)")
print(f"Expected inflation coef:  {m_aug.coef()['inflgdp1yr']:.3f}")

# OVB formula: E[b2] = β2 + β3*γ
m_aux = pf.feols('inflgdp1yr ~ urate', data=post_exp)
predicted = m_aug.coef()['urate'] + m_aug.coef()['inflgdp1yr'] * m_aux.coef()['urate']
print(f"OVB predicted bivariate slope: {predicted:.3f}  (actual: {m_post.coef()['urate']:.3f})")

# =============================================================================
# STEP 3: RAND Health Insurance Experiment — RCT as the gold standard
# =============================================================================
# Random assignment to insurance plans eliminates selection bias
data_rand = pd.read_stata(URL + "AED_HEALTHINSEXP.DTA")
data_rand_y1 = data_rand[data_rand['year'] == 1]

# Mean spending by plan
print(f"\n{'Plan':<12} {'Mean $':>10} {'N':>8}")
for plan, grp in data_rand_y1.groupby('plan')['spending']:
    print(f"{plan:<12} {grp.mean():>10,.0f} {len(grp):>8,}")

# Regression with cluster-robust SEs by family
fit_rct = pf.feols('spending ~ coins25 + coins50 + coins95 + coinsmixed + coinsindiv',
                data=data_rand_y1, vcov={'CRV1': 'idfamily'})
fit_rct.summary()

# =============================================================================
# STEP 4: Difference-in-Differences — health clinic access in South Africa
# =============================================================================
# DiD removes both time-invariant confounders and common time trends
data_did = pd.read_stata(URL + "AED_HEALTHACCESS.DTA")

# Manual DiD calculation
pre_c  = data_did[(data_did['hightreat']==0) & (data_did['post']==0)]['waz'].mean()
post_c = data_did[(data_did['hightreat']==0) & (data_did['post']==1)]['waz'].mean()
pre_t  = data_did[(data_did['hightreat']==1) & (data_did['post']==0)]['waz'].mean()
post_t = data_did[(data_did['hightreat']==1) & (data_did['post']==1)]['waz'].mean()
did    = (post_t - pre_t) - (post_c - pre_c)
print(f"\nDiD estimate (manual): {did:.3f} SD improvement in child nutrition")

# DiD regression with cluster-robust SEs by community
fit_did = pf.feols('waz ~ hightreat + post + postXhigh', data=data_did, vcov={'CRV1': 'idcommunity'})
print(f"DiD coefficient (regression): {fit_did.coef()['postXhigh']:.3f}")
print(f"p-value: {fit_did.pvalue()['postXhigh']:.4f}")

# =============================================================================
# STEP 5: Regression Discontinuity — incumbency advantage in U.S. Senate
# =============================================================================
# Candidates who barely win vs barely lose are quasi-randomly assigned to
# incumbent status — the jump at margin = 0 is the causal effect
data_rd = pd.read_stata(URL + "AED_INCUMBENCY.DTA")
data_rd = data_rd[data_rd['vote'].notna()].copy()

fit_rd = pf.feols('vote ~ win + margin', data=data_rd, vcov='HC1')
print(f"\nIncumbency advantage: {fit_rd.coef()['win']:.1f} percentage points")
ci = fit_rd.confint()
print(f"95% CI: [{ci.loc['win', '2.5%']:.1f}, {ci.loc['win', '97.5%']:.1f}]")
print(f"p-value: {fit_rd.pvalue()['win']:.4f}")

# =============================================================================
# STEP 6: Instrumental Variables — do institutions cause growth?
# =============================================================================
# OLS is biased by reverse causation; settler mortality instruments for
# modern institutions (relevant + exogenous to modern GDP)
data_iv = pd.read_stata(URL + "AED_INSTITUTIONS.DTA")

# OLS (biased)
m_ols = pf.feols('logpgp95 ~ avexpr', data=data_iv, vcov='HC1')

# IV/2SLS: y ~ exog | endog ~ instrument
m_iv = pf.feols('logpgp95 ~ 1 | avexpr ~ logem4', data=data_iv, vcov='HC1')

# First-stage F
m_1st = pf.feols('avexpr ~ logem4', data=data_iv, vcov='HC1')
first_f = m_1st.tstat()['logem4']**2
print(f"\nFirst-stage F = {first_f:.1f}  ({'Strong' if first_f > 10 else 'Weak'} instrument)")

print(f"OLS coefficient:  {m_ols.coef()['avexpr']:.3f}  (biased)")
print(f"IV/2SLS coefficient: {m_iv.coef()['avexpr']:.3f}  (causal)")
print(f"Causal effect: 1-unit improvement in institutions → "
      f"{np.exp(m_iv.coef()['avexpr']):.1f}x increase in GDP")

Try it yourself! Copy this code into an empty Google Colab notebook and run it: Open Colab


Next Steps:

  • Chapter 14: Regression with Indicator Variables
  • Chapter 15: Regression with Transformed Variables

Congratulations! You’ve completed nine comprehensive case studies demonstrating the breadth and power of multiple regression analysis — from school performance to macroeconomic policy to causal inference!

Common Mistakes to Avoid

  • Cherry-picking significant results: Report all specifications, not just the ones that “work”
  • Ignoring the direction of bias from omitted variables: Think through the sign of the bias
  • Not considering measurement error: Proxy variables introduce attenuation bias

Practice Exercises

Exercise 1: Interpreting Multiple Regression Coefficients

A regression of school test scores on parent education yields a coefficient of 80. When meals (% free lunch), English learners, and teacher quality are added, the parent education coefficient drops to 74.

(a) Why does the coefficient change when additional variables are included?

(b) Does the change mean parent education is less important than initially thought? Explain.

(c) What does the remaining coefficient of 74 represent in terms of partial effects?


Exercise 2: Log Transformation and Elasticities

A Cobb-Douglas production function is estimated as: \(\ln Q = 1.45 + 0.23 \ln K + 0.81 \ln L\)

(a) Interpret the coefficient 0.23 in terms of percentage changes.

(b) Interpret the coefficient 0.81 in terms of percentage changes.

(c) Do these results suggest constant, increasing, or decreasing returns to scale? Conduct a formal test.


Exercise 3: Omitted Variables Bias

A researcher estimates the effect of unemployment on inflation and finds a positive coefficient. When expected inflation is added, the unemployment coefficient becomes negative.

(a) Using the OVB formula, explain why the coefficient changed sign.

(b) What are the two conditions required for omitted variables bias to occur?

(c) Is the augmented model more reliable? Why or why not?


Exercise 4: Choosing Standard Error Types

For each scenario, identify the appropriate type of standard errors and explain why:

(a) Cross-sectional data on 500 firms with varying sizes (potential heteroskedasticity)

(b) Panel data with 50 schools observed over 10 years (observations clustered by school)

(c) Time series data on quarterly GDP growth for 30 years (potential serial correlation)


Exercise 5: Matching Causal Inference Methods

Match each research scenario with the most appropriate causal inference method (RCT, DiD, RD, IV) and state the key identifying assumption:

(a) A new drug is tested by randomly assigning patients to treatment and placebo groups

(b) A scholarship is awarded to students scoring above 80 on an entrance exam

(c) A new minimum wage law is implemented in one state but not its neighbor

(d) Historical settler mortality is used to explain current institutional quality


Exercise 6: Data Preparation Checklist

You receive a dataset containing country-level GDP, population, and education data from three different sources in .csv, .xlsx, and .dta formats.

(a) List the steps you would take to prepare this data for regression analysis.

(b) What potential issues should you check for when merging datasets from different sources?

(c) Why is data validation important before running regressions?