Chapter 12 of 18 · Interactive Dashboard

Additional Topics in Multiple Regression

Explore robust standard errors, prediction intervals, Type I/II errors, autocorrelation, and bootstrap methods with real house-price and GDP data.

Robust Standard Error Selector

Default OLS standard errors assume error variance is constant — but real data rarely obliges. When homoskedasticity fails, your t-statistics and p-values lie by construction.

When error variance is not constant, default OLS standard errors are invalid. Heteroskedastic-robust (HC1) SEs correct this without changing the coefficient estimates themselves. Only the SEs, t-statistics, and confidence intervals change. For cross-sectional data, reporting HC1 robust SEs is considered modern best practice — they cost nothing when homoskedasticity holds and protect you when it fails.
What you can do here
  • Toggle between Default and HC1-Robust SEs — watch SEs, t-stats, and p-values change while the coefficient stays fixed.
  • Select each variable in turn to see where heteroskedasticity bites hardest.
  • Check the SE ratio — ratios far from 1 flag heteroskedasticity.
Coefficient
Std. Error
t-statistic
p-value
SE Ratio
Try this
  1. Select bathrooms and toggle Default vs HC1-Robust. The SE ratio is > 1.2. Heteroskedasticity clearly inflates this variable's uncertainty — the default SE was too optimistic.
  2. Select lotsize and compare. The robust SE is smaller than the default (ratio < 1). Unusual but legitimate — robust SEs can tighten CIs as well as widen them.
  3. Scan every variable for a p-value significance flip. None flip at 5% here. Small dataset, modest heteroskedasticity — but on larger cross-sections, robust SEs can overturn conclusions.

Take-away: Coefficients are the same; only inference changes — always report robust SEs alongside defaults on cross-sectional data. Read §12.2 in the chapter →

Prediction Interval Visualizer

Predicting the average 2,000-sqft house is easy. Predicting a specific 2,000-sqft house is brutal — even with unlimited data, some uncertainty never goes away.

Predicting the average outcome E[y|x*] is more precise than predicting an individual y|x*. The forecast variance equals the conditional-mean variance plus σ²: Var(ŷf) = Var(ŷcm) + σ². As n → ∞ the conditional-mean SE shrinks to zero, but the forecast SE remains at least se — the RMSE. This is a fundamental limit on individual predictions: 95% PIs are at least ±1.96 × se wide, no matter how much data you collect.
What you can do here
  • Slide the "Predict at size" slider from 1,200 to 3,500 sqft.
  • Toggle "Both" / "CI Only" / "PI Only" to separate mean-prediction uncertainty from individual-prediction uncertainty.
  • Watch the PI / CI width ratio — it reveals how much of the total uncertainty is irreducible.
Predicted price
CI SE
PI SE
PI / CI width
RMSE
Try this
  1. Set size to the sample mean (~1,883 sqft). The CI is at its narrowest here. Drag to 3,500 sqft and both bands widen. Predictions are most precise near the center of the data; extrapolating to the edges costs you precision both in the mean and in the individual.
  2. Toggle CI Only, then PI Only. The PI dwarfs the CI. Most of the uncertainty in predicting an individual house is irreducible — the σ that no amount of data can eliminate.
  3. Watch the PI / CI width ratio. Largest near the mean, smallest at the extremes. At the data's center, σ dominates; at the edges, estimation uncertainty catches up — but never overtakes σ.

Take-away: Predicting a conditional mean gets arbitrarily precise with more data; predicting an individual outcome does not — a built-in ceiling on forecasting any specific y. Read §12.3 in the chapter →

Type I/II Error Explorer

A test can fail two ways: cry wolf when nothing's there, or miss a real effect. How do sample size, noise, and α each shift that balance?

Type I error is a false positive (rejecting a true H0); its probability equals α. Type II error is a false negative (failing to reject a false H0). Power = 1 − P(Type II) measures the ability to detect true effects. The most powerful test for a given size uses the most precise estimator — one more reason efficient estimation matters beyond the point estimate itself.
What you can do here
  • Slide the sample size n — larger n tightens SE(β̂) and shifts the whole power curve upward.
  • Slide the noise σ — more noise flattens the curve.
  • Slide the α level — higher α gives more power but more false positives.
SE(β̂)
80% power at |β|
Type I rate (α)
Power at β = SE
Try this
  1. Keep defaults (n = 30, σ = 15), then double n to 60. The minimum detectable effect at 80% power drops by ~30%. Larger samples shrink SE(β̂) — the whole power curve shifts upward for free.
  2. Increase α from 0.05 to 0.10. Power rises everywhere, but false positives rise with it. The fundamental size-power tradeoff: you can only buy detection with tolerance for mistakes.
  3. Set σ = 5 (low noise). The power curve becomes sharply steep. Precision beats sample size for detection — even a small effect pops once σ is small enough.

Take-away: Power depends on α, σ, and n together — and the most efficient estimator pushes the whole curve up, increasing your ability to detect real effects. Read §12.7 in the chapter →

Autocorrelation Visualizer

GDP growth today looks a lot like growth last quarter. How much, exactly? And when do those correlations finally die out?

In time series data, errors are often autocorrelated — today's shock persists into tomorrow. HAC (heteroskedasticity and autocorrelation consistent) standard errors, also called Newey-West SEs, account for both forms of dependence. The lag length m must be specified; a common rule of thumb is m = 0.75 × T1/3. For T = 241 quarters, m ≈ 5.
What you can do here
  • Slide max lags displayed from 4 to 20.
  • Toggle Time Series + ACF, Time Series only, or ACF only.
  • Compare ρ(1), ρ(2), … to the 95% band to see which lags are statistically distinguishable from zero.
Mean growth
ρ(1)
ρ(2)
95% band ±
Try this
  1. Look at lags 1–3 in the ACF. All well above the 95% band. GDP growth has strong persistence — today's shock spills into the next several quarters.
  2. Increase max lags to 20. By lag 6–7 autocorrelations cross zero and turn slightly negative. Shocks eventually mean-revert — the ACF both measures persistence and tells you when it ends.
  3. In the time series panel, scan for long runs of above- or below-average quarters. They cluster visibly. That visual clustering is autocorrelation's fingerprint in the raw data.

Take-away: When lag-1 autocorrelation is large, default SEs underestimate uncertainty — HAC (Newey-West) SEs are the standard fix. Read §12.2 in the chapter →

SE Ratio Comparator

One number tells you whether heteroskedasticity matters for each coefficient: robust SE / default SE. Is it close to 1 — or far off?

The ratio robust-SE / default-SE summarises how badly heteroskedasticity affects each coefficient. Ratio > 1 means the default SE was too optimistic — heteroskedasticity inflates uncertainty. Ratio < 1 means the default was too pessimistic (less common). Ratio ≈ 1 means both methods agree. A substantial deviation (say > 30%) flags heteroskedasticity worth correcting for.
What you can do here
  • Toggle Sort by Ratio to see which variables are most affected by heteroskedasticity.
  • Toggle Alphabetical to find a specific predictor quickly.
  • Compare to the reference line at 1.0 — the break-even between inflation and deflation.
Max ratio
Min ratio
Mean ratio
Vars > 1.0
Try this
  1. Sort by Ratio. The variables at the tops of the bars are the most heteroskedasticity-affected. Bathrooms typically leads; lotsize often sits below 1.0 (robust SE is actually smaller).
  2. Locate the 1.0 reference line. Variables to its right had inflated default SEs; those to its left had deflated ones. Heteroskedasticity doesn't just blow up SEs — it can shrink them too, always unpredictably.
  3. Remember that the coefficients are identical in both columns. Only inference changes — robust SEs reshuffle confidence, not point estimates.

Take-away: When the SE ratio strays from 1 for any coefficient, robust SEs earn their keep — and the biggest ratio tells you which variable cares most about heteroskedasticity. Read §12.2 in the chapter →

Bootstrap Confidence Intervals

What if normal-based CIs don't fit your data? The bootstrap builds a CI empirically — by resampling the 29 houses and watching where the slope lands.

The bootstrap resamples the original data with replacement many times to estimate the sampling distribution of a statistic. Bootstrap CIs don't rely on normality or large-sample approximations, making them especially useful with small samples, skewed distributions, or non-standard estimators where analytical formulas aren't available. A 95% percentile CI is just the 2.5th and 97.5th percentiles of the bootstrap estimates.
What you can do here
  • Slide "Reps per click" to set batch size (50–500).
  • Click Run Bootstrap to accumulate more resamples and smooth the histogram.
  • Click Clear All to start over.
  • Compare the Bootstrap SE to the OLS analytical SE — the two should converge.
OLS slope
Boot mean
Boot SE
95% CI lower
95% CI upper
Total reps
0
Try this
  1. Click Run Bootstrap once (200 reps). The histogram is rough. Click again — smoother. After 1,000+ reps the CI stabilises. Bootstrap precision is controlled by B, not by n — more reps buy more stable intervals at constant data.
  2. Compare the Bootstrap SE to the OLS analytical SE. They should be close. When they agree, the normal-based SE is validated empirically — when they disagree, trust the bootstrap.
  3. Inspect the histogram for skew. Slightly asymmetric even with a large B. This is why the bootstrap percentile CI can outperform normal-approximation CIs in small or skewed samples.

Take-away: The bootstrap builds a sampling distribution from the data itself — no normality, no formulas, no asymptotics; just replication. Read §12.6 in the chapter →

Code Summary

You've explored the key concepts interactively — now reproduce them in code. These self-contained blocks cover everything you practiced above. Pick your language, copy the code, and run it.

# =============================================================================
# CHAPTER 12 CHEAT SHEET: Further Topics in Multiple Regression
# =============================================================================

# --- Libraries ---
import pandas as pd                       # data loading and manipulation
import numpy as np                        # numerical operations
import matplotlib.pyplot as plt           # creating plots and visualizations
import pyfixest as pf                     # OLS regression with R-style formulas
# !pip install pyfixest                   # uncomment if running in Google Colab
import statsmodels.api as sm              # prediction tools (get_prediction)
from scipy import stats                   # statistical distributions for CIs and power
from statsmodels.graphics.tsaplots import plot_acf  # autocorrelation plots

# =============================================================================
# STEP 1: Load house price data from a URL
# =============================================================================
# pd.read_stata() reads Stata .dta files directly from the web
url_house = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_HOUSE.DTA"
data_house = pd.read_stata(url_house)

print(f"Dataset: {data_house.shape[0]} observations, {data_house.shape[1]} variables")
print(data_house[['price', 'size', 'bedrooms', 'bathrooms', 'lotsize', 'age']].describe().round(2))

# =============================================================================
# STEP 2: OLS regression with default standard errors
# =============================================================================
# Default SEs assume homoskedasticity (constant error variance)
fit_default = pf.feols('price ~ size + bedrooms + bathrooms + lotsize + age + monthsold',
                       data=data_house)

print(f"\nR-squared: {fit_default._r2:.4f}")
print(f"Size effect (default): ${fit_default.coef()['size']:,.2f} per sq ft")
fit_default.summary()

# =============================================================================
# STEP 3: Robust (HC1) standard errors — compare with default
# =============================================================================
# HC1 SEs correct for heteroskedasticity without changing coefficient estimates
# Only the SEs, t-statistics, and confidence intervals change
fit_robust = pf.feols('price ~ size + bedrooms + bathrooms + lotsize + age + monthsold',
                      data=data_house, vcov='HC1')

# SE ratio: how much heteroskedasticity affects each coefficient's uncertainty
print(f"{'Variable':<14} {'Coef':>10} {'Default SE':>12} {'Robust SE':>12} {'Ratio':>8}")
print("-" * 58)
for var in fit_default.coef().index:
    coef  = fit_default.coef()[var]
    se_d  = fit_default.se()[var]
    se_r  = fit_robust.se()[var]
    ratio = se_r / se_d                   # ratio > 1 → default was too optimistic
    print(f"{var:<14} {coef:>10.2f} {se_d:>12.2f} {se_r:>12.2f} {ratio:>8.3f}")

# =============================================================================
# STEP 4: Prediction — conditional mean CI vs. individual forecast PI
# =============================================================================
# Predicting E[y|x*] is more precise than predicting an individual y|x*
fit_simple = pf.feols('price ~ size', data=data_house)

# Predict for a 2000 sq ft house — use statsmodels get_prediction for CI/PI
# pyfixest provides point predictions; statsmodels computes interval bands
sm_simple = sm.OLS.from_formula('price ~ size', data=data_house).fit()
new_house = pd.DataFrame({'size': [2000]})
pred = sm_simple.get_prediction(new_house)
pred_frame = pred.summary_frame(alpha=0.05)

s_e = np.sqrt(np.sum(fit_simple._u_hat**2) / fit_simple._N)  # RMSE

print(f"\nPredicted price (2000 sq ft): ${pred_frame['mean'].values[0]:,.0f}")
print(f"95% CI for E[Y|X=2000]: [${pred_frame['mean_ci_lower'].values[0]:,.0f}, "
      f"${pred_frame['mean_ci_upper'].values[0]:,.0f}]")
print(f"95% PI for individual Y:  [${pred_frame['obs_ci_lower'].values[0]:,.0f}, "
      f"${pred_frame['obs_ci_upper'].values[0]:,.0f}]")
print(f"\nRMSE (s_e): ${s_e:,.0f} — PI can never be narrower than ±1.96 × s_e")

# Visualize: CI (narrow) vs. PI (wide)
sizes_sorted = data_house[['size']].sort_values('size')
pred_all = sm_simple.get_prediction(sizes_sorted)
pf_pred = pred_all.summary_frame(alpha=0.05)

fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(data_house['size'], data_house['price'], s=50, alpha=0.6, label='Actual')
ax.plot(sizes_sorted['size'], pf_pred['mean'], color='red', linewidth=2, label='Fitted line')
ax.fill_between(sizes_sorted['size'], pf_pred['mean_ci_lower'], pf_pred['mean_ci_upper'],
                color='red', alpha=0.2, label='95% CI (conditional mean)')
ax.fill_between(sizes_sorted['size'], pf_pred['obs_ci_lower'], pf_pred['obs_ci_upper'],
                color='blue', alpha=0.1, label='95% PI (individual)')
ax.set_xlabel('House Size (sq ft)')
ax.set_ylabel('House Price ($)')
ax.set_title('Confidence Interval vs. Prediction Interval')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 5: HAC (Newey-West) standard errors for time series
# =============================================================================
# Time series errors are often autocorrelated — HAC SEs account for this
url_gdp = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_REALGDPPC.DTA"
data_gdp = pd.read_stata(url_gdp)

mean_growth = data_gdp['growth'].mean()
n_gdp       = len(data_gdp)
lag_length   = int(0.75 * n_gdp**(1/3))  # rule of thumb: m = 0.75 × T^(1/3)

# Compare default SE vs. HAC SE for the mean
se_default = data_gdp['growth'].std() / np.sqrt(n_gdp)

# pyfixest NW requires a time_id column — create sequential index
data_gdp['_time'] = range(len(data_gdp))
data_gdp['_one']  = 1                    # intercept-only: regress growth on constant
fit_hac = pf.feols('growth ~ 1', data=data_gdp,
                   vcov='NW', vcov_kwargs={'time_id': '_time', 'lag': lag_length})

print(f"\nGDP Growth: mean = {mean_growth:.4f}")
print(f"Default SE (no autocorrelation): {se_default:.6f}")
print(f"HAC SE (lag = {lag_length}):            {fit_hac.se()['Intercept']:.6f}")
print(f"Ratio HAC/Default: {fit_hac.se()['Intercept'] / se_default:.3f}")

# Correlogram — visualize autocorrelation structure
fig, ax = plt.subplots(figsize=(10, 6))
plot_acf(data_gdp['growth'], lags=10, ax=ax, alpha=0.05)
ax.set_xlabel('Lag')
ax.set_ylabel('Autocorrelation')
ax.set_title('Correlogram of GDP Growth')
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 6: Power curve — Type I vs. Type II error tradeoff
# =============================================================================
# Power = P(reject H₀ | H₀ is false) — depends on true effect size
alpha = 0.05
se_test = 15                              # hypothetical standard error
n_test  = 30
t_crit  = stats.t.ppf(1 - alpha / 2, df=n_test - 1)

# Power as a function of the true coefficient
beta_range = np.linspace(-60, 60, 500)
power = []
for beta_true in beta_range:
    ncp = beta_true / se_test             # non-centrality parameter
    # Two-sided test: reject if |t| > t_crit
    power_val = 1 - (stats.t.cdf(t_crit - ncp, df=n_test - 1)
                      - stats.t.cdf(-t_crit - ncp, df=n_test - 1))
    power.append(power_val)

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(beta_range, power, linewidth=2)
ax.axhline(0.05, color='gray', linestyle='--', alpha=0.5, label='α = 0.05 (Type I)')
ax.axhline(0.80, color='green', linestyle='--', alpha=0.5, label='Power = 0.80 target')
ax.axvline(0, color='gray', linestyle='-', alpha=0.3, label='H₀: β = 0')
ax.set_xlabel('True Coefficient Value (β)')
ax.set_ylabel('Power = P(Reject H₀)')
ax.set_title('Power Curve: Larger Effects Are Easier to Detect')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 7: Bootstrap confidence intervals — no distributional assumptions
# =============================================================================
# Resample data with replacement to estimate the sampling distribution of β
np.random.seed(42)
n_boot = 1000
boot_slopes = []

for _ in range(n_boot):
    boot_sample = data_house.sample(n=len(data_house), replace=True)
    boot_fit    = pf.feols('price ~ size', data=boot_sample)
    boot_slopes.append(boot_fit.coef()['size'])

# Percentile method: 95% CI = [2.5th, 97.5th percentile]
ci_lower = np.percentile(boot_slopes, 2.5)
ci_upper = np.percentile(boot_slopes, 97.5)
ols_slope = fit_simple.coef()['size']

print(f"\nOLS slope (size): {ols_slope:.4f}")
print(f"Bootstrap 95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
print(f"Analytical 95% CI: [{fit_simple.confint().loc['size'][0]:.4f}, "
      f"{fit_simple.confint().loc['size'][1]:.4f}]")

fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(boot_slopes, bins=40, edgecolor='black', alpha=0.7)
ax.axvline(ols_slope, color='red', linewidth=2, label=f'OLS estimate: {ols_slope:.4f}')
ax.axvline(ci_lower, color='green', linewidth=2, linestyle='--', label=f'2.5th pctl: {ci_lower:.4f}')
ax.axvline(ci_upper, color='green', linewidth=2, linestyle='--', label=f'97.5th pctl: {ci_upper:.4f}')
ax.set_xlabel('Bootstrap Slope Estimates')
ax.set_ylabel('Frequency')
ax.set_title('Bootstrap Distribution of Size Coefficient (1000 replications)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Open empty Colab notebook →
* =============================================================================
* CHAPTER 12 CHEAT SHEET: Further Topics in Multiple Regression
* =============================================================================

* --- Setup ---
clear all                                // start with a clean workspace
set more off                             // do not pause output for long results

* =============================================================================
* STEP 1: Load house price data from a URL
* =============================================================================
* use loads a Stata .dta file; "clear" drops any data already in memory
use "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_HOUSE.DTA", clear

describe                                 // list all variables, types, and labels
display "Observations: " _N              // _N is Stata's built-in observation count
summarize price size bedrooms bathrooms lotsize age

* =============================================================================
* STEP 2: OLS regression with default standard errors
* =============================================================================
* Default SEs assume homoskedasticity (constant error variance)
regress price size bedrooms bathrooms lotsize age monthsold

// After regress, Stata stores results you can reference:
display "R-squared: " e(r2)
display "Size effect: " _b[size]         // _b[varname] retrieves coefficients

* =============================================================================
* STEP 3: Robust (HC1) standard errors — compare with default
* =============================================================================
* Adding , robust to regress requests HC1 heteroskedasticity-robust SEs
* Coefficient estimates stay identical; only SEs, t-stats, CIs change
regress price size bedrooms bathrooms lotsize age monthsold, robust

// estimates store saves results so you can compare models later
estimates store robust_model

// Re-run default for side-by-side comparison
quietly regress price size bedrooms bathrooms lotsize age monthsold
estimates store default_model

// estimates table displays both sets of results side by side
estimates table default_model robust_model, se stats(r2 N)

* =============================================================================
* STEP 4: Prediction — conditional mean CI vs. individual forecast PI
* =============================================================================
* Predicting E[y|x*] is more precise than predicting an individual y|x*
regress price size

// predict generates fitted values for every observation
predict yhat, xb                         // xb = linear prediction (fitted value)

// For a specific prediction at size = 2000:
// margins computes predicted values at specified covariate values
margins, at(size = 2000)                 // predicted E[Y|X=2000] with 95% CI

// For the prediction interval (individual), we need RMSE
display "RMSE (s_e): " e(rmse)          // irreducible individual uncertainty

// Visualize fitted line with CI bands
// lfitci plots the linear fit with confidence interval
twoway (scatter price size, msize(small) mcolor(gs10))            ///
       (lfitci price size, level(95) fcolor(red%20) lcolor(red)), ///
    xtitle("House Size (sq ft)") ytitle("House Price ($)")        ///
    title("Confidence Interval for the Conditional Mean")         ///
    legend(order(1 "Actual" 3 "Fitted line" 2 "95% CI"))

* =============================================================================
* STEP 5: HAC (Newey-West) standard errors for time series
* =============================================================================
* Time series errors are often autocorrelated — HAC SEs account for this
use "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_REALGDPPC.DTA", clear

summarize growth
display "Mean GDP growth: " r(mean)

// tsset declares the time-series structure
tsset daten

// newey estimates regression with Newey-West HAC standard errors
// lag(4) sets the maximum lag for the autocorrelation correction
newey growth, lag(4)
display "HAC SE: " _se[_cons]

// Compare with default OLS SE
regress growth
display "Default SE: " _se[_cons]

// Correlogram — visualize autocorrelation structure
// ac draws the autocorrelation function plot
ac growth, lags(10) level(95)            ///
    ytitle("Autocorrelation")            ///
    title("Correlogram of GDP Growth")

* =============================================================================
* STEP 6: Power curve — Type I vs. Type II error tradeoff
* =============================================================================
* Power = P(reject H0 | H0 is false) — depends on true effect size
* Stata's power command computes power for common tests

// power twomeans: power for testing whether a mean differs from a value
// power onettest: power for a one-sample t-test scenario
// Here we compute power manually for a range of effect sizes:

clear
set obs 500
gen beta_true = -60 + (_n - 1) * 240 / 499   // range -60 to 60

local se = 15                            // hypothetical standard error
local n  = 30                            // sample size
local df = `n' - 1
local t_crit = invttail(`df', 0.025)     // two-sided critical value at alpha=0.05

// Non-centrality parameter and power
gen ncp = beta_true / `se'
gen power = 1 - (t(`df', `t_crit' - ncp) - t(`df', -`t_crit' - ncp))

twoway (line power beta_true, lwidth(medthick) lcolor(blue))      ///
       (function y = 0.05, range(-60 60) lpattern(dash) lcolor(gs8)) ///
       (function y = 0.80, range(-60 60) lpattern(dash) lcolor(green)), ///
    xtitle("True Coefficient Value ({β)}")                   ///
    ytitle("Power = P(Reject H{sub:0})")                          ///
    title("Power Curve: Larger Effects Are Easier to Detect")     ///
    legend(order(1 "Power" 2 "{α} = 0.05" 3 "Target = 0.80"))

* =============================================================================
* STEP 7: Bootstrap confidence intervals — no distributional assumptions
* =============================================================================
* Resample data with replacement to estimate the sampling distribution of beta
use "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_HOUSE.DTA", clear

// bootstrap prefix resamples data and re-estimates the model each time
// reps(1000) = 1000 bootstrap replications; seed(42) for reproducibility
bootstrap _b[size], reps(1000) seed(42): regress price size

// estat bootstrap displays percentile-method CIs alongside normal CIs
estat bootstrap, percentile

// Compare with analytical 95% CI from OLS
regress price size
display "OLS slope: " _b[size]
display "Analytical 95% CI: [" _b[size] - invttail(e(df_r), 0.025)*_se[size] ///
    ", " _b[size] + invttail(e(df_r), 0.025)*_se[size] "]"
Paste into your Stata do-file editor
# =============================================================================
# CHAPTER 12 CHEAT SHEET: Further Topics in Multiple Regression
# =============================================================================

# --- Libraries ---
library(haven)           # read Stata .dta files
library(fixest)          # fast OLS estimation with feols()
library(dplyr)           # data manipulation
library(ggplot2)         # grammar of graphics

# =============================================================================
# STEP 1: Load house price data from a URL
# =============================================================================
url_house <- "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_HOUSE.DTA"
data_house <- read_dta(url_house)

cat("Dataset:", nrow(data_house), "observations,", ncol(data_house), "variables\n")

# =============================================================================
# STEP 2: OLS regression with default standard errors
# =============================================================================
model_default <- feols(price ~ size + bedrooms + bathrooms + lotsize + age + monthsold,
                       data = data_house)
summary(model_default)

# =============================================================================
# STEP 3: Robust (HC1) standard errors — compare with default
# =============================================================================
model_robust <- feols(price ~ size + bedrooms + bathrooms + lotsize + age + monthsold,
                      data = data_house, vcov = "HC1")

etable(model_default, model_robust,
       headers = c("Default", "Robust (HC1)"))

# =============================================================================
# STEP 4: Prediction — conditional mean CI vs. individual forecast PI
# =============================================================================
model_simple <- feols(price ~ size, data = data_house)

# Use base R's lm() for predict() with interval support
model_lm <- lm(price ~ size, data = data_house)

new_house <- data.frame(size = 2000)
ci_pred <- predict(model_lm, new_house, interval = "confidence", level = 0.95)
pi_pred <- predict(model_lm, new_house, interval = "prediction", level = 0.95)

cat("\nPredicted price (2000 sq ft): $", round(ci_pred[1], 0), "\n")
cat("95% CI for E[Y|X=2000]: [$", round(ci_pred[2], 0), ", $", round(ci_pred[3], 0), "]\n")
cat("95% PI for individual Y: [$", round(pi_pred[2], 0), ", $", round(pi_pred[3], 0), "]\n")
cat("RMSE:", round(sigma(model_lm), 0), "\n")

# =============================================================================
# STEP 5: HAC (Newey-West) standard errors for time series
# =============================================================================
url_gdp <- "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_REALGDPPC.DTA"
data_gdp <- read_dta(url_gdp)

# fixest supports Newey-West via vcov = NW(L) ~ time_var
model_hac <- feols(growth ~ 1, data = data_gdp, vcov = NW(4) ~ daten)
model_ols <- feols(growth ~ 1, data = data_gdp)

cat("\nGDP Growth: mean =", round(mean(data_gdp$growth), 4), "\n")
cat("Default SE:", round(se(model_ols)["(Intercept)"], 6), "\n")
cat("HAC SE:    ", round(se(model_hac)["(Intercept)"], 6), "\n")

# Correlogram — visualize autocorrelation structure
acf(data_gdp$growth, lag.max = 10, main = "Correlogram of GDP Growth")

# =============================================================================
# STEP 6: Power curve — Type I vs. Type II error tradeoff
# =============================================================================
alpha  <- 0.05
se_test <- 15
n_test  <- 30
t_crit  <- qt(1 - alpha/2, df = n_test - 1)

beta_range <- seq(-60, 60, length.out = 500)
ncp <- beta_range / se_test
power <- 1 - (pt(t_crit - ncp, df = n_test - 1) -
              pt(-t_crit - ncp, df = n_test - 1))

ggplot(data.frame(beta = beta_range, power = power), aes(x = beta, y = power)) +
  geom_line(linewidth = 1.2, color = "blue") +
  geom_hline(yintercept = 0.05, linetype = "dashed", color = "gray") +
  geom_hline(yintercept = 0.80, linetype = "dashed", color = "green") +
  labs(x = "True Coefficient Value", y = "Power = P(Reject H0)",
       title = "Power Curve: Larger Effects Are Easier to Detect") +
  theme_minimal()

# =============================================================================
# STEP 7: Bootstrap confidence intervals — no distributional assumptions
# =============================================================================
set.seed(42)
n_boot <- 1000

boot_slopes <- replicate(n_boot, {
  idx <- sample(nrow(data_house), replace = TRUE)
  coef(feols(price ~ size, data = data_house[idx, ]))["size"]
})

# Percentile method: 95% CI = [2.5th, 97.5th percentile]
ci_boot <- quantile(boot_slopes, probs = c(0.025, 0.975))

cat("\nOLS slope (size):", round(coef(model_simple)["size"], 4), "\n")
cat("Bootstrap 95% CI: [", round(ci_boot[1], 4), ",", round(ci_boot[2], 4), "]\n")
cat("Analytical 95% CI: [", round(confint(model_simple)["size", 1], 4), ",",
    round(confint(model_simple)["size", 2], 4), "]\n")

hist(boot_slopes, breaks = 40, col = "steelblue", border = "white",
     main = "Bootstrap Distribution of Size Coefficient",
     xlab = "Bootstrap Slope Estimates")
abline(v = coef(model_simple)["size"], col = "red", lwd = 2)
abline(v = ci_boot, col = "green", lwd = 2, lty = 2)
Paste into your R console or RStudio