Chapter 12 of 18 ยท Interactive Dashboard

Further Topics in Multiple Regression

Explore robust standard errors, prediction intervals, Type I/II errors, autocorrelation, and bootstrap methods with real house-price and GDP data.

Robust Standard Error Selector

Default OLS standard errors assume error variance is constant โ€” but real data rarely obliges. When homoskedasticity fails, your t-statistics and p-values lie by construction.

When error variance is not constant, default OLS standard errors are invalid. Heteroskedastic-robust (HC1) SEs correct this without changing the coefficient estimates themselves. Only the SEs, t-statistics, and confidence intervals change. For cross-sectional data, reporting HC1 robust SEs is considered modern best practice โ€” they cost nothing when homoskedasticity holds and protect you when it fails.
What you can do here
  • Toggle between Default and HC1-Robust SEs โ€” watch SEs, t-stats, and p-values change while the coefficient stays fixed.
  • Select each variable in turn to see where heteroskedasticity bites hardest.
  • Check the SE ratio โ€” ratios far from 1 flag heteroskedasticity.
Coefficient
โ€”
Std. Error
โ€”
t-statistic
โ€”
p-value
โ€”
Rยฒ
โ€”
SE Ratio
โ€”
Try this
  1. Select bathrooms and toggle Default vs HC1-Robust. The SE ratio is > 1.2. Heteroskedasticity clearly inflates this variable's uncertainty โ€” the default SE was too optimistic.
  2. Select lotsize and compare. The robust SE is smaller than the default (ratio < 1). Unusual but legitimate โ€” robust SEs can tighten CIs as well as widen them.
  3. Scan every variable for a p-value significance flip. None flip at 5% here. Small dataset, modest heteroskedasticity โ€” but on larger cross-sections, robust SEs can overturn conclusions.

Take-away: Coefficients are the same; only inference changes โ€” always report robust SEs alongside defaults on cross-sectional data. Read ยง12.2 in the chapter โ†’

Prediction Interval Visualizer

Predicting the average 2,000-sqft house is easy. Predicting a specific 2,000-sqft house is brutal โ€” even with unlimited data, some uncertainty never goes away.

Predicting the average outcome E[y|x*] is more precise than predicting an individual y|x*. The forecast variance equals the conditional-mean variance plus ฯƒยฒ: Var(ลทf) = Var(ลทcm) + ฯƒยฒ. As n โ†’ โˆž the conditional-mean SE shrinks to zero, but the forecast SE remains at least se โ€” the RMSE. This is a fundamental limit on individual predictions: 95% PIs are at least ยฑ1.96 ร— se wide, no matter how much data you collect.
What you can do here
  • Slide the "Predict at size" slider from 1,200 to 3,500 sqft.
  • Toggle "Both" / "CI Only" / "PI Only" to separate mean-prediction uncertainty from individual-prediction uncertainty.
  • Watch the PI / CI width ratio โ€” it reveals how much of the total uncertainty is irreducible.
Predicted price
โ€”
CI SE
โ€”
PI SE
โ€”
PI / CI width
โ€”
RMSE
โ€”
Try this
  1. Set size to the sample mean (~1,883 sqft). The CI is at its narrowest here. Drag to 3,500 sqft and both bands widen. Predictions are most precise near the center of the data; extrapolating to the edges costs you precision both in the mean and in the individual.
  2. Toggle CI Only, then PI Only. The PI dwarfs the CI. Most of the uncertainty in predicting an individual house is irreducible โ€” the ฯƒ that no amount of data can eliminate.
  3. Watch the PI / CI width ratio. Largest near the mean, smallest at the extremes. At the data's center, ฯƒ dominates; at the edges, estimation uncertainty catches up โ€” but never overtakes ฯƒ.

Take-away: Predicting a conditional mean gets arbitrarily precise with more data; predicting an individual outcome does not โ€” a built-in ceiling on forecasting any specific y. Read ยง12.3 in the chapter โ†’

Type I/II Error Explorer

A test can fail two ways: cry wolf when nothing's there, or miss a real effect. How do sample size, noise, and ฮฑ each shift that balance?

Type I error is a false positive (rejecting a true H0); its probability equals ฮฑ. Type II error is a false negative (failing to reject a false H0). Power = 1 โˆ’ P(Type II) measures the ability to detect true effects. The most powerful test for a given size uses the most precise estimator โ€” one more reason efficient estimation matters beyond the point estimate itself.
What you can do here
  • Slide the sample size n โ€” larger n tightens SE(ฮฒฬ‚) and shifts the whole power curve upward.
  • Slide the noise ฯƒ โ€” more noise flattens the curve.
  • Slide the ฮฑ level โ€” higher ฮฑ gives more power but more false positives.
SE(ฮฒฬ‚)
โ€”
80% power at |ฮฒ|
โ€”
Type I rate (ฮฑ)
โ€”
Power at ฮฒ = SE
โ€”
Try this
  1. Keep defaults (n = 30, ฯƒ = 15), then double n to 60. The minimum detectable effect at 80% power drops by ~30%. Larger samples shrink SE(ฮฒฬ‚) โ€” the whole power curve shifts upward for free.
  2. Increase ฮฑ from 0.05 to 0.10. Power rises everywhere, but false positives rise with it. The fundamental size-power tradeoff: you can only buy detection with tolerance for mistakes.
  3. Set ฯƒ = 5 (low noise). The power curve becomes sharply steep. Precision beats sample size for detection โ€” even a small effect pops once ฯƒ is small enough.

Take-away: Power depends on ฮฑ, ฯƒ, and n together โ€” and the most efficient estimator pushes the whole curve up, increasing your ability to detect real effects. Read ยง12.7 in the chapter โ†’

Autocorrelation Visualizer

GDP growth today looks a lot like growth last quarter. How much, exactly? And when do those correlations finally die out?

In time series data, errors are often autocorrelated โ€” today's shock persists into tomorrow. HAC (heteroskedasticity and autocorrelation consistent) standard errors, also called Newey-West SEs, account for both forms of dependence. The lag length m must be specified; a common rule of thumb is m = 0.75 ร— T1/3. For T = 241 quarters, m โ‰ˆ 5.
What you can do here
  • Slide max lags displayed from 4 to 20.
  • Toggle Time Series + ACF, Time Series only, or ACF only.
  • Compare ฯ(1), ฯ(2), โ€ฆ to the 95% band to see which lags are statistically distinguishable from zero.
Mean growth
โ€”
ฯ(1)
โ€”
ฯ(2)
โ€”
95% band ยฑ
โ€”
Try this
  1. Look at lags 1โ€“3 in the ACF. All well above the 95% band. GDP growth has strong persistence โ€” today's shock spills into the next several quarters.
  2. Increase max lags to 20. By lag 6โ€“7 autocorrelations cross zero and turn slightly negative. Shocks eventually mean-revert โ€” the ACF both measures persistence and tells you when it ends.
  3. In the time series panel, scan for long runs of above- or below-average quarters. They cluster visibly. That visual clustering is autocorrelation's fingerprint in the raw data.

Take-away: When lag-1 autocorrelation is large, default SEs underestimate uncertainty โ€” HAC (Newey-West) SEs are the standard fix. Read ยง12.2 in the chapter โ†’

SE Ratio Comparator

One number tells you whether heteroskedasticity matters for each coefficient: robust SE / default SE. Is it close to 1 โ€” or far off?

The ratio robust-SE / default-SE summarises how badly heteroskedasticity affects each coefficient. Ratio > 1 means the default SE was too optimistic โ€” heteroskedasticity inflates uncertainty. Ratio < 1 means the default was too pessimistic (less common). Ratio โ‰ˆ 1 means both methods agree. A substantial deviation (say > 30%) flags heteroskedasticity worth correcting for.
What you can do here
  • Toggle Sort by Ratio to see which variables are most affected by heteroskedasticity.
  • Toggle Alphabetical to find a specific predictor quickly.
  • Compare to the reference line at 1.0 โ€” the break-even between inflation and deflation.
Max ratio
โ€”
Min ratio
โ€”
Mean ratio
โ€”
Vars > 1.0
โ€”
Try this
  1. Sort by Ratio. The variables at the tops of the bars are the most heteroskedasticity-affected. Bathrooms typically leads; lotsize often sits below 1.0 (robust SE is actually smaller).
  2. Locate the 1.0 reference line. Variables to its right had inflated default SEs; those to its left had deflated ones. Heteroskedasticity doesn't just blow up SEs โ€” it can shrink them too, always unpredictably.
  3. Remember that the coefficients are identical in both columns. Only inference changes โ€” robust SEs reshuffle confidence, not point estimates.

Take-away: When the SE ratio strays from 1 for any coefficient, robust SEs earn their keep โ€” and the biggest ratio tells you which variable cares most about heteroskedasticity. Read ยง12.2 in the chapter โ†’

Bootstrap Confidence Intervals

What if normal-based CIs don't fit your data? The bootstrap builds a CI empirically โ€” by resampling the 29 houses and watching where the slope lands.

The bootstrap resamples the original data with replacement many times to estimate the sampling distribution of a statistic. Bootstrap CIs don't rely on normality or large-sample approximations, making them especially useful with small samples, skewed distributions, or non-standard estimators where analytical formulas aren't available. A 95% percentile CI is just the 2.5th and 97.5th percentiles of the bootstrap estimates.
What you can do here
  • Slide "Reps per click" to set batch size (50โ€“500).
  • Click Run Bootstrap to accumulate more resamples and smooth the histogram.
  • Click Clear All to start over.
  • Compare the Bootstrap SE to the OLS analytical SE โ€” the two should converge.
OLS slope
โ€”
Boot mean
โ€”
Boot SE
โ€”
95% CI lower
โ€”
95% CI upper
โ€”
Total reps
0
Try this
  1. Click Run Bootstrap once (200 reps). The histogram is rough. Click again โ€” smoother. After 1,000+ reps the CI stabilises. Bootstrap precision is controlled by B, not by n โ€” more reps buy more stable intervals at constant data.
  2. Compare the Bootstrap SE to the OLS analytical SE. They should be close. When they agree, the normal-based SE is validated empirically โ€” when they disagree, trust the bootstrap.
  3. Inspect the histogram for skew. Slightly asymmetric even with a large B. This is why the bootstrap percentile CI can outperform normal-approximation CIs in small or skewed samples.

Take-away: The bootstrap builds a sampling distribution from the data itself โ€” no normality, no formulas, no asymptotics; just replication. Read ยง12.6 in the chapter โ†’

Python Libraries and Code

You've explored the key concepts interactively โ€” now reproduce them in Python. This self-contained code block covers everything you practiced above. Copy it into an empty notebook and run it.

# =============================================================================
# CHAPTER 12 CHEAT SHEET: Further Topics in Multiple Regression
# =============================================================================

# --- Libraries ---
import pandas as pd                       # data loading and manipulation
import numpy as np                        # numerical operations
import matplotlib.pyplot as plt           # creating plots and visualizations
from statsmodels.formula.api import ols   # OLS regression with R-style formulas
import statsmodels.api as sm              # prediction tools (add_constant, get_prediction)
from scipy import stats                   # statistical distributions for CIs and power
from statsmodels.graphics.tsaplots import plot_acf  # autocorrelation plots

# =============================================================================
# STEP 1: Load house price data from a URL
# =============================================================================
# pd.read_stata() reads Stata .dta files directly from the web
url_house = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_HOUSE.DTA"
data_house = pd.read_stata(url_house)

print(f"Dataset: {data_house.shape[0]} observations, {data_house.shape[1]} variables")
print(data_house[['price', 'size', 'bedrooms', 'bathrooms', 'lotsize', 'age']].describe().round(2))

# =============================================================================
# STEP 2: OLS regression with default standard errors
# =============================================================================
# Default SEs assume homoskedasticity (constant error variance)
model_default = ols('price ~ size + bedrooms + bathrooms + lotsize + age + monthsold',
                    data=data_house).fit()

print(f"\nR-squared: {model_default.rsquared:.4f}")
print(f"Size effect (default): ${model_default.params['size']:,.2f} per sq ft")
model_default.summary()

# =============================================================================
# STEP 3: Robust (HC1) standard errors โ€” compare with default
# =============================================================================
# HC1 SEs correct for heteroskedasticity without changing coefficient estimates
# Only the SEs, t-statistics, and confidence intervals change
model_robust = ols('price ~ size + bedrooms + bathrooms + lotsize + age + monthsold',
                   data=data_house).fit(cov_type='HC1')

# SE ratio: how much heteroskedasticity affects each coefficient's uncertainty
print(f"{'Variable':<14} {'Coef':>10} {'Default SE':>12} {'Robust SE':>12} {'Ratio':>8}")
print("-" * 58)
for var in model_default.params.index:
    coef  = model_default.params[var]
    se_d  = model_default.bse[var]
    se_r  = model_robust.bse[var]
    ratio = se_r / se_d                   # ratio > 1 โ†’ default was too optimistic
    print(f"{var:<14} {coef:>10.2f} {se_d:>12.2f} {se_r:>12.2f} {ratio:>8.3f}")

# =============================================================================
# STEP 4: Prediction โ€” conditional mean CI vs. individual forecast PI
# =============================================================================
# Predicting E[y|x*] is more precise than predicting an individual y|x*
model_simple = ols('price ~ size', data=data_house).fit()

# Predict for a 2000 sq ft house
new_house = pd.DataFrame({'size': [2000]})
pred = model_simple.get_prediction(new_house)
pred_frame = pred.summary_frame(alpha=0.05)

s_e = np.sqrt(model_simple.mse_resid)    # RMSE โ€” irreducible individual uncertainty

print(f"\nPredicted price (2000 sq ft): ${pred_frame['mean'].values[0]:,.0f}")
print(f"95% CI for E[Y|X=2000]: [${pred_frame['mean_ci_lower'].values[0]:,.0f}, "
      f"${pred_frame['mean_ci_upper'].values[0]:,.0f}]")
print(f"95% PI for individual Y:  [${pred_frame['obs_ci_lower'].values[0]:,.0f}, "
      f"${pred_frame['obs_ci_upper'].values[0]:,.0f}]")
print(f"\nRMSE (s_e): ${s_e:,.0f} โ€” PI can never be narrower than ยฑ1.96 ร— s_e")

# Visualize: CI (narrow) vs. PI (wide)
sizes_sorted = data_house[['size']].sort_values('size')
pred_all = model_simple.get_prediction(sizes_sorted)
pf = pred_all.summary_frame(alpha=0.05)

fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(data_house['size'], data_house['price'], s=50, alpha=0.6, label='Actual')
ax.plot(sizes_sorted['size'], pf['mean'], color='red', linewidth=2, label='Fitted line')
ax.fill_between(sizes_sorted['size'], pf['mean_ci_lower'], pf['mean_ci_upper'],
                color='red', alpha=0.2, label='95% CI (conditional mean)')
ax.fill_between(sizes_sorted['size'], pf['obs_ci_lower'], pf['obs_ci_upper'],
                color='blue', alpha=0.1, label='95% PI (individual)')
ax.set_xlabel('House Size (sq ft)')
ax.set_ylabel('House Price ($)')
ax.set_title('Confidence Interval vs. Prediction Interval')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 5: HAC (Newey-West) standard errors for time series
# =============================================================================
# Time series errors are often autocorrelated โ€” HAC SEs account for this
url_gdp = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_REALGDPPC.DTA"
data_gdp = pd.read_stata(url_gdp)

mean_growth = data_gdp['growth'].mean()
n_gdp       = len(data_gdp)
lag_length   = int(0.75 * n_gdp**(1/3))  # rule of thumb: m = 0.75 ร— T^(1/3)

# Compare default SE vs. HAC SE for the mean
se_default = data_gdp['growth'].std() / np.sqrt(n_gdp)
y = data_gdp['growth']
X = np.ones(n_gdp)
from statsmodels.regression.linear_model import OLS
model_hac = OLS(y, X).fit(cov_type='HAC', cov_kwds={'maxlags': lag_length})

print(f"\nGDP Growth: mean = {mean_growth:.4f}")
print(f"Default SE (no autocorrelation): {se_default:.6f}")
print(f"HAC SE (lag = {lag_length}):            {model_hac.bse[0]:.6f}")
print(f"Ratio HAC/Default: {model_hac.bse[0] / se_default:.3f}")

# Correlogram โ€” visualize autocorrelation structure
fig, ax = plt.subplots(figsize=(10, 6))
plot_acf(data_gdp['growth'], lags=10, ax=ax, alpha=0.05)
ax.set_xlabel('Lag')
ax.set_ylabel('Autocorrelation')
ax.set_title('Correlogram of GDP Growth')
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 6: Power curve โ€” Type I vs. Type II error tradeoff
# =============================================================================
# Power = P(reject Hโ‚€ | Hโ‚€ is false) โ€” depends on true effect size
alpha = 0.05
se_test = 15                              # hypothetical standard error
n_test  = 30
t_crit  = stats.t.ppf(1 - alpha / 2, df=n_test - 1)

# Power as a function of the true coefficient
beta_range = np.linspace(-60, 60, 500)
power = []
for beta_true in beta_range:
    ncp = beta_true / se_test             # non-centrality parameter
    # Two-sided test: reject if |t| > t_crit
    power_val = 1 - (stats.t.cdf(t_crit - ncp, df=n_test - 1)
                      - stats.t.cdf(-t_crit - ncp, df=n_test - 1))
    power.append(power_val)

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(beta_range, power, linewidth=2)
ax.axhline(0.05, color='gray', linestyle='--', alpha=0.5, label='ฮฑ = 0.05 (Type I)')
ax.axhline(0.80, color='green', linestyle='--', alpha=0.5, label='Power = 0.80 target')
ax.axvline(0, color='gray', linestyle='-', alpha=0.3, label='Hโ‚€: ฮฒ = 0')
ax.set_xlabel('True Coefficient Value (ฮฒ)')
ax.set_ylabel('Power = P(Reject Hโ‚€)')
ax.set_title('Power Curve: Larger Effects Are Easier to Detect')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 7: Bootstrap confidence intervals โ€” no distributional assumptions
# =============================================================================
# Resample data with replacement to estimate the sampling distribution of ฮฒ
np.random.seed(42)
n_boot = 1000
boot_slopes = []

for _ in range(n_boot):
    boot_sample = data_house.sample(n=len(data_house), replace=True)
    boot_model  = ols('price ~ size', data=boot_sample).fit()
    boot_slopes.append(boot_model.params['size'])

# Percentile method: 95% CI = [2.5th, 97.5th percentile]
ci_lower = np.percentile(boot_slopes, 2.5)
ci_upper = np.percentile(boot_slopes, 97.5)
ols_slope = model_simple.params['size']

print(f"\nOLS slope (size): {ols_slope:.4f}")
print(f"Bootstrap 95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
print(f"Analytical 95% CI: [{model_simple.conf_int().loc['size'][0]:.4f}, "
      f"{model_simple.conf_int().loc['size'][1]:.4f}]")

fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(boot_slopes, bins=40, edgecolor='black', alpha=0.7)
ax.axvline(ols_slope, color='red', linewidth=2, label=f'OLS estimate: {ols_slope:.4f}')
ax.axvline(ci_lower, color='green', linewidth=2, linestyle='--', label=f'2.5th pctl: {ci_lower:.4f}')
ax.axvline(ci_upper, color='green', linewidth=2, linestyle='--', label=f'97.5th pctl: {ci_upper:.4f}')
ax.set_xlabel('Bootstrap Slope Estimates')
ax.set_ylabel('Frequency')
ax.set_title('Bootstrap Distribution of Size Coefficient (1000 replications)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Open empty Colab notebook โ†’