Chapter 11 of 18 · Interactive Dashboard

Statistical Inference for Multiple Regression

Explore confidence intervals, hypothesis tests, multicollinearity, F-tests, and robust standard errors — using the Davis housing dataset from the book.

Confidence Interval Explorer

A regression gives you a single coefficient, but the range of plausible values is where significance actually lives. Does the CI clear zero?

A 95% confidence interval bj ± tn−k, 0.025 × se(bj) gives a range of plausible values for βj. If the interval excludes zero, the coefficient is statistically significant at the 5% level. Narrower intervals mean more precise estimation, which improves with larger samples and lower multicollinearity. Wider intervals buy more confidence at the cost of precision.
What you can do here
  • Pick a variable from the dropdown — any of the six regressors in the full model.
  • Slide the confidence level from 80% to 99% and watch the CI widen.
  • Check the "Contains 0?" stat card — the quickest test of significance.
95%
Coefficient
—
SE
—
t-critical
—
CI Lower
—
CI Upper
—
Contains 0?
—
Try This
  1. Pick size at 95%. The CI is [$36.44, $100.30] and excludes zero. Size is significant at the 5% level — the effect of size on price is unmistakable even in a multicollinear model.
  2. Switch to bedrooms at 95%, then push to 90%. The CI still includes zero at both levels. Bedrooms cannot clear the bar even when you lower it — multicollinearity has swallowed any signal.
  3. Set 99% on the size variable. The CI widens but still excludes zero. Robust significance — the effect is real at any conventional confidence level.

Take-away: A CI that excludes zero is the quickest significance test — and its width tells you how confident "significant" actually is. Read §11.3 in the chapter →

Hypothesis Test for a Single Coefficient

Not just "is it significant?" — test any claim about the size coefficient: is it exactly $68, less than $90, more than $30? Pick a null, get an answer.

The most common hypothesis test examines H0: βj = 0 — whether variable xj has any partial effect on y. The t-statistic t = (bj − β0) / se(bj) measures how many standard errors the estimate is from the hypothesized value. Reject H0 when |t| > tcritical, or equivalently when the p-value < α. Generalizing to any β0 tests any substantive claim about the coefficient.
What you can do here
  • Slide the hypothesized value β0 from $0 to $120/sqft.
  • Watch the t-statistic, p-value, and decision badge update live.
  • Find the β0 where p = 0.05 — that value is the CI endpoint.
$0/sqft
bsize
—
SE
—
t-statistic
—
p-value
—
—
Try This
  1. Set β0 = 0 (the standard significance test). t = 4.44, p < 0.001, REJECT. Size has a non-zero partial effect on price — and the triangle is deep in the tail.
  2. Drag β0 toward 68 (the estimate itself). t → 0 and p → 1.0. We can never reject that β equals its own sample estimate — an obvious sanity check.
  3. Find the β0 where p first crosses 0.05. That value is exactly the lower (~$36) or upper (~$100) CI endpoint. Tests and CIs are two views of the same object.

Take-away: A t-test against any β0 is just a CI read from the side — the values you'd reject are precisely those outside the interval. Read §11.4 in the chapter →

Multicollinearity & SE Inflation

Does adding more predictors always sharpen your estimate? No — when predictors overlap, extra variables buy noise and inflate standard errors without moving R² much.

The standard error se(bj) = se / √Σ x̃ji² reveals what makes estimates precise. Four levers: (1) a well-fitting model (small se), (2) large n, (3) high variation in xj after controlling for others, and (4) low multicollinearity. When regressors are highly correlated, Σ x̃ji² shrinks — and standard errors explode.
What you can do here
  • Toggle between Size only, + Bedrooms, and the Full (6-var) model.
  • Watch SE(size) grow even though the coefficient barely changes.
  • Compare R² and adjusted R² — R² crawls up, adj R² drifts down.
bsize
—
SE(size)
—
t(size)
—
R²
—
Adj R²
—
Try This
  1. Start with Size only. SE = $11.17, R² = 0.618. A clean, precise bivariate estimate — baseline for the experiment.
  2. Add Bedrooms. SE jumps to $13.30 but R² stays at 0.618. Bedrooms overlaps with size (r = 0.52); it adds noise, not explanatory power.
  3. Switch to the Full 6-variable model. SE grows to $15.39 (38% larger than the bivariate!); R² rises 3.3 percentage points to 0.651. A trivial fit gain for a serious precision cost — more variables ≠ more information.

Take-away: Every correlated regressor you add inflates the SE of the ones you already care about — add variables deliberately, with precision-for-parsimony in mind. Read §11.2 in the chapter →

Joint F-Test

Five individually-insignificant coefficients — could they be jointly significant anyway? t-tests can't answer; only the F-test can.

Individual t-tests cannot test multiple restrictions simultaneously because they ignore correlations between coefficient estimates. The F-test evaluates joint significance using the F(q, n−k) distribution, where q is the number of restrictions. It compares how much worse the restricted model fits the data than the unrestricted one (KC 11.5). Under homoskedasticity, F = [(RSSr − RSSu)/q] / [RSSu/(n−k)] (KC 11.6). Large F = restrictions inconsistent with the data.
What you can do here
  • Toggle Overall F — tests whether all coefficients are zero.
  • Toggle Subset F — tests whether the 5 extra variables jointly add anything beyond size.
  • Watch the F-statistic, p-value, and decision badge update.
F-statistic
—
df1, df2
—
p-value
—
—
Try This
  1. Start with Overall F. F = 6.83, df = (6, 22), p < 0.001, REJECT. At least one variable in the full model is doing real work — the regression is jointly significant.
  2. Switch to Subset F. F = 0.42, df = (5, 22), p = 0.83, FAIL. The five extra variables are jointly indistinguishable from zero — size alone carries the model.
  3. Reconcile the two results. The overall F is significant; the subset F is not. Size contributes all the signal; bedrooms-through-monthsold contribute nothing. Parsimony wins on the evidence.

Take-away: F-tests reveal joint significance that t-tests miss — use them to justify dropping groups of correlated regressors. Read §11.5 in the chapter →

Model Comparison: R² vs Parsimony

A side-by-side table answers one question in one glance: does adding variables actually improve the model, or does it just run up the complexity bill?

To test whether a subset of regressors belongs in the model, compare the restricted and unrestricted models with an F-test, and compare them on adjusted R², AIC, and BIC. If the F-statistic is small (large p-value), the extras don't meaningfully improve fit and the simpler model is preferred. Raw R² always rises with more regressors; adjusted R² and the information criteria reward parsimony and can fall as you add noise.
What you can do here
  • Scan the R² row left to right — it only rises, by construction.
  • Scan the adjusted R² row — it can rise and fall.
  • Scan the F-statistic row — fewer variables with each still pulling weight gives a higher F.
VariableModel 1Model 2Model 3
Try This
  1. Scan the R² row. It rises monotonically (0.618 → 0.651). Guaranteed mathematically — raw R² is a one-way ratchet.
  2. Scan Adjusted R². Model 1 (0.603) wins; Model 3 (0.555) loses. The five extra regressors are penalised for weak explanatory power and adj R² flips the verdict.
  3. Scan the F-statistic row. Model 1's F (43.58) dwarfs Model 3's (6.83). Fewer regressors, but each one earns its keep — a cleaner "joint significance" story.

Take-away: When R² and adjusted R² disagree, believe adjusted R² — parsimony is not a stylistic preference, it's a statistical one. Read §11.7 in the chapter →

Robust vs Standard Errors

If error variance changes with x, your textbook SEs lie. Robust SEs don't — but do they change any conclusions on this dataset?

Heteroskedasticity-robust (HC1) standard errors provide valid inference without assuming homoskedasticity. Coefficient estimates remain unchanged; only the standard errors — and thus t-statistics and p-values — may differ. For cross-sectional data, reporting robust SEs alongside standard SEs is modern best practice: it costs nothing when homoskedasticity holds and protects you when it fails.
What you can do here
  • Toggle Standard vs Robust (HC1) SE.
  • Compare the SE, t, and p columns side by side across the six regressors.
  • Check whether any variable flips significance between the two SE types.
VariableCoefSEtp
Try This
  1. Compare SE(size) across Standard and Robust. $15.39 vs $15.36 — essentially identical. For size, the two worldviews agree completely.
  2. Compare SE(bathrooms). Standard $15,721 vs robust $19,284 — a 23% jump. Heteroskedasticity in this small cross-section inflates bathrooms' uncertainty once you account for it.
  3. Scan every row for a significance flip. None here. In 29-observation datasets the differences rarely decide significance — but in larger, more heteroskedastic data, robust SEs can flip conclusions.

Take-away: Always report robust (HC1) SEs alongside standard SEs — when they disagree, the robust ones are the honest reference. Read §11.7 in the chapter →

Python Libraries and Code

You've explored the key concepts interactively — now reproduce them in Python. This self-contained code block covers everything you practiced above. Copy it into an empty notebook and run it.

# =============================================================================
# CHAPTER 11 CHEAT SHEET: Statistical Inference for Multiple Regression
# =============================================================================

# --- Libraries ---
import numpy as np                              # numerical operations
import pandas as pd                             # data loading and manipulation
import matplotlib.pyplot as plt                 # creating plots and visualizations
from statsmodels.formula.api import ols         # OLS regression with R-style formulas
from scipy import stats                         # t and F distributions for inference
from statsmodels.stats.anova import anova_lm    # ANOVA table for model comparison

# =============================================================================
# STEP 1: Load data directly from a URL
# =============================================================================
# pd.read_stata() reads Stata .dta files; the dataset has 29 houses from Davis, CA
url = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_HOUSE.DTA"
data_house = pd.read_stata(url)

print(f"Dataset: {data_house.shape[0]} observations, {data_house.shape[1]} variables")
print(data_house[['price', 'size', 'bedrooms', 'bathrooms', 'lotsize', 'age']].describe().round(2))

# =============================================================================
# STEP 2: Estimate the full multiple regression model
# =============================================================================
# Formula syntax: 'y ~ x1 + x2 + ...' includes an intercept automatically
# IMPORTANT: .fit() estimates the model — without it, nothing is computed!
model_full = ols('price ~ size + bedrooms + bathrooms + lotsize + age + monthsold',
                 data=data_house).fit()

n = int(model_full.nobs)
k = len(model_full.params)
df = n - k  # degrees of freedom for t and F tests

print(f"\nSize effect: ${model_full.params['size']:,.2f} per sq ft (p = {model_full.pvalues['size']:.4f})")
print(f"R-squared: {model_full.rsquared:.4f} ({model_full.rsquared*100:.1f}% of variation explained)")
print(f"Degrees of freedom: n-k = {n}-{k} = {df}")

# Full regression table (coefficients, std errors, t-stats, p-values, R\u00b2)
model_full.summary()

# =============================================================================
# STEP 3: Confidence intervals — does the CI exclude zero?
# =============================================================================
# 95% CI: b_j \u00b1 t_critical \u00d7 se(b_j)
# If the interval excludes zero, the coefficient is statistically significant at 5%
conf_int = model_full.conf_int(alpha=0.05)
print("\n95% Confidence Intervals:")
print(conf_int.round(2))

# Manual calculation for the size coefficient
coef_size = model_full.params['size']
se_size   = model_full.bse['size']
t_crit    = stats.t.ppf(0.975, df)  # 0.975 = upper tail for two-sided 95% CI

ci_lower = coef_size - t_crit * se_size
ci_upper = coef_size + t_crit * se_size
print(f"\nSize 95% CI: [${ci_lower:.2f}, ${ci_upper:.2f}]  (excludes zero \u2192 significant)")

# =============================================================================
# STEP 4: Hypothesis test on a single coefficient
# =============================================================================
# Test H\u2080: \u03b2_size = 50 vs H\u2081: \u03b2_size \u2260 50
# t = (b_j - hypothesized value) / se(b_j)
null_value = 50
t_stat = (coef_size - null_value) / se_size
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))

print(f"\nTest H\u2080: \u03b2_size = {null_value}")
print(f"  t-statistic: {t_stat:.4f}")
print(f"  p-value: {p_value:.4f}")
print(f"  Decision: {'Reject' if p_value < 0.05 else 'Fail to reject'} H\u2080 at 5% level")

# Test of statistical significance: H\u2080: \u03b2_size = 0
t_stat_zero = coef_size / se_size
print(f"\nTest H\u2080: \u03b2_size = 0")
print(f"  t-statistic: {t_stat_zero:.4f}, p-value: {model_full.pvalues['size']:.6f}")
print(f"  Size IS statistically significant at 5% level")

# =============================================================================
# STEP 5: Joint F-test — are groups of coefficients jointly significant?
# =============================================================================
# Overall F-test: H\u2080: all slope coefficients = 0
print(f"\nOverall F-test: F = {model_full.fvalue:.4f}, p = {model_full.f_pvalue:.6e}")
print(f"  At least one variable matters \u2192 model is jointly significant")

# Subset F-test: are variables other than size jointly significant?
# H\u2080: \u03b2_bedrooms = \u03b2_bathrooms = \u03b2_lotsize = \u03b2_age = \u03b2_monthsold = 0
hypotheses = ['bedrooms = 0', 'bathrooms = 0', 'lotsize = 0',
              'age = 0', 'monthsold = 0']
f_test_result = model_full.f_test(hypotheses)
print(f"\nSubset F-test (drop 5 vars, keep size):")
print(f"  F = {f_test_result.fvalue[0][0]:.4f}, p = {float(f_test_result.pvalue):.4f}")
print(f"  Extra variables are NOT jointly significant \u2192 simpler model preferred")

# =============================================================================
# STEP 6: Model comparison — restricted vs unrestricted
# =============================================================================
# Restricted model: only size as predictor
model_restricted = ols('price ~ size', data=data_house).fit()

# ANOVA table confirms the F-test result
anova_results = anova_lm(model_restricted, model_full)
print("\nANOVA: Restricted (size only) vs Full model:")
print(anova_results)

# Compare fit statistics
print(f"\n{'Model':<25} {'R\u00b2':>8} {'Adj. R\u00b2':>9} {'F-stat':>8}")
print("-" * 52)
for name, m in [('Size only', model_restricted), ('Full model', model_full)]:
    print(f"{name:<25} {m.rsquared:>8.4f} {m.rsquared_adj:>9.4f} {m.fvalue:>8.2f}")

# =============================================================================
# STEP 7: Robust standard errors — valid under heteroskedasticity
# =============================================================================
# HC1 (White's correction) provides valid inference without constant variance
model_robust = model_full.get_robustcov_results(cov_type='HC1')

print("\nStandard vs Robust SEs:")
print(f"{'Variable':<14} {'Coef':>10} {'Std SE':>10} {'Robust SE':>10} {'p (robust)':>10}")
print("-" * 56)
for var in model_full.params.index:
    print(f"{var:<14} {model_full.params[var]:>10.2f} "
          f"{model_full.bse[var]:>10.2f} {model_robust.bse[var]:>10.2f} "
          f"{model_robust.pvalues[var]:>10.4f}")

# =============================================================================
# STEP 8: Coefficient plot — visual summary of significance
# =============================================================================
# Error bars crossing zero = not significant at 5%
params_plot = model_full.params[1:]  # exclude intercept
ci_plot = conf_int.iloc[1:, :]

fig, ax = plt.subplots(figsize=(10, 6))
y_pos = np.arange(len(params_plot))
ax.errorbar(params_plot.values, y_pos,
            xerr=[params_plot.values - ci_plot.iloc[:, 0].values,
                  ci_plot.iloc[:, 1].values - params_plot.values],
            fmt='o', markersize=8, capsize=5, capthick=2, linewidth=2)
ax.axvline(x=0, color='red', linestyle='--', linewidth=1.5, alpha=0.7, label='H\u2080: \u03b2 = 0')
ax.set_yticks(y_pos)
ax.set_yticklabels(params_plot.index)
ax.set_xlabel('Coefficient Value')
ax.set_title('Coefficient Estimates with 95% Confidence Intervals')
ax.legend()
ax.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()
Open empty Colab notebook →