Chapter 10 of 18 · Interactive Dashboard

Data Summary for Multiple Regression

Toggle variables, compare models, and visualize partial effects — building intuition for how multiple regression separates the contribution of each predictor.

Partial effects vs total effects

Why did the bedrooms coefficient collapse from $23,667 to $1,553 the moment we added size? That drop is where multiple regression earns its keep.

Multiple regression separates each variable's partial effect from the indirect effects flowing through its correlation with other regressors. In bivariate regression, the bedrooms coefficient ($23,667) captures both the direct effect and the indirect effect through size. Adding size to the model drops bedrooms to $1,553 — the partial effect holding size constant. This dramatic change illustrates why controlling for confounders is essential for isolating individual variable effects.
What you can do here
  • Pick a focus variable — the coefficient we want to understand.
  • Pick a control variable — the confounder we want to hold constant.
  • Watch the bivariate-vs-partial comparison for the percentage drop when you add the control.
Try this
  1. Set focus = Bedrooms, control = Size. The coefficient drops from $23,667 to $1,553 — a 93% reduction. Size was doing most of the work the bivariate regression attributed to bedrooms.
  2. Switch focus to Bathrooms, still controlling for Size. The partial effect shrinks again, for the same reason. Bigger houses have more bathrooms; controlling for size strips out that shared variation.
  3. Set focus = Age, control = Size. The bivariate effect was already near zero; controlling barely moves it. When a predictor is uncorrelated with the control, the partial effect equals the bivariate effect.
  4. Keep focus = Bedrooms and switch the control to Bedrooms' weaker confounder (e.g., Lot size). The coefficient drops less. The strength of the drop measures how much the control and the focus overlap.

Take-away: The gap between bivariate and partial effects is exactly the contribution that gets miscounted when you omit a correlated control. Read §10.1 in the chapter →

Correlation matrix heatmap

Before running a six-predictor regression, check the grid. Who's correlated with whom, and how much overlap do your regressors carry?

Pairwise scatterplot and correlation matrices reveal all two-way relationships before any formal modeling. They expose linear associations, nonlinearities, clusters, and outliers — and they flag multicollinearity: if two regressors are tightly correlated (e.g., size and bedrooms, r = 0.52), their individual effects may be hard to separate in a regression, and high bivariate correlations (like bedrooms with price) may shrink once you control for confounders (KC 10.3).
What you can do here
  • Scan the top row or leftmost column — correlations of each variable with price.
  • Check the off-diagonal cells between predictors — high values flag multicollinearity.
  • Hover any cell to read the exact r.
Try this
  1. Find the strongest predictor of price. Size leads at r = 0.79. This is why the size-only model gets most of the R² the full model ever reaches.
  2. Find the size–bedrooms cell. r = 0.52 — a substantial overlap. This is why bedrooms' coefficient shrinks so dramatically once size enters the regression.
  3. Find monthsold–bathrooms. r ≈ −0.39 — negative and moderate. Idiosyncratic to this sample; the two aren't causally linked, just correlated in 29 observations.
  4. Find a pair with near-zero correlation. Those pairs barely interact in a regression. Their coefficients stay stable as you move between bivariate and multiple specs.

Take-away: A correlation matrix is the cheapest insurance against multicollinearity surprises — always look at it before you pick your regressors. Read §10.3 in the chapter →

Multiple regression builder

Build your own model, coefficient by coefficient. Which predictors earn their spot, and which just inflate your standard errors?

Each coefficient bj is the expected change in y when xj increases by one unit, holding all other regressors constant. For the house-price example, a size coefficient of $68.37 means each additional square foot predicts a $68.37 price increase, controlling for bedrooms, bathrooms, lot size, age, and month sold. Statistical significance is assessed via t-statistics and confidence intervals; p < 0.05 is the conventional threshold.
What you can do here
  • Check or uncheck each of the 6 predictors to build any of the 63 possible models.
  • Watch the fit statistics — R², adjusted R², AIC, BIC — for evidence of overfitting.
  • Read the coefficient table for each variable's coefficient, SE, t-stat, p-value, and 95% CI.
VariableCoefStd ErrtP>|t|[95% CI]
Try this
  1. Start with size only (R² = 0.618, adj R² = 0.603). Now check bedrooms. R² barely moves (0.618) and adj R² falls to 0.589. Bedrooms adds essentially no new information once you already have size.
  2. Check all 6 variables. R² rises to 0.651 but adj R² falls to 0.555. Raw R² only goes up; adjusted R² punishes complexity, and here it says the extra variables are noise.
  3. Inspect the p-value column. Only size is significant at 5%; every other p > 0.25. Multicollinearity inflates SEs, hiding whatever signal the other predictors might carry.
  4. Uncheck everything except size. The parsimony principle is vindicated here: with n = 29, the simplest adequate specification wins on every criterion.

Take-away: Every coefficient holds all other regressors fixed — add variables deliberately, and let adjusted R², AIC, and BIC tell you when you've gone too far. Read §10.4 in the chapter →

Frisch-Waugh-Lovell theorem

"Holding other variables constant" sounds abstract — until you see it done mechanically. FWL shows you the literal steps that make partial effects partial.

The partial effect of xj in a multiple regression equals the slope from a bivariate regression of y on the residualized xj. Residualize xj by regressing it on all other regressors and keeping the residuals — what's left is the variation in xj that is independent of the other variables. Regressing y on that residualized xj reproduces the multiple-regression coefficient exactly. This is how multiple regression "controls for" covariates.
What you can do here
  • Pick a target variable X whose partial effect you want to isolate.
  • Pick a control variable Z to hold constant.
  • Read the FWL slope — it should match the corresponding multiple-regression coefficient exactly.
Try this
  1. Default: target = size, control = bedrooms. The FWL slope (~72.41) matches the multiple-regression size coefficient exactly. Mathematical proof by identity: the two numbers are the same by construction.
  2. Switch target to bedrooms, control to size. The FWL slope (~$1,553) matches bedrooms' coefficient in price ~ bedrooms + size. Same identity, different focus — partial effects are commutative.
  3. Try target = bathrooms, control = size. The slope reveals bathrooms' partial contribution after size's variation has been residualized away. That residual variation is what a multiple regression actually regresses on — FWL just shows you the mechanic.
  4. Look at the x-axis ("Residualized X"). The points have been shifted so that all the control's influence is removed. This is the "holding other variables constant" phrase made literal.

Take-away: Multiple regression coefficients are bivariate regressions in disguise — on the parts of each regressor that the other regressors cannot explain. Read §10.5 in the chapter →

Model comparison — when more is less

Adding variables always raises R². Does that mean every extra variable improves the model? Adjusted R², AIC, and BIC say no — and they can all move down as you add junk.

Adjusted R², AIC, and BIC penalize complexity so that extra variables pay only when they earn their keep. Adjusted R² divides sums of squares by degrees of freedom — a mild penalty. AIC and BIC impose stronger penalties; BIC's grows with sample size, favoring more parsimonious specifications (KC 10.6). The parsimony principle: prefer simpler models unless extra variables meaningfully improve fit (KC 10.7).
What you can do here
  • Read the four curves — R², adjusted R², AIC, BIC — across models of 1 to 6 predictors.
  • Watch the divergence: R² always rises while adj R² and AIC/BIC can move the other way.
  • Cross-reference with the regression builder above to see which specification each curve prefers.
Try this
  1. Read R² across the models. It climbs from 0.618 (size only) to 0.651 (all six) — a 3.3-point gain for five extra variables. Raw R² is a one-way ratchet.
  2. Read adjusted R² on the same models. It falls from 0.603 to 0.555. Once you penalize degrees of freedom, the extra five variables are net-negative.
  3. Read AIC and BIC. Both rise — worse models. BIC rises more. The BIC penalty grows with sample size; for n = 29 it is especially allergic to overfitting.
  4. Compare the four verdicts. All four criteria agree the best model is size alone. When adj R², AIC, and BIC converge, the decision is easy.

Take-away: R² alone is the wrong criterion for model choice — lean on adjusted R², AIC, and BIC, and let parsimony win ties. Read §10.7 in the chapter →

Variance inflation factors — detecting multicollinearity

If two regressors carry the same information, OLS can't tell them apart — but it can tell you how bad the overlap is. That number is called VIF, and it's every multicollinearity test's front door.

The Variance Inflation Factor measures how much a coefficient's variance is inflated by correlation with the other regressors. VIFj = 1 / (1 − Rj²), where Rj² is from regressing xj on all other regressors. Rules of thumb: VIF > 5 is moderate concern; VIF > 10 is severe; VIF → ∞ means perfect collinearity and inestimable coefficients. High VIFs inflate SEs and destabilize estimates.
What you can do here
  • Read the bar for each predictor — taller bars mean more inflation.
  • Compare to the 5 and 10 thresholds marked on the scale.
  • Cross-reference with the correlation matrix above — highly-correlated predictors drive the worst VIFs.
Try this
  1. Read bedrooms and size. VIF ≈ 57.8 and 40.1. Both far past the "severe" threshold — the two variables carry nearly redundant information.
  2. Read lotsize and monthsold. VIF ≈ 12.0 and 12.8. Also above threshold — with 29 observations and 6 correlated predictors, the whole model is over-collinear.
  3. Connect to the earlier widgets. Only size is significant in the full regression because the other SEs are inflated. High VIF → inflated SE → inflated p-value → "insignificant" coefficients that may actually carry signal.
  4. Mentally drop bedrooms or bathrooms from the model and imagine what remains. The remedy for multicollinearity is surgical: drop or combine the redundant predictors until VIFs fall below 5.

Take-away: High VIFs are the fingerprint of multicollinearity — they inflate standard errors and turn real effects into "insignificant" ones, so prune correlated regressors before chasing p-values. Read §10.8 in the chapter →

Regression diagnostics — actual vs predicted & residuals

Two regressions can have nearly the same R². But are their residuals equally well-behaved? Diagnostic plots are where models earn — or lose — your trust.

A well-fit regression shows predictions hugging the 45° line and residuals scattering randomly around zero. The actual-vs-predicted plot surfaces bias and heteroskedasticity; the residual plot reveals curvature, fan shapes, or outliers. For a multiple regression, these two plots are the first place to check whether the model's assumptions are plausible — before reporting any coefficient's standard error.
What you can do here
  • Pick a model — size only, size + bedrooms, or the full six-predictor model.
  • Scan the actual-vs-predicted plot for systematic under- or over-prediction.
  • Scan the residual plot for fans, curves, or isolated outliers.
Try this
  1. Start with Size only. Points scatter around the 45° line with moderate spread; residuals look random. A simple, well-behaved bivariate fit.
  2. Switch to the full six-predictor model. The scatter tightens slightly (R² 0.618 → 0.651). A marginal improvement at a large cost in standard-error precision — exactly what VIF and adj R² warned us about.
  3. Inspect the residual plot for any model. No fan, no curve, no serial pattern. The OLS homoskedasticity and linearity assumptions look defensible on this sample.
  4. Hover the house near $340,000. It is consistently under-predicted across all three specifications. A stable residual outlier — likely an omitted quality variable that no specification captures.

Take-away: Actual-vs-predicted and residual plots are cheap insurance — they can invalidate an R² in a single glance. Read §10.4 in the chapter →

Python Libraries and Code

You've explored the key concepts interactively — now reproduce them in Python. This self-contained code block covers everything you practiced above. Copy it into an empty notebook and run it.

# =============================================================================
# CHAPTER 10 CHEAT SHEET: Data Summary for Multiple Regression
# =============================================================================

# --- Libraries ---
import numpy as np                                          # numerical operations
import pandas as pd                                         # data loading and manipulation
import matplotlib.pyplot as plt                             # creating plots and visualizations
import seaborn as sns                                       # statistical visualization (heatmaps, pairplots)
from statsmodels.formula.api import ols                     # OLS regression with R-style formulas
from statsmodels.stats.outliers_influence import variance_inflation_factor  # multicollinearity detection

# =============================================================================
# STEP 1: Load data and explore
# =============================================================================
# pd.read_stata() reads Stata .dta files directly from a URL
url = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_HOUSE.DTA"
data_house = pd.read_stata(url)

print(f"Dataset: {data_house.shape[0]} observations, {data_house.shape[1]} variables")
print(data_house[['price', 'size', 'bedrooms', 'bathrooms', 'lotsize', 'age']].describe().round(2))

# =============================================================================
# STEP 2: Partial effects vs. total effects — why controls matter
# =============================================================================
# Bivariate regression captures TOTAL effect (direct + indirect through size)
model_bivariate = ols('price ~ bedrooms', data=data_house).fit()

# Multiple regression isolates the PARTIAL effect (holding size constant)
model_partial = ols('price ~ bedrooms + size', data=data_house).fit()

print(f"Bedrooms coefficient (bivariate):  ${model_bivariate.params['bedrooms']:,.2f}")
print(f"Bedrooms coefficient (multiple):   ${model_partial.params['bedrooms']:,.2f}")
print(f"Change: ${model_partial.params['bedrooms'] - model_bivariate.params['bedrooms']:,.2f}")
# The coefficient drops from ~$23,667 to ~$1,553 once we control for size

# =============================================================================
# STEP 3: Correlation matrix — check pairwise associations before modeling
# =============================================================================
# High correlations between regressors signal potential multicollinearity
corr_vars = ['price', 'size', 'bedrooms', 'bathrooms', 'lotsize', 'age']
corr_matrix = data_house[corr_vars].corr()

fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.3f', cmap='coolwarm', center=0,
            square=True, linewidths=1)
ax.set_title('Correlation Matrix Heatmap')
plt.tight_layout()
plt.show()

print(f"Price-Size correlation: {corr_matrix.loc['price', 'size']:.3f}")
print(f"Size-Bedrooms correlation: {corr_matrix.loc['size', 'bedrooms']:.3f}")

# =============================================================================
# STEP 4: Full multiple regression — estimate partial effects
# =============================================================================
# Each coefficient measures the change in price for a one-unit change in that
# variable, holding ALL other regressors constant
model_full = ols('price ~ size + bedrooms + bathrooms + lotsize + age + monthsold',
                 data=data_house).fit()

size_coef = model_full.params['size']
print(f"Size effect: each additional sq ft is associated with ${size_coef:,.2f} higher price")
print(f"R-squared: {model_full.rsquared:.4f} ({model_full.rsquared*100:.1f}% of variation explained)")
print(f"Adjusted R-squared: {model_full.rsquared_adj:.4f}")

# Full regression table (coefficients, std errors, t-stats, p-values)
model_full.summary()

# =============================================================================
# STEP 5: FWL theorem — how "holding constant" actually works
# =============================================================================
# Step A: Regress size on all other regressors, keep residuals
model_size_on_others = ols('size ~ bedrooms + bathrooms + lotsize + age + monthsold',
                            data=data_house).fit()
resid_size = model_size_on_others.resid

# Step B: Regress price on those residuals — the slope matches the full model
data_house['resid_size'] = resid_size
model_fwl = ols('price ~ resid_size', data=data_house).fit()

print(f"Size coef from FULL regression:     {model_full.params['size']:.10f}")
print(f"Coef from FWL residual regression:  {model_fwl.params['resid_size']:.10f}")
# These are identical — the FWL theorem in action

# =============================================================================
# STEP 6: Model comparison — parsimony vs. complexity
# =============================================================================
# Compare simple (size only) vs. full model using fit statistics
model_simple = ols('price ~ size', data=data_house).fit()

comparison = pd.DataFrame({
    'Model': ['Size only', 'Full (all variables)'],
    'R\u00b2':     [model_simple.rsquared,     model_full.rsquared],
    'Adj R\u00b2': [model_simple.rsquared_adj, model_full.rsquared_adj],
    'AIC':    [model_simple.aic,          model_full.aic],
    'BIC':    [model_simple.bic,          model_full.bic],
})
print(comparison.to_string(index=False))
# Adj R\u00b2 DECREASES when adding 5 weak predictors — parsimony wins

# =============================================================================
# STEP 7: VIF — detect multicollinearity
# =============================================================================
# VIF_j = 1 / (1 - R\u00b2_j), where R\u00b2_j is from regressing x_j on all other x's
# VIF > 5: moderate concern; VIF > 10: severe multicollinearity
X = data_house[['size', 'bedrooms', 'bathrooms', 'lotsize', 'age', 'monthsold']]
vif_data = pd.DataFrame({
    'Variable': X.columns,
    'VIF': [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
})
print(vif_data.to_string(index=False))

# =============================================================================
# STEP 8: Diagnostics — actual vs. predicted and residual plots
# =============================================================================
# Good fit: points hug the 45\u00b0 line; residuals scatter randomly around zero
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Actual vs. predicted
axes[0].scatter(data_house['price'], model_full.fittedvalues, alpha=0.7, s=50)
axes[0].plot([data_house['price'].min(), data_house['price'].max()],
             [data_house['price'].min(), data_house['price'].max()],
             'r--', linewidth=2, label='Perfect prediction (45\u00b0 line)')
axes[0].set_xlabel('Actual Price ($)')
axes[0].set_ylabel('Predicted Price ($)')
axes[0].set_title('Actual vs Predicted')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Residual plot
axes[1].scatter(model_full.fittedvalues, model_full.resid, alpha=0.7, s=50)
axes[1].axhline(y=0, color='red', linestyle='--', linewidth=2)
axes[1].set_xlabel('Fitted Values ($)')
axes[1].set_ylabel('Residuals ($)')
axes[1].set_title('Residual Plot')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
Open empty Colab notebook →