Chapter 16 of 18 Β· Interactive Dashboard

Checking the Model and Data

Explore regression diagnostics interactively -- multicollinearity, heteroskedasticity, autocorrelation, omitted variable bias, diagnostic plots, and influential observations.

VIF Explorer: Multicollinearity

Add one more regressor and your SEs triple β€” but the coefficient barely moves. That's multicollinearity, and VIF is the one number that diagnoses it.

The Variance Inflation Factor quantifies multicollinearity: VIFj = 1 / (1 βˆ’ RjΒ²), where RjΒ² is from regressing xj on all other regressors. VIF = 1 means no collinearity; VIF > 10 indicates serious problems (standard errors inflated by √10 ≈ 3.2×). OLS stays unbiased under multicollinearity β€” only precision suffers. Joint F-tests on correlated regressors remain valid and powerful even when individual t-tests are unreliable (Key Concept 16.2).
What you can do here
  • Toggle Base (age + educ) vs. + Interaction to see VIF explode when the interaction is added.
  • Slide the threshold to change the "concern" line from 2 to 20.
  • Read the correlation heatmap β€” high off-diagonal correlations drive the VIFs.
VIF age
--
VIF education
--
VIF interaction
--
Max VIF
--
--
Try this
  1. Start on the Base model. VIFs sit near 1.0 for both age and education. Toggle to + Interaction. VIFs jump to 22–37 β€” the interaction term is nearly a linear combination of the two main effects.
  2. Drag the threshold slider to 5. The base model stays comfortably below; the interaction model blows past the line on every regressor β€” any threshold from 5 to 10 flags the same problem.
  3. Read the correlation heatmap. ageΓ—education correlates 0.73 with age and 0.64 with education. These high bivariate correlations are the mechanical cause of the VIF blow-up.

Take-away: VIF flags when correlated regressors will inflate your SEs β€” but joint F-tests still work, so don't drop variables solely to "fix" a high VIF. Read Β§16.1 in the chapter β†’

Robust vs. Standard SEs

Default SEs assume error variance is constant. In cross-sectional data it almost never is β€” and default SEs quietly lie. Robust SEs are the cheap insurance.

Heteroskedasticity means error variance depends on the regressors: Var[ui|xi] β‰  σ². OLS coefficients remain unbiased, but default SEs are wrong β€” typically too small, giving false confidence. Use heteroskedasticity-robust (HC1/White) standard errors, valid whether or not heteroskedasticity is present. When assumptions 3 (homoskedasticity) or 4 (no autocorrelation) fail, the fix is to change the inference method β€” not the model (Key Concept 16.3).
What you can do here
  • Toggle SE Comparison vs Residual Scatter.
  • Highlight All Variables, Education only, or Age only to focus the comparison.
  • Read the SE ratio β€” values far from 1.0 are the signature of heteroskedasticity.
SE(age) standard
--
SE(age) robust
--
SE(educ) standard
--
SE(educ) robust
--
Ratio (educ)
--
Try this
  1. Stay on SE Comparison. Robust SE bars are consistently taller than standard SE bars; the ratio for education β‰ˆ 1.12. Default SEs understate uncertainty by ~12% on this regressor β€” enough to change a "significant at 5%" verdict on a borderline case.
  2. Toggle to Residual Scatter. The cloud fans outward at higher fitted values. Classic heteroskedasticity β€” error variance grows with predicted earnings, exactly the situation default SEs can't handle.
  3. Highlight Education Only. Its SE inflates more than age's does. Heteroskedasticity affects regressors unequally β€” robust SEs fix each one individually.

Take-away: Robust SEs cost nothing when homoskedasticity holds and protect you when it fails β€” report them as a default on any cross-sectional regression. Read Β§16.5 in the chapter β†’

Autocorrelation Explorer

When today's shock bleeds into tomorrow's, default SEs underestimate uncertainty β€” and your effective sample size is smaller than n. Raise ρ and watch it happen.

Autocorrelation means errors are correlated over time (Cov[ut, us] β‰  0), common in time series when shocks persist. OLS stays unbiased but SEs are wrong β€” typically too small, leading to false significance. Use HAC (Newey-West) standard errors for valid inference. Check the ACF of residuals: a single large spike at lag 1 suggests an AR(1) process; multiple significant lags suggest higher-order autocorrelation. Severe autocorrelation drastically reduces the effective sample size.
What you can do here
  • Slide ρ from 0 (pure white noise) to 0.95 (extreme persistence).
  • Slide the number of observations to see how the ACF estimate stabilises at larger n.
  • Click Resimulate to draw new errors at the same ρ.
Lag 1 ACF
--
Lag 5 ACF
--
Theory Lag 1
--
Eff. sample / n
--
Try this
  1. Set ρ = 0. The time series looks like white noise and ACF bars sit inside the band. No serial correlation β€” default SEs would be valid here.
  2. Drag to ρ = 0.80, then ρ = 0.95. Errors cluster in long runs and the ACF decays slowly β€” lag 10 is still large at ρ = 0.95. The effective sample size collapses as persistence grows β€” 500 observations can carry the information content of maybe 50.
  3. Resimulate at ρ = 0.80 several times. Each realisation looks different but the slow-decay ACF pattern is consistent β€” persistence is a population property, not a draw-specific artifact.

Take-away: Autocorrelation shrinks your effective sample even when n looks large β€” detect it via the ACF of residuals and correct inference with HAC standard errors. Read Β§16.6 in the chapter β†’

Diagnostic Plots

A tidy regression table can hide a sick model. The residual plot tells the truth β€” and it takes 10 seconds to read.

Diagnostic plots assess model adequacy after fitting. The actual-vs-fitted plot checks how well the model predicts: points should cluster near the 45Β° line. The residual-vs-fitted plot checks the assumptions: random scatter around zero with constant spread means assumptions 3 (homoskedasticity) and 4 are plausible; fan shapes suggest heteroskedasticity, curved patterns suggest nonlinearity. LOWESS smoothing through the residuals helps reveal systematic patterns you might otherwise miss.
What you can do here
  • Toggle Bivariate vs With Controls to see how adding regressors changes the diagnostics.
  • Toggle Actual vs Fitted / Residual vs Fitted between the two standard diagnostic views.
  • Turn country labels on to identify specific outliers.
Growth coef
--
--
Resid mean
--
Resid std
--
Try this
  1. Start with Bivariate on Actual vs Fitted. Points scatter widely β€” RΒ² β‰ˆ 0.19. Switch to With Controls. RΒ² roughly doubles to ~0.45 β€” the multivariate model is genuinely better, not just denser.
  2. Toggle to Residual vs Fitted. Compare Bivariate and With Controls side by side. Residuals look more random-scatter and less patterned in the multivariate version β€” a good sign for the OLS assumptions.
  3. Turn Show Countries on and hover the extreme points. The named outliers are the same countries across both views β€” those are the observations to investigate (or flag) before reporting results.

Take-away: Residual plots are the cheapest assumption-check you own β€” run them on every regression before you interpret any coefficient. Read Β§16.8 in the chapter β†’

Influential Observations

Is your regression driven by the whole sample β€” or by three countries? DFITS and DFBETAS tell you which observations are carrying the estimate.

DFITS measures influence on fitted values; DFBETAS measures influence on individual coefficients. DFITSi is the scaled change in Ε·i when observation i is excluded; DFBETASj,i is the scaled change in Ξ²Μ‚j. Thresholds for investigation: |DFITS| > 2√(k/n) and |DFBETAS| > 2/√n. Don't automatically delete flagged points β€” investigate whether they're data errors, genuine outliers, or valid extreme values carrying important information.
What you can do here
  • Toggle DFITS vs DFBETAS (growth) β€” different influence measures surface different points.
  • Toggle the label mode β€” Flagged only, All countries, or Off.
  • Compare the threshold line to each bar β€” anything beyond warrants investigation.
n countries
--
k parameters
--
Threshold
--
Flagged
--
Try this
  1. Keep DFITS and Flagged Only. China, Oman, Myanmar typically top the chart. These countries move predicted GDP growth noticeably when excluded β€” worth investigating before reporting pooled results.
  2. Switch to DFBETAS (growth). A different set of countries may emerge. DFITS measures influence on predictions; DFBETAS isolates influence on a specific coefficient β€” the two flags don't always agree.
  3. Turn on All Countries labels and inspect points near the threshold. Borderline cases are worth a look too β€” influence is a continuum, not a binary flag.

Take-away: DFITS and DFBETAS surface observations that bend the regression β€” use them to investigate, not automatically delete, influential points. Read Β§16.8 in the chapter β†’

Omitted Variable Bias

Does democracy cause growth? The bivariate regression says 0.131. With the right controls, it drops to 0.047 β€” a 64% "effect" turned out to be confounding.

Omitted variables bias every coefficient that correlates with the omitted cause. The democracy-growth example: the growth coefficient falls from 0.131 (bivariate) to 0.047 (with controls) β€” a 64% reduction. Institutional variables (religion, executive constraints) were correlated with both democracy and growth, biasing the bivariate estimate upward. The interview question is always: "What else affects y and correlates with x?" β€” find those variables and include them.
What you can do here
  • Toggle between None, + Institutional, + Religion, and All Controls to add layers of controls.
  • Watch the growth coefficient shrink as each layer is added.
  • Toggle the Coefficient Bar view to see the shrinkage visually.
Growth coef
--
SE (robust)
--
--
% change from bivariate
--
Try this
  1. Start at None (bivariate). Growth coef = 0.131. Add + Institutional controls. The coefficient drops substantially β€” institutions explain part of what the bivariate estimate attributed to democracy.
  2. Add + Religion controls. The coefficient drops further; at All Controls it lands at 0.047. Each layer of controls absorbs more of the spurious association.
  3. Toggle to Coefficient Bar. The shrinkage is visually dramatic. The bivariate "democracy effect" overstated the real relationship by 64% β€” the multivariate estimate is the honest number to report.

Take-away: If a coefficient moves substantially when you add a plausible control, the bivariate estimate was contaminated β€” OVB is not a hypothetical, it's what the numbers are telling you. Read Β§16.7 in the chapter β†’

Python Libraries and Code

You've explored the key concepts interactively β€” now reproduce them in Python. This self-contained code block covers everything you practiced above. Copy it into an empty notebook and run it.

# =============================================================================
# CHAPTER 16 CHEAT SHEET: Checking the Model and Data
# =============================================================================

# --- Libraries ---
import numpy as np                                          # numerical operations
import pandas as pd                                         # data loading and manipulation
import matplotlib.pyplot as plt                             # creating plots and visualizations
from statsmodels.formula.api import ols                     # OLS regression with R-style formulas
import statsmodels.api as sm                                # add_constant for VIF calculation
from statsmodels.stats.outliers_influence import (          # diagnostic tools:
    variance_inflation_factor, OLSInfluence)                #   VIF and influence measures
from statsmodels.nonparametric.smoothers_lowess import lowess  # LOWESS smooth for residual plots

# =============================================================================
# STEP 1: Load data
# =============================================================================
# Two datasets: earnings (cross-section) and democracy (cross-country)
url_base = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/"

data_earnings  = pd.read_stata(url_base + "AED_EARNINGS_COMPLETE.DTA")
data_democracy = pd.read_stata(url_base + "AED_DEMOCRACY.DTA")

print(f"Earnings:  {data_earnings.shape[0]} workers, {data_earnings.shape[1]} variables")
print(f"Democracy: {data_democracy.shape[0]} countries, {data_democracy.shape[1]} variables")

# =============================================================================
# STEP 2: Detect multicollinearity with VIF
# =============================================================================
# VIF_j = 1/(1 - R_j^2): measures how much SE inflates due to collinearity
# VIF > 10 = serious problem; VIF > 5 = investigate further

X_vif = data_earnings[['age', 'education', 'agebyeduc']].copy()
X_vif = sm.add_constant(X_vif)

vif_data = pd.DataFrame({
    'Variable': X_vif.columns,
    'VIF': [variance_inflation_factor(X_vif.values, i) for i in range(X_vif.shape[1])]
})
print("\nVariance Inflation Factors (with interaction term):")
print(vif_data.to_string(index=False))

# =============================================================================
# STEP 3: Compare standard vs robust standard errors
# =============================================================================
# Heteroskedasticity makes default SEs too small -> use HC1 (White) robust SEs
model_std    = ols('earnings ~ age + education', data=data_earnings).fit()
model_robust = ols('earnings ~ age + education', data=data_earnings).fit(cov_type='HC1')

se_comparison = pd.DataFrame({
    'Variable':    model_std.params.index,
    'Standard SE': model_std.bse.values.round(2),
    'Robust SE':   model_robust.bse.values.round(2),
    'Ratio':       (model_robust.bse / model_std.bse).values.round(3)
})
print("\nSE Comparison (ratio > 1 signals heteroskedasticity):")
print(se_comparison.to_string(index=False))

# =============================================================================
# STEP 4: Omitted variable bias β€” democracy and growth
# =============================================================================
# Adding controls reveals how much the bivariate estimate was biased upward
model_bivariate = ols('democracy ~ growth', data=data_democracy).fit(cov_type='HC1')
model_multiple  = ols('democracy ~ growth + constraint + indcent + catholic + muslim + protestant',
                      data=data_democracy).fit(cov_type='HC1')

b_biv  = model_bivariate.params['growth']
b_mult = model_multiple.params['growth']
print(f"\nGrowth coefficient (bivariate):       {b_biv:.4f}")
print(f"Growth coefficient (with controls):   {b_mult:.4f}")
print(f"Reduction: {(1 - b_mult/b_biv)*100:.0f}% β€” institutional controls absorb the bias")

# =============================================================================
# STEP 5: Diagnostic plots β€” residual vs fitted
# =============================================================================
# Random scatter around zero = assumptions OK; fan shape = heteroskedasticity
uhat = model_multiple.resid
yhat = model_multiple.fittedvalues

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Panel A: Actual vs Fitted
axes[0].scatter(yhat, data_democracy['democracy'], s=50, alpha=0.6)
axes[0].plot([yhat.min(), yhat.max()], [yhat.min(), yhat.max()],
             'r-', linewidth=2, label='45Β° line')
axes[0].set_xlabel('Fitted Democracy')
axes[0].set_ylabel('Actual Democracy')
axes[0].set_title('Actual vs Fitted')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Panel B: Residual vs Fitted (with LOWESS smooth)
axes[1].scatter(yhat, uhat, s=50, alpha=0.6)
axes[1].axhline(y=0, color='gray', linewidth=1)
lw = lowess(uhat, yhat, frac=0.3)          # LOWESS reveals hidden patterns
axes[1].plot(lw[:, 0], lw[:, 1], 'r--', linewidth=2, label='LOWESS smooth')
axes[1].set_xlabel('Fitted Democracy')
axes[1].set_ylabel('Residual')
axes[1].set_title('Residual vs Fitted')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# =============================================================================
# STEP 6: Influential observations β€” DFITS
# =============================================================================
# DFITS_i measures how much prediction i changes when observation i is excluded
# Threshold: |DFITS| > 2*sqrt(k/n)
influence = OLSInfluence(model_multiple)
dfits     = influence.dffits[0]
n = len(data_democracy)
k = len(model_multiple.params)
threshold = 2 * np.sqrt(k / n)

print(f"\nDFITS threshold: {threshold:.4f}")
print(f"Observations exceeding threshold: {np.sum(np.abs(dfits) > threshold)} out of {n}")

fig, ax = plt.subplots(figsize=(10, 6))
colors = ['red' if abs(d) > threshold else 'steelblue' for d in dfits]
ax.scatter(range(n), dfits, c=colors, s=40, alpha=0.7)
ax.axhline(y=threshold,  color='red', linestyle='--', label=f'Threshold Β±{threshold:.3f}')
ax.axhline(y=-threshold, color='red', linestyle='--')
ax.axhline(y=0, color='gray', linewidth=0.5)
ax.set_xlabel('Observation Index')
ax.set_ylabel('DFITS')
ax.set_title('Influential Observations (DFITS)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Open empty Colab notebook β†’