Chapter 16 — Model and Data Verification

Add one more regressor and your SEs triple — but the coefficient barely moves. That's multicollinearity, and VIF is the one number that diagnoses it.

The Variance Inflation Factor quantifies multicollinearity: VIF_j = 1 / (1 − R_j²), where R_j² is from regressing x_j on all other regressors. VIF = 1 means no collinearity; VIF > 10 indicates serious problems (standard errors inflated by √10 ≈ 3.2×). OLS stays unbiased under multicollinearity — only precision suffers. Joint F-tests on correlated regressors remain valid and powerful even when individual t-tests are unreliable (Key Concept 16.2).

Model specification

VIF threshold 10

VIF age

VIF education

VIF interaction

Max VIF

R²

Try this

Start on the Base model. VIFs sit near 1.0 for both age and education. Toggle to + Interaction. VIFs jump to 22–37 — the interaction term is nearly a linear combination of the two main effects.
Drag the threshold slider to 5. The base model stays comfortably below; the interaction model blows past the line on every regressor — any threshold from 5 to 10 flags the same problem.
Read the correlation heatmap. age×education correlates 0.73 with age and 0.64 with education. These high bivariate correlations are the mechanical cause of the VIF blow-up.

Take-away: VIF flags when correlated regressors will inflate your SEs — but joint F-tests still work, so don't drop variables solely to "fix" a high VIF. Read §16.1 in the chapter →

Default SEs assume error variance is constant. In cross-sectional data it almost never is — and default SEs quietly lie. Robust SEs are the cheap insurance.

Heteroscedasticity means error variance depends on the regressors: Var[u_i|x_i] ≠ σ². OLS coefficients remain unbiased, but default SEs are wrong — typically too small, giving false confidence. Use heteroscedasticity-robust (HC1/White) standard errors, valid whether or not heteroscedasticity is present. When assumptions 3 (homoscedasticity) or 4 (no autocorrelation) fail, the fix is to change the inference method — not the model (Key Concept 16.3).

Display

Highlight

SE(age) standard

SE(age) robust

SE(educ) standard

SE(educ) robust

Ratio (educ)

Try this

Stay on SE Comparison. Robust SE bars are consistently taller than standard SE bars; the ratio for education ≈ 1.12. Default SEs understate uncertainty by ~12% on this regressor — enough to change a "significant at 5%" verdict on a borderline case.
Toggle to Residual Scatter. The cloud fans outward at higher fitted values. Classic heteroscedasticity — error variance grows with predicted earnings, exactly the situation default SEs can't handle.
Highlight Education Only. Its SE inflates more than age's does. Heteroscedasticity affects regressors unequally — robust SEs fix each one individually.

Take-away: Robust SEs cost nothing when homoscedasticity holds and protect you when it fails — report them as a default on any cross-sectional regression. Read §16.5 in the chapter →

When today's shock bleeds into tomorrow's, default SEs underestimate uncertainty — and your effective sample size is smaller than n. Raise ρ and watch it happen.

Autocorrelation means errors are correlated over time (Cov[u_t, u_s] ≠ 0), common in time series when shocks persist. OLS stays unbiased but SEs are wrong — typically too small, leading to false significance. Use HAC (Newey-West) standard errors for valid inference. Check the ACF of residuals: a single large spike at lag 1 suggests an AR(1) process; multiple significant lags suggest higher-order autocorrelation. Severe autocorrelation drastically reduces the effective sample size.

ρ (persistence) 0.80

Observations 200

Lag 1 ACF

Lag 5 ACF

Theory Lag 1

Eff. sample / n

Try this

Set ρ = 0. The time series looks like white noise and ACF bars sit inside the band. No serial correlation — default SEs would be valid here.
Drag to ρ = 0.80, then ρ = 0.95. Errors cluster in long runs and the ACF decays slowly — lag 10 is still large at ρ = 0.95. The effective sample size collapses as persistence grows — 500 observations can carry the information content of maybe 50.
Resimulate at ρ = 0.80 several times. Each realisation looks different but the slow-decay ACF pattern is consistent — persistence is a population property, not a draw-specific artifact.

Take-away: Autocorrelation shrinks your effective sample even when n looks large — detect it via the ACF of residuals and correct inference with HAC standard errors. Read §16.6 in the chapter →

A tidy regression table can hide a sick model. The residual plot tells the truth — and it takes 10 seconds to read.

Diagnostic plots assess model adequacy after fitting. The actual-vs-fitted plot checks how well the model predicts: points should cluster near the 45° line. The residual-vs-fitted plot checks the assumptions: random scatter around zero with constant spread means assumptions 3 (homoscedasticity) and 4 are plausible; fan shapes suggest heteroscedasticity, curved patterns suggest nonlinearity. LOWESS smoothing through the residuals helps reveal systematic patterns you might otherwise miss.

Model

Plot type

Labels

Growth coef

R²

Resid mean

Resid std

Try this

Start with Bivariate on Actual vs Fitted. Points scatter widely — R² ≈ 0.19. Switch to With Controls. R² roughly doubles to ~0.45 — the multivariate model is genuinely better, not just denser.
Toggle to Residual vs Fitted. Compare Bivariate and With Controls side by side. Residuals look more random-scatter and less patterned in the multivariate version — a good sign for the OLS assumptions.
Turn Show Countries on and hover the extreme points. The named outliers are the same countries across both views — those are the observations to investigate (or flag) before reporting results.

Take-away: Residual plots are the cheapest assumption-check you own — run them on every regression before you interpret any coefficient. Read §16.8 in the chapter →

Is your regression driven by the whole sample — or by three countries? DFITS and DFBETAS tell you which observations are carrying the estimate.

DFITS measures influence on fitted values; DFBETAS measures influence on individual coefficients. DFITS_i is the scaled change in ŷ_i when observation i is excluded; DFBETAS_j,i is the scaled change in β̂_j. Thresholds for investigation: |DFITS| > 2√(k/n) and |DFBETAS| > 2/√n. Don't automatically delete flagged points — investigate whether they're data errors, genuine outliers, or valid extreme values carrying important information.

Measure

Labels

n countries

k parameters

Threshold

Flagged

Try this

Keep DFITS and Flagged Only. China, Oman, Myanmar typically top the chart. These countries move predicted GDP growth noticeably when excluded — worth investigating before reporting pooled results.
Switch to DFBETAS (growth). A different set of countries may emerge. DFITS measures influence on predictions; DFBETAS isolates influence on a specific coefficient — the two flags don't always agree.
Turn on All Countries labels and inspect points near the threshold. Borderline cases are worth a look too — influence is a continuum, not a binary flag.

Take-away: DFITS and DFBETAS surface observations that bend the regression — use them to investigate, not automatically delete, influential points. Read §16.8 in the chapter →

Does democracy cause growth? The bivariate regression says 0.131. With the right controls, it drops to 0.047 — a 64% "effect" turned out to be confounding.

Omitted variables bias every coefficient that correlates with the omitted cause. The democracy-growth example: the growth coefficient falls from 0.131 (bivariate) to 0.047 (with controls) — a 64% reduction. Institutional variables (religion, executive constraints) were correlated with both democracy and growth, biasing the bivariate estimate upward. The interview question is always: "What else affects y and correlates with x?" — find those variables and include them.

Controls

Show

Growth coef

SE (robust)

R²

% change from bivariate

Try this

Start at None (bivariate). Growth coef = 0.131. Add + Institutional controls. The coefficient drops substantially — institutions explain part of what the bivariate estimate attributed to democracy.
Add + Religion controls. The coefficient drops further; at All Controls it lands at 0.047. Each layer of controls absorbs more of the spurious association.
Toggle to Coefficient Bar. The shrinkage is visually dramatic. The bivariate "democracy effect" overstated the real relationship by 64% — the multivariate estimate is the honest number to report.

Take-away: If a coefficient moves substantially when you add a plausible control, the bivariate estimate was contaminated — OVB is not a hypothetical, it's what the numbers are telling you. Read §16.7 in the chapter →

Code Summary

You've explored the key concepts interactively — now reproduce them in code. These self-contained blocks cover everything you practiced above. Pick your language, copy the code, and run it.

# =============================================================================
# CHAPTER 16 CHEAT SHEET: Checking the Model and Data
# =============================================================================

# --- Libraries ---
import numpy as np                                          # numerical operations
import pandas as pd                                         # data loading and manipulation
import matplotlib.pyplot as plt                             # creating plots and visualizations
import pyfixest as pf                                       # fast fixed-effects estimation
# !pip install pyfixest                                     # uncomment if running in Google Colab
import statsmodels.api as sm                                # add_constant for VIF calculation
import statsmodels.formula.api as smf                       # OLS for diagnostics only
from statsmodels.stats.outliers_influence import (          # diagnostic tools:
    variance_inflation_factor, OLSInfluence)                #   VIF and influence measures
from statsmodels.nonparametric.smoothers_lowess import lowess  # LOWESS smooth for residual plots

# =============================================================================
# STEP 1: Load data
# =============================================================================
# Two datasets: earnings (cross-section) and democracy (cross-country)
url_base = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/"

data_earnings  = pd.read_stata(url_base + "AED_EARNINGS_COMPLETE.DTA")
data_democracy = pd.read_stata(url_base + "AED_DEMOCRACY.DTA")

print(f"Earnings:  {data_earnings.shape[0]} workers, {data_earnings.shape[1]} variables")
print(f"Democracy: {data_democracy.shape[0]} countries, {data_democracy.shape[1]} variables")

# =============================================================================
# STEP 2: Detect multicollinearity with VIF
# =============================================================================
# VIF_j = 1/(1 - R_j^2): measures how much SE inflates due to collinearity
# VIF > 10 = serious problem; VIF > 5 = investigate further

X_vif = data_earnings[['age', 'education', 'agebyeduc']].copy()
X_vif = sm.add_constant(X_vif)

vif_data = pd.DataFrame({
    'Variable': X_vif.columns,
    'VIF': [variance_inflation_factor(X_vif.values, i) for i in range(X_vif.shape[1])]
})
print("\nVariance Inflation Factors (with interaction term):")
print(vif_data.to_string(index=False))

# =============================================================================
# STEP 3: Compare standard vs robust standard errors
# =============================================================================
# Heteroskedasticity makes default SEs too small -> use HC1 (White) robust SEs
fit_std    = pf.feols('earnings ~ age + education', data=data_earnings)
fit_robust = pf.feols('earnings ~ age + education', data=data_earnings, vcov='HC1')

se_comparison = pd.DataFrame({
    'Variable':    list(fit_std.coef().index),
    'Standard SE': fit_std.se().values.round(2),
    'Robust SE':   fit_robust.se().values.round(2),
    'Ratio':       (fit_robust.se() / fit_std.se()).values.round(3)
})
print("\nSE Comparison (ratio > 1 signals heteroskedasticity):")
print(se_comparison.to_string(index=False))

# =============================================================================
# STEP 4: Omitted variable bias — democracy and growth
# =============================================================================
# Adding controls reveals how much the bivariate estimate was biased upward
fit_bivariate = pf.feols('democracy ~ growth', data=data_democracy, vcov='HC1')
fit_multiple  = pf.feols('democracy ~ growth + constraint + indcent + catholic + muslim + protestant',
                         data=data_democracy, vcov='HC1')

b_biv  = fit_bivariate.coef()['growth']
b_mult = fit_multiple.coef()['growth']
print(f"\nGrowth coefficient (bivariate):       {b_biv:.4f}")
print(f"Growth coefficient (with controls):   {b_mult:.4f}")
print(f"Reduction: {(1 - b_mult/b_biv)*100:.0f}% — institutional controls absorb the bias")

# =============================================================================
# STEP 5: Diagnostic plots — residual vs fitted
# =============================================================================
# Random scatter around zero = assumptions OK; fan shape = heteroskedasticity
uhat = fit_multiple._u_hat
yhat = fit_multiple.predict()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Panel A: Actual vs Fitted
axes[0].scatter(yhat, data_democracy['democracy'], s=50, alpha=0.6)
axes[0].plot([yhat.min(), yhat.max()], [yhat.min(), yhat.max()],
             'r-', linewidth=2, label='45° line')
axes[0].set_xlabel('Fitted Democracy')
axes[0].set_ylabel('Actual Democracy')
axes[0].set_title('Actual vs Fitted')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Panel B: Residual vs Fitted (with LOWESS smooth)
axes[1].scatter(yhat, uhat, s=50, alpha=0.6)
axes[1].axhline(y=0, color='gray', linewidth=1)
lw = lowess(uhat, yhat, frac=0.3)          # LOWESS reveals hidden patterns
axes[1].plot(lw[:, 0], lw[:, 1], 'r--', linewidth=2, label='LOWESS smooth')
axes[1].set_xlabel('Fitted Democracy')
axes[1].set_ylabel('Residual')
axes[1].set_title('Residual vs Fitted')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# =============================================================================
# STEP 6: Influential observations — DFITS
# =============================================================================
# DFITS_i measures how much prediction i changes when observation i is excluded
# Threshold: |DFITS| > 2*sqrt(k/n)
# OLSInfluence requires a statsmodels fitted model (diagnostics only)
_diag_model = smf.ols('democracy ~ growth + constraint + indcent + catholic + muslim + protestant',
                      data=data_democracy).fit()
influence = OLSInfluence(_diag_model)
dfits     = influence.dffits[0]
n = len(data_democracy)
k = len(fit_multiple.coef())
threshold = 2 * np.sqrt(k / n)

print(f"\nDFITS threshold: {threshold:.4f}")
print(f"Observations exceeding threshold: {np.sum(np.abs(dfits) > threshold)} out of {n}")

fig, ax = plt.subplots(figsize=(10, 6))
colors = ['red' if abs(d) > threshold else 'steelblue' for d in dfits]
ax.scatter(range(n), dfits, c=colors, s=40, alpha=0.7)
ax.axhline(y=threshold,  color='red', linestyle='--', label=f'Threshold ±{threshold:.3f}')
ax.axhline(y=-threshold, color='red', linestyle='--')
ax.axhline(y=0, color='gray', linewidth=0.5)
ax.set_xlabel('Observation Index')
ax.set_ylabel('DFITS')
ax.set_title('Influential Observations (DFITS)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Open empty Colab notebook →

* =============================================================================
* CHAPTER 16 CHEAT SHEET: Checking the Model and Data
* =============================================================================

* --- Setup ---
clear all                                // start with a clean workspace
set more off                             // do not pause output for long results

* =============================================================================
* STEP 1: Load data
* =============================================================================
* Two datasets: earnings (cross-section) and democracy (cross-country)
* use loads a Stata .dta file; "clear" drops any data already in memory
use "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_EARNINGS_COMPLETE.DTA", clear

describe                                 // list all variables, types, and labels
display "Observations: " _N             // _N is Stata's built-in observation count

* =============================================================================
* STEP 2: Detect multicollinearity with VIF
* =============================================================================
* VIF_j = 1/(1 - R_j^2): measures how much SE inflates due to collinearity
* VIF > 10 = serious problem; VIF > 5 = investigate further

// First run the regression that includes the interaction term
regress earnings age education agebyeduc

// estat vif computes VIF for each regressor after any regression
estat vif

// Interpretation: VIF near 1 = no collinearity; VIF > 10 = serious problem
// The interaction term (agebyeduc) will show a very high VIF because it is
// nearly a linear combination of age and education

* =============================================================================
* STEP 3: Compare standard vs robust standard errors
* =============================================================================
* Heteroskedasticity makes default SEs too small -> use robust (HC1) SEs

// Model with standard (homoskedastic) SEs
regress earnings age education
estimates store model_std

// Same model with robust (White/HC1) SEs
regress earnings age education, vce(robust)
estimates store model_robust

// Compare both sets of SEs side by side
estimates table model_std model_robust, se stats(N r2)
// Ratio > 1 signals heteroskedasticity: robust SEs are larger than standard SEs

* =============================================================================
* STEP 4: Omitted variable bias — democracy and growth
* =============================================================================
* Adding controls reveals how much the bivariate estimate was biased upward

// Switch to the democracy dataset
use "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_DEMOCRACY.DTA", clear
describe

// Bivariate regression: democracy on growth alone
regress democracy growth, vce(robust)
estimates store bivariate
scalar b_biv = _b[growth]

// Multiple regression: add institutional and religious controls
regress democracy growth constraint indcent catholic muslim protestant, vce(robust)
estimates store multiple
scalar b_mult = _b[growth]

// Display the reduction in the growth coefficient
display "Growth coefficient (bivariate):     " b_biv
display "Growth coefficient (with controls): " b_mult
display "Reduction: " (1 - b_mult/b_biv)*100 "% — institutional controls absorb the bias"

// Compare both models side by side
estimates table bivariate multiple, se stats(N r2)

* =============================================================================
* STEP 5: Diagnostic plots — residual vs fitted
* =============================================================================
* Random scatter around zero = assumptions OK; fan shape = heteroskedasticity

// Re-run the multiple regression (needed for predict)
quietly regress democracy growth constraint indcent catholic muslim protestant, vce(robust)

// Generate fitted values and residuals
predict yhat, xb                         // fitted values (X*beta)
predict uhat, residuals                  // OLS residuals (y - X*beta)

// Panel A: Actual vs Fitted with 45-degree line
twoway (scatter democracy yhat, msymbol(Oh) mcolor(navy))             ///
       (line yhat yhat, lcolor(red) lwidth(medthick)),                ///
    xtitle("Fitted Democracy") ytitle("Actual Democracy")            ///
    title("Actual vs Fitted")                                        ///
    legend(order(1 "Observations" 2 "45° line"))

// Panel B: Residual vs Fitted with LOWESS smooth
twoway (scatter uhat yhat, msymbol(Oh) mcolor(navy))                 ///
       (lowess uhat yhat, lcolor(red) lwidth(medthick) lpattern(dash)), ///
    xtitle("Fitted Democracy") ytitle("Residual")                    ///
    title("Residual vs Fitted")                                      ///
    yline(0, lcolor(gs10) lwidth(thin))                              ///
    legend(order(1 "Residuals" 2 "LOWESS smooth"))

* =============================================================================
* STEP 6: Influential observations — DFITS
* =============================================================================
* DFITS_i measures how much prediction i changes when observation i is excluded
* Threshold: |DFITS| > 2*sqrt(k/n)

// predict with dfits option generates DFITS after a regression
predict dfits_val, dfits

// Compute the threshold
local k = e(df_m) + 1                    // number of parameters (including intercept)
local n = e(N)                           // number of observations
local threshold = 2 * sqrt(`k' / `n')
display "DFITS threshold: " `threshold'

// Count observations exceeding the threshold
count if abs(dfits_val) > `threshold'
display "Observations exceeding threshold: " r(N) " out of " `n'

// Plot DFITS with threshold lines
gen obs_id = _n                          // observation index for x-axis
twoway (scatter dfits_val obs_id if abs(dfits_val) <= `threshold',   ///
            msymbol(Oh) mcolor(navy) msize(small))                    ///
       (scatter dfits_val obs_id if abs(dfits_val) > `threshold',    ///
            msymbol(Oh) mcolor(red) msize(small)),                    ///
    yline(`threshold', lcolor(red) lpattern(dash))                    ///
    yline(-`threshold', lcolor(red) lpattern(dash))                   ///
    yline(0, lcolor(gs10) lwidth(thin))                               ///
    xtitle("Observation Index") ytitle("DFITS")                       ///
    title("Influential Observations (DFITS)")                         ///
    legend(order(1 "Below threshold" 2 "Above threshold"))

Paste into your Stata do-file editor

# =============================================================================
# CHAPTER 16 CHEAT SHEET: Checking the Model and Data
# =============================================================================

# --- Libraries ---
library(haven)           # read Stata .dta files
library(fixest)          # fast OLS estimation with feols()
library(dplyr)           # data manipulation
library(ggplot2)         # grammar of graphics
library(car)             # vif() for multicollinearity detection

# =============================================================================
# STEP 1: Load data
# =============================================================================
url_base <- "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/"
data_earnings  <- read_dta(paste0(url_base, "AED_EARNINGS_COMPLETE.DTA"))
data_democracy <- read_dta(paste0(url_base, "AED_DEMOCRACY.DTA"))

cat("Earnings: ", nrow(data_earnings), "workers\n")
cat("Democracy:", nrow(data_democracy), "countries\n")

# =============================================================================
# STEP 2: Detect multicollinearity with VIF
# =============================================================================
# VIF_j = 1/(1 - R_j^2); VIF > 10 = serious; VIF > 5 = investigate
model_vif <- lm(earnings ~ age + education + agebyeduc, data = data_earnings)
vif(model_vif)

# =============================================================================
# STEP 3: Compare standard vs robust standard errors
# =============================================================================
model_std    <- feols(earnings ~ age + education, data = data_earnings)
model_robust <- feols(earnings ~ age + education, data = data_earnings, vcov = "HC1")

etable(model_std, model_robust, headers = c("Standard", "Robust (HC1)"))

# =============================================================================
# STEP 4: Omitted variable bias — democracy and growth
# =============================================================================
m_biv  <- feols(democracy ~ growth, data = data_democracy, vcov = "HC1")
m_mult <- feols(democracy ~ growth + constraint + indcent + catholic + muslim + protestant,
                data = data_democracy, vcov = "HC1")

cat("\nGrowth coefficient (bivariate):    ", round(coef(m_biv)["growth"], 4), "\n")
cat("Growth coefficient (with controls):", round(coef(m_mult)["growth"], 4), "\n")
cat("Reduction:", round((1 - coef(m_mult)["growth"]/coef(m_biv)["growth"])*100, 0),
    "% — institutional controls absorb the bias\n")

etable(m_biv, m_mult, headers = c("Bivariate", "With controls"))

# =============================================================================
# STEP 5: Diagnostic plots — residual vs fitted
# =============================================================================
data_democracy$yhat  <- fitted(m_mult)
data_democracy$resid <- residuals(m_mult)

library(patchwork)

p1 <- ggplot(data_democracy, aes(x = yhat, y = democracy)) +
  geom_point(alpha = 0.6) +
  geom_abline(color = "red", linetype = "dashed", linewidth = 1.2) +
  labs(x = "Fitted Democracy", y = "Actual Democracy",
       title = "Actual vs Fitted") +
  theme_minimal()

p2 <- ggplot(data_democracy, aes(x = yhat, y = resid)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, color = "gray") +
  geom_smooth(method = "loess", formula = y ~ x, color = "red",
              linetype = "dashed", se = FALSE) +
  labs(x = "Fitted Democracy", y = "Residual",
       title = "Residual vs Fitted") +
  theme_minimal()

p1 + p2

# =============================================================================
# STEP 6: Influential observations — DFITS
# =============================================================================
# DFITS_i measures how much prediction i changes when obs i is excluded
model_lm  <- lm(democracy ~ growth + constraint + indcent + catholic + muslim + protestant,
                 data = data_democracy)
dfits_val <- dffits(model_lm)
n <- nrow(data_democracy)
k <- length(coef(model_lm))
threshold <- 2 * sqrt(k / n)

cat("\nDFITS threshold:", round(threshold, 4), "\n")
cat("Observations exceeding threshold:", sum(abs(dfits_val) > threshold), "out of", n, "\n")

ggplot(data.frame(obs = 1:n, dfits = dfits_val), aes(x = obs, y = dfits)) +
  geom_point(aes(color = abs(dfits) > threshold), size = 2, alpha = 0.7) +
  scale_color_manual(values = c("FALSE" = "steelblue", "TRUE" = "red"), guide = "none") +
  geom_hline(yintercept = c(-threshold, threshold), color = "red", linetype = "dashed") +
  geom_hline(yintercept = 0, color = "gray", linewidth = 0.5) +
  labs(x = "Observation Index", y = "DFITS",
       title = "Influential Observations (DFITS)") +
  theme_minimal()

Paste into your R console or RStudio

Model and Data Verification

VIF Explorer: Multicollinearity

Robust vs. Standard SEs

Autocorrelation Explorer

Diagnostic Plots

Influential Observations

Omitted Variable Bias

Code Summary