Chapter 15 of 18 · Interactive Dashboard

Regression with Transformed Variables

Explore log transformations, quadratic models, standardized coefficients, interaction effects, and retransformation bias — using the same earnings dataset as the book.

Log Transformation Explorer

Same data, same regressors — but one specification reports "$5,737 per year of school" and the other reports "10% per year." Which answer is the right one?

In a log-linear model (ln y = β₁ + β₂x), β₂ is a semi-elasticity: a 1-unit increase in x is associated with a 100 × β₂ percent change in y. In a log-log model (ln y = β₁ + β₂ ln x), β₂ is an elasticity — a 1% increase in x gives a β₂% change in y. You cannot directly compare R² across models with different dependent variables (y vs ln y); instead compare by prediction accuracy, information criteria, or economic plausibility (Key Concept 15.2).
What you can do here
  • Toggle Levels vs Log-linear to see how the coefficient and interpretation change.
  • Read the interpretation stat card — it translates the coefficient into plain English.
  • Compare the fitted lines on the scatter — notice where each specification misses.
Education coef
Interpretation
t-statistic
Try This
  1. Start in Levels. Each extra year of education adds ≈ $5,737 to earnings. Switch to Log-linear. Each extra year adds ≈ 10%. Dollars-per-year is concrete at the sample mean but misleading at the extremes; percent-per-year scales naturally across income levels.
  2. Compare R² across the two specs. Log fits better (0.176 vs 0.103). Earnings are right-skewed; logging compresses the tail so the regression line fits everyone, not just the middle.
  3. Inspect the fitted lines on the scatter. The levels line misses the high earners badly; the log line is visibly more even. Shape of the residuals, not just R², is what decides which functional form is right.

Take-away: Log transformations convert dollar-change questions into percent-change questions — the right choice when the outcome is skewed and scale-dependent. Read §15.2 in the chapter →

Quadratic Model & Turning Point

Earnings rise with age — up to a point. When does the rise stop? A quadratic term answers in one closed-form formula.

A quadratic model y = β₁ + β₂x + β₃x² captures nonlinear relationships with a turning point at x* = −β₂ / (2β₃). The marginal effect ME = β₂ + 2β₃x varies with x — unlike linear models where it is constant. If β₃ < 0, the shape is an inverted U (earnings peaking at a certain age). Always test the joint significance of x and x² with an F-test, because individual t-tests on the quadratic term are misleading when x and x² are highly correlated (Key Concept 15.4).
What you can do here
  • Toggle between Linear and Quadratic to compare the two fits.
  • Slide age to compute the marginal effect at any point on the life cycle.
  • Read the turning point — where the marginal effect crosses zero.
Age = 40
Turning point
Marginal effect
Try This
  1. Toggle to Linear. The line rises monotonically, missing the decline at older ages. Toggle back to Quadratic. The curve peaks in middle age and falls — the shape that linear can never capture.
  2. Slide age to the turning point (~49.5). The marginal effect crosses zero. One more year of age neither helps nor hurts — that's the inflection of the inverted-U curve.
  3. Slide age to 60. The marginal effect is negative. Past the peak, additional age costs earnings — the classic life-cycle result quantified.

Take-away: A single quadratic term converts a monotonic line into an inverted-U with a named turning point — and its marginal effect has an age-specific formula. Read §15.3 in the chapter →

Standardized Coefficients

Which predictor matters most — education in years, gender as 0/1, or log-hours? Raw coefficients can't answer; standardized ones can.

Standardized (beta) coefficients β* = β × (sx / sy) measure effects in standard-deviation units. A one-standard-deviation increase in x is associated with a β* standard-deviation change in y. This lets you compare variables on the same scale regardless of their original units — a genuine ranking of explanatory importance within a single regression.
What you can do here
  • Toggle Raw coefficients vs Standardized (β*).
  • Compare bar lengths in the two modes — the ranking often changes.
  • Read the full coefficient label for interpretation.
Try This
  1. Start in Raw. ln-hours has the largest coefficient (0.94); gender is small (−0.19). But these raw numbers aren't comparable — they ride on different unit scales and different variances.
  2. Switch to Standardized. Education leads (β* = 0.40); ln-hours drops to 0.23. Standardization reveals that log-hours has low variance in this sample, so its raw-coefficient lead was misleading.
  3. Read gender's standardized coefficient (β* ≈ −0.13). Small in SD units, but still ~$7,000 at the sample mean — "small in β*" isn't the same as "unimportant in dollars".

Take-away: Standardized coefficients let you compare predictors on one yardstick — but always check back against the dollar (or raw) units when stakes are practical. Read §15.4 in the chapter →

Interaction Marginal Effects

Does one more year of schooling pay the same at 25 as it does at 55? An interaction term lets the data say "no."

With an interaction term x × z, the marginal effect of x depends on z: MEx = β₂ + β₄z. The effect of one variable changes with the level of another — the linear-constant-effect assumption is relaxed. Individual coefficients on x and x × z may appear insignificant due to multicollinearity between them, so always use joint F-tests to evaluate overall significance.
What you can do here
  • Slide the age from 25 to 65.
  • Watch the marginal effect at this age update — it's the formula βeduc + βeduc×age × age.
  • Trace the ME line on the chart — its slope is the interaction coefficient.
Age = 40
Education coef
Interaction coef
ME at this age
Try This
  1. Slide to age 25, then to age 55. The ME rises from ~$5,241 to ~$6,112. The interaction coefficient is about $29/year-of-age, so more experienced workers earn more per extra year of schooling.
  2. Watch the marginal-effect line slope upward. Education and experience are complementary — schooling pays off more when combined with more labor-market experience.
  3. Do the arithmetic. Age 25 → 55 is 30 years × $29 = ~$871 extra return per year of education. Modest in absolute terms, but economically meaningful over a career.

Take-away: Interactions turn the constant marginal effect into an age-varying one — and always check their block with a joint F-test, since x and x × z are highly collinear. Read §15.5 in the chapter →

Retransformation Bias

Regress in logs, exponentiate back, publish prediction. Wrong by 18% — unless you know the smearing trick.

The naive prediction exp(ln̂ y) systematically underestimates E[y|x] because E[exp(u)] ≠ exp(E[u]) (Jensen's inequality). Under normal homoskedastic errors, multiply by the correction factor exp(se² / 2). Duan's smearing estimator offers a nonparametric alternative: ŷ = exp(ln̂ y) × (1/n) Σ exp(ûi). Either way, naive exponentiation is biased — and the bias grows with the variance of ln y.
What you can do here
  • Toggle Compare both / Naive only / Corrected only.
  • Read the smearing factor in the stat cards.
  • Compare the Actual, Naive, and Corrected means — the corrected version should be close to actual.
Actual mean
Naive mean
Corrected mean
Smearing factor
Try This
  1. Start in Compare both. The naive bar underpredicts actual earnings by ~$9,500 (~18%). That's the Jensen's-inequality gap that silently contaminates every uncorrected log-model forecast.
  2. Read the smearing factor (exp(σ²/2) ≈ 1.18). Multiply every naive prediction by 1.18 to remove the downward bias — a one-line fix that restores unbiased level predictions.
  3. Note the small residual gap ($55,501 vs $56,369 actual). The normal-based correction is approximate when errors aren't exactly Gaussian — Duan's nonparametric smearing is the fallback when you don't want to assume normality.

Take-away: Exponentiating log-model predictions without a smearing correction systematically underpredicts the mean — a bias that only vanishes when error variance is zero. Read §15.6 in the chapter →

Residual Distribution: Levels vs. Log

OLS inference assumes residuals are roughly normal. Are they? One histogram decides whether you should be regressing on y or on ln y.

Choosing between levels and logs is partly about which specification produces better-behaved residuals. Right-skewed residuals (from a levels model on skewed y) suggest a log transformation would help — the OLS standard errors, t-tests, and p-values rely on residuals being approximately normal. The skewness statistic quantifies the asymmetry: closer to 0 is better. You can't compare R² across the two specifications (different dependent variables), but you can compare residual shapes directly.
What you can do here
  • Toggle Levels residuals vs Log residuals.
  • Watch the skewness stat card — it measures asymmetry numerically.
  • Compare histogram shapes — normal means roughly symmetric with thin tails.
Mean
SD
Skewness
Try This
  1. Stay on Levels residuals. The histogram has a long right tail — large positive residuals (underpredictions of high earners). A strong signal that the normality assumption is broken and the standard errors lie.
  2. Switch to Log residuals. The distribution becomes visibly symmetric and the skewness drops. Logging y pulls the tail back and delivers residuals the OLS formulas actually assume.
  3. Conclude. This is the residual-shape reason economists default to ln(earnings) — inference is only valid if the residuals cooperate.

Take-away: The right specification is partly a residual-diagnostic question — logs produce cleaner, more symmetric residuals when y is skewed, and cleaner residuals mean honest t-tests and CIs. Read §15.2 in the chapter →

Python Libraries and Code

You've explored the key concepts interactively — now reproduce them in Python. This self-contained code block covers everything you practiced above. Copy it into an empty notebook and run it.

# =============================================================================
# CHAPTER 15 CHEAT SHEET: Regression with Transformed Variables
# =============================================================================

# --- Libraries ---
import numpy as np                        # numerical operations (log, exp, sqrt)
import pandas as pd                       # data loading and manipulation
import matplotlib.pyplot as plt           # creating plots and visualizations
from statsmodels.formula.api import ols   # OLS regression with R-style formulas

# =============================================================================
# STEP 1: Load data directly from a URL
# =============================================================================
# 872 full-time workers aged 25-65 with earnings, education, age, and hours
url = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_EARNINGS_COMPLETE.DTA"
data_earnings = pd.read_stata(url)

# Create log and squared variables for transformations
data_earnings['lnage'] = np.log(data_earnings['age'])

print(f"Dataset: {data_earnings.shape[0]} observations, {data_earnings.shape[1]} variables")
print(data_earnings[['earnings', 'lnearnings', 'age', 'education']].describe().round(2))

# =============================================================================
# STEP 2: Log transformations — levels vs log-linear vs log-log
# =============================================================================
# Three specifications of the same relationship reveal different stories
ols_levels = ols('earnings ~ age + education', data=data_earnings).fit(cov_type='HC1')
ols_loglin = ols('lnearnings ~ age + education', data=data_earnings).fit(cov_type='HC1')
ols_loglog = ols('lnearnings ~ lnage + education', data=data_earnings).fit(cov_type='HC1')

print("=== Levels: absolute dollar effects ===")
print(f"  Education: +${ols_levels.params['education']:,.0f} per year")

print("\n=== Log-Linear: semi-elasticity (% change per unit) ===")
print(f"  Education: +{100*ols_loglin.params['education']:.1f}% per year (Mincer return)")

print("\n=== Log-Log: elasticity (% change per % change) ===")
print(f"  Age elasticity: {ols_loglog.params['lnage']:.4f}")

# =============================================================================
# STEP 3: Quadratic model — turning point and varying marginal effects
# =============================================================================
# A quadratic in age captures the inverted-U life-cycle earnings profile
ols_quad = ols('earnings ~ age + agesq + education', data=data_earnings).fit(cov_type='HC1')

bage    = ols_quad.params['age']
bagesq  = ols_quad.params['agesq']
turning_point = -bage / (2 * bagesq)        # age where earnings peak

print(f"Turning point: {turning_point:.1f} years")
for a in [25, 40, 55]:
    me = bage + 2 * bagesq * a              # ME varies with age
    print(f"  ME at age {a}: ${me:,.0f}/year")

# Joint F-test: H0: age and age² are jointly zero
f_test = ols_quad.wald_test('(age = 0, agesq = 0)', use_f=True)
print(f"Joint F-test p-value: {f_test.pvalue:.4f}")

# =============================================================================
# STEP 4: Standardized coefficients — compare variable importance
# =============================================================================
# Raw coefficients can't be compared across different units; beta* can
ols_mix = ols('earnings ~ gender + age + agesq + education + dself + dgovt + lnhours',
              data=data_earnings).fit(cov_type='HC1')

sd_y = data_earnings['earnings'].std()
predictors = ['gender', 'age', 'agesq', 'education', 'dself', 'dgovt', 'lnhours']

print(f"\n{'Variable':<12} {'Raw coef':>12} {'Beta*':>8}")
print("-" * 34)
for var in sorted(predictors, key=lambda v: abs(ols_mix.params[v] * data_earnings[v].std() / sd_y), reverse=True):
    raw  = ols_mix.params[var]
    beta_star = raw * data_earnings[var].std() / sd_y
    print(f"{var:<12} {raw:>12.2f} {beta_star:>8.4f}")

# =============================================================================
# STEP 5: Interaction terms — education returns that vary with age
# =============================================================================
# Does one more year of schooling pay the same at 25 as at 55?
ols_inter = ols('earnings ~ age + education + agebyeduc', data=data_earnings).fit(cov_type='HC1')

b_educ  = ols_inter.params['education']
b_inter = ols_inter.params['agebyeduc']

print(f"\nME of education = {b_educ:,.0f} + {b_inter:.1f} × age")
for a in [25, 40, 55]:
    me = b_educ + b_inter * a               # ME depends on age
    print(f"  At age {a}: ${me:,.0f} per year of education")

# =============================================================================
# STEP 6: Retransformation bias — naive exp() underpredicts
# =============================================================================
# Jensen's inequality: E[exp(u)] > exp(E[u]), so naive predictions are biased
rmse_log = np.sqrt(ols_loglin.mse_resid)
correction = np.exp(rmse_log**2 / 2)        # normal-based smearing factor

naive_pred    = np.exp(ols_loglin.fittedvalues)
adjusted_pred = correction * naive_pred

print(f"\nSmearing factor: {correction:.4f}")
print(f"Actual mean:     ${data_earnings['earnings'].mean():,.0f}")
print(f"Naive mean:      ${naive_pred.mean():,.0f}  (underpredicts)")
print(f"Corrected mean:  ${adjusted_pred.mean():,.0f}  (bias removed)")

# =============================================================================
# STEP 7: Comprehensive model — combine all transformation types
# =============================================================================
# A single model mixing logs, quadratics, dummies, and continuous regressors
ols_full = ols('lnearnings ~ gender + age + agesq + education + dself + dgovt + lnhours',
               data=data_earnings).fit(cov_type='HC1')

print(f"\nR²: {ols_full.rsquared:.4f}")
print(f"Education return: ~{100*ols_full.params['education']:.1f}% per year (semi-elasticity)")
print(f"Gender gap: ~{100*ols_full.params['gender']:.1f}%")
print(f"Hours elasticity: {ols_full.params['lnhours']:.3f} (log-log coefficient)")

# Full regression table
ols_full.summary()
Open empty Colab notebook →