Chapter 14 of 18 · Interactive Dashboard

Regression with Indicator Variables

Explore the gender earnings gap, interaction effects, the dummy variable trap, and worker-type differences using the same dataset and models as the book.

Gender Earnings Gap Explorer

Women earn $16,396 less per year than men in this sample. How much of that gap survives once we compare workers with similar education, age, and hours?

Regressing y on just an intercept and a single indicator d is algebraically a difference-in-means test. The fitted model ŷ = b + ad has intercept b = mean of y when d = 0, and slope a = the mean difference (ȳ₁ − ȳ₀). When you add other regressors, the coefficient on d becomes the adjusted difference — how much of the raw gap remains after controlling for observable differences in those regressors.
What you can do here
  • Pick a model — from gender-only through gender + education + age + hours.
  • Watch the gender coefficient move as controls are added.
  • Compare R² across models to see how much each control adds.
Gender coef
t-statistic
p-value
Try This
  1. Start with Model 1 (gender only). Gap = −$16,396, R² = 2.5%. The raw gap is large but gender alone explains almost none of the earnings variation — most of y₍earnings₎ is driven by other things.
  2. Switch to Model 2 (+ Education). The gap widens to −$18,258 — it got bigger! Women in this sample have slightly more education on average, so controlling for education strips out some "protective" variation and the residual gap grows.
  3. Switch to Model 4 (full controls). The gap shrinks substantially once age and hours are added — quantifying how much of the raw gap is "explained" by observable characteristics vs. genuinely unexplained.

Take-away: A regression on an indicator is a difference-in-means test — and its coefficient adjusts once you add controls, quantifying what's explained vs. what isn't. Read §14.1 in the chapter →

Interaction Effects: Do Returns to Education Differ by Gender?

Maybe the earnings gap isn't a flat number — maybe it widens (or narrows) with education. An interaction term lets the data say so.

An interacted indicator variable — the product of an indicator d and a continuous regressor x — lets both intercepts and slopes differ between groups. In the model y = β₁ + β₂x + α₁d + α₂(d × x) + u, α₁ shifts the intercept and α₂ measures how the slope on x differs between groups. Include only d (no interaction) and the lines stay parallel; add the interaction and they can fan in or out across the range of x.
What you can do here
  • Toggle the interaction term On or Off.
  • Watch the male and female slopes in the stat cards — parallel lines mean equal slopes; diverging lines mean different ones.
  • Read the interaction coefficient — it's exactly the slope gap.
Male slope
Female slope
Interaction coef
Try This
  1. Keep interaction OFF. Both genders share a single slope of $5,907/year; the two lines are perfectly parallel. The model forces equal returns to education — whatever the data prefers — and only the intercept shifts between groups.
  2. Toggle interaction ON. Male slope = $6,921/yr; female slope = $6,921 − $2,765 = $4,156/yr. Very different returns — women's earnings rise more slowly with education than men's in this sample.
  3. Follow the lines out to 20 years of education. The gender gap widens with education — the two groups diverge, so interaction is necessary to tell the right story.

Take-away: Without interactions, a model forces parallel slopes; adding d × x lets the data reveal differential returns that a simple indicator cannot capture. Read §14.3 in the chapter →

Model Comparison

Does the gender coefficient stay put as you add controls? Or does it grow, shrink, or even flip sign depending on what else is in the model?

When an indicator appears with its interaction, individual t-tests can be misleading; use an F-test for joint significance. Running separate regressions for each group allows all coefficients to differ simultaneously, and the Chow test (F-test) evaluates whether the relationship is fundamentally different between groups (Key Concept 14.5). The joint test H₀: α₁ = 0, α₂ = 0 evaluates whether the categorical variable matters at all.
What you can do here
  • Scan the gender row across the five columns to see how the coefficient evolves.
  • Compare R² to see which controls add the most explanatory power.
  • Cross-reference with the interaction widget — Models 3 and 5 add the gender × education interaction.
VariableModel 1Model 2Model 3Model 4Model 5
Try This
  1. Scan the gender row across all five models. The coefficient shifts substantially — and may even flip sign once the interaction enters. That flip is not evidence that the gap disappeared — it's evidence that the interaction absorbs part of the effect, which is why joint F-tests are required.
  2. Compare R² across models. Education adds the biggest jump; later variables add less. Early controls earn their keep; the last few variables fight over the residual.
  3. Read the Model 5 gender coefficient (~$57,129). Does that mean women earn more? No — in an interacted model, the "gender" coefficient is the gap at x = 0 (zero education); the true gender effect must be evaluated at realistic x.

Take-away: In models with interactions, individual coefficients are not the whole story — test the gender block jointly and interpret effects at meaningful values of the interacting variable. Read §14.4 in the chapter →

The Dummy Variable Trap

Three categories, three indicators — and suddenly OLS refuses to run. Why does adding the "last" category break everything, and which one do you drop?

Including all C indicators from a set of mutually exclusive categories plus an intercept creates perfect multicollinearity — the dummy variable trap. Because d₁ + d₂ + ⋯ + dC = 1 for every observation, the intercept is an exact linear combination of the indicators. The fix: drop one indicator (the base category) or drop the intercept. Standard practice keeps the intercept and drops one category; the remaining coefficients are then interpreted as differences from the base.
What you can do here
  • Toggle the base (omitted) category between Private, Self-employed, and Government.
  • Watch the coefficients and SEs change when the base changes.
  • Watch the predicted means at the bottom — they stay identical.
VariableCoefficientSEt
Pred: Self-emp
Pred: Private
Pred: Govt
Try This
  1. Keep Private as base. The coefficients on dself and dgovt are differences from the private-sector mean. Private is the reference point; the other two categories are compared to it.
  2. Switch to Self-employed as base. All the coefficients change dramatically — but the predicted means for each category stay the same. The parameterization changes; the model's predictions do not.
  3. Switch to Government. Same R², same predictions, new coefficient interpretation. The base category is a choice about interpretation, not a choice about model fit — pick whichever makes the story clearest.

Take-away: C categories require C − 1 indicators plus an intercept; the dropped category becomes the reference point and every other coefficient is read relative to it. Read §14.5 in the chapter →

Worker Type Comparison

Self-employed workers look like they earn the most — but is the CI wide enough to matter? The bar heights are only half the story.

Regressing y on a set of mutually exclusive indicators (with no other controls) is equivalent to analysis of variance (ANOVA). The coefficients give group means or differences from the base mean. The regression's F-test for joint significance of the indicators is identical to the ANOVA F-statistic — testing whether the categorical variable explains significant variation in y (Key Concept 14.7). Looking at group means with CIs is the visual counterpart.
What you can do here
  • Compare the bar heights across employment types.
  • Check the whiskers (95% CIs) — wider = less precise, narrower = more precise.
  • Look for CI overlap between two categories — overlap usually means the difference isn't statistically significant.
Try This
  1. Identify the highest-mean category. Self-employed is visibly highest. But the CI is by far the widest — small sample (n = 79) and heavy-tailed self-employment income make the point estimate unreliable.
  2. Check if the private and government CIs overlap. Substantial overlap suggests the difference between those two categories may not be statistically significant at 5%.
  3. Note the asymmetry of precision. The narrowest CIs belong to the largest subsamples. A reminder that "highest mean" and "most precisely estimated" are two different things.

Take-away: A set of mutually exclusive indicators is just ANOVA in regression clothing — and the CIs around each group mean tell you which differences can be trusted. Read §14.5 in the chapter →

Earnings vs Education by Gender

Two regression lines, one chart. Are they parallel, or do they fan apart? The answer decides whether you need an interaction term.

Scatter plots with separate regression lines by group visually reveal whether slopes and intercepts differ. Parallel lines indicate only an intercept shift — an indicator without interaction is enough. Non-parallel lines indicate differential slopes — an interaction term is needed. Either way, the gender gap is the vertical distance between the lines at a given x, and it's constant only when the lines are parallel.
What you can do here
  • Toggle Both / Male only / Female only to inspect each group.
  • Read male slope, female slope, and the gaps at 12 and 16 years of education in the stat cards.
  • Judge parallelism by eye — if the two lines tilt apart, interaction belongs in the model.
Male slope
Female slope
Gap at Educ=12
Gap at Educ=16
Try This
  1. Compare the two regression lines by eye. Are they parallel? If not, returns to education differ by gender and an interaction term is mandatory in the model.
  2. Compare the gap at 12 years vs 16 years. If the gap grows with education, men's returns outpace women's — exactly what the interaction term in widget 2 estimates.
  3. Toggle Male only, then Female only. The male subsample has more high-earning outliers. Visual skew is one reason male regressions dominate the pooled slope — the interaction term makes that story explicit.

Take-away: Two lines on one chart tell you in a glance whether you need interactions — and at which education levels the gender gap is widest. Read §14.3 in the chapter →

Python Libraries and Code

You've explored the key concepts interactively — now reproduce them in Python. This self-contained code block covers everything you practiced above. Copy it into an empty notebook and run it.

# =============================================================================
# CHAPTER 14 CHEAT SHEET: Regression with Indicator Variables
# =============================================================================

# --- Libraries ---
import pandas as pd                       # data loading and manipulation
import numpy as np                        # numerical operations
import matplotlib.pyplot as plt           # creating plots and visualizations
from statsmodels.formula.api import ols   # OLS regression with R-style formulas
from scipy import stats                   # t-tests for group comparisons
from scipy.stats import f_oneway          # one-way ANOVA F-test

# =============================================================================
# STEP 1: Load data directly from a URL
# =============================================================================
# pd.read_stata() reads Stata .dta files — 872 full-time workers aged 25-65
url = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_EARNINGS_COMPLETE.DTA"
data = pd.read_stata(url)

print(f"Dataset: {data.shape[0]} observations, {data.shape[1]} variables")

# =============================================================================
# STEP 2: Descriptive statistics — compare earnings by gender
# =============================================================================
# Indicator variable: gender = 1 (female), gender = 0 (male)
mean_male   = data[data['gender'] == 0]['earnings'].mean()
mean_female = data[data['gender'] == 1]['earnings'].mean()
diff_means  = mean_female - mean_male

print(f"Mean earnings (Male):   ${mean_male:,.2f}")
print(f"Mean earnings (Female): ${mean_female:,.2f}")
print(f"Difference (F - M):     ${diff_means:,.2f}")

# =============================================================================
# STEP 3: Regression on a single indicator — equivalent to difference in means
# =============================================================================
# The intercept = mean for d=0 (males); the gender coefficient = mean difference
# IMPORTANT: .fit(cov_type='HC1') uses robust standard errors
model1 = ols('earnings ~ gender', data=data).fit(cov_type='HC1')

intercept = model1.params['Intercept']    # mean earnings for males
gap       = model1.params['gender']       # earnings gap (females - males)
r2        = model1.rsquared

print(f"\nModel 1: earnings = {intercept:,.0f} + ({gap:,.0f}) × gender")
print(f"Raw gender gap: ${gap:,.0f} (females earn ${abs(gap):,.0f} less)")
print(f"R-squared: {r2:.4f} ({r2*100:.1f}% of variation explained)")

model1.summary()

# =============================================================================
# STEP 4: Add controls and interaction — how the gap changes
# =============================================================================
# Adding education as a control measures the gap AFTER accounting for education
model2 = ols('earnings ~ gender + education', data=data).fit(cov_type='HC1')

# Adding gender×education interaction allows returns to education to differ by gender
model3 = ols('earnings ~ gender + education + genderbyeduc', data=data).fit(cov_type='HC1')

# Full model with additional controls
model4 = ols('earnings ~ gender + education + genderbyeduc + age + hours',
             data=data).fit(cov_type='HC1')

# Compare how the gender coefficient evolves across models
print(f"{'Model':<12} {'Gender Coef':>14} {'R²':>8}")
print("-" * 36)
for name, m in [('Gender only', model1), ('+ Education', model2),
                ('+ Interact', model3), ('+ Age,Hours', model4)]:
    g = m.params['gender']
    print(f"{name:<12} {g:>14,.0f} {m.rsquared:>8.4f}")

# =============================================================================
# STEP 5: Scatter plot — visualize separate regression lines by gender
# =============================================================================
# Non-parallel lines indicate different slopes = interaction term is needed
fig, ax = plt.subplots(figsize=(10, 6))

for g, label, color in [(0, 'Male', 'tab:blue'), (1, 'Female', 'tab:red')]:
    subset = data[data['gender'] == g]
    ax.scatter(subset['education'], subset['earnings'], alpha=0.3, s=25,
               label=label, color=color)
    # Fit and plot regression line for each group
    z = np.polyfit(subset['education'], subset['earnings'], 1)
    edu_range = np.linspace(subset['education'].min(), subset['education'].max(), 100)
    ax.plot(edu_range, np.poly1d(z)(edu_range), linewidth=2, color=color,
            label=f'{label} slope: ${z[0]:,.0f}/yr')

ax.set_xlabel('Years of Education')
ax.set_ylabel('Earnings ($)')
ax.set_title('Earnings vs Education by Gender (non-parallel = interaction needed)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 6: Sets of indicators — worker type and the dummy variable trap
# =============================================================================
# Three mutually exclusive categories: dself, dprivate, dgovt (sum to 1)
# Drop one (dprivate = base) to avoid perfect multicollinearity
model_worker = ols('earnings ~ dself + dgovt + education + age',
                   data=data).fit(cov_type='HC1')

print(f"Base category: Private sector")
print(f"Self-employed vs Private: ${model_worker.params['dself']:,.0f}")
print(f"Government vs Private:   ${model_worker.params['dgovt']:,.0f}")
print(f"R-squared: {model_worker.rsquared:.4f}")

model_worker.summary()

# =============================================================================
# STEP 7: ANOVA — test if earnings differ across worker types
# =============================================================================
# Regression on mutually exclusive indicators = analysis of variance (ANOVA)
group_self = data[data['dself'] == 1]['earnings']
group_priv = data[data['dprivate'] == 1]['earnings']
group_govt = data[data['dgovt'] == 1]['earnings']

f_stat, p_value = f_oneway(group_self, group_priv, group_govt)
print(f"\nANOVA F-statistic: {f_stat:.2f}, p-value: {p_value:.4f}")

# Group means with counts
data['worker_type'] = np.where(data['dself'] == 1, 'Self-employed',
                      np.where(data['dprivate'] == 1, 'Private', 'Government'))
print(data.groupby('worker_type')['earnings'].agg(['mean', 'count']).round(0))
Open empty Colab notebook →