Chapter 14 — Regression with Indicator Variables

Women earn $16,396 less per year than men in this sample. How much of that gap survives once we compare workers with similar education, age, and hours?

Regressing y on just an intercept and a single indicator d is algebraically a difference-in-means test. The fitted model ŷ = b + ad has intercept b = mean of y when d = 0, and slope a = the mean difference (ȳ₁ − ȳ₀). When you add other regressors, the coefficient on d becomes the adjusted difference — how much of the raw gap remains after controlling for observable differences in those regressors.

Model

Gender coef

—

t-statistic

—

p-value

—

R²

—

Try this

Start with Model 1 (gender only). Gap = −$16,396, R² = 2.5%. The raw gap is large but gender alone explains almost none of the earnings variation — most of y₍earnings₎ is driven by other things.
Switch to Model 2 (+ Education). The gap widens to −$18,258 — it got bigger! Women in this sample have slightly more education on average, so controlling for education strips out some "protective" variation and the residual gap grows.
Switch to Model 4 (full controls). The gap shrinks substantially once age and hours are added — quantifying how much of the raw gap is "explained" by observable characteristics vs. genuinely unexplained.

Take-away: A regression on an indicator is a difference-in-means test — and its coefficient adjusts once you add controls, quantifying what's explained vs. what isn't. Read §14.1 in the chapter →

Maybe the earnings gap isn't a flat number — maybe it widens (or narrows) with education. An interaction term lets the data say so.

An interacted indicator variable — the product of an indicator d and a continuous regressor x — lets both intercepts and slopes differ between groups. In the model y = β₁ + β₂x + α₁d + α₂(d × x) + u, α₁ shifts the intercept and α₂ measures how the slope on x differs between groups. Include only d (no interaction) and the lines stay parallel; add the interaction and they can fan in or out across the range of x.

Interaction term

Male slope

—

Female slope

—

Interaction coef

—

R²

—

Try this

Keep interaction OFF. Both genders share a single slope of $5,907/year; the two lines are perfectly parallel. The model forces equal returns to education — whatever the data prefers — and only the intercept shifts between groups.
Toggle interaction ON. Male slope = $6,921/yr; female slope = $6,921 − $2,765 = $4,156/yr. Very different returns — women's earnings rise more slowly with education than men's in this sample.
Follow the lines out to 20 years of education. The gender gap widens with education — the two groups diverge, so interaction is necessary to tell the right story.

Take-away: Without interactions, a model forces parallel slopes; adding d × x lets the data reveal differential returns that a simple indicator cannot capture. Read §14.3 in the chapter →

Does the gender coefficient stay put as you add controls? Or does it grow, shrink, or even flip sign depending on what else is in the model?

When an indicator appears with its interaction, individual t-tests can be misleading; use an F-test for joint significance. Running separate regressions for each group allows all coefficients to differ simultaneously, and the Chow test (F-test) evaluates whether the relationship is fundamentally different between groups (Key Concept 14.5). The joint test H₀: α₁ = 0, α₂ = 0 evaluates whether the categorical variable matters at all.

Variable	Model 1	Model 2	Model 3	Model 4	Model 5

Try this

Scan the gender row across all five models. The coefficient shifts substantially — and may even flip sign once the interaction enters. That flip is not evidence that the gap disappeared — it's evidence that the interaction absorbs part of the effect, which is why joint F-tests are required.
Compare R² across models. Education adds the biggest jump; later variables add less. Early controls earn their keep; the last few variables fight over the residual.
Read the Model 5 gender coefficient (~$57,129). Does that mean women earn more? No — in an interacted model, the "gender" coefficient is the gap at x = 0 (zero education); the true gender effect must be evaluated at realistic x.

Take-away: In models with interactions, individual coefficients are not the whole story — test the gender block jointly and interpret effects at meaningful values of the interacting variable. Read §14.4 in the chapter →

Three categories, three indicators — and suddenly OLS refuses to run. Why does adding the "last" category break everything, and which one do you drop?

Including all C indicators from a set of mutually exclusive categories plus an intercept creates perfect multicollinearity — the dummy variable trap. Because d₁ + d₂ + ⋯ + d_C = 1 for every observation, the intercept is an exact linear combination of the indicators. The fix: drop one indicator (the base category) or drop the intercept. Standard practice keeps the intercept and drops one category; the remaining coefficients are then interpreted as differences from the base.

Base (omitted) category

Variable	Coefficient	SE	t

R²

—

Pred: Self-emp

—

Pred: Private

—

Pred: Govt

—

Try this

Keep Private as base. The coefficients on dself and dgovt are differences from the private-sector mean. Private is the reference point; the other two categories are compared to it.
Switch to Self-employed as base. All the coefficients change dramatically — but the predicted means for each category stay the same. The parameterization changes; the model's predictions do not.
Switch to Government. Same R², same predictions, new coefficient interpretation. The base category is a choice about interpretation, not a choice about model fit — pick whichever makes the story clearest.

Take-away: C categories require C − 1 indicators plus an intercept; the dropped category becomes the reference point and every other coefficient is read relative to it. Read §14.5 in the chapter →

Self-employed workers look like they earn the most — but is the CI wide enough to matter? The bar heights are only half the story.

Regressing y on a set of mutually exclusive indicators (with no other controls) is equivalent to analysis of variance (ANOVA). The coefficients give group means or differences from the base mean. The regression's F-test for joint significance of the indicators is identical to the ANOVA F-statistic — testing whether the categorical variable explains significant variation in y (Key Concept 14.7). Looking at group means with CIs is the visual counterpart.

Try this

Identify the highest-mean category. Self-employed is visibly highest. But the CI is by far the widest — small sample (n = 79) and heavy-tailed self-employment income make the point estimate unreliable.
Check if the private and government CIs overlap. Substantial overlap suggests the difference between those two categories may not be statistically significant at 5%.
Note the asymmetry of precision. The narrowest CIs belong to the largest subsamples. A reminder that "highest mean" and "most precisely estimated" are two different things.

Take-away: A set of mutually exclusive indicators is just ANOVA in regression clothing — and the CIs around each group mean tell you which differences can be trusted. Read §14.5 in the chapter →

Two regression lines, one chart. Are they parallel, or do they fan apart? The answer decides whether you need an interaction term.

Scatter plots with separate regression lines by group visually reveal whether slopes and intercepts differ. Parallel lines indicate only an intercept shift — an indicator without interaction is enough. Non-parallel lines indicate differential slopes — an interaction term is needed. Either way, the gender gap is the vertical distance between the lines at a given x, and it's constant only when the lines are parallel.

Show

Male slope

—

Female slope

—

Gap at Educ=12

—

Gap at Educ=16

—

Try this

Compare the two regression lines by eye. Are they parallel? If not, returns to education differ by gender and an interaction term is mandatory in the model.
Compare the gap at 12 years vs 16 years. If the gap grows with education, men's returns outpace women's — exactly what the interaction term in widget 2 estimates.
Toggle Male only, then Female only. The male subsample has more high-earning outliers. Visual skew is one reason male regressions dominate the pooled slope — the interaction term makes that story explicit.

Take-away: Two lines on one chart tell you in a glance whether you need interactions — and at which education levels the gender gap is widest. Read §14.3 in the chapter →

Code Summary

You've explored the key concepts interactively — now reproduce them in code. These self-contained blocks cover everything you practiced above. Pick your language, copy the code, and run it.

# =============================================================================
# CHAPTER 14 CHEAT SHEET: Regression with Indicator Variables
# =============================================================================

# --- Libraries ---
import pandas as pd                       # data loading and manipulation
import numpy as np                        # numerical operations
import matplotlib.pyplot as plt           # creating plots and visualizations
import pyfixest as pf                     # fast OLS estimation with feols()
# !pip install pyfixest                   # uncomment if running in Google Colab
from scipy import stats                   # t-tests for group comparisons
from scipy.stats import f_oneway          # one-way ANOVA F-test

# =============================================================================
# STEP 1: Load data directly from a URL
# =============================================================================
# pd.read_stata() reads Stata .dta files — 872 full-time workers aged 25-65
url = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_EARNINGS_COMPLETE.DTA"
data = pd.read_stata(url)

print(f"Dataset: {data.shape[0]} observations, {data.shape[1]} variables")

# =============================================================================
# STEP 2: Descriptive statistics — compare earnings by gender
# =============================================================================
# Indicator variable: gender = 1 (female), gender = 0 (male)
mean_male   = data[data['gender'] == 0]['earnings'].mean()
mean_female = data[data['gender'] == 1]['earnings'].mean()
diff_means  = mean_female - mean_male

print(f"Mean earnings (Male):   ${mean_male:,.2f}")
print(f"Mean earnings (Female): ${mean_female:,.2f}")
print(f"Difference (F - M):     ${diff_means:,.2f}")

# =============================================================================
# STEP 3: Regression on a single indicator — equivalent to difference in means
# =============================================================================
# The intercept = mean for d=0 (males); the gender coefficient = mean difference
# IMPORTANT: vcov='HC1' uses robust standard errors
fit1 = pf.feols('earnings ~ gender', data=data, vcov='HC1')

intercept = fit1.coef()['Intercept']      # mean earnings for males
gap       = fit1.coef()['gender']         # earnings gap (females - males)
r2        = fit1._r2

print(f"\nModel 1: earnings = {intercept:,.0f} + ({gap:,.0f}) × gender")
print(f"Raw gender gap: ${gap:,.0f} (females earn ${abs(gap):,.0f} less)")
print(f"R-squared: {r2:.4f} ({r2*100:.1f}% of variation explained)")

fit1.summary()

# =============================================================================
# STEP 4: Add controls and interaction — how the gap changes
# =============================================================================
# Adding education as a control measures the gap AFTER accounting for education
fit2 = pf.feols('earnings ~ gender + education', data=data, vcov='HC1')

# Adding gender×education interaction allows returns to education to differ by gender
fit3 = pf.feols('earnings ~ gender + education + genderbyeduc', data=data, vcov='HC1')

# Full model with additional controls
fit4 = pf.feols('earnings ~ gender + education + genderbyeduc + age + hours',
                data=data, vcov='HC1')

# Compare how the gender coefficient evolves across models
print(f"{'Model':<12} {'Gender Coef':>14} {'R²':>8}")
print("-" * 36)
for name, f in [('Gender only', fit1), ('+ Education', fit2),
                ('+ Interact', fit3), ('+ Age,Hours', fit4)]:
    g = f.coef()['gender']
    print(f"{name:<12} {g:>14,.0f} {f._r2:>8.4f}")

# =============================================================================
# STEP 5: Scatter plot — visualize separate regression lines by gender
# =============================================================================
# Non-parallel lines indicate different slopes = interaction term is needed
fig, ax = plt.subplots(figsize=(10, 6))

for g, label, color in [(0, 'Male', 'tab:blue'), (1, 'Female', 'tab:red')]:
    subset = data[data['gender'] == g]
    ax.scatter(subset['education'], subset['earnings'], alpha=0.3, s=25,
               label=label, color=color)
    # Fit and plot regression line for each group
    z = np.polyfit(subset['education'], subset['earnings'], 1)
    edu_range = np.linspace(subset['education'].min(), subset['education'].max(), 100)
    ax.plot(edu_range, np.poly1d(z)(edu_range), linewidth=2, color=color,
            label=f'{label} slope: ${z[0]:,.0f}/yr')

ax.set_xlabel('Years of Education')
ax.set_ylabel('Earnings ($)')
ax.set_title('Earnings vs Education by Gender (non-parallel = interaction needed)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 6: Sets of indicators — worker type and the dummy variable trap
# =============================================================================
# Three mutually exclusive categories: dself, dprivate, dgovt (sum to 1)
# Drop one (dprivate = base) to avoid perfect multicollinearity
fit_worker = pf.feols('earnings ~ dself + dgovt + education + age',
                      data=data, vcov='HC1')

print(f"Base category: Private sector")
print(f"Self-employed vs Private: ${fit_worker.coef()['dself']:,.0f}")
print(f"Government vs Private:   ${fit_worker.coef()['dgovt']:,.0f}")
print(f"R-squared: {fit_worker._r2:.4f}")

fit_worker.summary()

# =============================================================================
# STEP 7: ANOVA — test if earnings differ across worker types
# =============================================================================
# Regression on mutually exclusive indicators = analysis of variance (ANOVA)
group_self = data[data['dself'] == 1]['earnings']
group_priv = data[data['dprivate'] == 1]['earnings']
group_govt = data[data['dgovt'] == 1]['earnings']

f_stat, p_value = f_oneway(group_self, group_priv, group_govt)
print(f"\nANOVA F-statistic: {f_stat:.2f}, p-value: {p_value:.4f}")

# Group means with counts
data['worker_type'] = np.where(data['dself'] == 1, 'Self-employed',
                      np.where(data['dprivate'] == 1, 'Private', 'Government'))
print(data.groupby('worker_type')['earnings'].agg(['mean', 'count']).round(0))

Open empty Colab notebook →

* =============================================================================
* CHAPTER 14 CHEAT SHEET: Regression with Indicator Variables
* =============================================================================

* --- Setup ---
clear all                                // start with a clean workspace
set more off                             // do not pause output for long results

* =============================================================================
* STEP 1: Load data directly from a URL
* =============================================================================
* use loads a Stata .dta file; "clear" drops any data already in memory
* 872 full-time workers aged 25-65
use "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_EARNINGS_COMPLETE.DTA", clear

describe                                 // list all variables, types, and labels
display "Observations: " _N              // _N is Stata's built-in observation count

* =============================================================================
* STEP 2: Descriptive statistics — compare earnings by gender
* =============================================================================
* Indicator variable: gender = 1 (female), gender = 0 (male)
* tabulate with summarize gives group means in one command
tabulate gender, summarize(earnings)

// Alternatively, compare means explicitly
summarize earnings if gender == 0        // male mean
local mean_male = r(mean)
summarize earnings if gender == 1        // female mean
local mean_female = r(mean)
display "Difference (F - M): " `mean_female' - `mean_male'

* =============================================================================
* STEP 3: Regression on a single indicator — equivalent to difference in means
* =============================================================================
* The intercept = mean for d=0 (males); the gender coefficient = mean difference
* IMPORTANT: vce(robust) uses heteroscedasticity-robust standard errors (HC1)
regress earnings gender, vce(robust)

// After regress, Stata stores coefficients in e(b) and R-squared in e(r2)
display "Raw gender gap: " _b[gender]
display "R-squared:      " e(r2)

* =============================================================================
* STEP 4: Add controls and interaction — how the gap changes
* =============================================================================
* Adding education as a control measures the gap AFTER accounting for education
regress earnings gender education, vce(robust)
estimates store model2                   // store results for later comparison

* Adding gender×education interaction allows returns to education to differ
* c.education#i.gender creates the interaction term
regress earnings gender education genderbyeduc, vce(robust)
estimates store model3

* Full model with additional controls
regress earnings gender education genderbyeduc age hours, vce(robust)
estimates store model4

// Compare how the gender coefficient evolves across models
// esttab produces a side-by-side comparison table (requires estout package)
// ssc install estout                    // install once if needed
// esttab model2 model3 model4, se r2

* =============================================================================
* STEP 5: Scatter plot — visualize separate regression lines by gender
* =============================================================================
* Non-parallel lines indicate different slopes = interaction term is needed
* twoway plots scatter points with linear fits for each group
twoway (scatter earnings education if gender == 0,               ///
            mcolor(blue%30) msize(small))                        ///
       (scatter earnings education if gender == 1,               ///
            mcolor(red%30) msize(small))                         ///
       (lfit earnings education if gender == 0,                  ///
            lcolor(blue) lwidth(medthick))                       ///
       (lfit earnings education if gender == 1,                  ///
            lcolor(red) lwidth(medthick)),                       ///
    xtitle("Years of Education")                                 ///
    ytitle("Earnings ($)")                                       ///
    title("Earnings vs Education by Gender")                     ///
    legend(order(1 "Male" 2 "Female" 3 "Male fit" 4 "Female fit"))

* =============================================================================
* STEP 6: Sets of indicators — worker type and the dummy variable trap
* =============================================================================
* Three mutually exclusive categories: dself, dprivate, dgovt (sum to 1)
* Drop one (dprivate = base) to avoid perfect multicollinearity
regress earnings dself dgovt education age, vce(robust)

display "Base category: Private sector"
display "Self-employed vs Private: " _b[dself]
display "Government vs Private:   " _b[dgovt]
display "R-squared: " e(r2)

// Change the base category to see how coefficients shift
// but predicted means stay the same
regress earnings dprivate dgovt education age, vce(robust)
display "Base category: Self-employed"

* =============================================================================
* STEP 7: ANOVA — test if earnings differ across worker types
* =============================================================================
* Regression on mutually exclusive indicators = analysis of variance (ANOVA)
* oneway performs a one-way ANOVA F-test across groups

// Create a single categorical variable for worker type
gen worker_type = 1 if dself == 1
replace worker_type = 2 if dprivate == 1
replace worker_type = 3 if dgovt == 1
label define wtype 1 "Self-employed" 2 "Private" 3 "Government"
label values worker_type wtype

// One-way ANOVA: test if earnings differ across worker types
oneway earnings worker_type

// Group means with counts
tabulate worker_type, summarize(earnings)

Paste into your Stata do-file editor

# =============================================================================
# CHAPTER 14 CHEAT SHEET: Regression with Indicator Variables
# =============================================================================

# --- Libraries ---
library(haven)           # read Stata .dta files
library(fixest)          # fast OLS estimation with feols()
library(dplyr)           # data manipulation
library(ggplot2)         # grammar of graphics

# =============================================================================
# STEP 1: Load data directly from a URL
# =============================================================================
# 872 full-time workers aged 25-65
url <- "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_EARNINGS_COMPLETE.DTA"
data <- read_dta(url)

cat("Dataset:", nrow(data), "observations,", ncol(data), "variables\n")

# =============================================================================
# STEP 2: Descriptive statistics — compare earnings by gender
# =============================================================================
# Indicator variable: gender = 1 (female), gender = 0 (male)
data |>
  group_by(gender) |>
  summarize(mean_earnings = mean(earnings), n = n())

# =============================================================================
# STEP 3: Regression on a single indicator — equivalent to difference in means
# =============================================================================
# The intercept = mean for d=0 (males); the gender coefficient = mean difference
model1 <- feols(earnings ~ gender, data = data, vcov = "HC1")
summary(model1)

cat("Raw gender gap: $", round(coef(model1)["gender"], 0), "\n")

# =============================================================================
# STEP 4: Add controls and interaction — how the gap changes
# =============================================================================
model2 <- feols(earnings ~ gender + education, data = data, vcov = "HC1")
model3 <- feols(earnings ~ gender + education + genderbyeduc,
                data = data, vcov = "HC1")
model4 <- feols(earnings ~ gender + education + genderbyeduc + age + hours,
                data = data, vcov = "HC1")

# Compare how the gender coefficient evolves across models
etable(model1, model2, model3, model4,
       headers = c("Gender only", "+ Educ", "+ Interact", "+ Age,Hours"),
       keep = "gender")

# =============================================================================
# STEP 5: Scatter plot — visualize separate regression lines by gender
# =============================================================================
ggplot(data, aes(x = education, y = earnings, color = factor(gender))) +
  geom_point(alpha = 0.3, size = 1.5) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, linewidth = 1.2) +
  scale_color_manual(values = c("0" = "blue", "1" = "red"),
                     labels = c("Male", "Female")) +
  labs(x = "Years of Education", y = "Earnings ($)",
       title = "Earnings vs Education by Gender", color = NULL) +
  theme_minimal()

# =============================================================================
# STEP 6: Sets of indicators — worker type and the dummy variable trap
# =============================================================================
# Three mutually exclusive categories: dself, dprivate, dgovt (sum to 1)
# Drop one (dprivate = base) to avoid perfect multicollinearity
model_worker <- feols(earnings ~ dself + dgovt + education + age,
                      data = data, vcov = "HC1")
summary(model_worker)

cat("Base category: Private sector\n")
cat("Self-employed vs Private: $", round(coef(model_worker)["dself"], 0), "\n")
cat("Government vs Private:   $", round(coef(model_worker)["dgovt"], 0), "\n")

# =============================================================================
# STEP 7: ANOVA — test if earnings differ across worker types
# =============================================================================
data <- data |>
  mutate(worker_type = case_when(
    dself == 1    ~ "Self-employed",
    dprivate == 1 ~ "Private",
    dgovt == 1    ~ "Government"
  ))

# One-way ANOVA
aov_result <- aov(earnings ~ worker_type, data = data)
summary(aov_result)

# Group means
data |>
  group_by(worker_type) |>
  summarize(mean = round(mean(earnings), 0), n = n())

Paste into your R console or RStudio

Regression with Indicator Variables

Gender Earnings Gap Explorer

Interaction Effects: Do Returns to Education Differ by Gender?

Model Comparison

The Dummy Variable Trap

Worker Type Comparison

Earnings vs Education by Gender

Code Summary