Chapter 06 of 18 · Interactive Dashboard

The Least Squares Estimator

Slide, toggle, and simulate to build intuition for OLS properties — unbiasedness, sampling variability, standard errors, and efficiency.

Population vs. Sample Regression

Which line is the real one — the one we drew through the data, or the invisible one that generated it? In econometrics the answer matters: one is what we see, the other is what we want to know.

The population line is fixed and unknown; the sample line estimates it from data. The population regression E[y|x] = β₁ + β₂x describes the true relationship with unknown parameters β₁ and β₂. The sample regression ŷ = b₁ + b₂x estimates it from a limited sample. The error u = y − E[y|x] is the deviation from the unknown population line and is unobservable; the residual e = y − ŷ is the deviation from the fitted sample line and is observable. We use residuals to learn about errors.
What you can do here
  • Slide sample size n — small n means noisy sample lines; large n pulls the sample line close to the population line.
  • Toggle Errors / Residuals / Both — compare deviations from the true line (errors) vs. deviations from the fitted line (residuals).
  • Click Resimulate — draw a fresh random sample and watch how much b₁ and b₂ move around the true values.
True β₁
1.0000
True β₂
2.0000
Sample b₁
Sample b₂
Sampling error
Try this
  1. Toggle to "Both" at n = 30. Purple segments (errors) reach to the true line; pink segments (residuals) reach to the fitted line. Residuals are shorter on average — OLS minimizes squared residuals, not squared errors.
  2. Set n = 5 and Resimulate several times. The sample line swings wildly; now switch to n = 100 and resimulate. With more data the sample line barely moves — that is sampling variability shrinking.
  3. Show "Errors (u)". Some deviations are positive, some negative, and their average is close to zero. That is the E[u|x] = 0 assumption in action.

Take-away: Every sample gives a different ŷ, but across many draws the sample slopes average to the true β₂ — that is unbiasedness. Read §6.1 in the chapter →

OLS Assumptions Diagnostic

OLS is trustworthy only when four background assumptions hold. Which ones bend the line, and which only bend the standard errors?

Four assumptions separate a trustworthy OLS fit from a misleading one. (1) Correct specification y = β₁ + β₂x + u; (2) mean-zero errors E[u|x] = 0; (3) homoskedasticity Var[u|x] = σ²ᵤ; (4) independence of errors across observations. Assumptions 1–2 are essential for unbiasedness; 3–4 affect variance and can be relaxed with robust standard errors.
What you can do here
  • Pick an assumption to violate — choose linearity, mean-zero, homoskedasticity, or independence to see its effect in isolation.
  • Slide severity — turn the violation up from mild to extreme and watch the estimated slope drift or the scatter fan out.
  • Click Resimulate — redraw the sample under the chosen violation and see how consistent the distortion is.
True β₂
2.0000
Sample b₂
Bias (b₂ − β₂)
se(b₂)
Try this
  1. Select "Mean-Zero" and push severity to 100%. The sample line consistently misses the true line. This is omitted-variable bias — the most dangerous violation because it shifts b₂ systematically away from β₂.
  2. Select "Homosk." at 100%. The fitted line still tracks the true line well (b₂ ≈ 2), but the scatter fans out. The slope is fine; only the reported se(b₂) becomes unreliable.
  3. Set "All Correct" and Resimulate several times. b₂ varies randomly around 2.0 with no drift. Random around the truth, never systematic — that is the hallmark of an unbiased estimator.

Take-away: Assumptions 1–2 decide whether b₂ is centered on β₂; assumptions 3–4 decide how precise the SE reports are. Read §6.3 in the chapter →

Monte Carlo Unbiasedness Simulator

A single sample gives one estimate. Run 1,000 samples and what do they average to — and what shape do they form?

Across many samples, OLS estimates cluster around the true parameter and trace out a bell curve. Monte Carlo simulation demonstrates unbiasedness: the average of many OLS estimates equals the true parameter. The distribution of estimates is approximately normal (CLT), with spread measured by the standard error. This validates the theoretical properties of OLS in practice.
What you can do here
  • Slide the number of simulations — more draws produce a smoother histogram of b₂ or b₁.
  • Slide sample size n — larger n concentrates the histogram around the true value.
  • Toggle Slope vs. Intercept — compare the sampling distributions of b₂ and b₁.
  • Click Resimulate — regenerate all samples with a fresh random seed.
True value
2.0000
Mean of estimates
SD of estimates
Theoretical SE
|Bias|
Try this
  1. At n = 30 the SD of b₂ ≈ 0.38. Set n = 120 (4× larger). The SD drops to ≈ 0.19. Four times the data halves the spread — that is the √n rule.
  2. Switch to "Intercept (b₁)". The distribution is wider (SD ≈ 1.2) than the slope's. The intercept is harder to estimate because it extrapolates to x = 0, far from the data center.
  3. Set n = 10 and Resimulate; then set n = 100. The small-n histogram is lumpy; the large-n bell is smooth every time. The CLT becomes visible only when n is big enough.

Take-away: The mean of many b₂ estimates is β₂ (unbiased); the spread across samples is what the standard error measures. Read §6.3 in the chapter →

Standard Error Anatomy

Why are some slope estimates precise and others fuzzy? Three things decide — and you can turn each dial yourself.

The standard error of b₂ shrinks when noise is low, sample size is large, and x is spread wide. se(b₂) = sₑ / √[Σ(xᵢ − x̄)²]. We divide by (n − 2), not n, because estimating 2 parameters (b₁ and b₂) uses 2 degrees of freedom. Precision is better (smaller SE) when: (1) the model fits well (small sₑ), (2) sample size is large, (3) regressors are widely scattered.
What you can do here
  • Slide error σᵤ — smaller noise means points hug the line and the SE shrinks.
  • Slide sample size n — more data raises the degrees of freedom and tightens the SE at a √n rate.
  • Slide x-spread σₓ — wider x values expand SSx and make the slope more precisely identified.
  • Click Resimulate — refresh the noise draws to see the SE as a random-sample quantity.
sₑ (Root MSE)
df = n − 2
SSx = Σ(xᵢ−x̄)²
se(b₂)
Formula: se(b₂) =
Try this
  1. Set σᵤ = 0.5, then drag to σᵤ = 5. Points go from hugging the line to an exploding cloud, and se(b₂) grows proportionally. Model fit is the first lever of precision.
  2. Keep σᵤ = 2, set n = 5 (df = 3), then increase to n = 80. The SE falls, but each extra observation helps less than the last. Sample size helps only at the √n rate — diminishing returns.
  3. Set σₓ = 0.3, then drag to σₓ = 3. A tight cluster of x gives a poorly identified slope; wide spread gives a precise one. This is exactly why experimental design prefers widely varied regressors.

Take-away: Precision improves with a better-fitting model, more observations, and wider variation in x — but sample size helps only at the rate of √n. Read §6.4 in the chapter →

Gauss-Markov: OLS vs. Alternatives

Why do we use OLS instead of some other line-fitter? Because under the four assumptions, nothing linear can beat it.

Under the four assumptions, OLS is the Best Linear Unbiased Estimator (BLUE). Among all linear unbiased estimators, OLS has the smallest variance. If errors are also normally distributed, OLS is the Best Unbiased Estimator (not just among linear ones). This optimality is what justifies the widespread use of OLS.
What you can do here
  • Toggle Assumptions Met / Heteroskedastic — check whether OLS still wins when assumption 3 is broken.
  • Slide the number of simulations — more draws give cleaner, easier-to-compare sampling distributions.
  • Click Resimulate — redraw all samples and see whether the ranking of estimators is stable.
OLS SD
Alt-Linear SD
Median-Slope SD
Efficiency (Alt/OLS)
Try this
  1. Under "Assumptions Met", compare the three histograms. OLS (cyan) is the tightest; the efficiency ratio (Alt SD / OLS SD) sits well above 1.0. That is BLUE made visible.
  2. Switch to "Heteroskedastic". OLS is still centered on 2.0 (unbiased), but the median-slope estimator now has comparable spread. Break homoskedasticity and OLS loses its efficiency edge.
  3. Increase simulations to 2000. Each histogram smooths out and the ranking of variances becomes unambiguous. Monte Carlo convergence makes the theorem's claim concrete.

Take-away: When all four assumptions hold, OLS is tightest of the three; break homoskedasticity and its advantage fades. Read §6.3 in the chapter →

Real-Data Sampling Variability

Sampling variability isn't just a simulation curiosity — it happens every time a researcher works with a subset of real data. How far does one sample of 50 countries stray from all 108?

Even with real economic data, each sample tells a slightly different story — sampling error is always present. Treating the full dataset of 108 countries as the "population," drawing a sample of 50 simulates the real-world situation of working with incomplete data. The difference between the population coefficient β₂ and the sample coefficient b₂ is the sampling error — random, sometimes positive, sometimes negative, but on average zero.
What you can do here
  • Slide subsample size — from n = 10 to n = 90 of the 108 countries, and watch how closely the sample line tracks the population line.
  • Slide MC draws — more repeated subsamples yield a cleaner histogram of b₂.
  • Toggle Display — view the sample highlighted inside the full scatter, or the sample alone.
  • Click Resimulate — draw a fresh subsample of countries and compare.
Pop β₂
Sample b₂
Sampling error
MC Mean(b₂)
MC SD(b₂)
Try this
  1. Set n = 20 and Resimulate several times. The pink sample line swings widely around the purple population line. With only 20 of 108 countries, uncertainty in the slope is substantial.
  2. Increase n to 80. The sample line barely departs from the population line, and the MC SD drops by roughly √4 = 2×. The 1/√n rule holds with real economic data too.
  3. Toggle to "Sample Only" at n = 30. Different subsets of countries produce noticeably different slopes. Which countries end up in your sample matters — that is sampling variability in real research.

Take-away: The full 108-country slope is our benchmark; any single subsample wiggles around it with uncertainty that shrinks as n grows. Read the case study in the chapter →

Standard Errors and Sample Size

If I double my sample, does my standard error halve? Not quite — the tyranny of the square root says you need four times the data to halve it.

Standard errors fall with the square root of n — so precision has diminishing returns. se(b₂) = σᵤ / √[n × Var(x)]. To halve the standard error you need four times the sample size. Going from n = 100 to n = 400 yields the same precision gain as going from n = 25 to n = 100. This diminishing-returns relationship is a fundamental constraint of statistical estimation.
What you can do here
  • Slide error σᵤ — noisier data shifts the whole SE-vs-n curve upward.
  • Slide x-spread σₓ — wider regressors press the curve downward.
  • Toggle Generated DGP / Convergence Data — compare a textbook DGP with the real 108-country dataset.
  • Click Resimulate — redraw the Monte Carlo points along the theoretical curve.
SE at n=25
SE at n=100
Ratio (n=25/n=100)
Theory predicts
2.00×
Try this
  1. Read the SE at n = 25 and at n = 100. The ratio should be close to 2.0. That is the 1/√n law predicting that quadrupling n halves the SE.
  2. Increase σᵤ from 2 to 4. The entire SE-vs-n curve shifts upward at every sample size. Noisier data means a larger SE no matter how big your sample is.
  3. Switch to "Convergence Data". The theoretical curve no longer fits perfectly — real data is not a clean normal DGP. The √n decay still holds approximately, which is why the rule travels from theory to practice.

Take-away: Going from n = 25 to n = 100 halves the SE; going from n = 100 to n = 200 only saves you another ~29%. Read the case study in the chapter →

Code Summary

You've explored the key concepts interactively — now reproduce them in code. These self-contained blocks cover everything you practiced above. Pick your language, copy the code, and run it.

# =============================================================================
# CHAPTER 6 CHEAT SHEET: The Least Squares Estimator
# =============================================================================

# --- Libraries ---
import numpy as np                        # random sampling and numerical operations
import pandas as pd                       # data manipulation
import matplotlib.pyplot as plt           # creating plots and visualizations
import pyfixest as pf                     # OLS regression with R-style formulas
# !pip install pyfixest                   # uncomment if running in Google Colab

# =============================================================================
# STEP 1: Define the Data-Generating Process (DGP)
# =============================================================================
# The DGP specifies the TRUE population relationship: y = b1 + b2x + u
# We know the true parameters — in real research, we never do!
beta_1_true = 1       # true intercept
beta_2_true = 2       # true slope
sigma_u     = 2       # error standard deviation

# Generate one sample of n observations
np.random.seed(42)
n = 30
x = np.random.normal(3, 1, n)
u = np.random.normal(0, sigma_u, n)
y = beta_1_true + beta_2_true * x + u

data = pd.DataFrame({'x': x, 'y': y})
print(f"Generated sample: {n} observations from y = {beta_1_true} + {beta_2_true}x + u")

# =============================================================================
# STEP 2: Fit OLS and compare sample vs. population parameters
# =============================================================================
# The sample regression estimates the unknown population line from data
fit = pf.feols('y ~ x', data=data)

b1 = fit.coef()['Intercept']
b2 = fit.coef()['x']

print(f"\nPopulation:  E[y|x] = {beta_1_true} + {beta_2_true}x")
print(f"Sample:      y_hat = {b1:.2f} + {b2:.2f}x")
print(f"Sampling error in slope: b2 - b2_true = {b2 - beta_2_true:.4f}")

# Full regression table (coefficients, std errors, t-stats, p-values, R2)
fit.summary()

# =============================================================================
# STEP 3: Scatter plot — population line vs. sample line
# =============================================================================
# Visualizing the gap between the true line and our estimate
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(data['x'], data['y'], s=50, alpha=0.7, label='Observed data')
ax.plot(data['x'], fit.predict(), color='red', linewidth=2,
        label=f'Sample: y_hat = {b1:.2f} + {b2:.2f}x')
x_range = np.linspace(data['x'].min(), data['x'].max(), 100)
ax.plot(x_range, beta_1_true + beta_2_true * x_range,
        color='green', linewidth=2, linestyle='--',
        label=f'Population: E[y|x] = {beta_1_true} + {beta_2_true}x')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Population Regression vs. Sample Regression')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 4: Monte Carlo simulation — demonstrate unbiasedness
# =============================================================================
# Draw many samples from the SAME DGP to see how b2 varies
# Unbiasedness: on average, b2 equals the true b2
n_simulations = 1000
b2_estimates = []

for i in range(n_simulations):
    x_sim = np.random.normal(3, 1, n)
    u_sim = np.random.normal(0, sigma_u, n)
    y_sim = beta_1_true + beta_2_true * x_sim + u_sim
    df_sim = pd.DataFrame({'x': x_sim, 'y': y_sim})
    m = pf.feols('y ~ x', data=df_sim)
    b2_estimates.append(m.coef()['x'])

print(f"\nMonte Carlo results ({n_simulations} simulations, n={n} each):")
print(f"  True b2:              {beta_2_true}")
print(f"  Mean of b2 estimates: {np.mean(b2_estimates):.4f}  (approx b2, confirming unbiasedness)")
print(f"  Std dev of estimates:  {np.std(b2_estimates):.4f}  (empirical standard error)")

# =============================================================================
# STEP 5: Visualize the sampling distribution of b2
# =============================================================================
# The histogram should be centered on b2 (unbiasedness) and bell-shaped (CLT)
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(b2_estimates, bins=40, density=True, alpha=0.7, edgecolor='white',
        label=f'{n_simulations} estimates of b2')
ax.axvline(beta_2_true, color='green', linewidth=2, linestyle='--',
           label=f'True b2 = {beta_2_true}')
ax.axvline(np.mean(b2_estimates), color='red', linewidth=2,
           label=f'Mean of estimates = {np.mean(b2_estimates):.4f}')
ax.set_xlabel('Slope estimate (b2)')
ax.set_ylabel('Density')
ax.set_title('Sampling Distribution of b2: Unbiasedness + CLT')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 6: Standard error — what controls precision?
# =============================================================================
# se(b2) = s_e / sqrt[sum((xi - x_bar)^2)]
# Smaller when: (1) model fits well, (2) large n, (3) x spread wide
se_b2       = fit.se()['x']                         # from regression output
s_e         = np.sqrt(np.sum(fit._u_hat**2) / (n - 2))  # standard error of regression
x_variation = np.sum((data['x'] - data['x'].mean())**2)

print(f"\nStandard error anatomy (from the single-sample regression):")
print(f"  s_e (root MSE):          {s_e:.4f}")
print(f"  sum((xi - x_bar)^2):     {x_variation:.4f}")
print(f"  se(b2) = s_e / sqrt(sum) = {s_e / np.sqrt(x_variation):.4f}")
print(f"  se(b2) from output:      {se_b2:.4f}")

# =============================================================================
# STEP 7: Effect of sample size on precision
# =============================================================================
# Theory: se(b2) proportional to 1/sqrt(n) — doubling n cuts SE by ~30%, quadrupling halves it
sample_sizes = [20, 50, 100, 200]

print(f"\n{'n':>6}  {'Mean b2':>10}  {'Std dev (empirical SE)':>22}")
print("-" * 42)
for ns in sample_sizes:
    estimates = []
    for _ in range(1000):
        xs = np.random.normal(3, 1, ns)
        us = np.random.normal(0, sigma_u, ns)
        ys = beta_1_true + beta_2_true * xs + us
        m = pf.feols('y ~ x', data=pd.DataFrame({'x': xs, 'y': ys}))
        estimates.append(m.coef()['x'])
    print(f"{ns:>6}  {np.mean(estimates):>10.4f}  {np.std(estimates):>22.4f}")
Open empty Colab notebook →
* =============================================================================
* CHAPTER 6 CHEAT SHEET: The Least Squares Estimator
* =============================================================================

* --- Setup ---
clear all                                // start with a clean workspace
set more off                             // do not pause output for long results

* =============================================================================
* STEP 1: Define the Data-Generating Process (DGP)
* =============================================================================
* The DGP specifies the TRUE population relationship: y = b1 + b2*x + u
* We know the true parameters — in real research, we never do!
local beta_1_true = 1                    // true intercept
local beta_2_true = 2                    // true slope
local sigma_u     = 2                    // error standard deviation

* Generate one sample of n observations
set seed 42
set obs 30
gen x = rnormal(3, 1)
gen u = rnormal(0, `sigma_u')
gen y = `beta_1_true' + `beta_2_true' * x + u

display "Generated sample: " _N " observations from y = `beta_1_true' + `beta_2_true'x + u"

* =============================================================================
* STEP 2: Fit OLS and compare sample vs. population parameters
* =============================================================================
* The sample regression estimates the unknown population line from data
regress y x

// Display comparison of population vs sample parameters
display _newline "Population:  E[y|x] = `beta_1_true' + `beta_2_true'x"
display "Sample:      y_hat = " %6.2f _b[_cons] " + " %6.2f _b[x] "x"
display "Sampling error in slope: b2 - b2_true = " %8.4f (_b[x] - `beta_2_true')

* =============================================================================
* STEP 3: Scatter plot — population line vs. sample line
* =============================================================================
* Visualizing the gap between the true line and our estimate
predict yhat                             // fitted values from the regression

// Create the population (true) line for comparison
gen y_true = `beta_1_true' + `beta_2_true' * x

twoway (scatter y x, msize(small) msymbol(circle))                          ///
       (line yhat x, sort lcolor(red) lwidth(medthick)                      ///
            legend(label(2 "Sample: y_hat")))                               ///
       (line y_true x, sort lcolor(green) lwidth(medthick) lpattern(dash)   ///
            legend(label(3 "Population: E[y|x]"))),                         ///
    xtitle("x") ytitle("y")                                                 ///
    title("Population Regression vs. Sample Regression")                    ///
    legend(order(1 "Observed data" 2 "Sample line" 3 "Population line"))

* =============================================================================
* STEP 4: Monte Carlo simulation — demonstrate unbiasedness
* =============================================================================
* Draw many samples from the SAME DGP to see how b2 varies
* Unbiasedness: on average, b2 equals the true b2
local n_sims = 1000
local n_obs  = 30

// Program to run a single simulation
capture program drop one_sim
program define one_sim, rclass
    args beta1 beta2 sigma n
    drop _all
    set obs `n'
    gen x = rnormal(3, 1)
    gen u = rnormal(0, `sigma')
    gen y = `beta1' + `beta2' * x + u
    regress y x
    return scalar b2 = _b[x]
end

// Run the simulation
simulate b2 = r(b2), reps(`n_sims') seed(42): ///
    one_sim `beta_1_true' `beta_2_true' `sigma_u' `n_obs'

summarize b2
display _newline "Monte Carlo results (`n_sims' simulations, n=`n_obs' each):"
display "  True b2:              `beta_2_true'"
display "  Mean of b2 estimates: " %8.4f r(mean) "  (approx b2, confirming unbiasedness)"
display "  Std dev of estimates: " %8.4f r(sd) "  (empirical standard error)"

* =============================================================================
* STEP 5: Visualize the sampling distribution of b2
* =============================================================================
* The histogram should be centered on b2 (unbiasedness) and bell-shaped (CLT)
histogram b2, kdensity                                                      ///
    xline(`beta_2_true', lcolor(green) lwidth(medthick) lpattern(dash))     ///
    xtitle("Slope estimate (b2)")                                           ///
    ytitle("Density")                                                       ///
    title("Sampling Distribution of b2: Unbiasedness + CLT")               ///
    note("Green line = true b2 = `beta_2_true'")

* =============================================================================
* STEP 6: Standard error — what controls precision?
* =============================================================================
* se(b2) = s_e / sqrt[sum((xi - x_bar)^2)]
* Smaller when: (1) model fits well, (2) large n, (3) x spread wide

// Reload original sample for SE anatomy
clear all
set seed 42
set obs 30
gen x = rnormal(3, 1)
gen u = rnormal(0, 2)
gen y = 1 + 2 * x + u

regress y x
local se_b2 = _se[x]                    // standard error of slope
local s_e   = e(rmse)                   // root MSE (std error of regression)

// Compute sum of squared deviations of x
summarize x
local x_var = r(Var) * (r(N) - 1)       // sum((xi - x_bar)^2)

display _newline "Standard error anatomy (from the single-sample regression):"
display "  s_e (root MSE):          " %8.4f `s_e'
display "  sum((xi - x_bar)^2):     " %8.4f `x_var'
display "  se(b2) = s_e / sqrt(sum) = " %8.4f (`s_e' / sqrt(`x_var'))
display "  se(b2) from output:      " %8.4f `se_b2'

* =============================================================================
* STEP 7: Effect of sample size on precision
* =============================================================================
* Theory: se(b2) proportional to 1/sqrt(n)
* Doubling n cuts SE by ~30%, quadrupling halves it

display _newline "    n      Mean b2    Std dev (empirical SE)"
display "==========================================

foreach ns in 20 50 100 200 {
    capture program drop sim_n
    program define sim_n, rclass
        args beta1 beta2 sigma n
        drop _all
        set obs `n'
        gen x = rnormal(3, 1)
        gen u = rnormal(0, `sigma')
        gen y = `beta1' + `beta2' * x + u
        regress y x
        return scalar b2 = _b[x]
    end

    quietly simulate b2 = r(b2), reps(1000) seed(42): ///
        sim_n 1 2 2 `ns'
    quietly summarize b2
    display %6.0f `ns' "  " %10.4f r(mean) "  " %22.4f r(sd)
}
Paste into your Stata do-file editor
# =============================================================================
# CHAPTER 6 CHEAT SHEET: The Least Squares Estimator
# =============================================================================

# --- Libraries ---
library(fixest)          # fast OLS estimation with feols()
library(ggplot2)         # grammar of graphics for all plots
library(dplyr)           # data manipulation

# =============================================================================
# STEP 1: Define the Data-Generating Process (DGP)
# =============================================================================
# The DGP specifies the TRUE population relationship: y = b1 + b2*x + u
# We know the true parameters — in real research, we never do!
beta_1_true <- 1         # true intercept
beta_2_true <- 2         # true slope
sigma_u     <- 2         # error standard deviation

# Generate one sample of n observations
set.seed(42)
n <- 30
x <- rnorm(n, mean = 3, sd = 1)
u <- rnorm(n, mean = 0, sd = sigma_u)
y <- beta_1_true + beta_2_true * x + u

data <- data.frame(x = x, y = y)
cat("Generated sample:", n, "observations from y =", beta_1_true, "+", beta_2_true, "x + u\n")

# =============================================================================
# STEP 2: Fit OLS and compare sample vs. population parameters
# =============================================================================
model <- feols(y ~ x, data = data)
summary(model)

b1 <- coef(model)["(Intercept)"]
b2 <- coef(model)["x"]

cat("\nPopulation:  E[y|x] =", beta_1_true, "+", beta_2_true, "x\n")
cat("Sample:      y_hat =", round(b1, 2), "+", round(b2, 2), "x\n")
cat("Sampling error in slope: b2 - b2_true =", round(b2 - beta_2_true, 4), "\n")

# =============================================================================
# STEP 3: Scatter plot — population line vs. sample line
# =============================================================================
ggplot(data, aes(x = x, y = y)) +
  geom_point(size = 2, alpha = 0.7) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE,
              color = "red", linewidth = 1.2) +
  geom_abline(intercept = beta_1_true, slope = beta_2_true,
              color = "green", linewidth = 1.2, linetype = "dashed") +
  labs(x = "x", y = "y",
       title = "Population Regression vs. Sample Regression") +
  theme_minimal()

# =============================================================================
# STEP 4: Monte Carlo simulation — demonstrate unbiasedness
# =============================================================================
# Draw many samples from the SAME DGP to see how b2 varies
set.seed(42)
n_sims <- 1000

# replicate() runs the simulation n_sims times
b2_estimates <- replicate(n_sims, {
  x_sim <- rnorm(n, mean = 3, sd = 1)
  u_sim <- rnorm(n, mean = 0, sd = sigma_u)
  y_sim <- beta_1_true + beta_2_true * x_sim + u_sim
  coef(feols(y_sim ~ x_sim))["x_sim"]
})

cat("\nMonte Carlo results (", n_sims, "simulations, n =", n, "each):\n")
cat("  True b2:             ", beta_2_true, "\n")
cat("  Mean of b2 estimates:", round(mean(b2_estimates), 4),
    " (confirming unbiasedness)\n")
cat("  Std dev of estimates:", round(sd(b2_estimates), 4),
    " (empirical standard error)\n")

# =============================================================================
# STEP 5: Visualize the sampling distribution of b2
# =============================================================================
ggplot(data.frame(b2 = b2_estimates), aes(x = b2)) +
  geom_histogram(bins = 40, fill = "steelblue", color = "white", alpha = 0.7) +
  geom_vline(xintercept = beta_2_true, color = "green",
             linewidth = 1.5, linetype = "dashed") +
  geom_vline(xintercept = mean(b2_estimates), color = "red", linewidth = 1.2) +
  labs(x = "Slope estimate (b2)", y = "Frequency",
       title = "Sampling Distribution of b2: Unbiasedness + CLT") +
  theme_minimal()

# =============================================================================
# STEP 6: Standard error — what controls precision?
# =============================================================================
# se(b2) = s_e / sqrt[sum((xi - x_bar)^2)]
se_b2       <- se(model)["x"]
s_e         <- sigma(model)               # residual standard error (root MSE)
x_variation <- sum((data$x - mean(data$x))^2)

cat("\nStandard error anatomy:\n")
cat("  s_e (root MSE):         ", round(s_e, 4), "\n")
cat("  sum((xi - x_bar)^2):    ", round(x_variation, 4), "\n")
cat("  se(b2) = s_e / sqrt(sum):", round(s_e / sqrt(x_variation), 4), "\n")
cat("  se(b2) from output:     ", round(se_b2, 4), "\n")

# =============================================================================
# STEP 7: Effect of sample size on precision
# =============================================================================
# Theory: se(b2) proportional to 1/sqrt(n)
cat("\n    n    Mean b2    Std dev (empirical SE)\n")

for (ns in c(20, 50, 100, 200)) {
  ests <- replicate(1000, {
    xs <- rnorm(ns, 3, 1); us <- rnorm(ns, 0, sigma_u)
    ys <- beta_1_true + beta_2_true * xs + us
    coef(feols(ys ~ xs))["xs"]
  })
  cat(sprintf("%5d  %10.4f  %22.4f\n", ns, mean(ests), sd(ests)))
}
Paste into your R console or RStudio