Chapter 06 of 18 ยท Interactive Dashboard

The Least Squares Estimator

Slide, toggle, and simulate to build intuition for OLS properties โ€” unbiasedness, sampling variability, standard errors, and efficiency.

Population vs. Sample Regression

Which line is the real one โ€” the one we drew through the data, or the invisible one that generated it? In econometrics the answer matters: one is what we see, the other is what we want to know.

The population line is fixed and unknown; the sample line estimates it from data. The population regression E[y|x] = ฮฒโ‚ + ฮฒโ‚‚x describes the true relationship with unknown parameters ฮฒโ‚ and ฮฒโ‚‚. The sample regression ลท = bโ‚ + bโ‚‚x estimates it from a limited sample. The error u = y โˆ’ E[y|x] is the deviation from the unknown population line and is unobservable; the residual e = y โˆ’ ลท is the deviation from the fitted sample line and is observable. We use residuals to learn about errors.
What you can do here
  • Slide sample size n โ€” small n means noisy sample lines; large n pulls the sample line close to the population line.
  • Toggle Errors / Residuals / Both โ€” compare deviations from the true line (errors) vs. deviations from the fitted line (residuals).
  • Click Resimulate โ€” draw a fresh random sample and watch how much bโ‚ and bโ‚‚ move around the true values.
True ฮฒโ‚
1.0000
True ฮฒโ‚‚
2.0000
Sample bโ‚
โ€”
Sample bโ‚‚
โ€”
Sampling error
โ€”
Rยฒ
โ€”
Try this
  1. Toggle to "Both" at n = 30. Purple segments (errors) reach to the true line; pink segments (residuals) reach to the fitted line. Residuals are shorter on average โ€” OLS minimizes squared residuals, not squared errors.
  2. Set n = 5 and Resimulate several times. The sample line swings wildly; now switch to n = 100 and resimulate. With more data the sample line barely moves โ€” that is sampling variability shrinking.
  3. Show "Errors (u)". Some deviations are positive, some negative, and their average is close to zero. That is the E[u|x] = 0 assumption in action.

Take-away: Every sample gives a different ลท, but across many draws the sample slopes average to the true ฮฒโ‚‚ โ€” that is unbiasedness. Read ยง6.1 in the chapter โ†’

OLS Assumptions Diagnostic

OLS is trustworthy only when four background assumptions hold. Which ones bend the line, and which only bend the standard errors?

Four assumptions separate a trustworthy OLS fit from a misleading one. (1) Correct specification y = ฮฒโ‚ + ฮฒโ‚‚x + u; (2) mean-zero errors E[u|x] = 0; (3) homoskedasticity Var[u|x] = ฯƒยฒแตค; (4) independence of errors across observations. Assumptions 1โ€“2 are essential for unbiasedness; 3โ€“4 affect variance and can be relaxed with robust standard errors.
What you can do here
  • Pick an assumption to violate โ€” choose linearity, mean-zero, homoskedasticity, or independence to see its effect in isolation.
  • Slide severity โ€” turn the violation up from mild to extreme and watch the estimated slope drift or the scatter fan out.
  • Click Resimulate โ€” redraw the sample under the chosen violation and see how consistent the distortion is.
True ฮฒโ‚‚
2.0000
Sample bโ‚‚
โ€”
Bias (bโ‚‚ โˆ’ ฮฒโ‚‚)
โ€”
se(bโ‚‚)
โ€”
Try this
  1. Select "Mean-Zero" and push severity to 100%. The sample line consistently misses the true line. This is omitted-variable bias โ€” the most dangerous violation because it shifts bโ‚‚ systematically away from ฮฒโ‚‚.
  2. Select "Homosk." at 100%. The fitted line still tracks the true line well (bโ‚‚ โ‰ˆ 2), but the scatter fans out. The slope is fine; only the reported se(bโ‚‚) becomes unreliable.
  3. Set "All Correct" and Resimulate several times. bโ‚‚ varies randomly around 2.0 with no drift. Random around the truth, never systematic โ€” that is the hallmark of an unbiased estimator.

Take-away: Assumptions 1โ€“2 decide whether bโ‚‚ is centered on ฮฒโ‚‚; assumptions 3โ€“4 decide how precise the SE reports are. Read ยง6.3 in the chapter โ†’

Monte Carlo Unbiasedness Simulator

A single sample gives one estimate. Run 1,000 samples and what do they average to โ€” and what shape do they form?

Across many samples, OLS estimates cluster around the true parameter and trace out a bell curve. Monte Carlo simulation demonstrates unbiasedness: the average of many OLS estimates equals the true parameter. The distribution of estimates is approximately normal (CLT), with spread measured by the standard error. This validates the theoretical properties of OLS in practice.
What you can do here
  • Slide the number of simulations โ€” more draws produce a smoother histogram of bโ‚‚ or bโ‚.
  • Slide sample size n โ€” larger n concentrates the histogram around the true value.
  • Toggle Slope vs. Intercept โ€” compare the sampling distributions of bโ‚‚ and bโ‚.
  • Click Resimulate โ€” regenerate all samples with a fresh random seed.
True value
2.0000
Mean of estimates
โ€”
SD of estimates
โ€”
Theoretical SE
โ€”
|Bias|
โ€”
Try this
  1. At n = 30 the SD of bโ‚‚ โ‰ˆ 0.38. Set n = 120 (4ร— larger). The SD drops to โ‰ˆ 0.19. Four times the data halves the spread โ€” that is the โˆšn rule.
  2. Switch to "Intercept (bโ‚)". The distribution is wider (SD โ‰ˆ 1.2) than the slope's. The intercept is harder to estimate because it extrapolates to x = 0, far from the data center.
  3. Set n = 10 and Resimulate; then set n = 100. The small-n histogram is lumpy; the large-n bell is smooth every time. The CLT becomes visible only when n is big enough.

Take-away: The mean of many bโ‚‚ estimates is ฮฒโ‚‚ (unbiased); the spread across samples is what the standard error measures. Read ยง6.3 in the chapter โ†’

Standard Error Anatomy

Why are some slope estimates precise and others fuzzy? Three things decide โ€” and you can turn each dial yourself.

The standard error of bโ‚‚ shrinks when noise is low, sample size is large, and x is spread wide. se(bโ‚‚) = sโ‚‘ / โˆš[ฮฃ(xแตข โˆ’ xฬ„)ยฒ]. We divide by (n โˆ’ 2), not n, because estimating 2 parameters (bโ‚ and bโ‚‚) uses 2 degrees of freedom. Precision is better (smaller SE) when: (1) the model fits well (small sโ‚‘), (2) sample size is large, (3) regressors are widely scattered.
What you can do here
  • Slide error ฯƒแตค โ€” smaller noise means points hug the line and the SE shrinks.
  • Slide sample size n โ€” more data raises the degrees of freedom and tightens the SE at a โˆšn rate.
  • Slide x-spread ฯƒโ‚“ โ€” wider x values expand SSx and make the slope more precisely identified.
  • Click Resimulate โ€” refresh the noise draws to see the SE as a random-sample quantity.
sโ‚‘ (Root MSE)
โ€”
df = n โˆ’ 2
โ€”
SSx = ฮฃ(xแตขโˆ’xฬ„)ยฒ
โ€”
se(bโ‚‚)
โ€”
Rยฒ
โ€”
Formula: se(bโ‚‚) = โ€”
Try this
  1. Set ฯƒแตค = 0.5, then drag to ฯƒแตค = 5. Points go from hugging the line to an exploding cloud, and se(bโ‚‚) grows proportionally. Model fit is the first lever of precision.
  2. Keep ฯƒแตค = 2, set n = 5 (df = 3), then increase to n = 80. The SE falls, but each extra observation helps less than the last. Sample size helps only at the โˆšn rate โ€” diminishing returns.
  3. Set ฯƒโ‚“ = 0.3, then drag to ฯƒโ‚“ = 3. A tight cluster of x gives a poorly identified slope; wide spread gives a precise one. This is exactly why experimental design prefers widely varied regressors.

Take-away: Precision improves with a better-fitting model, more observations, and wider variation in x โ€” but sample size helps only at the rate of โˆšn. Read ยง6.4 in the chapter โ†’

Gauss-Markov: OLS vs. Alternatives

Why do we use OLS instead of some other line-fitter? Because under the four assumptions, nothing linear can beat it.

Under the four assumptions, OLS is the Best Linear Unbiased Estimator (BLUE). Among all linear unbiased estimators, OLS has the smallest variance. If errors are also normally distributed, OLS is the Best Unbiased Estimator (not just among linear ones). This optimality is what justifies the widespread use of OLS.
What you can do here
  • Toggle Assumptions Met / Heteroskedastic โ€” check whether OLS still wins when assumption 3 is broken.
  • Slide the number of simulations โ€” more draws give cleaner, easier-to-compare sampling distributions.
  • Click Resimulate โ€” redraw all samples and see whether the ranking of estimators is stable.
OLS SD
โ€”
Alt-Linear SD
โ€”
Median-Slope SD
โ€”
Efficiency (Alt/OLS)
โ€”
Try this
  1. Under "Assumptions Met", compare the three histograms. OLS (cyan) is the tightest; the efficiency ratio (Alt SD / OLS SD) sits well above 1.0. That is BLUE made visible.
  2. Switch to "Heteroskedastic". OLS is still centered on 2.0 (unbiased), but the median-slope estimator now has comparable spread. Break homoskedasticity and OLS loses its efficiency edge.
  3. Increase simulations to 2000. Each histogram smooths out and the ranking of variances becomes unambiguous. Monte Carlo convergence makes the theorem's claim concrete.

Take-away: When all four assumptions hold, OLS is tightest of the three; break homoskedasticity and its advantage fades. Read ยง6.3 in the chapter โ†’

Real-Data Sampling Variability

Sampling variability isn't just a simulation curiosity โ€” it happens every time a researcher works with a subset of real data. How far does one sample of 50 countries stray from all 108?

Even with real economic data, each sample tells a slightly different story โ€” sampling error is always present. Treating the full dataset of 108 countries as the "population," drawing a sample of 50 simulates the real-world situation of working with incomplete data. The difference between the population coefficient ฮฒโ‚‚ and the sample coefficient bโ‚‚ is the sampling error โ€” random, sometimes positive, sometimes negative, but on average zero.
What you can do here
  • Slide subsample size โ€” from n = 10 to n = 90 of the 108 countries, and watch how closely the sample line tracks the population line.
  • Slide MC draws โ€” more repeated subsamples yield a cleaner histogram of bโ‚‚.
  • Toggle Display โ€” view the sample highlighted inside the full scatter, or the sample alone.
  • Click Resimulate โ€” draw a fresh subsample of countries and compare.
Pop ฮฒโ‚‚
โ€”
Sample bโ‚‚
โ€”
Sampling error
โ€”
MC Mean(bโ‚‚)
โ€”
MC SD(bโ‚‚)
โ€”
Try this
  1. Set n = 20 and Resimulate several times. The pink sample line swings widely around the purple population line. With only 20 of 108 countries, uncertainty in the slope is substantial.
  2. Increase n to 80. The sample line barely departs from the population line, and the MC SD drops by roughly โˆš4 = 2ร—. The 1/โˆšn rule holds with real economic data too.
  3. Toggle to "Sample Only" at n = 30. Different subsets of countries produce noticeably different slopes. Which countries end up in your sample matters โ€” that is sampling variability in real research.

Take-away: The full 108-country slope is our benchmark; any single subsample wiggles around it with uncertainty that shrinks as n grows. Read the case study in the chapter โ†’

Standard Errors and Sample Size

If I double my sample, does my standard error halve? Not quite โ€” the tyranny of the square root says you need four times the data to halve it.

Standard errors fall with the square root of n โ€” so precision has diminishing returns. se(bโ‚‚) = ฯƒแตค / โˆš[n ร— Var(x)]. To halve the standard error you need four times the sample size. Going from n = 100 to n = 400 yields the same precision gain as going from n = 25 to n = 100. This diminishing-returns relationship is a fundamental constraint of statistical estimation.
What you can do here
  • Slide error ฯƒแตค โ€” noisier data shifts the whole SE-vs-n curve upward.
  • Slide x-spread ฯƒโ‚“ โ€” wider regressors press the curve downward.
  • Toggle Generated DGP / Convergence Data โ€” compare a textbook DGP with the real 108-country dataset.
  • Click Resimulate โ€” redraw the Monte Carlo points along the theoretical curve.
SE at n=25
โ€”
SE at n=100
โ€”
Ratio (n=25/n=100)
โ€”
Theory predicts
2.00ร—
Try this
  1. Read the SE at n = 25 and at n = 100. The ratio should be close to 2.0. That is the 1/โˆšn law predicting that quadrupling n halves the SE.
  2. Increase ฯƒแตค from 2 to 4. The entire SE-vs-n curve shifts upward at every sample size. Noisier data means a larger SE no matter how big your sample is.
  3. Switch to "Convergence Data". The theoretical curve no longer fits perfectly โ€” real data is not a clean normal DGP. The โˆšn decay still holds approximately, which is why the rule travels from theory to practice.

Take-away: Going from n = 25 to n = 100 halves the SE; going from n = 100 to n = 200 only saves you another ~29%. Read the case study in the chapter โ†’

Python Libraries and Code

You've explored the key concepts interactively โ€” now reproduce them in Python. This self-contained code block covers everything you practiced above. Copy it into an empty notebook and run it.

# =============================================================================
# CHAPTER 6 CHEAT SHEET: The Least Squares Estimator
# =============================================================================

# --- Libraries ---
import numpy as np                        # random sampling and numerical operations
import pandas as pd                       # data manipulation
import matplotlib.pyplot as plt           # creating plots and visualizations
from statsmodels.formula.api import ols   # OLS regression with R-style formulas

# =============================================================================
# STEP 1: Define the Data-Generating Process (DGP)
# =============================================================================
# The DGP specifies the TRUE population relationship: y = b1 + b2x + u
# We know the true parameters โ€” in real research, we never do!
beta_1_true = 1       # true intercept
beta_2_true = 2       # true slope
sigma_u     = 2       # error standard deviation

# Generate one sample of n observations
np.random.seed(42)
n = 30
x = np.random.normal(3, 1, n)
u = np.random.normal(0, sigma_u, n)
y = beta_1_true + beta_2_true * x + u

data = pd.DataFrame({'x': x, 'y': y})
print(f"Generated sample: {n} observations from y = {beta_1_true} + {beta_2_true}x + u")

# =============================================================================
# STEP 2: Fit OLS and compare sample vs. population parameters
# =============================================================================
# The sample regression estimates the unknown population line from data
model = ols('y ~ x', data=data).fit()

b1 = model.params['Intercept']
b2 = model.params['x']

print(f"\nPopulation:  E[y|x] = {beta_1_true} + {beta_2_true}x")
print(f"Sample:      y_hat = {b1:.2f} + {b2:.2f}x")
print(f"Sampling error in slope: b2 - b2_true = {b2 - beta_2_true:.4f}")

# Full regression table (coefficients, std errors, t-stats, p-values, R2)
model.summary()

# =============================================================================
# STEP 3: Scatter plot โ€” population line vs. sample line
# =============================================================================
# Visualizing the gap between the true line and our estimate
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(data['x'], data['y'], s=50, alpha=0.7, label='Observed data')
ax.plot(data['x'], model.fittedvalues, color='red', linewidth=2,
        label=f'Sample: y_hat = {b1:.2f} + {b2:.2f}x')
x_range = np.linspace(data['x'].min(), data['x'].max(), 100)
ax.plot(x_range, beta_1_true + beta_2_true * x_range,
        color='green', linewidth=2, linestyle='--',
        label=f'Population: E[y|x] = {beta_1_true} + {beta_2_true}x')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Population Regression vs. Sample Regression')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 4: Monte Carlo simulation โ€” demonstrate unbiasedness
# =============================================================================
# Draw many samples from the SAME DGP to see how b2 varies
# Unbiasedness: on average, b2 equals the true b2
n_simulations = 1000
b2_estimates = []

for i in range(n_simulations):
    x_sim = np.random.normal(3, 1, n)
    u_sim = np.random.normal(0, sigma_u, n)
    y_sim = beta_1_true + beta_2_true * x_sim + u_sim
    df_sim = pd.DataFrame({'x': x_sim, 'y': y_sim})
    m = ols('y ~ x', data=df_sim).fit()
    b2_estimates.append(m.params['x'])

print(f"\nMonte Carlo results ({n_simulations} simulations, n={n} each):")
print(f"  True b2:              {beta_2_true}")
print(f"  Mean of b2 estimates: {np.mean(b2_estimates):.4f}  (approx b2, confirming unbiasedness)")
print(f"  Std dev of estimates:  {np.std(b2_estimates):.4f}  (empirical standard error)")

# =============================================================================
# STEP 5: Visualize the sampling distribution of b2
# =============================================================================
# The histogram should be centered on b2 (unbiasedness) and bell-shaped (CLT)
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(b2_estimates, bins=40, density=True, alpha=0.7, edgecolor='white',
        label=f'{n_simulations} estimates of b2')
ax.axvline(beta_2_true, color='green', linewidth=2, linestyle='--',
           label=f'True b2 = {beta_2_true}')
ax.axvline(np.mean(b2_estimates), color='red', linewidth=2,
           label=f'Mean of estimates = {np.mean(b2_estimates):.4f}')
ax.set_xlabel('Slope estimate (b2)')
ax.set_ylabel('Density')
ax.set_title('Sampling Distribution of b2: Unbiasedness + CLT')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 6: Standard error โ€” what controls precision?
# =============================================================================
# se(b2) = s_e / sqrt[sum((xi - x_bar)^2)]
# Smaller when: (1) model fits well, (2) large n, (3) x spread wide
se_b2       = model.bse['x']                       # from regression output
s_e         = np.sqrt(model.mse_resid)             # standard error of regression
x_variation = np.sum((data['x'] - data['x'].mean())**2)

print(f"\nStandard error anatomy (from the single-sample regression):")
print(f"  s_e (root MSE):          {s_e:.4f}")
print(f"  sum((xi - x_bar)^2):     {x_variation:.4f}")
print(f"  se(b2) = s_e / sqrt(sum) = {s_e / np.sqrt(x_variation):.4f}")
print(f"  se(b2) from output:      {se_b2:.4f}")

# =============================================================================
# STEP 7: Effect of sample size on precision
# =============================================================================
# Theory: se(b2) proportional to 1/sqrt(n) โ€” doubling n cuts SE by ~30%, quadrupling halves it
sample_sizes = [20, 50, 100, 200]

print(f"\n{'n':>6}  {'Mean b2':>10}  {'Std dev (empirical SE)':>22}")
print("-" * 42)
for ns in sample_sizes:
    estimates = []
    for _ in range(1000):
        xs = np.random.normal(3, 1, ns)
        us = np.random.normal(0, sigma_u, ns)
        ys = beta_1_true + beta_2_true * xs + us
        m = ols('y ~ x', data=pd.DataFrame({'x': xs, 'y': ys})).fit()
        estimates.append(m.params['x'])
    print(f"{ns:>6}  {np.mean(estimates):>10.4f}  {np.std(estimates):>22.4f}")
Open empty Colab notebook โ†’