Chapter 03 of 18 ยท Interactive Dashboard

The Sample Mean

Explore sampling distributions, the Central Limit Theorem, standard error, and estimator properties through interactive simulations.

Sampling Distribution of the Mean

How much does your estimate of the population mean depend on the particular sample you happened to draw? Each dot in this histogram is the mean of one sample โ€” repeated sampling reveals the whole distribution of possible answers.

The sample mean X̄ is itself a random variable. Every observed x̄ is one realization of X̄ = (X1 + ⋯ + Xn) / n. Under simple random sampling, its distribution has mean E[X̄] = μ (unbiased) and standard deviation σ/√n โ€” the theoretical standard error. That predictability is what makes statistical inference possible.
What you can do here
  • Switch between Coin Toss and Census Ages โ€” one has a symmetric population (p = 0.5), the other is heavily skewed.
  • Toggle the Normal overlay to compare the empirical histogram to the theoretical N(μ, SE²) curve.
  • Turn on ±1 or ±2 SE bands โ€” the panel reports what fraction of the sample means land inside each band (theory predicts ~68% and ~95%).
Samples
โ€”
Mean of X̄
โ€”
SD of X̄
โ€”
Theoretical SE
โ€”
Try This
  1. Turn on ±2 SE bands on the Coin Toss data. Roughly 95% of the 400 sample means land inside the band โ€” exactly what the normal-based theory predicts for a well-behaved sampling distribution.
  2. Switch to Census Ages. The population is right-skewed (ages 0โ€“100+), yet the 100 sample means cluster into a tidy bell shape โ€” the Central Limit Theorem making non-normal data behave for inference.
  3. Toggle the Normal overlay on both datasets. The pink curve lines up with the empirical histogram under both populations โ€” the sampling distribution of X̄ is normal even when the underlying variable is not.

Take-away: X̄ isn't a single number โ€” it's a random variable whose distribution sits at μ with spread σ/√n, and that distribution is the engine of all statistical inference that follows. Read ยง3.3 in the chapter โ†’

How Sample Size Affects Precision

Is it worth quadrupling survey costs to halve your standard error? The √n rule makes the tradeoff painfully concrete.

The standard error se(X̄) = s/√n measures how precisely the sample mean estimates μ. Because σ is unknown in practice, we estimate it with the sample standard deviation s. The SE shrinks with √n, not with n โ€” so to halve the SE you must quadruple the sample size. That asymmetry is why big precision gains come with big data bills.
What you can do here
  • Slide the sample size n from 5 to 500 and watch the histogram of 500 simulated means narrow around μ = 0.5.
  • Compare the empirical SE and the theoretical SE in the stat cards โ€” they should agree closely.
  • Click Resimulate to draw a fresh batch of 500 samples โ€” the bell shape is stable, the bin heights wobble.
n = 30
n
30
Empirical SE
โ€”
Theoretical SE
โ€”
Mean of means
โ€”
Try This
  1. Set n = 10, then n = 40. The SE drops from ~0.158 to ~0.079 โ€” halved by a 4× larger sample, exactly as √n predicts.
  2. Set n = 100, then n = 400. The SE moves from ~0.05 to ~0.025 โ€” another 4× in data for another halving, now with diminishing marginal returns: the histogram is already nearly a spike.
  3. Click Resimulate several times at n = 30. The bar heights wobble (sampling variability) but the bell shape and width don't โ€” that width is the SE, and it depends on n, not on which 500 samples you happen to draw.

Take-away: Precision grows with the square root of sample size, not with sample size itself โ€” every halving of uncertainty costs four times more data. Read ยง3.3 in the chapter โ†’

Central Limit Theorem in Action

Does the normal curve really describe sample means when the population is nowhere close to normal? The CLT says yes โ€” but how quickly does it kick in?

The Central Limit Theorem says the standardized sample mean Z = (X̄ − μ) / (σ/√n) converges to N(0, 1) as n → ∞. This holds for any population distribution with a finite mean and variance (Key Concept 3.5). In practice (Key Concept 3.6), the convergence is fast: even strongly skewed or bimodal populations produce approximately-normal sample means for moderate n.
What you can do here
  • Pick a population shape โ€” Bernoulli, Uniform, right-skewed Exponential, Bimodal, or Census-like ages.
  • Slide the sample size n from 2 to 100 and watch the right panel morph from the population's shape toward a bell curve.
  • Click Resimulate to draw fresh samples without changing the population or n.
n = 5
Pop μ
โ€”
Pop σ
โ€”
Mean of X̄s
โ€”
SE (empirical)
โ€”
SE (theoretical)
โ€”

Population Distribution

Distribution of 500 Sample Means

Try This
  1. Select Right-Skewed (Exponential) and start at n = 2. At n = 2 the means are still skewed; by n ≈ 10 they are clearly bell-shaped; by n = 30 the normal curve fits tightly โ€” CLT convergence happens in moderate, not huge, samples.
  2. Select Bimodal and set n = 50. The population has two distinct peaks; the distribution of sample means has exactly one. The CLT smooths out the original shape completely.
  3. Select Census-like ages (μ ≈ 24, σ ≈ 19) at n = 25. This matches the book's 1880 Census example: a heavily skewed population, but sample means look normal โ€” which is why normal-based confidence intervals and tests work on real, messy data.

Take-away: The CLT is the bridge from arbitrary populations to normal-based inference โ€” it's why t-tests, confidence intervals, and regression SEs work in practice. Read ยง3.4 in the chapter โ†’

Estimator Properties

Is the mean always the best summary? A single outlier can wreck it. When should you reach for the median or a trimmed mean instead?

A good estimator is unbiased, consistent, and efficient. Unbiased means E[θ̂] = θ โ€” no systematic error. Consistent means it converges to θ as n → ∞. Efficient means the smallest variance among unbiased estimators. The sample mean is all three for clean normal data, which is why it's the default choice โ€” but its efficiency collapses when outliers enter, and the median (a less efficient, robust alternative) starts winning on MSE.
What you can do here
  • Toggle Mean / Median / Trimmed 10% to see each estimator's sampling distribution.
  • Slide the sample size n to watch variance shrink (consistency).
  • Slide outlier contamination from 0% to 20% to see which estimator copes with heavy-tail data.
  • Compare Bias, Variance, and MSE in the stat cards โ€” MSE = Variance + Bias² is the summary metric.
n = 30
0%
True μ
50
Avg Estimate
โ€”
Bias
โ€”
Variance
โ€”
MSE
โ€”
Try This
  1. At 0% contamination, toggle Mean vs. Median. The mean has smaller variance โ€” for clean normal data it's the most efficient estimator, which is exactly the Gauss-Markov result in miniature.
  2. Slide contamination to 15% and toggle again. Now the mean is biased upward by the outliers and its MSE balloons; the median's MSE stays modest. "Efficient for clean data" and "robust to outliers" are different virtues.
  3. Increase n from 30 to 200 at any contamination level. Variance shrinks for every estimator โ€” that is consistency, the guarantee that more data eventually wins regardless of which estimator you chose.

Take-away: Pick the estimator to match the data: the mean for clean normal samples, the median when outliers threaten โ€” and let MSE (bias² + variance) decide the tradeoff. Read ยง3.5 in the chapter โ†’

Weighted vs Unweighted Means

If your sample over-represents one group โ€” online polls, volunteer studies, mailing-list surveys โ€” the plain average lies about the population. Can we un-bias it?

Simple random sampling assumes every observation comes from the same distribution with common mean μ. When inclusion probabilities πi differ across observations, the unweighted mean is biased toward over-sampled groups. Inverse-probability weights wi = 1/πi rebuild the population mean: w = Σwixi / Σwi. The correction works exactly when you know the true inclusion probabilities.
What you can do here
  • Slide Group A and Group B means to set the two subpopulations (in thousands of dollars).
  • Slide the population fraction of group A โ€” the share of the true population that is in group A.
  • Slide the sample fraction of group A โ€” the share of your actual sample that turned out to be in group A.
  • Compare the three bars: true population mean, unweighted sample mean, and weighted sample mean.
$60k
$50k
50%
80%
True Pop Mean
โ€”
Unweighted
โ€”
Weighted
โ€”
Bias
โ€”
Try This
  1. Keep defaults: μA = $60k, μB = $50k, 50% of population is A, 80% of sample is A. The unweighted mean is pulled upward toward A, while the weighted mean hits the true $55k exactly โ€” IPW recovers the population mean from a biased sample.
  2. Slide sample fraction A down to 50% (matching the population). The bias disappears and unweighted = weighted โ€” when a sample is already representative, weighting buys nothing.
  3. Make the groups more different (A = $80k, B = $30k) and push sample fraction A to 80% again. Bias grows linearly in both the mis-sampling gap and the group gap โ€” the formula is exactly Bias = (fA − πA) × (μA − μB).

Take-away: Nonrepresentative samples aren't fatal โ€” they're correctable, but only if you know the inclusion probabilities that produced the imbalance. Read ยง3.7 in the chapter โ†’

Python Libraries and Code

You've explored the key concepts interactively โ€” now reproduce them in Python. This self-contained code block covers everything you practiced above. Copy it into an empty notebook and run it.

# =============================================================================
# CHAPTER 3 CHEAT SHEET: The Sample Mean
# =============================================================================

# --- Libraries ---
import numpy as np                        # numerical operations and random sampling
import pandas as pd                       # data loading and manipulation
import matplotlib.pyplot as plt           # creating plots and visualizations
from scipy import stats                   # normal distribution PDF for overlays

# =============================================================================
# STEP 1: Load pre-computed sample means from coin toss experiments
# =============================================================================
# 400 samples of 30 coin tosses each โ€” precomputed in the textbook dataset
url_coin = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_COINTOSSMEANS.DTA"
data_coin = pd.read_stata(url_coin)
xbar_coin = data_coin['xbar']

print(f"Coin toss experiment: {len(xbar_coin)} sample means (each from n=30 tosses)")
print(f"Mean of sample means: {xbar_coin.mean():.4f}  (theoretical ฮผ = 0.5)")
print(f"SD of sample means:   {xbar_coin.std():.4f}  (theoretical ฯƒ/โˆšn = {np.sqrt(0.25/30):.4f})")

# =============================================================================
# STEP 2: Visualize the sampling distribution with normal overlay
# =============================================================================
# The histogram of 400 sample means approximates the sampling distribution of Xฬ„
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(xbar_coin, bins=25, density=True, edgecolor='black', alpha=0.7,
        label='400 sample means')

# Overlay theoretical normal: N(ฮผ, ฯƒยฒ/n)
theo_se = np.sqrt(0.25 / 30)
x_range = np.linspace(xbar_coin.min(), xbar_coin.max(), 100)
ax.plot(x_range, stats.norm.pdf(x_range, 0.5, theo_se),
        'r-', linewidth=2.5, label=f'N(0.5, {theo_se:.3f}ยฒ)')
ax.set_xlabel('Sample Mean (proportion of heads)')
ax.set_ylabel('Density')
ax.set_title('Sampling Distribution of Xฬ„ from 400 Coin Toss Experiments (n=30)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 3: Central Limit Theorem โ€” non-normal population still gives normal Xฬ„
# =============================================================================
# 1880 U.S. Census ages: highly skewed population, yet sample means are normal
url_census = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_CENSUSAGEMEANS.DTA"
data_census = pd.read_stata(url_census)

# Identify the sample mean column
if 'mean' in data_census.columns:
    age_means = data_census['mean']
elif 'xmean' in data_census.columns:
    age_means = data_census['xmean']
else:
    age_means = data_census.iloc[:, 0]

print(f"\n1880 Census: {len(age_means)} sample means (each from n=25 people)")
print(f"Mean of sample means: {age_means.mean():.2f} years  (theoretical ฮผ = 24.13)")
print(f"SD of sample means:   {age_means.std():.2f} years  (theoretical ฯƒ/โˆšn = {18.61/np.sqrt(25):.2f})")

fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(age_means, bins=20, density=True, edgecolor='black', alpha=0.7,
        label='100 sample means')
age_range = np.linspace(age_means.min(), age_means.max(), 100)
ax.plot(age_range, stats.norm.pdf(age_range, 24.13, 18.61 / np.sqrt(25)),
        'r-', linewidth=2.5, label=f'N(24.13, {18.61/np.sqrt(25):.2f}ยฒ)')
ax.set_xlabel('Sample Mean Age (years)')
ax.set_ylabel('Density')
ax.set_title('CLT in Action: Normal Sample Means from a Skewed Population')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 4: Standard error โ€” how sample size affects precision
# =============================================================================
# SE = ฯƒ/โˆšn: to halve the SE, you must quadruple the sample size
sigma = 0.5  # coin toss population std dev

print(f"\nStandard error vs sample size (ฯƒ = {sigma}):")
print(f"{'n':<10} {'SE = ฯƒ/โˆšn':<15} {'Var(Xฬ„) = ฯƒยฒ/n':<15}")
print("-" * 40)
for n in [10, 30, 100, 400, 1000]:
    se = sigma / np.sqrt(n)
    var_xbar = sigma**2 / n
    print(f"{n:<10} {se:<15.4f} {var_xbar:<15.6f}")

# =============================================================================
# STEP 5: Monte Carlo simulation โ€” verify the theory computationally
# =============================================================================
# Simulate 1000 samples of 30 coin tosses to see the CLT converge
np.random.seed(10101)
n_sims = 1000
sample_size = 30
sim_means = np.array([np.random.binomial(1, 0.5, sample_size).mean()
                       for _ in range(n_sims)])

print(f"\nMonte Carlo simulation ({n_sims} samples, n={sample_size}):")
print(f"Mean of simulated means: {sim_means.mean():.4f}  (theoretical: 0.5)")
print(f"SD of simulated means:   {sim_means.std():.4f}  (theoretical: {np.sqrt(0.25/30):.4f})")

fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(sim_means, bins=30, density=True, edgecolor='black', alpha=0.7,
        label=f'{n_sims} simulated means')
x_range = np.linspace(sim_means.min(), sim_means.max(), 100)
ax.plot(x_range, stats.norm.pdf(x_range, 0.5, np.sqrt(0.25/30)),
        'r-', linewidth=2.5, label='Theoretical N(0.5, 0.091ยฒ)')
ax.set_xlabel('Sample Mean')
ax.set_ylabel('Density')
ax.set_title('Monte Carlo Simulation vs Theoretical Sampling Distribution')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 6: Weighted means โ€” correcting for nonrepresentative samples
# =============================================================================
# When inclusion probabilities differ, the unweighted mean is biased;
# inverse-probability weights w_i = 1/ฯ€_i recover the true population mean
np.random.seed(42)
income_men = np.random.normal(60000, 15000, 50)
income_women = np.random.normal(50000, 15000, 50)
true_pop_mean = (income_men.mean() + income_women.mean()) / 2

# Biased sample: oversample women (70% women, 30% men)
sample_men = np.random.choice(income_men, size=15, replace=False)
sample_women = np.random.choice(income_women, size=35, replace=False)
sample = np.concatenate([sample_men, sample_women])

# Unweighted mean is biased toward the oversampled group
unweighted = sample.mean()

# Weighted mean with IPW: w_i = 1/ฯ€_i corrects the imbalance
weights = np.concatenate([np.repeat(1/0.3, 15), np.repeat(1/0.7, 35)])
weighted = np.average(sample, weights=weights)

print(f"\nWeighted vs Unweighted Means:")
print(f"True population mean:  ${true_pop_mean:,.0f}")
print(f"Unweighted mean:       ${unweighted:,.0f}  (bias: ${unweighted - true_pop_mean:,.0f})")
print(f"Weighted mean (IPW):   ${weighted:,.0f}  (bias: ${weighted - true_pop_mean:,.0f})")
Open empty Colab notebook โ†’