Chapter 03 of 18 · Interactive Dashboard

The Sample Mean

Explore sampling distributions, the Central Limit Theorem, standard error, and estimator properties through interactive simulations.

Sampling Distribution of the Mean

How much does your estimate of the population mean depend on the particular sample you happened to draw? Each dot in this histogram is the mean of one sample — repeated sampling reveals the whole distribution of possible answers.

The sample mean X̄ is itself a random variable. Every observed x̄ is one realization of X̄ = (X1 + ⋯ + Xn) / n. Under simple random sampling, its distribution has mean E[X̄] = μ (unbiased) and standard deviation σ/√n — the theoretical standard error. That predictability is what makes statistical inference possible.
What you can do here
  • Switch between Coin Toss and Census Ages — one has a symmetric population (p = 0.5), the other is heavily skewed.
  • Toggle the Normal overlay to compare the empirical histogram to the theoretical N(μ, SE²) curve.
  • Turn on ±1 or ±2 SE bands — the panel reports what fraction of the sample means land inside each band (theory predicts ~68% and ~95%).
Samples
Mean of X̄
SD of X̄
Theoretical SE
Try this
  1. Turn on ±2 SE bands on the Coin Toss data. Roughly 95% of the 400 sample means land inside the band — exactly what the normal-based theory predicts for a well-behaved sampling distribution.
  2. Switch to Census Ages. The population is right-skewed (ages 0–100+), yet the 100 sample means cluster into a tidy bell shape — the Central Limit Theorem making non-normal data behave for inference.
  3. Toggle the Normal overlay on both datasets. The pink curve lines up with the empirical histogram under both populations — the sampling distribution of X̄ is normal even when the underlying variable is not.

Take-away: X̄ isn't a single number — it's a random variable whose distribution sits at μ with spread σ/√n, and that distribution is the engine of all statistical inference that follows. Read §3.3 in the chapter →

How Sample Size Affects Precision

Is it worth quadrupling survey costs to halve your standard error? The √n rule makes the tradeoff painfully concrete.

The standard error se(X̄) = s/√n measures how precisely the sample mean estimates μ. Because σ is unknown in practice, we estimate it with the sample standard deviation s. The SE shrinks with √n, not with n — so to halve the SE you must quadruple the sample size. That asymmetry is why big precision gains come with big data bills.
What you can do here
  • Slide the sample size n from 5 to 500 and watch the histogram of 500 simulated means narrow around μ = 0.5.
  • Compare the empirical SE and the theoretical SE in the stat cards — they should agree closely.
  • Click Resimulate to draw a fresh batch of 500 samples — the bell shape is stable, the bin heights wobble.
n = 30
n
30
Empirical SE
Theoretical SE
Mean of means
Try this
  1. Set n = 10, then n = 40. The SE drops from ~0.158 to ~0.079 — halved by a 4× larger sample, exactly as √n predicts.
  2. Set n = 100, then n = 400. The SE moves from ~0.05 to ~0.025 — another 4× in data for another halving, now with diminishing marginal returns: the histogram is already nearly a spike.
  3. Click Resimulate several times at n = 30. The bar heights wobble (sampling variability) but the bell shape and width don't — that width is the SE, and it depends on n, not on which 500 samples you happen to draw.

Take-away: Precision grows with the square root of sample size, not with sample size itself — every halving of uncertainty costs four times more data. Read §3.3 in the chapter →

Central Limit Theorem in Action

Does the normal curve really describe sample means when the population is nowhere close to normal? The CLT says yes — but how quickly does it kick in?

The Central Limit Theorem says the standardized sample mean Z = (X̄ − μ) / (σ/√n) converges to N(0, 1) as n → ∞. This holds for any population distribution with a finite mean and variance (Key Concept 3.5). In practice (Key Concept 3.6), the convergence is fast: even strongly skewed or bimodal populations produce approximately-normal sample means for moderate n.
What you can do here
  • Pick a population shape — Bernoulli, Uniform, right-skewed Exponential, Bimodal, or Census-like ages.
  • Slide the sample size n from 2 to 100 and watch the right panel morph from the population's shape toward a bell curve.
  • Click Resimulate to draw fresh samples without changing the population or n.
n = 5
Pop μ
Pop σ
Mean of X̄s
SE (empirical)
SE (theoretical)

Population Distribution

Distribution of 500 Sample Means

Try this
  1. Select Right-Skewed (Exponential) and start at n = 2. At n = 2 the means are still skewed; by n ≈ 10 they are clearly bell-shaped; by n = 30 the normal curve fits tightly — CLT convergence happens in moderate, not huge, samples.
  2. Select Bimodal and set n = 50. The population has two distinct peaks; the distribution of sample means has exactly one. The CLT smooths out the original shape completely.
  3. Select Census-like ages (μ ≈ 24, σ ≈ 19) at n = 25. This matches the book's 1880 Census example: a heavily skewed population, but sample means look normal — which is why normal-based confidence intervals and tests work on real, messy data.

Take-away: The CLT is the bridge from arbitrary populations to normal-based inference — it's why t-tests, confidence intervals, and regression SEs work in practice. Read §3.4 in the chapter →

Estimator Properties

Is the mean always the best summary? A single outlier can wreck it. When should you reach for the median or a trimmed mean instead?

A good estimator is unbiased, consistent, and efficient. Unbiased means E[θ̂] = θ — no systematic error. Consistent means it converges to θ as n → ∞. Efficient means the smallest variance among unbiased estimators. The sample mean is all three for clean normal data, which is why it's the default choice — but its efficiency collapses when outliers enter, and the median (a less efficient, robust alternative) starts winning on MSE.
What you can do here
  • Toggle Mean / Median / Trimmed 10% to see each estimator's sampling distribution.
  • Slide the sample size n to watch variance shrink (consistency).
  • Slide outlier contamination from 0% to 20% to see which estimator copes with heavy-tail data.
  • Compare Bias, Variance, and MSE in the stat cards — MSE = Variance + Bias² is the summary metric.
n = 30
0%
True μ
50
Avg Estimate
Bias
Variance
MSE
Try this
  1. At 0% contamination, toggle Mean vs. Median. The mean has smaller variance — for clean normal data it's the most efficient estimator, which is exactly the Gauss-Markov result in miniature.
  2. Slide contamination to 15% and toggle again. Now the mean is biased upward by the outliers and its MSE balloons; the median's MSE stays modest. "Efficient for clean data" and "robust to outliers" are different virtues.
  3. Increase n from 30 to 200 at any contamination level. Variance shrinks for every estimator — that is consistency, the guarantee that more data eventually wins regardless of which estimator you chose.

Take-away: Pick the estimator to match the data: the mean for clean normal samples, the median when outliers threaten — and let MSE (bias² + variance) decide the tradeoff. Read §3.5 in the chapter →

Weighted vs Unweighted Means

If your sample over-represents one group — online polls, volunteer studies, mailing-list surveys — the plain average lies about the population. Can we un-bias it?

Simple random sampling assumes every observation comes from the same distribution with common mean μ. When inclusion probabilities πi differ across observations, the unweighted mean is biased toward over-sampled groups. Inverse-probability weights wi = 1/πi rebuild the population mean: w = Σwixi / Σwi. The correction works exactly when you know the true inclusion probabilities.
What you can do here
  • Slide Group A and Group B means to set the two subpopulations (in thousands of dollars).
  • Slide the population fraction of group A — the share of the true population that is in group A.
  • Slide the sample fraction of group A — the share of your actual sample that turned out to be in group A.
  • Compare the three bars: true population mean, unweighted sample mean, and weighted sample mean.
$60k
$50k
50%
80%
True Pop Mean
Unweighted
Weighted
Bias
Try this
  1. Keep defaults: μA = $60k, μB = $50k, 50% of population is A, 80% of sample is A. The unweighted mean is pulled upward toward A, while the weighted mean hits the true $55k exactly — IPW recovers the population mean from a biased sample.
  2. Slide sample fraction A down to 50% (matching the population). The bias disappears and unweighted = weighted — when a sample is already representative, weighting buys nothing.
  3. Make the groups more different (A = $80k, B = $30k) and push sample fraction A to 80% again. Bias grows linearly in both the mis-sampling gap and the group gap — the formula is exactly Bias = (fA − πA) × (μA − μB).

Take-away: Nonrepresentative samples aren't fatal — they're correctable, but only if you know the inclusion probabilities that produced the imbalance. Read §3.7 in the chapter →

Code Summary

You've explored the key concepts interactively — now reproduce them in code. These self-contained blocks cover everything you practiced above. Pick your language, copy the code, and run it.

# =============================================================================
# CHAPTER 3 CHEAT SHEET: The Sample Mean
# =============================================================================

# --- Libraries ---
import numpy as np                        # numerical operations and random sampling
import pandas as pd                       # data loading and manipulation
import matplotlib.pyplot as plt           # creating plots and visualizations
from scipy import stats                   # normal distribution PDF for overlays

# =============================================================================
# STEP 1: Load pre-computed sample means from coin toss experiments
# =============================================================================
# 400 samples of 30 coin tosses each — precomputed in the textbook dataset
url_coin = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_COINTOSSMEANS.DTA"
data_coin = pd.read_stata(url_coin)
xbar_coin = data_coin['xbar']

print(f"Coin toss experiment: {len(xbar_coin)} sample means (each from n=30 tosses)")
print(f"Mean of sample means: {xbar_coin.mean():.4f}  (theoretical μ = 0.5)")
print(f"SD of sample means:   {xbar_coin.std():.4f}  (theoretical σ/√n = {np.sqrt(0.25/30):.4f})")

# =============================================================================
# STEP 2: Visualize the sampling distribution with normal overlay
# =============================================================================
# The histogram of 400 sample means approximates the sampling distribution of X̄
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(xbar_coin, bins=25, density=True, edgecolor='black', alpha=0.7,
        label='400 sample means')

# Overlay theoretical normal: N(μ, σ²/n)
theo_se = np.sqrt(0.25 / 30)
x_range = np.linspace(xbar_coin.min(), xbar_coin.max(), 100)
ax.plot(x_range, stats.norm.pdf(x_range, 0.5, theo_se),
        'r-', linewidth=2.5, label=f'N(0.5, {theo_se:.3f}²)')
ax.set_xlabel('Sample Mean (proportion of heads)')
ax.set_ylabel('Density')
ax.set_title('Sampling Distribution of X̄ from 400 Coin Toss Experiments (n=30)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 3: Central Limit Theorem — non-normal population still gives normal X̄
# =============================================================================
# 1880 U.S. Census ages: highly skewed population, yet sample means are normal
url_census = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_CENSUSAGEMEANS.DTA"
data_census = pd.read_stata(url_census)

# Identify the sample mean column
if 'mean' in data_census.columns:
    age_means = data_census['mean']
elif 'xmean' in data_census.columns:
    age_means = data_census['xmean']
else:
    age_means = data_census.iloc[:, 0]

print(f"\n1880 Census: {len(age_means)} sample means (each from n=25 people)")
print(f"Mean of sample means: {age_means.mean():.2f} years  (theoretical μ = 24.13)")
print(f"SD of sample means:   {age_means.std():.2f} years  (theoretical σ/√n = {18.61/np.sqrt(25):.2f})")

fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(age_means, bins=20, density=True, edgecolor='black', alpha=0.7,
        label='100 sample means')
age_range = np.linspace(age_means.min(), age_means.max(), 100)
ax.plot(age_range, stats.norm.pdf(age_range, 24.13, 18.61 / np.sqrt(25)),
        'r-', linewidth=2.5, label=f'N(24.13, {18.61/np.sqrt(25):.2f}²)')
ax.set_xlabel('Sample Mean Age (years)')
ax.set_ylabel('Density')
ax.set_title('CLT in Action: Normal Sample Means from a Skewed Population')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 4: Standard error — how sample size affects precision
# =============================================================================
# SE = σ/√n: to halve the SE, you must quadruple the sample size
sigma = 0.5  # coin toss population std dev

print(f"\nStandard error vs sample size (σ = {sigma}):")
print(f"{'n':<10} {'SE = σ/√n':<15} {'Var(X̄) = σ²/n':<15}")
print("-" * 40)
for n in [10, 30, 100, 400, 1000]:
    se = sigma / np.sqrt(n)
    var_xbar = sigma**2 / n
    print(f"{n:<10} {se:<15.4f} {var_xbar:<15.6f}")

# =============================================================================
# STEP 5: Monte Carlo simulation — verify the theory computationally
# =============================================================================
# Simulate 1000 samples of 30 coin tosses to see the CLT converge
np.random.seed(10101)
n_sims = 1000
sample_size = 30
sim_means = np.array([np.random.binomial(1, 0.5, sample_size).mean()
                       for _ in range(n_sims)])

print(f"\nMonte Carlo simulation ({n_sims} samples, n={sample_size}):")
print(f"Mean of simulated means: {sim_means.mean():.4f}  (theoretical: 0.5)")
print(f"SD of simulated means:   {sim_means.std():.4f}  (theoretical: {np.sqrt(0.25/30):.4f})")

fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(sim_means, bins=30, density=True, edgecolor='black', alpha=0.7,
        label=f'{n_sims} simulated means')
x_range = np.linspace(sim_means.min(), sim_means.max(), 100)
ax.plot(x_range, stats.norm.pdf(x_range, 0.5, np.sqrt(0.25/30)),
        'r-', linewidth=2.5, label='Theoretical N(0.5, 0.091²)')
ax.set_xlabel('Sample Mean')
ax.set_ylabel('Density')
ax.set_title('Monte Carlo Simulation vs Theoretical Sampling Distribution')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 6: Weighted means — correcting for nonrepresentative samples
# =============================================================================
# When inclusion probabilities differ, the unweighted mean is biased;
# inverse-probability weights w_i = 1/π_i recover the true population mean
np.random.seed(42)
income_men = np.random.normal(60000, 15000, 50)
income_women = np.random.normal(50000, 15000, 50)
true_pop_mean = (income_men.mean() + income_women.mean()) / 2

# Biased sample: oversample women (70% women, 30% men)
sample_men = np.random.choice(income_men, size=15, replace=False)
sample_women = np.random.choice(income_women, size=35, replace=False)
sample = np.concatenate([sample_men, sample_women])

# Unweighted mean is biased toward the oversampled group
unweighted = sample.mean()

# Weighted mean with IPW: w_i = 1/π_i corrects the imbalance
weights = np.concatenate([np.repeat(1/0.3, 15), np.repeat(1/0.7, 35)])
weighted = np.average(sample, weights=weights)

print(f"\nWeighted vs Unweighted Means:")
print(f"True population mean:  ${true_pop_mean:,.0f}")
print(f"Unweighted mean:       ${unweighted:,.0f}  (bias: ${unweighted - true_pop_mean:,.0f})")
print(f"Weighted mean (IPW):   ${weighted:,.0f}  (bias: ${weighted - true_pop_mean:,.0f})")
Open empty Colab notebook →
* =============================================================================
* CHAPTER 3 CHEAT SHEET: The Sample Mean
* =============================================================================

* --- Setup ---
clear all                                // start with a clean workspace
set more off                             // do not pause output for long results

* =============================================================================
* STEP 1: Load pre-computed sample means from coin toss experiments
* =============================================================================
* 400 samples of 30 coin tosses each — precomputed in the textbook dataset
use "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_COINTOSSMEANS.DTA", clear

describe                                 // list all variables, types, and labels
display "Observations: " _N              // _N is Stata's built-in observation count

* Compute mean and SD of the sample means
summarize xbar
display "Mean of sample means: " r(mean)           // theoretical μ = 0.5
display "SD of sample means:   " r(sd)             // theoretical σ/√n = 0.0913

* =============================================================================
* STEP 2: Visualize the sampling distribution with normal overlay
* =============================================================================
* The histogram of 400 sample means approximates the sampling distribution
* kdensity overlays a smooth kernel density estimate
histogram xbar, kdensity normal                    ///
    xtitle("Sample Mean (proportion of heads)")    ///
    ytitle("Density")                              ///
    title("Sampling Distribution of X-bar from 400 Coin Toss Experiments (n=30)")

* =============================================================================
* STEP 3: Central Limit Theorem — non-normal population still gives normal X-bar
* =============================================================================
* 1880 U.S. Census ages: highly skewed population, yet sample means are normal
use "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_CENSUSAGEMEANS.DTA", clear

describe
summarize                                // identify the sample mean variable

* Use the first numeric variable (sample means of ages)
ds, has(type numeric)                    // list numeric variables

* Summarize the sample means
summarize xmean, detail
display "Mean of sample means: " r(mean)           // theoretical μ = 24.13
display "SD of sample means:   " r(sd)             // theoretical σ/√n = 3.72

* Histogram with normal overlay demonstrates CLT
histogram xmean, kdensity normal                   ///
    xtitle("Sample Mean Age (years)")              ///
    ytitle("Density")                              ///
    title("CLT in Action: Normal Sample Means from a Skewed Population")

* =============================================================================
* STEP 4: Standard error — how sample size affects precision
* =============================================================================
* SE = σ/√n: to halve the SE, you must quadruple the sample size
* Demonstrate with coin toss: σ = 0.5

display ""
display "Standard error vs sample size (σ = 0.5):"
display "{hline 40}"
display %~10s "n" %~15s "SE = σ/√n" %~15s "Var(X-bar)"
display "{hline 40}"
foreach n in 10 30 100 400 1000 {
    local se = 0.5 / sqrt(`n')
    local var_xbar = 0.25 / `n'
    display %10.0f `n' %15.4f `se' %15.6f `var_xbar'
}

* =============================================================================
* STEP 5: Monte Carlo simulation — verify the theory computationally
* =============================================================================
* Simulate 1000 samples of 30 coin tosses to see the CLT converge
clear
set seed 10101
set obs 1000                             // one observation per simulation run

* Generate each simulated mean: average of 30 Bernoulli(0.5) draws
gen sim_mean = .
forvalues i = 1/1000 {
    // Draw 30 coin tosses and compute mean
    quietly {
        preserve
        clear
        set obs 30
        gen toss = runiform() < 0.5
        summarize toss
        local m = r(mean)
        restore
        replace sim_mean = `m' in `i'
    }
}

summarize sim_mean
display ""
display "Monte Carlo simulation (1000 samples, n=30):"
display "Mean of simulated means: " r(mean)        // theoretical: 0.5
display "SD of simulated means:   " r(sd)          // theoretical: 0.0913

histogram sim_mean, kdensity normal                ///
    xtitle("Sample Mean")                          ///
    ytitle("Density")                              ///
    title("Monte Carlo Simulation vs Theoretical Sampling Distribution")

* =============================================================================
* STEP 6: Weighted means — correcting for nonrepresentative samples
* =============================================================================
* When inclusion probabilities differ, the unweighted mean is biased;
* inverse-probability weights w_i = 1/π_i recover the true population mean
clear
set seed 42

* Create a population: 50 men (mean 60000) and 50 women (mean 50000)
set obs 100
gen male = (_n <= 50)
gen income = rnormal(60000, 15000) if male == 1
replace income = rnormal(50000, 15000) if male == 0

* True population mean
summarize income
local true_mean = r(mean)

* Biased sample: oversample women (70% women, 30% men)
* Tag 15 random men and 35 random women
gen rand = runiform()
sort male rand
gen in_sample = 0
// First 15 men (male==1 block is obs 51-100 after sort)
by male: replace in_sample = 1 if male == 1 & _n <= 15
by male: replace in_sample = 1 if male == 0 & _n <= 35
keep if in_sample == 1

* Unweighted mean (biased toward oversampled group)
summarize income
local unweighted = r(mean)

* Create IPW weights: w_i = 1/π_i
gen ipw = 1/0.3 if male == 1
replace ipw = 1/0.7 if male == 0

* Weighted mean corrects the imbalance
summarize income [aw=ipw]
local weighted = r(mean)

display ""
display "Weighted vs Unweighted Means:"
display "True population mean:  $" %9.0fc `true_mean'
display "Unweighted mean:       $" %9.0fc `unweighted'
display "Weighted mean (IPW):   $" %9.0fc `weighted'
Paste into your Stata do-file editor
# =============================================================================
# CHAPTER 3 CHEAT SHEET: The Sample Mean
# =============================================================================

# --- Libraries ---
library(haven)           # read Stata .dta files directly from URLs
library(ggplot2)         # grammar of graphics for all plots
library(dplyr)           # data manipulation

# =============================================================================
# STEP 1: Load pre-computed sample means from coin toss experiments
# =============================================================================
# 400 samples of 30 coin tosses each — precomputed in the textbook dataset
url_coin <- "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_COINTOSSMEANS.DTA"
data_coin <- read_dta(url_coin)

cat("Coin toss experiment:", nrow(data_coin), "sample means (each from n=30 tosses)\n")
cat("Mean of sample means:", round(mean(data_coin$xbar), 4),
    " (theoretical μ = 0.5)\n")
cat("SD of sample means:  ", round(sd(data_coin$xbar), 4),
    " (theoretical σ/√n =", round(sqrt(0.25/30), 4), ")\n")

# =============================================================================
# STEP 2: Visualize the sampling distribution with normal overlay
# =============================================================================
# The histogram of 400 sample means approximates the sampling distribution
theo_se <- sqrt(0.25 / 30)

ggplot(data_coin, aes(x = xbar)) +
  geom_histogram(aes(y = after_stat(density)), bins = 25,
                 fill = "steelblue", color = "black", alpha = 0.7) +
  stat_function(fun = dnorm, args = list(mean = 0.5, sd = theo_se),
                color = "red", linewidth = 1.5) +
  labs(x = "Sample Mean (proportion of heads)", y = "Density",
       title = "Sampling Distribution of X-bar from 400 Coin Toss Experiments (n=30)") +
  theme_minimal()

# =============================================================================
# STEP 3: Central Limit Theorem — non-normal population still gives normal X-bar
# =============================================================================
# 1880 U.S. Census ages: highly skewed population, yet sample means are normal
url_census <- "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_CENSUSAGEMEANS.DTA"
data_census <- read_dta(url_census)

# Identify the sample mean column
age_means <- if ("xmean" %in% names(data_census)) data_census$xmean else data_census[[1]]

cat("\n1880 Census:", length(age_means), "sample means\n")
cat("Mean of sample means:", round(mean(age_means), 2), "years\n")
cat("SD of sample means:  ", round(sd(age_means), 2), "years\n")

ggplot(data.frame(xmean = age_means), aes(x = xmean)) +
  geom_histogram(aes(y = after_stat(density)), bins = 20,
                 fill = "steelblue", color = "black", alpha = 0.7) +
  stat_function(fun = dnorm, args = list(mean = 24.13, sd = 18.61/sqrt(25)),
                color = "red", linewidth = 1.5) +
  labs(x = "Sample Mean Age (years)", y = "Density",
       title = "CLT in Action: Normal Sample Means from a Skewed Population") +
  theme_minimal()

# =============================================================================
# STEP 4: Standard error — how sample size affects precision
# =============================================================================
# SE = σ/√n: to halve the SE, you must quadruple the sample size
sigma <- 0.5

se_table <- data.frame(
  n = c(10, 30, 100, 400, 1000),
  SE = sigma / sqrt(c(10, 30, 100, 400, 1000)),
  Var_Xbar = sigma^2 / c(10, 30, 100, 400, 1000)
)
print(se_table, row.names = FALSE)

# =============================================================================
# STEP 5: Monte Carlo simulation — verify the theory computationally
# =============================================================================
# Simulate 1000 samples of 30 coin tosses to see the CLT converge
set.seed(10101)
n_sims <- 1000
sample_size <- 30

# replicate() runs an expression n_sims times and collects results
sim_means <- replicate(n_sims, mean(rbinom(sample_size, 1, 0.5)))

cat("\nMonte Carlo simulation (", n_sims, "samples, n =", sample_size, "):\n")
cat("Mean of simulated means:", round(mean(sim_means), 4), " (theoretical: 0.5)\n")
cat("SD of simulated means:  ", round(sd(sim_means), 4),
    " (theoretical:", round(sqrt(0.25/30), 4), ")\n")

ggplot(data.frame(x = sim_means), aes(x = x)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30,
                 fill = "steelblue", color = "white", alpha = 0.7) +
  stat_function(fun = dnorm, args = list(mean = 0.5, sd = sqrt(0.25/30)),
                color = "red", linewidth = 1.5) +
  labs(x = "Sample Mean", y = "Density",
       title = "Monte Carlo Simulation vs Theoretical Sampling Distribution") +
  theme_minimal()

# =============================================================================
# STEP 6: Weighted means — correcting for nonrepresentative samples
# =============================================================================
# When inclusion probabilities differ, the unweighted mean is biased;
# inverse-probability weights w_i = 1/π_i recover the true population mean
set.seed(42)
income_men   <- rnorm(50, mean = 60000, sd = 15000)
income_women <- rnorm(50, mean = 50000, sd = 15000)
true_pop_mean <- (mean(income_men) + mean(income_women)) / 2

# Biased sample: oversample women (70% women, 30% men)
sample_men   <- sample(income_men, 15, replace = FALSE)
sample_women <- sample(income_women, 35, replace = FALSE)
sample_income <- c(sample_men, sample_women)

# Unweighted mean is biased toward the oversampled group
unweighted <- mean(sample_income)

# Weighted mean with IPW: w_i = 1/π_i corrects the imbalance
weights <- c(rep(1/0.3, 15), rep(1/0.7, 35))
weighted_mean <- weighted.mean(sample_income, weights)

cat("\nWeighted vs Unweighted Means:\n")
cat("True population mean: $", round(true_pop_mean, 0), "\n")
cat("Unweighted mean:      $", round(unweighted, 0), "\n")
cat("Weighted mean (IPW):  $", round(weighted_mean, 0), "\n")
Paste into your R console or RStudio