Chapter 2 — Univariate Data Summary

What is a typical worker's earnings? One common answer is "the average." But averages can mislead when a few people earn much more than the rest. Which number should you trust — the mean or the median?

Summary statistics condense a dataset into a few interpretable numbers. They describe the center (mean, median) and the spread (standard deviation, quartiles). The median is more robust to outliers than the mean, which makes it preferred for skewed data like incomes and wealth.

Dataset

Add synthetic high earner off

Try this

Add one fake high earner. On the earnings dataset, drag the slider to $500k. The mean jumps by thousands; the median barely moves. That is what "robust to outliers" means.
Switch to U.S. real GDP per capita. The mean and median are almost identical. The distribution over time is close to symmetric — no long tail pulling the mean.
Switch to monthly home sales. The mean sits above the median again. Housing markets have big boom months but floors in the bust, so the right tail pulls the mean up.

Take-away: when the mean and median disagree, the data is skewed — and the median usually gives the more honest summary. Read §2.1 in the chapter →

Before we fit any model, we need to see the shape of the data. Where do most observations cluster? Is there a long tail? A histogram answers these questions — but only if we pick the right bin width.

A histogram shows the distribution of a variable by grouping observations into bins. The bin width is a choice, not a fact in the data. Narrow bins reveal fine detail (and noise). Wide bins smooth everything out. A kernel density estimate (KDE) sidesteps the choice by smoothing across all bin edges at once.

Dataset

Bin width

Overlay

Markers

Try this

Slide to the narrowest bin width. Watch the spikes appear. Many earnings cluster on round numbers ($20k, $30k, $40k) — a reporting artifact, not a feature of the economy.
Slide to the widest bin. The spikes vanish, but so do real peaks. Oversmoothing hides information; undersmoothing invents it.
Turn on the KDE overlay. The smooth curve does not depend on any bin choice. If a histogram peak survives the KDE, it is probably real.

Take-away: bin width is a dial you tune — move it until the shape is clear but not misleading. Read §2.2 in the chapter →

Which observations should count as "outliers"? There is no universal answer. The standard rule is a convention, not a physical law — and if you move the dial, the set of outliers changes.

A box plot summarizes the middle 50% of the data in one picture. The box spans the interquartile range (IQR) — from the 25th to the 75th percentile. The line inside is the median. Whiskers extend up to 1.5 × IQR past each quartile. Points beyond the whiskers are flagged as potential outliers.

Dataset

IQR multiplier 1.5

Try this

Earnings, multiplier 1.5 → 3.0. At 1.5 several high earners are flagged. At 3.0 most of them vanish. The data did not change — you did.
Switch to real GDP per capita. Almost nothing gets flagged. A smooth, trending time series has a wide middle and short tails under the IQR rule.
Think about reporting. A point flagged at 1.5× but not at 3× is a judgment call. Good practice: report the rule you used.

Take-away: outliers are defined by a rule you choose. Always state the rule before labeling anything "unusual." Read §2.2 in the chapter →

Cross-section data tells us who. Time-series data tells us when. Recessions, booms, oil shocks, and pandemics all leave visible marks — if we know how to look.

A time series is ordered in time, so neighbouring values are related. A line chart shows the path. Three complementary views: level (the raw values), log (whether growth is steady), and growth rate (quarter-on-quarter change). Overlaying recession shading anchors the series to real events.

View

Recession shading

Level tells the growth story · Log shows whether growth is steady (straight line) or shifting · Growth tells the volatility story. Three views of the same quarterly data.

Try this

Switch between Level and Log. Both look near-linear from 1959–2020. Growth was already roughly steady. The famous "hockey stick" lies earlier in history (see the log widget below).
Switch to Growth. The picture changes completely. The 2008 Great Recession and the 2020-Q2 COVID crash become the two deepest downward spikes.
Turn on recession shading. Every dip in Level and Growth lines up with a shaded NBER recession — evidence that recessions are macro-visible, not just statistical labels.

Take-away: one series, three views — each answers a different question. Always ask which view matches your question. Read §2.2 in the chapter →

Bars or pies? The answer is almost always bars. Your eyes compare lengths much faster than angles — and categorical data has no natural order, so you choose one. That choice shapes the story.

Bar charts compare categories by length; pie charts compare by angle. Length comparisons are easier for the human eye, which is why bar charts win for anything past three categories. A sorted bar chart makes rankings instant. An alphabetical bar chart hides them.

Dataset

Chart type

Sort

Try this

Health expenditures: bar vs. pie. Which chart makes it obvious that Hospital spending is about 60% larger than Physician spending? With 13 slices, the pie barely helps.
Switch the sort to alphabetical. Same numbers, worse picture. Ranking now takes effort. Sort order is a design choice that carries information.
Try the fishing dataset (4 categories). With so few slices a pie is still legible. The number of categories matters as much as the chart type.

Take-away: default to a sorted bar chart. Reach for a pie only with three or four categories. Read §2.3–2.4 in the chapter →

When most observations are small and a few are huge, a histogram looks like a wall with a long tail. Exponential growth over two centuries looks like a vertical cliff. The natural log rebalances both views — revealing shape where the raw data hides it.

Taking the log of a variable compresses the big values and stretches the small ones. That has two concrete payoffs: (1) a right-skewed cross-section often becomes close to symmetric, which is friendlier for the statistics that come later; and (2) an exponential time series becomes a straight line, where the slope reads directly as the growth rate.

A ·Cross-section: earnings before and after ln()

B ·Time series: U.S. real GDP per capita (the hockey stick)

Try this

Compare the two earnings histograms. Skewness drops from about 1.70 on the raw data toward zero after ln(). The long right tail flattens; the shape becomes close to symmetric.
Find the take-off on the linear GDP chart. The series is nearly flat for decades, then lifts off around the 1870s — the Industrial Revolution. Before that, income per person barely moved.
Look at the same data on the log chart. It becomes nearly a straight line. A straight log line means constant percentage growth (~1.5–1.8%/year). The hockey stick on the raw chart is just what compounding looks like.
Spot the kink in the log slope. The slope steepens around 1870 — that is the onset of modern growth. The linear chart buries this signal; the log chart puts it in plain view.

Take-away: when data spans many orders of magnitude, reach for logs. Skewed cross-sections become symmetric; exponential time series become lines. Read §2.5 in the chapter →

Is $100,000 a high income? The word high has no meaning by itself. High compared to whom? Standardization gives us one: how many standard deviations away from the mean?

A z-score rescales any observation onto a common ruler. The formula is simple: z = (x − mean) / sd. It measures distance from the mean in units of standard deviation. For bell-shaped data, three rules of thumb:

About 68% of observations fall within ±1.
About 95% within ±2.
About 99.7% within ±3.

Earnings value

Z-score

Interpretation

Try this

Drag to $36,000 (the median). The z-score lands near zero. This earner is typical — right at the center of the distribution.
Drag to $100,000. The z-score is around +2, which puts this earner in the top ~2.5% by the rule of thumb. Rare, but not impossibly so.
Drag to the maximum (~$172,000). The z-score is well above +3. Under the 2σ rule this is a clear outlier — fewer than one in a hundred observations would land this high in a bell-shaped distribution.

Take-away: z-scores turn any value into a distance from the mean on a common scale. "Rare" becomes a number, not a feeling. Read §2.5 in the chapter →

Monthly home sales zigzag with the seasons — peaks every summer, troughs every winter. But underneath the zigzag there is a trend. How do we reveal it?

A moving average smooths a time series by averaging several consecutive observations. A centred window of 12 months cancels out a full seasonal cycle, leaving the trend visible. Too narrow a window keeps too much noise. Too wide a window smears away real events — like the 2007–2011 housing bust.

Moving-average window (months) 11

Seasonally-adjusted series

Recession shading

Try this

Set the window to 1. This is the raw series. Seasonal swings dominate; the trend is buried.
Set the window to 11. Close to one seasonal cycle. The seasonal swings cancel. The 2007–2011 housing bust now leaps out.
Set the window to 24. Oversmoothed. The start of the crash is blurred. More smoothing is not always better — there is an optimal window tied to the cycle you want to remove.

Take-away: a moving average is a dial between noise and signal — match the window to the cycle you want to cancel. Read §2.6 in the chapter →

Code Summary

You've explored the key concepts interactively — now reproduce them in code. These self-contained blocks cover everything you practiced above. Pick your language, copy the code, and run it.

# =============================================================================
# CHAPTER 2 CHEAT SHEET: Univariate Data Summary
# =============================================================================

# --- Libraries ---
import numpy as np                        # numerical operations (log, mean)
import pandas as pd                       # data loading and manipulation
import matplotlib.pyplot as plt           # creating plots and visualizations
from scipy import stats                   # skewness, kurtosis, distribution shape

# =============================================================================
# STEP 1: Load data directly from a URL
# =============================================================================
# pd.read_stata() reads Stata .dta files; this dataset has 171 observations
url_earnings = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_EARNINGS.DTA"
data_earnings = pd.read_stata(url_earnings)

earnings = data_earnings['earnings']
print(f"Dataset: {data_earnings.shape[0]} observations, {data_earnings.shape[1]} variables")

# =============================================================================
# STEP 2: Summary statistics — mean vs median reveals skewness
# =============================================================================
# .describe() gives count, mean, std, min, quartiles, max in one call
print(data_earnings[['earnings']].describe().round(2))

# Skewness and kurtosis measure the shape of the distribution
print(f"\nSkewness:        {stats.skew(earnings):.2f}  (> 1 = strongly right-skewed)")
print(f"Excess kurtosis: {stats.kurtosis(earnings):.2f}  (> 0 = heavier tails than normal)")
print(f"Mean - Median:   ${earnings.mean() - earnings.median():,.0f}  (positive gap signals right skew)")

# =============================================================================
# STEP 3: Histogram with KDE overlay — see the distribution shape
# =============================================================================
# Bin width is a choice: narrower = more detail (and noise), wider = smoother
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(earnings, bins=20, edgecolor='black', alpha=0.7, density=True, label='Histogram')
earnings.plot.kde(ax=ax, linewidth=2, color='red', label='KDE')
ax.set_xlabel('Annual Earnings ($)')
ax.set_ylabel('Density')
ax.set_title('Earnings Distribution: Histogram + Kernel Density Estimate')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 4: Box plot — visualize quartiles and outliers
# =============================================================================
# The box spans Q1 to Q3 (IQR); whiskers extend 1.5×IQR; dots are outliers
fig, ax = plt.subplots(figsize=(10, 4))
ax.boxplot(earnings, vert=False, patch_artist=True,
           boxprops=dict(facecolor='lightblue', alpha=0.7),
           medianprops=dict(color='red', linewidth=2))
ax.set_xlabel('Annual Earnings ($)')
ax.set_title('Box Plot of Earnings — Median, Quartiles, and Outliers')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 5: Log transformation — taming right skew
# =============================================================================
# np.log() compresses big values and stretches small ones, making skewed
# distributions more symmetric — a prerequisite for many statistical methods
data_earnings['lnearnings'] = np.log(earnings)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].hist(earnings, bins=20, edgecolor='black', alpha=0.7, color='steelblue')
axes[0].set_title(f'Original  (skewness = {stats.skew(earnings):.2f})')
axes[0].set_xlabel('Earnings ($)')

axes[1].hist(data_earnings['lnearnings'], bins=20, edgecolor='black', alpha=0.7, color='coral')
axes[1].set_title(f'Log-transformed  (skewness = {stats.skew(data_earnings["lnearnings"]):.2f})')
axes[1].set_xlabel('ln(Earnings)')

plt.suptitle('Effect of Log Transformation on Skewness', fontweight='bold')
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 6: Z-scores — how unusual is each observation?
# =============================================================================
# z = (x - mean) / std  puts every value on a common "standard deviations
# from the mean" scale: |z| > 2 is unusual, |z| > 3 is very unusual
z_scores = (earnings - earnings.mean()) / earnings.std()

print(f"Highest earner: ${earnings.max():,.0f}  →  z = {z_scores.max():.2f}")
print(f"Median earner:  ${earnings.median():,.0f}  →  z = {(earnings.median() - earnings.mean()) / earnings.std():.2f}")
print(f"Observations with |z| > 2: {(z_scores.abs() > 2).sum()} out of {len(z_scores)}")

# =============================================================================
# STEP 7: Time series — moving average smooths seasonal noise
# =============================================================================
# Monthly home sales zigzag with the seasons; an 11-month moving average
# cancels one full seasonal cycle, revealing the underlying trend
url_homesales = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_MONTHLYHOMESALES.DTA"
data_hs = pd.read_stata(url_homesales)
data_hs = data_hs[data_hs['year'] >= 2005]

fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(data_hs['daten'], data_hs['exsales'], linewidth=1, alpha=0.6, label='Original (monthly)')
ax.plot(data_hs['daten'], data_hs['exsales_ma11'], linewidth=2, color='red',
        linestyle='--', label='11-month Moving Average')
ax.set_xlabel('Year')
ax.set_ylabel('Monthly Home Sales')
ax.set_title('U.S. Home Sales: Raw Series vs. Moving Average (2005–2015)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Open empty Colab notebook →

* =============================================================================
* CHAPTER 2 CHEAT SHEET: Univariate Data Summary
* =============================================================================

* --- Setup ---
clear all                                // start with a clean workspace
set more off                             // do not pause output for long results

* =============================================================================
* STEP 1: Load data directly from a URL
* =============================================================================
* use loads a Stata .dta file; "clear" drops any data already in memory
use "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_EARNINGS.DTA", clear

describe                                 // list all variables, types, and labels
display "Observations: " _N              // _N is Stata's built-in observation count

* =============================================================================
* STEP 2: Summary statistics — mean vs median reveals skewness
* =============================================================================
* summarize gives n, mean, std, min, max; "detail" adds median, skewness, kurtosis
summarize earnings, detail

// After running the command above, Stata stores results you can reference:
display "Skewness:        " r(skewness)       // > 1 = strongly right-skewed
display "Kurtosis:        " r(kurtosis)       // > 3 = heavier tails than normal
display "Mean - Median:   " r(mean) - r(p50)  // positive gap signals right skew

* =============================================================================
* STEP 3: Histogram with KDE overlay — see the distribution shape
* =============================================================================
* histogram draws a frequency or density histogram
* kdensity overlays a smooth kernel density estimate on the same plot
histogram earnings, kdensity                           ///
    xtitle("Annual Earnings ($)")                      ///
    ytitle("Density")                                  ///
    title("Earnings Distribution: Histogram + Kernel Density Estimate")

* =============================================================================
* STEP 4: Box plot — visualize quartiles and outliers
* =============================================================================
* graph box draws a box-and-whisker plot: box spans Q1 to Q3 (IQR),
* whiskers extend 1.5 x IQR, and dots mark outliers beyond that range
graph box earnings,                                    ///
    ytitle("Annual Earnings ($)")                      ///
    title("Box Plot of Earnings — Median, Quartiles, and Outliers")

* =============================================================================
* STEP 5: Log transformation — taming right skew
* =============================================================================
* gen creates a new variable; ln() is the natural log function
* Log compresses big values and stretches small ones, making skewed
* distributions more symmetric — a prerequisite for many statistical methods
gen lnearnings = ln(earnings)

// Compare histograms side by side using graph combine
histogram earnings, name(raw, replace) title("Original")          ///
    xtitle("Earnings ($)")
histogram lnearnings, name(logged, replace) title("Log-transformed") ///
    xtitle("ln(Earnings)")
graph combine raw logged,                              ///
    title("Effect of Log Transformation on Skewness")

* =============================================================================
* STEP 6: Z-scores — how unusual is each observation?
* =============================================================================
* z = (x - mean) / std puts every value on a "standard deviations from the
* mean" scale: |z| > 2 is unusual, |z| > 3 is very unusual

// First compute mean and standard deviation, then generate z-scores
summarize earnings
gen z_earnings = (earnings - r(mean)) / r(sd)

// Inspect extreme values
summarize z_earnings, detail
display "Observations with |z| > 2:"
count if abs(z_earnings) > 2

* =============================================================================
* STEP 7: Time series — moving average smooths seasonal noise
* =============================================================================
* Monthly home sales zigzag with the seasons; an 11-month moving average
* cancels one full seasonal cycle, revealing the underlying trend
use "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_MONTHLYHOMESALES.DTA", clear
keep if year >= 2005

// tsset declares the time-series structure so Stata can use time operators
// tssmooth ma computes a centered moving average of the specified window
tsset daten
tssmooth ma exsales_smooth = exsales, window(5 1 5)    // 11-month centered MA

twoway (line exsales daten, lwidth(thin) lcolor(gs10))                     ///
       (line exsales_smooth daten, lwidth(medthick) lcolor(red) lpattern(dash)), ///
    xtitle("Year") ytitle("Monthly Home Sales")                            ///
    title("U.S. Home Sales: Raw Series vs. Moving Average (2005–2015)")    ///
    legend(order(1 "Original (monthly)" 2 "11-month Moving Average"))

Paste into your Stata do-file editor

# =============================================================================
# CHAPTER 2 CHEAT SHEET: Univariate Data Summary
# =============================================================================

# --- Libraries ---
library(haven)           # read Stata .dta files directly from URLs
library(dplyr)           # data manipulation (mutate, filter, summarize)
library(ggplot2)         # grammar of graphics for all plots
library(e1071)           # skewness() and kurtosis() functions

# =============================================================================
# STEP 1: Load data directly from a URL
# =============================================================================
# read_dta() reads Stata .dta files; works with local paths or URLs
url_earnings <- "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_EARNINGS.DTA"
data_earnings <- read_dta(url_earnings)

earnings <- data_earnings$earnings
cat("Dataset:", nrow(data_earnings), "observations,", ncol(data_earnings), "variables\n")

# =============================================================================
# STEP 2: Summary statistics — mean vs median reveals skewness
# =============================================================================
# summary() gives min, Q1, median, mean, Q3, max in one call
summary(earnings)

# Skewness and kurtosis measure the shape of the distribution
cat("\nSkewness:       ", round(skewness(earnings), 2),
    " (> 1 = strongly right-skewed)\n")
cat("Excess kurtosis:", round(kurtosis(earnings), 2),
    " (> 0 = heavier tails than normal)\n")
cat("Mean - Median:  $", round(mean(earnings) - median(earnings), 0),
    " (positive gap signals right skew)\n")

# =============================================================================
# STEP 3: Histogram with KDE overlay — see the distribution shape
# =============================================================================
# geom_histogram draws the bars; geom_density overlays the smooth KDE curve
# after_stat(density) rescales the histogram to match the density scale
ggplot(data_earnings, aes(x = earnings)) +
  geom_histogram(aes(y = after_stat(density)), bins = 20,
                 fill = "steelblue", color = "black", alpha = 0.7) +
  geom_density(color = "red", linewidth = 1.2) +
  labs(x = "Annual Earnings ($)", y = "Density",
       title = "Earnings Distribution: Histogram + Kernel Density Estimate") +
  theme_minimal()

# =============================================================================
# STEP 4: Box plot — visualize quartiles and outliers
# =============================================================================
# The box spans Q1 to Q3 (IQR); whiskers extend 1.5x IQR; dots are outliers
ggplot(data_earnings, aes(y = earnings)) +
  geom_boxplot(fill = "lightblue", alpha = 0.7, outlier.color = "red") +
  coord_flip() +
  labs(x = "", y = "Annual Earnings ($)",
       title = "Box Plot of Earnings \u2014 Median, Quartiles, and Outliers") +
  theme_minimal()

# =============================================================================
# STEP 5: Log transformation — taming right skew
# =============================================================================
# log() compresses big values and stretches small ones, making skewed
# distributions more symmetric — a prerequisite for many statistical methods
data_earnings <- data_earnings |>
  mutate(lnearnings = log(earnings))

# Compare original vs log-transformed distributions side by side
library(patchwork)       # combine ggplots side by side with + operator

p1 <- ggplot(data_earnings, aes(x = earnings)) +
  geom_histogram(bins = 20, fill = "steelblue", color = "black", alpha = 0.7) +
  labs(x = "Earnings ($)",
       title = paste0("Original (skewness = ",
                      round(skewness(earnings), 2), ")")) +
  theme_minimal()

p2 <- ggplot(data_earnings, aes(x = lnearnings)) +
  geom_histogram(bins = 20, fill = "coral", color = "black", alpha = 0.7) +
  labs(x = "ln(Earnings)",
       title = paste0("Log-transformed (skewness = ",
                      round(skewness(data_earnings$lnearnings), 2), ")")) +
  theme_minimal()

p1 + p2 + plot_annotation(title = "Effect of Log Transformation on Skewness")

# =============================================================================
# STEP 6: Z-scores — how unusual is each observation?
# =============================================================================
# z = (x - mean) / sd puts every value on a "standard deviations from the
# mean" scale: |z| > 2 is unusual, |z| > 3 is very unusual
z_scores <- scale(earnings)[, 1]         # scale() returns a matrix; [,1] extracts vector

cat("Highest earner: $", max(earnings), " \u2192  z =", round(max(z_scores), 2), "\n")
cat("Median earner:  $", median(earnings), " \u2192  z =",
    round((median(earnings) - mean(earnings)) / sd(earnings), 2), "\n")
cat("Observations with |z| > 2:", sum(abs(z_scores) > 2),
    "out of", length(z_scores), "\n")

# =============================================================================
# STEP 7: Time series — moving average smooths seasonal noise
# =============================================================================
# Monthly home sales zigzag with the seasons; an 11-month moving average
# cancels one full seasonal cycle, revealing the underlying trend
url_homesales <- "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_MONTHLYHOMESALES.DTA"
data_hs <- read_dta(url_homesales) |>
  filter(year >= 2005)

# zoo::rollmean computes a centered moving average; k = window width
library(zoo)             # rollmean() for moving averages

data_hs <- data_hs |>
  mutate(exsales_smooth = rollmean(exsales, k = 11, fill = NA, align = "center"))

ggplot(data_hs, aes(x = daten)) +
  geom_line(aes(y = exsales), linewidth = 0.5, alpha = 0.6) +
  geom_line(aes(y = exsales_smooth), color = "red",
            linewidth = 1.2, linetype = "dashed") +
  labs(x = "Year", y = "Monthly Home Sales",
       title = "U.S. Home Sales: Raw Series vs. Moving Average (2005\u20132015)") +
  theme_minimal()

Paste into your R console or RStudio

Keep learning

You have used every widget. The full chapter covers everything here plus case studies (cross-country distributions, convergence, spatial data) that are not in the dashboard.

Read the full Chapter 2 →

Univariate Data Summary

Summary statistics & the mean–median gap

Histograms & kernel density — bin width matters

Box plot & the IQR rule for outliers

Time-series line chart — trend, cycles, and recessions

Charts for categorical data — bar vs. pie, sorted vs. alphabetical

Log transformation — taming skew and linearizing exponential growth

A ·Cross-section: earnings before and after ln()

B ·Time series: U.S. real GDP per capita (the hockey stick)

Z-scores — how unusual is this observation?

Moving averages & seasonal adjustment — finding the trend

Code Summary

Keep learning