Chapter 2 of 18 Β· Interactive Dashboard

Univariate Data Summary

Slide, toggle, and compare to build intuition for summary statistics, histograms, box plots, transformations, and time-series smoothing β€” using the same datasets and examples as the chapter.

Summary statistics & the mean–median gap

What is a typical worker's earnings? One common answer is "the average." But averages can mislead when a few people earn much more than the rest. Which number should you trust β€” the mean or the median?

Summary statistics condense a dataset into a few interpretable numbers. They describe the center (mean, median) and the spread (standard deviation, quartiles). The median is more robust to outliers than the mean, which makes it preferred for skewed data like incomes and wealth.
What you can do here
  • Pick a dataset β€” earnings (skewed), GDP (roughly symmetric), or home sales.
  • Watch the mean and median update together in the stats cards and on the chart.
  • Drag the outlier slider (earnings only) to add one fake high earner and see who moves.
Try this
  1. Add one fake high earner. On the earnings dataset, drag the slider to $500k. The mean jumps by thousands; the median barely moves. That is what "robust to outliers" means.
  2. Switch to U.S. real GDP per capita. The mean and median are almost identical. The distribution over time is close to symmetric β€” no long tail pulling the mean.
  3. Switch to monthly home sales. The mean sits above the median again. Housing markets have big boom months but floors in the bust, so the right tail pulls the mean up.

Take-away: when the mean and median disagree, the data is skewed β€” and the median usually gives the more honest summary. Read Β§2.1 in the chapter β†’

Histograms & kernel density β€” bin width matters

Before we fit any model, we need to see the shape of the data. Where do most observations cluster? Is there a long tail? A histogram answers these questions β€” but only if we pick the right bin width.

A histogram shows the distribution of a variable by grouping observations into bins. The bin width is a choice, not a fact in the data. Narrow bins reveal fine detail (and noise). Wide bins smooth everything out. A kernel density estimate (KDE) sidesteps the choice by smoothing across all bin edges at once.
What you can do here
  • Pick a dataset to see how the shape changes across economic variables.
  • Slide the bin width from narrow to wide.
  • Overlay a KDE to check which peaks are real and which are just bin artifacts.
  • Toggle mean and median markers to spot skew at a glance.
Try this
  1. Slide to the narrowest bin width. Watch the spikes appear. Many earnings cluster on round numbers ($20k, $30k, $40k) β€” a reporting artifact, not a feature of the economy.
  2. Slide to the widest bin. The spikes vanish, but so do real peaks. Oversmoothing hides information; undersmoothing invents it.
  3. Turn on the KDE overlay. The smooth curve does not depend on any bin choice. If a histogram peak survives the KDE, it is probably real.

Take-away: bin width is a dial you tune β€” move it until the shape is clear but not misleading. Read Β§2.2 in the chapter β†’

Box plot & the IQR rule for outliers

Which observations should count as "outliers"? There is no universal answer. The standard rule is a convention, not a physical law β€” and if you move the dial, the set of outliers changes.

A box plot summarizes the middle 50% of the data in one picture. The box spans the interquartile range (IQR) β€” from the 25th to the 75th percentile. The line inside is the median. Whiskers extend up to 1.5 Γ— IQR past each quartile. Points beyond the whiskers are flagged as potential outliers.
What you can do here
  • Pick a dataset to compare outlier patterns.
  • Slide the IQR multiplier from 1.0 (strict) to 3.0 (lenient) and watch the outlier list change.
Try this
  1. Earnings, multiplier 1.5 β†’ 3.0. At 1.5 several high earners are flagged. At 3.0 most of them vanish. The data did not change β€” you did.
  2. Switch to real GDP per capita. Almost nothing gets flagged. A smooth, trending time series has a wide middle and short tails under the IQR rule.
  3. Think about reporting. A point flagged at 1.5Γ— but not at 3Γ— is a judgment call. Good practice: report the rule you used.

Take-away: outliers are defined by a rule you choose. Always state the rule before labeling anything "unusual." Read Β§2.2 in the chapter β†’

Time-series line chart β€” trend, cycles, and recessions

Cross-section data tells us who. Time-series data tells us when. Recessions, booms, oil shocks, and pandemics all leave visible marks β€” if we know how to look.

A time series is ordered in time, so neighbouring values are related. A line chart shows the path. Three complementary views: level (the raw values), log (whether growth is steady), and growth rate (quarter-on-quarter change). Overlaying recession shading anchors the series to real events.
What you can do here
  • Switch the view β€” level, log, or growth rate β€” to ask a different question.
  • Toggle recession shading to overlay NBER-dated U.S. recessions.
Level tells the growth story Β· Log shows whether growth is steady (straight line) or shifting Β· Growth tells the volatility story. Three views of the same quarterly data.
Try this
  1. Switch between Level and Log. Both look near-linear from 1959–2020. Growth was already roughly steady. The famous "hockey stick" lies earlier in history (see the log widget below).
  2. Switch to Growth. The picture changes completely. The 2008 Great Recession and the 2020-Q2 COVID crash become the two deepest downward spikes.
  3. Turn on recession shading. Every dip in Level and Growth lines up with a shaded NBER recession β€” evidence that recessions are macro-visible, not just statistical labels.

Take-away: one series, three views β€” each answers a different question. Always ask which view matches your question. Read Β§2.2 in the chapter β†’

Charts for categorical data β€” bar vs. pie, sorted vs. alphabetical

Bars or pies? The answer is almost always bars. Your eyes compare lengths much faster than angles β€” and categorical data has no natural order, so you choose one. That choice shapes the story.

Bar charts compare categories by length; pie charts compare by angle. Length comparisons are easier for the human eye, which is why bar charts win for anything past three categories. A sorted bar chart makes rankings instant. An alphabetical bar chart hides them.
What you can do here
  • Pick a dataset β€” 13 health-expenditure categories or 4 fishing-site categories.
  • Switch chart type between bar and pie.
  • Switch the sort order between value (ranked) and alphabetical.
Try this
  1. Health expenditures: bar vs. pie. Which chart makes it obvious that Hospital spending is about 60% larger than Physician spending? With 13 slices, the pie barely helps.
  2. Switch the sort to alphabetical. Same numbers, worse picture. Ranking now takes effort. Sort order is a design choice that carries information.
  3. Try the fishing dataset (4 categories). With so few slices a pie is still legible. The number of categories matters as much as the chart type.

Take-away: default to a sorted bar chart. Reach for a pie only with three or four categories. Read Β§2.3–2.4 in the chapter β†’

Log transformation β€” taming skew and linearizing exponential growth

When most observations are small and a few are huge, a histogram looks like a wall with a long tail. Exponential growth over two centuries looks like a vertical cliff. The natural log rebalances both views β€” revealing shape where the raw data hides it.

Taking the log of a variable compresses the big values and stretches the small ones. That has two concrete payoffs: (1) a right-skewed cross-section often becomes close to symmetric, which is friendlier for the statistics that come later; and (2) an exponential time series becomes a straight line, where the slope reads directly as the growth rate.
What you can do here
  • Compare four charts side by side β€” this widget has no controls, only observations to make.
  • Panel A: the same earnings data shown raw (left) vs. after ln() (right).
  • Panel B: 200 years of U.S. real GDP per capita shown raw (left) vs. on a log scale (right).
  • Watch the skewness number on the earnings histograms, and the shape of the GDP curve.

A Β·Cross-section: earnings before and after ln()

B Β·Time series: U.S. real GDP per capita (the hockey stick)

Try this
  1. Compare the two earnings histograms. Skewness drops from about 1.70 on the raw data toward zero after ln(). The long right tail flattens; the shape becomes close to symmetric.
  2. Find the take-off on the linear GDP chart. The series is nearly flat for decades, then lifts off around the 1870s β€” the Industrial Revolution. Before that, income per person barely moved.
  3. Look at the same data on the log chart. It becomes nearly a straight line. A straight log line means constant percentage growth (~1.5–1.8%/year). The hockey stick on the raw chart is just what compounding looks like.
  4. Spot the kink in the log slope. The slope steepens around 1870 β€” that is the onset of modern growth. The linear chart buries this signal; the log chart puts it in plain view.

Take-away: when data spans many orders of magnitude, reach for logs. Skewed cross-sections become symmetric; exponential time series become lines. Read Β§2.5 in the chapter β†’

Z-scores β€” how unusual is this observation?

Is $100,000 a high income? The word high has no meaning by itself. High compared to whom? Standardization gives us one: how many standard deviations away from the mean?

A z-score rescales any observation onto a common ruler. The formula is simple: z = (x βˆ’ mean) / sd. It measures distance from the mean in units of standard deviation. For bell-shaped data, three rules of thumb:
  • About 68% of observations fall within Β±1.
  • About 95% within Β±2.
  • About 99.7% within Β±3.
What you can do here
  • Drag the slider to pick any earnings value.
  • Read the z-score and plain-language interpretation live.
  • Watch the chart show where your pick sits on the distribution.
Try this
  1. Drag to $36,000 (the median). The z-score lands near zero. This earner is typical β€” right at the center of the distribution.
  2. Drag to $100,000. The z-score is around +2, which puts this earner in the top ~2.5% by the rule of thumb. Rare, but not impossibly so.
  3. Drag to the maximum (~$172,000). The z-score is well above +3. Under the 2Οƒ rule this is a clear outlier β€” fewer than one in a hundred observations would land this high in a bell-shaped distribution.

Take-away: z-scores turn any value into a distance from the mean on a common scale. "Rare" becomes a number, not a feeling. Read Β§2.5 in the chapter β†’

Moving averages & seasonal adjustment β€” finding the trend

Monthly home sales zigzag with the seasons β€” peaks every summer, troughs every winter. But underneath the zigzag there is a trend. How do we reveal it?

A moving average smooths a time series by averaging several consecutive observations. A centred window of 12 months cancels out a full seasonal cycle, leaving the trend visible. Too narrow a window keeps too much noise. Too wide a window smears away real events β€” like the 2007–2011 housing bust.
What you can do here
  • Slide the window width from 1 (raw data) to 24 months.
  • Toggle a seasonally-adjusted series to compare against the moving-average smooth.
  • Toggle recession shading to see how the trend responds to macro shocks.
Try this
  1. Set the window to 1. This is the raw series. Seasonal swings dominate; the trend is buried.
  2. Set the window to 11. Close to one seasonal cycle. The seasonal swings cancel. The 2007–2011 housing bust now leaps out.
  3. Set the window to 24. Oversmoothed. The start of the crash is blurred. More smoothing is not always better β€” there is an optimal window tied to the cycle you want to remove.

Take-away: a moving average is a dial between noise and signal β€” match the window to the cycle you want to cancel. Read Β§2.6 in the chapter β†’

Python Libraries and Code

You've explored the key concepts interactively β€” now reproduce them in Python. This self-contained code block covers everything you practiced above. Copy it into an empty notebook and run it.

# =============================================================================
# CHAPTER 2 CHEAT SHEET: Univariate Data Summary
# =============================================================================

# --- Libraries ---
import numpy as np                        # numerical operations (log, mean)
import pandas as pd                       # data loading and manipulation
import matplotlib.pyplot as plt           # creating plots and visualizations
from scipy import stats                   # skewness, kurtosis, distribution shape

# =============================================================================
# STEP 1: Load data directly from a URL
# =============================================================================
# pd.read_stata() reads Stata .dta files; this dataset has 171 observations
url_earnings = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_EARNINGS.DTA"
data_earnings = pd.read_stata(url_earnings)

earnings = data_earnings['earnings']
print(f"Dataset: {data_earnings.shape[0]} observations, {data_earnings.shape[1]} variables")

# =============================================================================
# STEP 2: Summary statistics β€” mean vs median reveals skewness
# =============================================================================
# .describe() gives count, mean, std, min, quartiles, max in one call
print(data_earnings[['earnings']].describe().round(2))

# Skewness and kurtosis measure the shape of the distribution
print(f"\nSkewness:        {stats.skew(earnings):.2f}  (> 1 = strongly right-skewed)")
print(f"Excess kurtosis: {stats.kurtosis(earnings):.2f}  (> 0 = heavier tails than normal)")
print(f"Mean - Median:   ${earnings.mean() - earnings.median():,.0f}  (positive gap signals right skew)")

# =============================================================================
# STEP 3: Histogram with KDE overlay β€” see the distribution shape
# =============================================================================
# Bin width is a choice: narrower = more detail (and noise), wider = smoother
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(earnings, bins=20, edgecolor='black', alpha=0.7, density=True, label='Histogram')
earnings.plot.kde(ax=ax, linewidth=2, color='red', label='KDE')
ax.set_xlabel('Annual Earnings ($)')
ax.set_ylabel('Density')
ax.set_title('Earnings Distribution: Histogram + Kernel Density Estimate')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 4: Box plot β€” visualize quartiles and outliers
# =============================================================================
# The box spans Q1 to Q3 (IQR); whiskers extend 1.5Γ—IQR; dots are outliers
fig, ax = plt.subplots(figsize=(10, 4))
ax.boxplot(earnings, vert=False, patch_artist=True,
           boxprops=dict(facecolor='lightblue', alpha=0.7),
           medianprops=dict(color='red', linewidth=2))
ax.set_xlabel('Annual Earnings ($)')
ax.set_title('Box Plot of Earnings β€” Median, Quartiles, and Outliers')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 5: Log transformation β€” taming right skew
# =============================================================================
# np.log() compresses big values and stretches small ones, making skewed
# distributions more symmetric β€” a prerequisite for many statistical methods
data_earnings['lnearnings'] = np.log(earnings)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].hist(earnings, bins=20, edgecolor='black', alpha=0.7, color='steelblue')
axes[0].set_title(f'Original  (skewness = {stats.skew(earnings):.2f})')
axes[0].set_xlabel('Earnings ($)')

axes[1].hist(data_earnings['lnearnings'], bins=20, edgecolor='black', alpha=0.7, color='coral')
axes[1].set_title(f'Log-transformed  (skewness = {stats.skew(data_earnings["lnearnings"]):.2f})')
axes[1].set_xlabel('ln(Earnings)')

plt.suptitle('Effect of Log Transformation on Skewness', fontweight='bold')
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 6: Z-scores β€” how unusual is each observation?
# =============================================================================
# z = (x - mean) / std  puts every value on a common "standard deviations
# from the mean" scale: |z| > 2 is unusual, |z| > 3 is very unusual
z_scores = (earnings - earnings.mean()) / earnings.std()

print(f"Highest earner: ${earnings.max():,.0f}  β†’  z = {z_scores.max():.2f}")
print(f"Median earner:  ${earnings.median():,.0f}  β†’  z = {(earnings.median() - earnings.mean()) / earnings.std():.2f}")
print(f"Observations with |z| > 2: {(z_scores.abs() > 2).sum()} out of {len(z_scores)}")

# =============================================================================
# STEP 7: Time series β€” moving average smooths seasonal noise
# =============================================================================
# Monthly home sales zigzag with the seasons; an 11-month moving average
# cancels one full seasonal cycle, revealing the underlying trend
url_homesales = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_MONTHLYHOMESALES.DTA"
data_hs = pd.read_stata(url_homesales)
data_hs = data_hs[data_hs['year'] >= 2005]

fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(data_hs['daten'], data_hs['exsales'], linewidth=1, alpha=0.6, label='Original (monthly)')
ax.plot(data_hs['daten'], data_hs['exsales_ma11'], linewidth=2, color='red',
        linestyle='--', label='11-month Moving Average')
ax.set_xlabel('Year')
ax.set_ylabel('Monthly Home Sales')
ax.set_title('U.S. Home Sales: Raw Series vs. Moving Average (2005–2015)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Open empty Colab notebook β†’

Keep learning

You have used every widget. The full chapter covers everything here plus case studies (cross-country distributions, convergence, spatial data) that are not in the dashboard.

Read the full Chapter 2 β†’