Log Transformations
Why do statisticians and data analysts reach for the logarithm so often? This dashboard answers that question in three short sections — each with a live plot, a few sliders, and a takeaway. Drag the controls, watch the picture change, and read the numbers update in real time.
1. Right-skewed distributions
Income, wealth, city populations, firm sizes — these distributions all share a long right tail of very large values. The tail pulls the mean far above the median and makes the histogram unreadable. A log transformation rescales the data so that multiplicative differences become additive, and a lognormal distribution becomes a clean, symmetric bell.
Every household as a dot
x ~ Lognormal(μ, σ).
Each dot is one household. Notice how the dots cluster densely at low incomes
but spread out into a sparse tail of high earners on the right — that long tail
pulls the mean (red) far above the median (green).
On the raw scale the dots cluster on the left and spread into a long sparse tail to the right; the mean line sits well above the median line.
log(x). Same households, plotted on a
log scale. The cluster is now roughly symmetric, and the mean and median lines
sit on top of each other.
On the log scale the cloud of dots is roughly symmetric and the mean and median lines essentially coincide.
The same households, summarized as a box plot
| Statistic | Raw scale | Log scale |
|---|---|---|
| Mean | — | — |
| Median | — | — |
| Mean − median gap | — | — |
| Sample skewness | — | — |
Takeaway
A right-skewed lognormal becomes a symmetric bell on the log scale. Mean and median coincide; skewness collapses to ≈ 0. This is why log-income, log-wealth, and log-population are the standard objects of analysis whenever the underlying quantity is multiplicative.
2. Outliers and the mean–median gap
When a distribution is right-skewed because of outliers, the mean is pulled above the median by those extreme values: the median barely notices an outlier, but every extreme observation drags the mean up. (This is why news outlets report median household income, not mean.) A log transformation compresses the upper tail and reduces the influence of those outliers, so the distribution becomes more nearly normal and the mean and median move back together. The shrinking gap between them is the visible signature that the transformation worked.
Where does each household sit? Where are the mean and median?
ln(value) on the x-axis. The injected
outliers — same dots — sit much closer to the bulk on this scale, so the mean
line barely shifts and stays on top of the median.
The outlier rule — which households does it flag?
The same households, summarized as a box plot
The same box-plot convention as Section 1: the shaded box covers the middle 50% of households, the line inside is the median, and red dots beyond the whiskers are the outliers per Tukey's rule. Watch the box stay anchored on the bulk while red outlier dots stretch out as you inject outliers.
| Headline — mean vs median | Raw scale | Log scale |
|---|---|---|
| Mean | — | — |
| Median | — | — |
| Mean − median gap | — | — |
| Companion diagnostics | Raw scale | Log scale |
|---|---|---|
| Sample skewness | — | — |
| Outliers (Tukey 1.5·IQR rule) | — | — |
| Spread of the sample | — | — |
Takeaway
When a distribution is right-skewed because of outliers, the mean is dragged above the median. The log transformation compresses the upper tail and reduces the influence of those outliers, so the distribution becomes approximately normal and the mean and median move back together. The shrinking mean–median gap is the diagnostic that the transformation worked. Caveats: the transformation is asymmetric, so very small values close to zero can become more extreme on the log scale; and log(0) is undefined — use log1p or filter zeros before transforming.
3. The hockey stick: log of a time series
U.S. GDP per capita over the last two centuries looks like a hockey stick on a linear axis: nearly flat for a hundred years, then sharply curving up. That shape is misleading — the country has been growing at roughly the same percentage each year. On a log axis, constant percentage growth is a straight line. The slope of log-GDP is the continuous growth rate, and the stability of that slope tells you whether the growth rate itself is steady or shifting.
log(x) and histogram the
result). Here we keep the data in dollars and transform the
axis (yaxis.type = "log") — so hover values stay
readable as currency. For a strictly positive series the picture is the
same either way.
—
| Country | Annualized growth (1820–2020) | R² of log-linear fit |
|---|
Takeaway
On a linear axis, a hockey stick can mean either an exploding growth rate or a steady percentage rate that just compounds for long enough. The log axis disambiguates: a straight line means a stable growth rate, and the slope itself is that rate. When the line bends, the rate is changing — Argentina bends downward (the divergence story), Japan bends from very steep catch-up to flat (post-1990 stagnation), the United States stays remarkably straight.