Chapter 09 — Models with Natural Logarithms

Econometricians write "log points" like they mean percentages. Do they? Mostly — but only up to a point, and the point is smaller than you think.

The change in ln(x) approximates the proportionate change in x: Δln(x) ≈ Δx/x. Multiplying by 100 gives the percentage change. The approximation is excellent for changes under 10%. For larger changes the exact formula is %Δx = 100 × (e^Δln(x) − 1), which grows nonlinearly and diverges from the log approximation.

Percentage change 5%

Starting value x₀ 100

Exact %Δx

—

Log Approx 100×Δln(x)

—

Approx Error

—

Exact via exp

—

Try this

Set the percentage change to 1%. The exact and log approximations are visually identical. For small changes the two are interchangeable — which is why econometricians slide between "log points" and "percent" so casually.
Drag to 10%. The error is about 0.5 percentage points. Still close, but you can now see the gap — this is roughly the ceiling of "log ≈ percent" sanity.
Push to 50%. The log approximation says 40.5% instead of 50%. A 10-percentage-point error — at large changes the approximation simply breaks, so use the exact e^Δln(x) − 1 formula instead.

Take-away: "Log changes ≈ percent changes" is a small-change approximation — under 10% it's exact enough; above that, use the precise exp formula. Read §9.1 in the chapter →

The same data, the same two columns — but four very different regressions, depending on what you log. Which one fits the story you're actually telling?

Logging y, x, both, or neither produces four distinct models with four distinct interpretations. Linear: Δy = β₁Δx (dollar change per unit). Log-linear: 100·β₁ ≈ % change in y per unit increase in x (semi-elasticity). Log-log: β₁ ≈ % change in y per % change in x (elasticity, unit-free). Linear-log: β₁/100 ≈ change in y per % change in x. Choose on the basis of theory, data properties, fit, and interpretation needs.

Model specification

Intercept (b₁)

—

Slope (b₂)

—

R²

—

Interpretation

—

Try this

Start with Linear. The raw scatter is skewed by a handful of high-earning observations. Dollar-per-year interpretation, but the outliers pull the line and heteroskedasticity is obvious.
Switch to Log-Linear. Logging earnings compresses the y-axis; R² climbs to ≈ 0.334 and the slope becomes a percentage return on education. Each extra year of schooling is associated with a ~13% higher wage — the classic Mincer result.
Compare Log-Log and Linear-Log. Log-Log yields an elasticity ≈ 1.48; Linear-Log has the worst R². Of the four, Log-Linear best matches both the data shape and the economic theory of percentage returns.

Take-away: Transforming variables changes the question the slope answers — pick the specification your theory implies, not the one that maximizes R². Read §9.3 in the chapter →

A 13% return to education sounds modest. But what does it actually look like across a 20-year career? The log-linear model answers straight lines in log space, exponential curves in levels.

In the log-linear model ln(y) = β₀ + β₁x, the slope β₁ is the semi-elasticity of y with respect to x. Multiply by 100 to read it in percent: 100 × β₁ ≈ the percentage change in y when x increases by 1 unit. The approximation works well for β₁ < 0.10; for larger values use the exact form 100 × (e^β₁ − 1). Log-linear is the default earnings specification because a percentage return scales naturally across income levels.

Semi-elasticity β₁ 0.13

Intercept β₀ 8.50

% change per unit x

—

Exact % (large β₁)

—

y at x=10

—

y at x=16

—

Try this

Set β₁ = 0.05 (a 5% return per year). The log-space plot is a gentle straight line; the level-space curve is nearly linear. At small β₁ the approximation is essentially exact.
Set β₁ = 0.13 (the earnings-education estimate). The level-space curve is visibly exponential — earnings compound, not add, with education. A percentage return turns into a widening dollar gap at higher education levels.
Push β₁ to 0.25. The approximation says 25%; the exact formula gives 28.4%. At this magnitude the log approximation understates reality by roughly three percentage points — time to switch to the exact form.

Take-away: A semi-elasticity is "percent per unit" — straight lines in log space, exponential in levels; the curvature you see in level space is the economic content of the model. Read §9.3 in the chapter →

What curvature does "an elasticity of 0.33" actually put on a graph? And why does an elasticity of 1 look like a straight line through the origin?

In the log-log model ln(y) = β₀ + β₁ln(x), the slope β₁ is the elasticity: the % change in y per 1% change in x. It is unit-free — the answer doesn't care whether you measured in dollars or thousands. β₁ < 1 gives diminishing returns (concave in levels); β₁ > 1 gives increasing returns (convex); β₁ = 1 gives a proportional relationship (straight line through the origin).

Elasticity β₁ 1.48

Intercept β₀ 7.00

Elasticity

—

Returns type

—

y at x=5

—

y at x=20

—

Try this

Set β₁ = 0.33 (typical capital elasticity in growth models). The level-space curve is strongly concave. Classic diminishing returns — each additional unit of capital adds less output.
Set β₁ = 1.0. The log-space slope is exactly 1; the level-space curve collapses to a straight line through the origin. Proportional growth — doubling x exactly doubles y.
Set β₁ = 2.0. The level-space curve bows upward (convex). Increasing returns — rare in standard production, but common for network and agglomeration effects.

Take-away: Elasticity is the one regression number that reads the same in any currency or unit — and its value above or below 1 decides whether you're looking at diminishing or increasing returns. Read §9.2 in the chapter →

The S&P 500 looks like a rocket in levels — and a straight line in logs. That single trick turns a century of compound growth into a regression slope.

Exponential growth x_t = x₀(1+r)^t becomes linear in logs: ln(x_t) ≈ ln(x₀) + r × t. The OLS slope on logged data directly estimates the annual growth rate r (KC 9.6). A back-of-envelope companion: the Rule of 72 — doubling time ≈ 72 / r, where r is in percent (KC 9.7). S&P 500 grew at ~6.5%/year (1927–2019), doubling roughly every 11 years.

Display

Custom growth rate 6.5%

Est. growth rate

—

R² (log model)

—

Rule of 72 doubling

—

Custom doubling

—

Try this

Start with Levels + Logs. The left panel is the S&P 500's explosive curve; the right panel is a nearly straight line. That straightness is the visual signature of constant proportional growth.
Slide the custom growth rate to 3%, then to 12%. Rule of 72 gives 24 years and 6 years to double respectively. A 4× gap in growth rate compresses into a 4× gap in doubling time — the arithmetic is that simple.
Toggle to Logs Only and look at the 1930s, 2000, and 2008 dips. Major crashes show as pronounced drops below the linear trend. Crashes are defined by their deviation from trend — and the log plot makes them visible at a glance.

Take-away: Log-scale turns compound growth into a straight line — and the slope you estimate on logged data is simply the average growth rate. Read §9.4 in the chapter →

All four specs, one dataset, side by side. Which one lines up with the scatter you actually see?

Choosing the right functional form is guided by theory, data, fit, and interpretation. Ask: does economic theory predict absolute or percentage effects? Are the variables right-skewed or strictly positive? Which R² and residual shape look cleanest? Which coefficient interpretation is most useful? Log-linear typically wins for earnings because a percentage return scales naturally across income levels.

Highlight model

R² Linear

—

R² Log-Linear

—

R² Log-Log

—

R² Lin-Log

—

Try this

Scan the four R² values. Log-Linear has the highest. For earnings-on-education, logging the dependent variable pays off in fit as well as interpretability.
Highlight Log-Lin. The y-axis compression reduces the pull of high earners. Logging y stabilises the variance — a free upgrade to the homoskedasticity assumption.
Highlight Linear. The raw scatter fans out — extreme heteroskedasticity. That visible fan is exactly why economists default to log-earnings models.

Take-away: A higher R² alone is not enough — the best specification is the one that matches your theory's predictions and the shape of your data together. Read §9.3 in the chapter →

Productivity varies 100× across countries, capital varies even more. Linear regressions on these numbers are swamped by a few rich outliers — log transformations save the story.

Logarithmic models are the standard tool in development economics for analyzing cross-country differences. They handle the huge numerical ranges (productivity, GDP, capital), match economic theory's multiplicative production-function structure, and deliver elasticities that compare cleanly across units and currencies (KC 9.8, 9.9). The log-log specification directly estimates the output elasticity of capital — a core parameter in growth theory.

Model

Show labels

Slope (b₂)

—

R²

—

Interpretation

—

N countries

—

Try this

Start with Log-Log (kl). The slope is the output elasticity of capital, below 1. Diminishing returns to capital — the standard growth-model prediction.
Switch to Log-Linear (h). The slope is now the semi-elasticity of productivity with respect to the human-capital index. Each 1-unit rise in the HCI raises productivity by that many percent — a different question, a different scale, same dataset.
Switch to Linear (kl). The scatter is dominated by a few rich countries and the rest crowd near the origin. This is exactly why cross-country work logs — the linear view is not wrong, it's unreadable.

Take-away: For cross-country data, log specifications are not a stylistic choice — they are the only way to make the data, the theory, and the interpretation all fit in one picture. Read the case studies →

Python Libraries and Code

You've explored the key concepts interactively — now reproduce them in Python. This self-contained code block covers everything you practiced above. Copy it into an empty notebook and run it.

# =============================================================================
# CHAPTER 9 CHEAT SHEET: Models with Natural Logarithms
# =============================================================================

# --- Libraries ---
import numpy as np                        # logarithms and exponentials
import pandas as pd                       # data loading and manipulation
import matplotlib.pyplot as plt           # creating plots and visualizations
from statsmodels.formula.api import ols   # OLS regression with R-style formulas

# =============================================================================
# STEP 1: Load the earnings-education dataset
# =============================================================================
# pd.read_stata() reads Stata .dta files directly from a URL
url_earn = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_EARNINGS.DTA"
data_earnings = pd.read_stata(url_earn)

print(f"Dataset: {data_earnings.shape[0]} observations, {data_earnings.shape[1]} variables")

# =============================================================================
# STEP 2: Logarithmic approximation — why economists use logs
# =============================================================================
# Key property: Δln(x) ≈ Δx/x (proportionate change)
# Multiplying by 100 gives the percentage change
x0, x1 = 40, 40.4
exact = (x1 - x0) / x0
approx = np.log(x1) - np.log(x0)
print(f"Change from {x0} to {x1}:")
print(f"  Exact proportionate change: {exact:.6f} ({exact*100:.2f}%)")
print(f"  Log approximation Δln(x):   {approx:.6f} ({approx*100:.2f}%)")

# =============================================================================
# STEP 3: Descriptive statistics and log transformations
# =============================================================================
# Create log-transformed variables for the regression models
data_earnings['lnearn'] = np.log(data_earnings['earnings'])
data_earnings['lneduc'] = np.log(data_earnings['education'])

print(data_earnings[['earnings', 'lnearn', 'education', 'lneduc']].describe().round(2))

# =============================================================================
# STEP 4: Estimate all four model specifications
# =============================================================================
# Each model answers a different economic question about earnings and education

# Model 1: Linear — Δy = β₁Δx (dollar change per year of education)
model_linear = ols('earnings ~ education', data=data_earnings).fit()

# Model 2: Log-linear — 100β₁ ≈ % change in y per unit x (semi-elasticity)
model_loglin = ols('lnearn ~ education', data=data_earnings).fit()

# Model 3: Log-log — β₁ ≈ % change in y per % change in x (elasticity)
model_loglog = ols('lnearn ~ lneduc', data=data_earnings).fit()

# Model 4: Linear-log — β₁/100 ≈ dollar change per % change in x
model_linlog = ols('earnings ~ lneduc', data=data_earnings).fit()

# Print the most important model: log-linear (semi-elasticity)
semi_elast = model_loglin.params['education']
print(f"Log-linear: each year of education → {100*semi_elast:.1f}% higher earnings")
print(f"Log-log elasticity: {model_loglog.params['lneduc']:.3f}")

# Full regression table for the log-linear model
model_loglin.summary()

# =============================================================================
# STEP 5: Compare all four models side by side
# =============================================================================
# The comparison shows that model choice affects both R² and interpretation
models = {
    'Linear':     ('earnings ~ education',  model_linear,  'education', '${:,.0f} per year'),
    'Log-linear': ('ln(y) ~ x',            model_loglin,  'education', '{:.1f}% per year'),
    'Log-log':    ('ln(y) ~ ln(x)',         model_loglog,  'lneduc',   '{:.2f}% per 1%'),
    'Linear-log': ('y ~ ln(x)',            model_linlog,  'lneduc',   '${:,.0f} per 1%'),
}

print(f"{'Model':<12} {'Specification':<16} {'Slope':>10} {'R²':>8}  Interpretation")
print("-" * 75)
for name, (spec, m, var, fmt) in models.items():
    slope = m.params[var]
    interp = fmt.format(100*slope if 'per year' in fmt and 'Log' in name else slope/100 if 'per 1%' in fmt and name == 'Linear-log' else slope)
    print(f"{name:<12} {spec:<16} {slope:>10.4f} {m.rsquared:>8.3f}  {interp}")

# =============================================================================
# STEP 6: Scatter plot with the log-linear fitted line
# =============================================================================
# The log-linear model (semi-elasticity) provides the best fit for earnings data
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(data_earnings['education'], data_earnings['lnearn'], s=50, alpha=0.7)
ax.plot(data_earnings['education'], model_loglin.fittedvalues,
        color='red', linewidth=2, label='Fitted line')
ax.set_xlabel('Education (years)')
ax.set_ylabel('ln(Earnings)')
ax.set_title(f'Log-Linear Model: semi-elasticity = {semi_elast:.4f}  (R² = {model_loglin.rsquared:.3f})')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 7: Exponential growth — S&P 500 and the Rule of 72
# =============================================================================
# Exponential growth in levels becomes linear in logs:
# ln(x_t) ≈ ln(x₀) + r × t, where slope r = annual growth rate
url_sp500 = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_SP500INDEX.DTA"
data_sp500 = pd.read_stata(url_sp500)

model_sp500 = ols('lnsp500 ~ year', data=data_sp500).fit()
growth_rate = model_sp500.params['year']

print(f"S&P 500 estimated growth rate: {100*growth_rate:.2f}% per year")
print(f"Rule of 72: doubles every {72/(100*growth_rate):.1f} years")
print(f"R-squared: {model_sp500.rsquared:.4f}")

# Visualize: exponential in levels vs. linear in logs
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].plot(data_sp500['year'], data_sp500['sp500'], linewidth=2)
axes[0].set_xlabel('Year')
axes[0].set_ylabel('S&P 500 Index')
axes[0].set_title('Exponential Growth in Levels')
axes[0].grid(True, alpha=0.3)

axes[1].plot(data_sp500['year'], data_sp500['lnsp500'], linewidth=2)
axes[1].plot(data_sp500['year'], model_sp500.fittedvalues,
             color='red', linewidth=2, linestyle='--', label='Fitted (linear)')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('ln(S&P 500 Index)')
axes[1].set_title(f'Linear in Logs: growth = {100*growth_rate:.2f}%/year')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Open empty Colab notebook →

Models with Natural Logarithms

Logarithmic Approximation Explorer

Four Model Specifications

Semi-Elasticity Visualizer

Elasticity Visualizer

Exponential Growth and Rule of 72

Model Comparison: 2×2 Grid

Cross-Country Log Models

Python Libraries and Code