Chapter 01 of 18 · Interactive Dashboard

Analysis of Economics Data

Explore house-price data from Central Davis, CA to build intuition for scatter plots, regression lines, slope interpretation, R², and the difference between association and causation.

Data at a glance — descriptive statistics

Is a $253,000 house "typical" for Central Davis? Before you fit any regression, you need to know what typical even looks like for each variable.

Descriptive analysis summarizes data; statistical inference generalizes from it. Descriptive tools — mean, median, quartiles, std dev — describe the 29 houses in front of you. Inference uses those 29 observations to say something about the broader Davis housing market. Most econometric analysis involves both, in that order.
What you can do here
  • Switch the variable between sale price, size, and bedrooms.
  • Compare the mean (cyan line) and the median (pink dotted line) — the gap between them is a quick read on skew.
  • Scan the quartiles and IQR in the stat cards to feel where the middle 50% of the data sits.
Try this
  1. Select Sale price. Mean ≈ $253,910 sits above median ≈ $244,000 and skewness is positive. A right-skewed tail: a handful of expensive houses pulls the average above the typical price.
  2. Switch to Size. The mean and median land close together, so size is more symmetric than price — a better-behaved variable to build a regression around.
  3. Switch to Bedrooms. The box collapses onto a few integer values — summary statistics still compute, but the "distribution" is really a discrete bar chart in disguise.

Take-away: Know each variable's center, spread, and shape before running any regression — the same slope means very different things in a tight sample versus a dispersed one. Read §1.4 in the chapter →

Scatter plot & regression line — seeing the relationship

Does a bigger house really cost more? And if so, does the relationship look like a line — or a curve, or nothing at all?

Always plot your data before running a regression. A scatter plot reveals direction (positive or negative), form (linear or curved), and strength (tight or scattered), plus any outliers — all of which summary statistics alone can hide. The fitted OLS line then picks the slope and intercept that minimize the sum of squared residuals on the cloud.
What you can do here
  • Toggle the regression line on/off to see what OLS adds to a bare scatter.
  • Toggle residuals on to see the vertical gap between each house and its predicted price.
  • Hover a point to read its size and sale price.
Try this
  1. Turn the regression line off and mentally draw your own. Most people's eyeball line lands close to OLS but not exactly on it — OLS is a computed, reproducible answer, not a judgment call.
  2. Turn the line back on. OLS picks slope $73.77/sq ft and intercept $115,017 — the unique line that makes the sum of squared residuals as small as possible.
  3. Toggle residuals on and spot the longest pink segment. That house's price is furthest from what size alone predicts — a reminder that size is only one of many price drivers (condition, location, age all live inside the residual).

Take-away: A scatter plot is the cheapest insurance against running a regression on data that isn't linear to begin with. Read §1.5 in the chapter →

Prediction explorer — what does the slope mean?

What price does our model predict for a 2,500-sq-ft house? And how much would that prediction move if we'd estimated the slope a little differently?

The slope is the marginal effect: each extra square foot adds $73.77 to the predicted price. Regression quantifies this per-unit effect of x on y (Key Concept 1.4), but predictions must stay inside the observed data range. Push the size outside 1,400–3,300 sq ft and you are extrapolating — the linear pattern may not hold there (Key Concept 1.6).
What you can do here
  • Slide the house size to watch the pink diamond trace predictions along the fitted line.
  • Slide the "what-if" slope from $50 to $100/sq ft to feel how much the prediction moves when the slope itself is uncertain.
  • Watch the dashed boundary lines at 1,400 and 3,300 sq ft — they mark where the data actually lives.
Try this
  1. Set size to 2,000 sq ft and keep the slope at $73.77. The prediction lands near $262,500 — the textbook worked example computed with the same intercept and slope.
  2. Increase size by 100 sq ft. The predicted price rises by exactly $7,377 — that's slope × 100, the textbook definition of a marginal effect.
  3. Drag size to 4,000 sq ft. The dashed boundary warns you you're past the observed data range — predictions here are assumptions, not evidence.
  4. Drag the slope to $60, then to $90. For a 2,000-sq-ft house that's roughly a $40k swing — uncertainty in the slope translates directly into uncertainty in every prediction.

Take-away: A regression equation lets you predict, but only within the range the data covers — and the uncertainty in the slope is also uncertainty in every prediction. Read §1.9 in the chapter →

R² — how much variation does the regression explain?

R² is 0.6175 — is that good? And what does "62% of the variation is explained" actually mean in picture form?

R² is the share of total price variation that size can account for. Reading regression output centers on four numbers: the coefficient estimate, the standard error, the t-statistic / p-value, and . This widget shows R² geometrically: total variation (TSS) splits into what the line explains (ESS, cyan segments) and what it leaves over (RSS, pink segments). R² = ESS / TSS = 1 − RSS / TSS.
What you can do here
  • Click Explained to see only the cyan segments — each prediction's distance from the mean price.
  • Click Residual to see only the pink segments — each actual price's distance from its prediction.
  • Click Scatter + line to see both, superimposed on the data.
Try this
  1. Click Explained. The cyan bars get taller for houses far from the average size — that is exactly the variation the slope is capturing.
  2. Click Residual. Pink bars are what size can't explain — the 38% of price variation driven by location, condition, and everything else we didn't measure.
  3. Eyeball the cyan bars vs. the pink bars. The cyan bars dominate — that is the geometric meaning of R² = 0.62: explained variation outweighs residual variation roughly 62 to 38.

Take-away: R² is a ratio of two sums of squares — ESS over TSS — and you can see it as cyan-vs-pink rather than read it as a number. Read §1.7 in the chapter →

Multiple predictors — association is not causation

If size "explains" 62% of prices, does that mean size causes higher prices? And what happens when we try bedrooms, bathrooms, lot size, or age instead?

A high R² with one predictor never proves causation. Regression results must be read with caution: association does not imply causation, omitted variables can bias the slope, and predictions should not extrapolate beyond the data. Five regressions on the same 29 houses produce five different slopes and five different R²s — none of them rule out a lurking variable (location, condition, school district) driving both the predictor and the price.
What you can do here
  • Pick a predictor — size, bedrooms, bathrooms, lot size, or age.
  • Watch the slope, intercept, SE(slope), and R² update in the stat cards.
  • Read the callout — it compares each fit back to the baseline size regression.
Try this
  1. Start with Size (R² ≈ 62%) and switch to Bedrooms. R² drops sharply — bedrooms and size are correlated, so bedrooms partially proxies for size but carries less information on its own.
  2. Switch to Age. The slope is negative: older houses sell for less on average. "All else equal" is the trap — age may also proxy for neighborhood vintage or condition, which you haven't controlled for.
  3. Cycle through all five predictors. Five regressions, five different stories about price — until you can control for confounders in a multiple regression, none of them is a causal story.

Take-away: Switching predictors gives you five different stories about price — until you control for confounders, none of them is a causal story. Read §1.9 in the chapter →

Python Libraries and Code

You've explored the key concepts interactively — now reproduce them in Python. This self-contained code block covers everything you practiced above. Copy it into an empty notebook and run it.

# =============================================================================
# CHAPTER 1 CHEAT SHEET: Analysis of Economics Data
# =============================================================================

# --- Libraries ---
import pandas as pd                       # data loading and manipulation
import matplotlib.pyplot as plt           # creating plots and visualizations
from statsmodels.formula.api import ols   # OLS regression with R-style formulas

# =============================================================================
# STEP 1: Load data directly from a URL
# =============================================================================
# pd.read_stata() reads Stata .dta files (pandas also supports CSV, Excel, etc.)
url = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_HOUSE.DTA"
data_house = pd.read_stata(url)

print(f"Dataset: {data_house.shape[0]} observations, {data_house.shape[1]} variables")

# =============================================================================
# STEP 2: Descriptive statistics — summarize before modeling
# =============================================================================
# .head() shows the first rows; .describe() gives mean, std, min, quartiles, max
print(data_house[['price', 'size']].describe().round(2))

# =============================================================================
# STEP 3: Scatter plot — always visualize before fitting a regression
# =============================================================================
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(data_house['size'], data_house['price'], s=50, alpha=0.7)
ax.set_xlabel('House Size (square feet)')
ax.set_ylabel('House Sale Price (dollars)')
ax.set_title('House Price vs Size')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 4: OLS regression — fit the model
# =============================================================================
# Formula syntax: 'y ~ x' regresses y on x (intercept included automatically)
# IMPORTANT: .fit() estimates the model — without it, nothing is computed!
model = ols('price ~ size', data=data_house).fit()

# Extract key results
slope     = model.params['size']       # marginal effect: $/sq ft
intercept = model.params['Intercept']  # predicted price when size = 0
r_squared = model.rsquared             # proportion of variation explained

print(f"Estimated equation: price = {intercept:,.0f} + {slope:.2f} × size")
print(f"Interpretation: each additional sq ft is associated with ${slope:,.2f} higher price")
print(f"R-squared: {r_squared:.4f} ({r_squared*100:.1f}% of variation explained)")

# Full regression table (coefficients, std errors, t-stats, p-values, R²)
model.summary()

# =============================================================================
# STEP 5: Scatter plot with fitted regression line and R²
# =============================================================================
# model.fittedvalues contains the predicted y-values from the estimated equation
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(data_house['size'], data_house['price'], s=50, alpha=0.7, label='Actual prices')
ax.plot(data_house['size'], model.fittedvalues, color='red', linewidth=2, label='Fitted line')
ax.set_xlabel('House Size (square feet)')
ax.set_ylabel('House Sale Price (dollars)')
ax.set_title(f'OLS Regression: price = {intercept:,.0f} + {slope:.2f} × size  (R² = {r_squared:.2%})')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 6: Compare predictors — association is NOT causation
# =============================================================================
# Running separate regressions with different x-variables shows that each tells
# a different story. High R² does not prove causation — omitted variables
# (location, condition, school district) can bias any single-variable slope.
predictors = {
    'size':      'Size (sq ft)',
    'bedrooms':  'Bedrooms',
    'bathrooms': 'Bathrooms',
    'lotsize':   'Lot size',
    'age':       'Age (years)',
}

print(f"{'Predictor':<18} {'Slope':>10} {'R²':>8}")
print("-" * 38)
for var, label in predictors.items():
    m = ols(f'price ~ {var}', data=data_house).fit()
    print(f"{label:<18} {m.params[var]:>10.2f} {m.rsquared:>8.4f}")
Open empty Colab notebook →