Chapter 01 — Analysis of Economics Data

Is a $253,000 house "typical" for Central Davis? Before you fit any regression, you need to know what typical even looks like for each variable.

Descriptive analysis summarizes data; statistical inference generalizes from it. Descriptive tools — mean, median, quartiles, std dev — describe the 29 houses in front of you. Inference uses those 29 observations to say something about the broader Davis housing market. Most econometric analysis involves both, in that order.

Variable

Try this

Select Sale price. Mean ≈ $253,910 sits above median ≈ $244,000 and skewness is positive. A right-skewed tail: a handful of expensive houses pulls the average above the typical price.
Switch to Size. The mean and median land close together, so size is more symmetric than price — a better-behaved variable to build a regression around.
Switch to Bedrooms. The box collapses onto a few integer values — summary statistics still compute, but the "distribution" is really a discrete bar chart in disguise.

Take-away: Know each variable's center, spread, and shape before running any regression — the same slope means very different things in a tight sample versus a dispersed one. Read §1.4 in the chapter →

Does a bigger house really cost more? And if so, does the relationship look like a line — or a curve, or nothing at all?

Always plot your data before running a regression. A scatter plot reveals direction (positive or negative), form (linear or curved), and strength (tight or scattered), plus any outliers — all of which summary statistics alone can hide. The fitted OLS line then picks the slope and intercept that minimize the sum of squared residuals on the cloud.

Regression line

Residuals

Try this

Turn the regression line off and mentally draw your own. Most people's eyeball line lands close to OLS but not exactly on it — OLS is a computed, reproducible answer, not a judgment call.
Turn the line back on. OLS picks slope $73.77/sq ft and intercept $115,017 — the unique line that makes the sum of squared residuals as small as possible.
Toggle residuals on and spot the longest pink segment. That house's price is furthest from what size alone predicts — a reminder that size is only one of many price drivers (condition, location, age all live inside the residual).

Take-away: A scatter plot is the cheapest insurance against running a regression on data that isn't linear to begin with. Read §1.5 in the chapter →

What price does our model predict for a 2,500-sq-ft house? And how much would that prediction move if we'd estimated the slope a little differently?

The slope is the marginal effect: each extra square foot adds $73.77 to the predicted price. Regression quantifies this per-unit effect of x on y (Key Concept 1.4), but predictions must stay inside the observed data range. Push the size outside 1,400–3,300 sq ft and you are extrapolating — the linear pattern may not hold there (Key Concept 1.6).

House size (sq ft)

Predicted price

What-if slope ($/sq ft)

Try this

Set size to 2,000 sq ft and keep the slope at $73.77. The prediction lands near $262,500 — the textbook worked example computed with the same intercept and slope.
Increase size by 100 sq ft. The predicted price rises by exactly $7,377 — that's slope × 100, the textbook definition of a marginal effect.
Drag size to 4,000 sq ft. The dashed boundary warns you you're past the observed data range — predictions here are assumptions, not evidence.
Drag the slope to $60, then to $90. For a 2,000-sq-ft house that's roughly a $40k swing — uncertainty in the slope translates directly into uncertainty in every prediction.

Take-away: A regression equation lets you predict, but only within the range the data covers — and the uncertainty in the slope is also uncertainty in every prediction. Read §1.9 in the chapter →

R² is 0.6175 — is that good? And what does "62% of the variation is explained" actually mean in picture form?

R² is the share of total price variation that size can account for. Reading regression output centers on four numbers: the coefficient estimate, the standard error, the t-statistic / p-value, and R². This widget shows R² geometrically: total variation (TSS) splits into what the line explains (ESS, cyan segments) and what it leaves over (RSS, pink segments). R² = ESS / TSS = 1 − RSS / TSS.

Show

Try this

Click Explained. The cyan bars get taller for houses far from the average size — that is exactly the variation the slope is capturing.
Click Residual. Pink bars are what size can't explain — the 38% of price variation driven by location, condition, and everything else we didn't measure.
Eyeball the cyan bars vs. the pink bars. The cyan bars dominate — that is the geometric meaning of R² = 0.62: explained variation outweighs residual variation roughly 62 to 38.

Take-away: R² is a ratio of two sums of squares — ESS over TSS — and you can see it as cyan-vs-pink rather than read it as a number. Read §1.7 in the chapter →

If size "explains" 62% of prices, does that mean size causes higher prices? And what happens when we try bedrooms, bathrooms, lot size, or age instead?

A high R² with one predictor never proves causation. Regression results must be read with caution: association does not imply causation, omitted variables can bias the slope, and predictions should not extrapolate beyond the data. Five regressions on the same 29 houses produce five different slopes and five different R²s — none of them rule out a lurking variable (location, condition, school district) driving both the predictor and the price.

Predictor (x-axis)

Try this

Start with Size (R² ≈ 62%) and switch to Bedrooms. R² drops sharply — bedrooms and size are correlated, so bedrooms partially proxies for size but carries less information on its own.
Switch to Age. The slope is negative: older houses sell for less on average. "All else equal" is the trap — age may also proxy for neighborhood vintage or condition, which you haven't controlled for.
Cycle through all five predictors. Five regressions, five different stories about price — until you can control for confounders in a multiple regression, none of them is a causal story.

Take-away: Switching predictors gives you five different stories about price — until you control for confounders, none of them is a causal story. Read §1.9 in the chapter →

Code Summary

You've explored the key concepts interactively — now reproduce them in code. These self-contained blocks cover everything you practiced above. Pick your language, copy the code, and run it.

# =============================================================================
# CHAPTER 1 CHEAT SHEET: Analysis of Economics Data
# =============================================================================

# --- Libraries ---
import pandas as pd                       # data loading and manipulation
import matplotlib.pyplot as plt           # creating plots and visualizations
import pyfixest as pf                     # OLS regression (Python port of R's fixest)
# !pip install pyfixest                   # uncomment if running in Google Colab

# =============================================================================
# STEP 1: Load data directly from a URL
# =============================================================================
# pd.read_stata() reads Stata .dta files (pandas also supports CSV, Excel, etc.)
url = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_HOUSE.DTA"
data_house = pd.read_stata(url)

print(f"Dataset: {data_house.shape[0]} observations, {data_house.shape[1]} variables")

# =============================================================================
# STEP 2: Descriptive statistics — summarize before modeling
# =============================================================================
# .head() shows the first rows; .describe() gives mean, std, min, quartiles, max
print(data_house[['price', 'size']].describe().round(2))

# =============================================================================
# STEP 3: Scatter plot — always visualize before fitting a regression
# =============================================================================
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(data_house['size'], data_house['price'], s=50, alpha=0.7)
ax.set_xlabel('House Size (square feet)')
ax.set_ylabel('House Sale Price (dollars)')
ax.set_title('House Price vs Size')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 4: OLS regression — fit the model
# =============================================================================
# pf.feols() estimates OLS in one step — same formula syntax as R's fixest
# Formula: 'y ~ x' regresses y on x (intercept included automatically)
fit = pf.feols('price ~ size', data=data_house)

# Extract key results
slope     = fit.coef()['size']         # marginal effect: $/sq ft
intercept = fit.coef()['Intercept']    # predicted price when size = 0
r_squared = fit._r2                    # proportion of variation explained

print(f"Estimated equation: price = {intercept:,.0f} + {slope:.2f} × size")
print(f"Interpretation: each additional sq ft is associated with ${slope:,.2f} higher price")
print(f"R-squared: {r_squared:.4f} ({r_squared*100:.1f}% of variation explained)")

# Full regression table (coefficients, std errors, t-stats, p-values, R²)
fit.summary()

# =============================================================================
# STEP 5: Scatter plot with fitted regression line and R²
# =============================================================================
# fit.predict() returns the predicted y-values from the estimated equation
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(data_house['size'], data_house['price'], s=50, alpha=0.7, label='Actual prices')
ax.plot(data_house['size'], fit.predict(), color='red', linewidth=2, label='Fitted line')
ax.set_xlabel('House Size (square feet)')
ax.set_ylabel('House Sale Price (dollars)')
ax.set_title(f'OLS Regression: price = {intercept:,.0f} + {slope:.2f} × size  (R² = {r_squared:.2%})')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 6: Compare predictors — association is NOT causation
# =============================================================================
# Running separate regressions with different x-variables shows that each tells
# a different story. High R² does not prove causation — omitted variables
# (location, condition, school district) can bias any single-variable slope.
predictors = {
    'size':      'Size (sq ft)',
    'bedrooms':  'Bedrooms',
    'bathrooms': 'Bathrooms',
    'lotsize':   'Lot size',
    'age':       'Age (years)',
}

print(f"{'Predictor':<18} {'Slope':>10} {'R²':>8}")
print("-" * 38)
for var, label in predictors.items():
    m = pf.feols(f'price ~ {var}', data=data_house)
    print(f"{label:<18} {m.coef()[var]:>10.2f} {m._r2:>8.4f}")

Open empty Colab notebook →

* =============================================================================
* CHAPTER 1 CHEAT SHEET: Analysis of Economics Data
* =============================================================================

* --- Setup ---
clear all                                // start with a clean workspace
set more off                             // do not pause output for long results

* =============================================================================
* STEP 1: Load data directly from a URL
* =============================================================================
* use loads a Stata .dta file; "clear" drops any data already in memory
use "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_HOUSE.DTA", clear

describe                                 // list all variables, types, and labels
display "Observations: " _N              // _N is Stata's built-in observation count

* =============================================================================
* STEP 2: Descriptive statistics — summarize before modeling
* =============================================================================
* summarize gives n, mean, std, min, max; "detail" adds median and quartiles
summarize price size, detail

* =============================================================================
* STEP 3: Scatter plot — always visualize before fitting a regression
* =============================================================================
* scatter draws a scatter plot: first variable is y-axis, second is x-axis
scatter price size,                                    ///
    xtitle("House Size (square feet)")                 ///
    ytitle("House Sale Price (dollars)")               ///
    title("House Price vs Size")                       ///
    msymbol(circle) mcolor(blue%70)

* =============================================================================
* STEP 4: OLS regression — fit the model
* =============================================================================
* regress fits OLS: first variable is y, remaining are x's
* IMPORTANT: Stata automatically includes a constant (intercept)
regress price size

// After running regress, Stata stores results you can reference:
display "Slope (size):    " _b[size]           // marginal effect: $/sq ft
display "Intercept:       " _b[_cons]          // predicted price when size = 0
display "R-squared:       " e(r2)              // proportion of variation explained
display "Interpretation: each additional sq ft is associated with $" _b[size] " higher price"

* =============================================================================
* STEP 5: Scatter plot with fitted regression line and R²
* =============================================================================
* predict generates predicted values (fitted y-hat) after regress
predict price_hat                        // stores predicted values in price_hat

twoway (scatter price size, msymbol(circle) mcolor(blue%70))           ///
       (line price_hat size, lcolor(red) lwidth(medthick) sort),       ///
    xtitle("House Size (square feet)")                                 ///
    ytitle("House Sale Price (dollars)")                               ///
    title("OLS Regression: price on size")                             ///
    legend(order(1 "Actual prices" 2 "Fitted line"))

* =============================================================================
* STEP 6: Compare predictors — association is NOT causation
* =============================================================================
* Running separate regressions with different x-variables shows that each tells
* a different story. High R² does not prove causation — omitted variables
* (location, condition, school district) can bias any single-variable slope.

// Size
regress price size
display "Size       — Slope: " _b[size]      " R²: " e(r2)

// Bedrooms
regress price bedrooms
display "Bedrooms   — Slope: " _b[bedrooms]  " R²: " e(r2)

// Bathrooms
regress price bathrooms
display "Bathrooms  — Slope: " _b[bathrooms] " R²: " e(r2)

// Lot size
regress price lotsize
display "Lot size   — Slope: " _b[lotsize]   " R²: " e(r2)

// Age
regress price age
display "Age        — Slope: " _b[age]       " R²: " e(r2)

Paste into your Stata do-file editor

# =============================================================================
# CHAPTER 1 CHEAT SHEET: Analysis of Economics Data
# =============================================================================

# --- Libraries ---
library(haven)           # read Stata .dta files directly from URLs
library(fixest)          # fast OLS estimation with feols()
library(dplyr)           # data manipulation (mutate, filter, summarize)
library(ggplot2)         # grammar of graphics for all plots

# =============================================================================
# STEP 1: Load data directly from a URL
# =============================================================================
# read_dta() reads Stata .dta files; works with local paths or URLs
url <- "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_HOUSE.DTA"
data_house <- read_dta(url)

cat("Dataset:", nrow(data_house), "observations,", ncol(data_house), "variables\n")

# =============================================================================
# STEP 2: Descriptive statistics — summarize before modeling
# =============================================================================
# summary() gives min, Q1, median, mean, Q3, max in one call
summary(data_house[, c("price", "size")])

# =============================================================================
# STEP 3: Scatter plot — always visualize before fitting a regression
# =============================================================================
ggplot(data_house, aes(x = size, y = price)) +
  geom_point(color = "steelblue", size = 3, alpha = 0.7) +
  labs(x = "House Size (square feet)", y = "House Sale Price (dollars)",
       title = "House Price vs Size") +
  theme_minimal()

# =============================================================================
# STEP 4: OLS regression — fit the model
# =============================================================================
# feols() from fixest estimates OLS; formula syntax: y ~ x
# IMPORTANT: fixest automatically includes an intercept
model <- feols(price ~ size, data = data_house)

# summary() shows coefficients, SEs, t-stats, p-values, and R²
summary(model)

# Extract key results
slope     <- coef(model)["size"]         # marginal effect: $/sq ft
intercept <- coef(model)["(Intercept)"]  # predicted price when size = 0
r_squared <- r2(model)                   # proportion of variation explained

cat("Estimated equation: price =", round(intercept, 0), "+",
    round(slope, 2), "x size\n")
cat("Interpretation: each additional sq ft is associated with $",
    round(slope, 2), "higher price\n")
cat("R-squared:", round(r_squared, 4), "\n")

# =============================================================================
# STEP 5: Scatter plot with fitted regression line and R²
# =============================================================================
# fitted() extracts the predicted y-values from the estimated model
ggplot(data_house, aes(x = size, y = price)) +
  geom_point(color = "steelblue", size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", formula = y ~ x, color = "red",
              linewidth = 1.2, se = FALSE) +
  labs(x = "House Size (square feet)", y = "House Sale Price (dollars)",
       title = paste0("OLS Regression: price on size  (R² = ",
                      round(r_squared * 100, 1), "%)")) +
  theme_minimal()

# =============================================================================
# STEP 6: Compare predictors — association is NOT causation
# =============================================================================
# Running separate regressions with different x-variables shows that each tells
# a different story. High R² does not prove causation — omitted variables
# (location, condition, school district) can bias any single-variable slope.
predictors <- c("size", "bedrooms", "bathrooms", "lotsize", "age")

results <- lapply(predictors, function(var) {
  m <- feols(as.formula(paste("price ~", var)), data = data_house)
  data.frame(Predictor = var, Slope = coef(m)[var], R2 = r2(m))
})
do.call(rbind, results)

Paste into your R console or RStudio

Analysis of Economics Data

Data at a glance — descriptive statistics

Scatter plot & regression line — seeing the relationship

Prediction explorer — what does the slope mean?

R² — how much variation does the regression explain?

Multiple predictors — association is not causation

Code Summary