Chapter 05 — Bivariate Data Summary

Before you compare two variables, it pays to know each one first. How big are the typical values? How much do they vary? And do the two series even move together?

Summary statistics describe each variable before you examine their relationship. For bivariate analysis, compute the mean, median, standard deviation, minimum, and maximum of both variables. Comparing means and medians reveals skewness; standard deviations indicate variability. These univariate summaries give essential context for interpreting correlation and regression.

Variable pair

Try this

Start on price vs size. The size CV (0.21) is larger than the price CV (0.15). Relative to their means, house sizes vary more than house prices.
Switch to price vs age. The covariance is nearly zero and r is close to zero. These two series barely co-move — age carries almost no information about price.
Switch to price vs bedrooms. The mean (3.79) sits just under the median (4). Discrete counts rarely skew, so the mean–median gap here is tiny.

Take-away: when two variables have similar CVs they vary by a similar relative amount, but the sign of their covariance tells you whether they move together or apart. Read §5.1 in the chapter →

Two columns of numbers hide their story. A scatterplot shows it — direction, tightness, and outliers in a single picture. And r gives that picture one number between −1 and +1.

Scatterplots reveal direction, strength, form, and outliers; correlation summarises them in one number. A scatterplot provides visual evidence of a relationship between two continuous variables — positive or negative, tight or loose, linear or curved, with or without outliers. The correlation coefficient r is a scale-free measure of linear association that ranges from −1 (perfect negative) through 0 (no linear relationship) to +1 (perfect positive). It is unit-free, symmetric in x and y, and only captures linear co-movement.

X variable (predictor)

Correlation

Strength

Try this

Stay on size. r ≈ 0.79 and the cloud forms a clear upward band. Size is the strongest single predictor of price in this sample.
Switch to bedrooms. r drops to about 0.43 and the scatter loosens visibly. Bedrooms still predict price, but with much more uncertainty.
Switch to age. r sits near −0.07 and the cloud looks random. For this sample, age is essentially uninformative about price.
Hover the point furthest from the trend. Check its size, bedrooms, and price. With only n = 29 observations, most "outliers" are just natural variation rather than true anomalies.

Take-away: a scatterplot tells you the shape of a relationship; r gives you its strength and direction in one unit-free number. Read §5.3 in the chapter →

What does r = 0.5 actually look like? The number is abstract until you see 30 points scattered under it. Slide the target and let your eye calibrate.

The correlation coefficient is a scale-free measure of linear association. It ranges from −1 (perfect negative) to +1 (perfect positive), with 0 meaning no linear relationship. Correlation is unit-free, symmetric in x and y, and measures only linear co-movement — a curved relationship can have r = 0 and still be strongly deterministic.

Target correlation 0.8

Actual sample r

Try this

Set target r to 0.8 (close to the house data's 0.79). The cloud forms a tight upward band. This is what "strong positive correlation" looks like.
Slide to 0.0. The cloud becomes round with no direction. Knowing x tells you nothing about y when r = 0.
Slide to −0.5. A downward tilt is visible but loose. Moderate correlation is obvious on inspection, but the scatter is still wide.
Slide to +1.0. Every point lands on a single line. Real data never looks this clean — any real-world r of 1 is a warning sign.

Take-away: calibrate your eye — a given r always implies the same kind of scatter, independent of units. Read §5.4 in the chapter →

A cloud of points suggests a trend — but which straight line best captures it? OLS answers: the one that makes the sum of squared vertical errors as small as possible.

Ordinary Least Squares picks the line that minimises the sum of squared residuals. This yields closed-form expressions for the slope b₂ = Σ(x_i − mean of x)(y_i − mean of y) / Σ(x_i − mean of x)² and the intercept b₁ = mean of y − b₂ · mean of x. The slope equals the covariance of x and y divided by the variance of x.

X variable (predictor)

Predict at X =

Show residuals

Try this

Stay on size. Read the equation: price = $115,017 + $73.77 × size. Slide to 2,000 sqft. The fitted price is $262,559 — each extra square foot adds about $73.77.
Toggle "Show residuals" on. Vertical lines appear from every point to the line. OLS is the unique line that makes the sum of the squares of those lengths as small as possible.
Switch X to bedrooms. The slope jumps to about $23,667 per bedroom but R² drops near 0.18. A steeper slope is not a better fit — spread around the line matters more.
Switch X to age. The slope is nearly flat and R² ≈ 0.005. When a predictor carries no information, OLS simply returns a near-horizontal line at mean price.

Take-away: OLS converts a cloud of points into two numbers — an intercept and a slope — by minimising squared prediction errors. Read §5.5 in the chapter →

Your regression line explains some of the variation in price. But how much — and how much is left over?

R² is the fraction of variation in y explained by the regression on x. It ranges from 0 (no explanatory power) to 1 (perfect fit). For bivariate regression, R² equals the square of the correlation coefficient, so R² = r². An R² of 0.62 means 62% of the variation in house price is explained by variation in size; the remaining 38% is due to other factors.

View

Try this

Start in "All" view. ESS (cyan) is 61.7% of TSS. That percentage is R² — the share of variation the line captures.
Switch to "Total." Purple segments show how far each price sits from the mean of $253,910. This is the variation the model is trying to explain.
Switch to "Explained." Cyan segments show how far each predicted price sits from the mean. This is the variation the line accounts for.
Switch to "Residual." Pink segments show prediction errors. Smaller pink totals mean a tighter fit; R² = 1 − RSS/TSS.

Take-away: R² is not about steepness — it is about how tightly the cloud hugs the line. Read §5.6 in the chapter →

If size caused price, regressing each on the other would give reciprocal slopes. It doesn't. That gap is the numerical signature of association, not causation.

Regression measures association, not causation. A regression coefficient shows how much y changes when x changes, but does not prove x causes y. Causation requires additional assumptions, an experimental design, or advanced econometric techniques (Chapter 17). Regression is directional and asymmetric: regressing y on x yields a different slope than regressing x on y. Correlation, by contrast, is symmetric.

Regression direction

Try this

Start with "price ~ size." The slope is $73.77 per sqft. If regression were symmetric, the reverse slope would be 1/73.77 ≈ 0.01356 sqft per dollar. It will not be — and that is the whole point.
Switch to "size ~ price." The actual slope is 0.00837 sqft per dollar, not 0.01356. The two regressions answer different questions and give different answers.
Toggle "Both." The two lines intersect at (1,883 sqft, $253,910) but diverge everywhere else. They agree only at the joint mean.
Read the R² values. Both regressions share R² = 0.6175. Correlation is symmetric; regression is not — a lesson that will matter all the way through Chapter 17.

Take-away: the asymmetry of the two slopes is the numerical fingerprint of association, not causation. Read §5.10 in the chapter →

Is the price–size relationship really a straight line? The only way to know is to let the data try a curve and see whether it picks one.

Parametric regression assumes a specific functional form; nonparametric regression lets the shape emerge from the data. LOWESS fits many small weighted regressions at each point, stitching them into a smooth curve whose flexibility is set by a bandwidth (the fraction of nearby points used at each location). Kernel smoothing is a weighted moving average that does the same job from a different angle. When the flexible curve agrees with the straight OLS line, the linearity assumption is validated.

OLS line

LOWESS curve

LOWESS bandwidth (frac) 0.65

Kernel smoothing

Try this

Keep OLS and LOWESS on (frac = 0.65). The two curves trace nearly the same path. The linear assumption holds for this sample.
Drag bandwidth down to 0.30. LOWESS becomes wiggly, chasing individual points. With n = 29, a small bandwidth overfits noise.
Push bandwidth up to 1.00. LOWESS flattens into something almost identical to OLS. When a flexible method converges to the rigid one, parsimony wins.
Turn on kernel smoothing. A different nonparametric estimator — same verdict. The price–size relationship is genuinely linear, not a hidden curve.

Take-away: nonparametric curves are a sanity check — when they echo the OLS line, the straight-line model is trustworthy. Read §5.11 in the chapter →

Code Summary

You've explored the key concepts interactively — now reproduce them in code. These self-contained blocks cover everything you practiced above. Pick your language, copy the code, and run it.

# =============================================================================
# CHAPTER 5 CHEAT SHEET: Bivariate Data Summary
# =============================================================================

# --- Libraries ---
import pandas as pd                                         # data loading and manipulation
import matplotlib.pyplot as plt                              # creating plots and visualizations
import pyfixest as pf                                         # OLS regression with R-style formulas
# !pip install pyfixest                                       # uncomment if running in Google Colab
from statsmodels.nonparametric.smoothers_lowess import lowess  # LOWESS nonparametric smoothing

# =============================================================================
# STEP 1: Load data directly from a URL
# =============================================================================
# pd.read_stata() reads Stata .dta files — the dataset has 29 house sales
url = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_HOUSE.DTA"
data_house = pd.read_stata(url)

print(f"Dataset: {data_house.shape[0]} observations, {data_house.shape[1]} variables")

# =============================================================================
# STEP 2: Descriptive statistics — summarize each variable before comparing
# =============================================================================
# .describe() gives mean, std, min, quartiles, max for both variables
print(data_house[['price', 'size']].describe().round(2))

# =============================================================================
# STEP 3: Scatter plot — visualize the relationship before quantifying it
# =============================================================================
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(data_house['size'], data_house['price'], s=60, alpha=0.7)
ax.set_xlabel('House Size (square feet)')
ax.set_ylabel('House Sale Price (dollars)')
ax.set_title('House Price vs Size')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 4: Correlation coefficient — one number for direction and strength
# =============================================================================
# .corr() computes the Pearson correlation matrix; r is unit-free and symmetric
corr_matrix = data_house[['price', 'size']].corr()
r = corr_matrix.loc['price', 'size']

print(f"Correlation coefficient: r = {r:.4f}")
print(f"Strength: {'Strong' if abs(r) > 0.7 else 'Moderate' if abs(r) > 0.4 else 'Weak'}")
print(f"r² = {r**2:.4f} ({r**2*100:.1f}% of variation shared)")

# =============================================================================
# STEP 5: OLS regression — fit the best-fitting line
# =============================================================================
# Formula syntax: 'y ~ x' regresses y on x (intercept included automatically)
# pf.feols() estimates the model in one call (no separate .fit() step)
fit = pf.feols('price ~ size', data=data_house)

slope     = fit.coef()['size']          # marginal effect: $/sq ft
intercept = fit.coef()['Intercept']     # predicted price when size = 0
r_squared = fit._r2                     # proportion of variation explained

print(f"Estimated equation: price = {intercept:,.0f} + {slope:.2f} × size")
print(f"Interpretation: each additional sq ft is associated with ${slope:,.2f} higher price")
print(f"R-squared: {r_squared:.4f} ({r_squared*100:.1f}% of variation explained)")

# Full regression table (coefficients, std errors, t-stats, p-values, R²)
fit.summary()

# =============================================================================
# STEP 6: Scatter plot with fitted line and R² — visualize model fit
# =============================================================================
# fit.predict() returns the predicted y-values from the estimated equation
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(data_house['size'], data_house['price'], s=60, alpha=0.7, label='Actual prices')
ax.plot(data_house['size'], fit.predict(), color='red', linewidth=2, label='Fitted line')
ax.set_xlabel('House Size (square feet)')
ax.set_ylabel('House Sale Price (dollars)')
ax.set_title(f'OLS: price = {intercept:,.0f} + {slope:.2f} × size  (R² = {r_squared:.2%})')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 7: Reverse regression — association is NOT causation
# =============================================================================
# If regression = causation, the reverse slope would be 1/slope. It is not.
reverse_fit = pf.feols('size ~ price', data=data_house)

print(f"price ~ size  slope: {slope:.4f}")
print(f"size ~ price  slope: {reverse_fit.coef()['price']:.6f}")
print(f"1 / original slope:  {1/slope:.6f}")
print(f"Reciprocals match?   {1/slope:.6f} ≠ {reverse_fit.coef()['price']:.6f}")
print("→ Regression is asymmetric: association, not causation!")

# =============================================================================
# STEP 8: Nonparametric regression — check the linearity assumption
# =============================================================================
# LOWESS fits weighted local regressions; if the curve tracks the OLS line,
# the linear assumption is validated for this dataset
lowess_result = lowess(data_house['price'], data_house['size'], frac=0.6)

fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(data_house['size'], data_house['price'], s=60, alpha=0.6, label='Actual data')
ax.plot(data_house['size'], fit.predict(), color='red',
        linewidth=2, label='OLS (parametric)')
ax.plot(lowess_result[:, 0], lowess_result[:, 1], color='green',
        linewidth=2, linestyle='--', label='LOWESS (nonparametric)')
ax.set_xlabel('House Size (square feet)')
ax.set_ylabel('House Sale Price (dollars)')
ax.set_title('Parametric vs Nonparametric Regression')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Open empty Colab notebook →

* =============================================================================
* CHAPTER 5 CHEAT SHEET: Bivariate Data Summary
* =============================================================================

* --- Setup ---
clear all                                // start with a clean workspace
set more off                             // do not pause output for long results

* =============================================================================
* STEP 1: Load data directly from a URL
* =============================================================================
* use loads a Stata .dta file; "clear" drops any data already in memory
use "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_HOUSE.DTA", clear

describe                                 // list all variables, types, and labels
display "Observations: " _N              // _N is Stata's built-in observation count

* =============================================================================
* STEP 2: Descriptive statistics — summarize each variable before comparing
* =============================================================================
* summarize gives n, mean, std, min, max; "detail" adds median and percentiles
summarize price size, detail

* =============================================================================
* STEP 3: Scatter plot — visualize the relationship before quantifying it
* =============================================================================
* twoway scatter draws a scatterplot of y against x
twoway scatter price size,                             ///
    xtitle("House Size (square feet)")                 ///
    ytitle("House Sale Price (dollars)")               ///
    title("House Price vs Size")

* =============================================================================
* STEP 4: Correlation coefficient — one number for direction and strength
* =============================================================================
* correlate computes the Pearson correlation matrix; r is unit-free and symmetric
correlate price size

// pwcorr shows pairwise correlations with significance levels
pwcorr price size, sig star(0.05)

* =============================================================================
* STEP 5: OLS regression — fit the best-fitting line
* =============================================================================
* regress fits an OLS model: regress y x (intercept included automatically)
* The coefficient on size is the marginal effect: dollars per sq ft
regress price size

// After regress, Stata stores results you can reference:
display "Slope (b):       " _b[size]          // marginal effect: $/sq ft
display "Intercept (a):   " _b[_cons]         // predicted price when size = 0
display "R-squared:       " e(r2)             // proportion of variation explained

* =============================================================================
* STEP 6: Scatter plot with fitted line and R² — visualize model fit
* =============================================================================
* predict creates a new variable with the predicted y-values from the model
predict price_hat                        // predicted values from the last regress

twoway (scatter price size, mcolor(%60))                                     ///
       (line price_hat size, lcolor(red) lwidth(medthick) sort),             ///
    xtitle("House Size (square feet)")                                       ///
    ytitle("House Sale Price (dollars)")                                     ///
    title("OLS Regression: Fitted Line and R²")                          ///
    legend(order(1 "Actual prices" 2 "Fitted line"))

* =============================================================================
* STEP 7: Reverse regression — association is NOT causation
* =============================================================================
* If regression = causation, the reverse slope would be 1/slope. It is not.
// Store the original slope for comparison
scalar orig_slope = _b[size]

regress size price
display "price ~ size  slope: " orig_slope
display "size ~ price  slope: " _b[price]
display "1 / original slope:  " 1 / orig_slope
display "Reciprocals match?   No — regression is asymmetric!"

* =============================================================================
* STEP 8: Nonparametric regression — check the linearity assumption
* =============================================================================
* lowess fits weighted local regressions; if the curve tracks the OLS line,
* the linear assumption is validated for this dataset

// Re-estimate OLS for the combined plot
regress price size
predict price_ols                        // OLS fitted values

// lowess draws the nonparametric fit directly
twoway (scatter price size, mcolor(%50))                                       ///
       (line price_ols size, lcolor(red) lwidth(medthick) sort)                ///
       (lowess price size, lcolor(green) lwidth(medthick) lpattern(dash)),     ///
    xtitle("House Size (square feet)")                                         ///
    ytitle("House Sale Price (dollars)")                                       ///
    title("Parametric vs Nonparametric Regression")                            ///
    legend(order(1 "Actual data" 2 "OLS (parametric)" 3 "LOWESS (nonparametric)"))

Paste into your Stata do-file editor

# =============================================================================
# CHAPTER 5 CHEAT SHEET: Bivariate Data Summary
# =============================================================================

# --- Libraries ---
library(haven)           # read Stata .dta files directly from URLs
library(fixest)          # fast OLS estimation with feols()
library(dplyr)           # data manipulation (mutate, filter, summarize)
library(ggplot2)         # grammar of graphics for all plots

# =============================================================================
# STEP 1: Load data directly from a URL
# =============================================================================
url <- "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_HOUSE.DTA"
data_house <- read_dta(url)

cat("Dataset:", nrow(data_house), "observations,", ncol(data_house), "variables\n")

# =============================================================================
# STEP 2: Descriptive statistics — summarize each variable before comparing
# =============================================================================
summary(data_house[, c("price", "size")])

# =============================================================================
# STEP 3: Scatter plot — visualize the relationship before quantifying it
# =============================================================================
ggplot(data_house, aes(x = size, y = price)) +
  geom_point(color = "steelblue", size = 3, alpha = 0.7) +
  labs(x = "House Size (square feet)", y = "House Sale Price (dollars)",
       title = "House Price vs Size") +
  theme_minimal()

# =============================================================================
# STEP 4: Correlation coefficient — one number for direction and strength
# =============================================================================
# cor() computes the Pearson correlation; cor.test() adds a p-value
r <- cor(data_house$price, data_house$size)
cat("Correlation coefficient: r =", round(r, 4), "\n")
cat("r² =", round(r^2, 4), "(", round(r^2*100, 1), "% of variation shared)\n")

cor.test(data_house$price, data_house$size)

# =============================================================================
# STEP 5: OLS regression — fit the best-fitting line
# =============================================================================
# feols() from fixest: formula syntax y ~ x (intercept included automatically)
model <- feols(price ~ size, data = data_house)
summary(model)

slope     <- coef(model)["size"]
intercept <- coef(model)["(Intercept)"]
r_squared <- r2(model)

cat("Estimated equation: price =", round(intercept, 0), "+",
    round(slope, 2), "× size\n")
cat("R-squared:", round(r_squared, 4), "\n")

# =============================================================================
# STEP 6: Scatter plot with fitted line and R² — visualize model fit
# =============================================================================
ggplot(data_house, aes(x = size, y = price)) +
  geom_point(color = "steelblue", size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", formula = y ~ x, color = "red",
              linewidth = 1.2, se = FALSE) +
  labs(x = "House Size (square feet)", y = "House Sale Price (dollars)",
       title = paste0("OLS Regression: Fitted Line  (R² = ",
                      round(r_squared * 100, 1), "%)")) +
  theme_minimal()

# =============================================================================
# STEP 7: Reverse regression — association is NOT causation
# =============================================================================
# If regression = causation, the reverse slope would be 1/slope. It is not.
reverse_model <- feols(size ~ price, data = data_house)

cat("price ~ size  slope:", round(slope, 4), "\n")
cat("size ~ price  slope:", round(coef(reverse_model)["price"], 6), "\n")
cat("1 / original slope: ", round(1/slope, 6), "\n")
cat("Reciprocals match?   No — regression is asymmetric!\n")

# =============================================================================
# STEP 8: Nonparametric regression — check the linearity assumption
# =============================================================================
# LOESS fits weighted local regressions; if the curve tracks the OLS line,
# the linear assumption is validated for this dataset
ggplot(data_house, aes(x = size, y = price)) +
  geom_point(color = "steelblue", size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", formula = y ~ x, color = "red",
              linewidth = 1.2, se = FALSE, aes(linetype = "OLS (parametric)")) +
  geom_smooth(method = "loess", formula = y ~ x, color = "green",
              linewidth = 1.2, se = FALSE, linetype = "dashed",
              aes(linetype = "LOESS (nonparametric)")) +
  labs(x = "House Size (square feet)", y = "House Sale Price (dollars)",
       title = "Parametric vs Nonparametric Regression") +
  theme_minimal()

Paste into your R console or RStudio

Bivariate Data Summary

Bivariate summary statistics

Scatterplot & correlation coefficient

What does a given correlation look like?

OLS regression line

R-squared decomposition — where does the variation go?

Regression asymmetry — Y~X ≠ X~Y

Parametric vs nonparametric regression

Code Summary