Chapter 05 of 18 Β· Interactive Dashboard

Bivariate Data Summary

Slide, toggle, and compare to build intuition for scatterplots, correlation, regression, and R-squared β€” using the same house-price dataset as the chapter.

Bivariate summary statistics

Before you compare two variables, it pays to know each one first. How big are the typical values? How much do they vary? And do the two series even move together?

Summary statistics describe each variable before you examine their relationship. For bivariate analysis, compute the mean, median, standard deviation, minimum, and maximum of both variables. Comparing means and medians reveals skewness; standard deviations indicate variability. These univariate summaries give essential context for interpreting correlation and regression.
What you can do here
  • Pick a variable pair β€” each option compares price to one predictor.
  • Read the two stats grids side by side β€” one for price (Y), one for the predictor (X).
  • Watch the callout β€” it reports the covariance and the correlation r, letting you see a unit-dependent number next to a unit-free one.
Try this
  1. Start on price vs size. The size CV (0.21) is larger than the price CV (0.15). Relative to their means, house sizes vary more than house prices.
  2. Switch to price vs age. The covariance is nearly zero and r is close to zero. These two series barely co-move β€” age carries almost no information about price.
  3. Switch to price vs bedrooms. The mean (3.79) sits just under the median (4). Discrete counts rarely skew, so the mean–median gap here is tiny.

Take-away: when two variables have similar CVs they vary by a similar relative amount, but the sign of their covariance tells you whether they move together or apart. Read Β§5.1 in the chapter β†’

Scatterplot & correlation coefficient

Two columns of numbers hide their story. A scatterplot shows it β€” direction, tightness, and outliers in a single picture. And r gives that picture one number between βˆ’1 and +1.

Scatterplots reveal direction, strength, form, and outliers; correlation summarises them in one number. A scatterplot provides visual evidence of a relationship between two continuous variables β€” positive or negative, tight or loose, linear or curved, with or without outliers. The correlation coefficient r is a scale-free measure of linear association that ranges from βˆ’1 (perfect negative) through 0 (no linear relationship) to +1 (perfect positive). It is unit-free, symmetric in x and y, and only captures linear co-movement.
What you can do here
  • Pick an X variable β€” price is always on the Y axis.
  • Read r and the strength label β€” see how one number summarises the cloud.
  • Hover any point β€” inspect the observation that sits furthest from the trend.
Try this
  1. Stay on size. r β‰ˆ 0.79 and the cloud forms a clear upward band. Size is the strongest single predictor of price in this sample.
  2. Switch to bedrooms. r drops to about 0.43 and the scatter loosens visibly. Bedrooms still predict price, but with much more uncertainty.
  3. Switch to age. r sits near βˆ’0.07 and the cloud looks random. For this sample, age is essentially uninformative about price.
  4. Hover the point furthest from the trend. Check its size, bedrooms, and price. With only n = 29 observations, most "outliers" are just natural variation rather than true anomalies.

Take-away: a scatterplot tells you the shape of a relationship; r gives you its strength and direction in one unit-free number. Read Β§5.3 in the chapter β†’

What does a given correlation look like?

What does r = 0.5 actually look like? The number is abstract until you see 30 points scattered under it. Slide the target and let your eye calibrate.

The correlation coefficient is a scale-free measure of linear association. It ranges from βˆ’1 (perfect negative) to +1 (perfect positive), with 0 meaning no linear relationship. Correlation is unit-free, symmetric in x and y, and measures only linear co-movement β€” a curved relationship can have r = 0 and still be strongly deterministic.
What you can do here
  • Drag the target-r slider from βˆ’1 through 0 to +1.
  • Compare target vs actual r β€” the simulated sample rarely lands exactly on target.
  • Watch the cloud change shape β€” tight bands mean strong correlation; round blobs mean none.
Try this
  1. Set target r to 0.8 (close to the house data's 0.79). The cloud forms a tight upward band. This is what "strong positive correlation" looks like.
  2. Slide to 0.0. The cloud becomes round with no direction. Knowing x tells you nothing about y when r = 0.
  3. Slide to βˆ’0.5. A downward tilt is visible but loose. Moderate correlation is obvious on inspection, but the scatter is still wide.
  4. Slide to +1.0. Every point lands on a single line. Real data never looks this clean β€” any real-world r of 1 is a warning sign.

Take-away: calibrate your eye β€” a given r always implies the same kind of scatter, independent of units. Read Β§5.4 in the chapter β†’

OLS regression line

A cloud of points suggests a trend β€” but which straight line best captures it? OLS answers: the one that makes the sum of squared vertical errors as small as possible.

Ordinary Least Squares picks the line that minimises the sum of squared residuals. This yields closed-form expressions for the slope b2 = Σ(xi βˆ’ mean of x)(yi βˆ’ mean of y) / Σ(xi βˆ’ mean of x)2 and the intercept b1 = mean of y βˆ’ b2 Β· mean of x. The slope equals the covariance of x and y divided by the variance of x.
What you can do here
  • Pick an X variable β€” price is always on the Y axis.
  • Slide "Predict at X" β€” read the fitted value for any x.
  • Toggle "Show residuals" on β€” see the vertical distance OLS is minimising for every observation.
Try this
  1. Stay on size. Read the equation: price = $115,017 + $73.77 Γ— size. Slide to 2,000 sqft. The fitted price is $262,559 β€” each extra square foot adds about $73.77.
  2. Toggle "Show residuals" on. Vertical lines appear from every point to the line. OLS is the unique line that makes the sum of the squares of those lengths as small as possible.
  3. Switch X to bedrooms. The slope jumps to about $23,667 per bedroom but R² drops near 0.18. A steeper slope is not a better fit β€” spread around the line matters more.
  4. Switch X to age. The slope is nearly flat and R² β‰ˆ 0.005. When a predictor carries no information, OLS simply returns a near-horizontal line at mean price.

Take-away: OLS converts a cloud of points into two numbers β€” an intercept and a slope β€” by minimising squared prediction errors. Read Β§5.5 in the chapter β†’

R-squared decomposition β€” where does the variation go?

Your regression line explains some of the variation in price. But how much β€” and how much is left over?

R² is the fraction of variation in y explained by the regression on x. It ranges from 0 (no explanatory power) to 1 (perfect fit). For bivariate regression, R² equals the square of the correlation coefficient, so R² = r2. An R² of 0.62 means 62% of the variation in house price is explained by variation in size; the remaining 38% is due to other factors.
What you can do here
  • Toggle "All" β€” see the full decomposition side by side.
  • Toggle "Total (TSS)" β€” distance from each price to the sample mean.
  • Toggle "Explained (ESS)" β€” distance from each fitted price to the mean.
  • Toggle "Residual (RSS)" β€” distance from each fitted price to the actual price.
Try this
  1. Start in "All" view. ESS (cyan) is 61.7% of TSS. That percentage is R² β€” the share of variation the line captures.
  2. Switch to "Total." Purple segments show how far each price sits from the mean of $253,910. This is the variation the model is trying to explain.
  3. Switch to "Explained." Cyan segments show how far each predicted price sits from the mean. This is the variation the line accounts for.
  4. Switch to "Residual." Pink segments show prediction errors. Smaller pink totals mean a tighter fit; R² = 1 βˆ’ RSS/TSS.

Take-away: R² is not about steepness β€” it is about how tightly the cloud hugs the line. Read Β§5.6 in the chapter β†’

Regression asymmetry β€” Y~X β‰  X~Y

If size caused price, regressing each on the other would give reciprocal slopes. It doesn't. That gap is the numerical signature of association, not causation.

Regression measures association, not causation. A regression coefficient shows how much y changes when x changes, but does not prove x causes y. Causation requires additional assumptions, an experimental design, or advanced econometric techniques (Chapter 17). Regression is directional and asymmetric: regressing y on x yields a different slope than regressing x on y. Correlation, by contrast, is symmetric.
What you can do here
  • Toggle "price ~ size" β€” the default regression of price on size.
  • Toggle "size ~ price" β€” swap the roles of x and y.
  • Toggle "Both" β€” overlay the two fitted lines to see them cross at the mean point.
Try this
  1. Start with "price ~ size." The slope is $73.77 per sqft. If regression were symmetric, the reverse slope would be 1/73.77 β‰ˆ 0.01356 sqft per dollar. It will not be β€” and that is the whole point.
  2. Switch to "size ~ price." The actual slope is 0.00837 sqft per dollar, not 0.01356. The two regressions answer different questions and give different answers.
  3. Toggle "Both." The two lines intersect at (1,883 sqft, $253,910) but diverge everywhere else. They agree only at the joint mean.
  4. Read the R² values. Both regressions share R² = 0.6175. Correlation is symmetric; regression is not β€” a lesson that will matter all the way through Chapter 17.

Take-away: the asymmetry of the two slopes is the numerical fingerprint of association, not causation. Read Β§5.10 in the chapter β†’

Parametric vs nonparametric regression

Is the price–size relationship really a straight line? The only way to know is to let the data try a curve and see whether it picks one.

Parametric regression assumes a specific functional form; nonparametric regression lets the shape emerge from the data. LOWESS fits many small weighted regressions at each point, stitching them into a smooth curve whose flexibility is set by a bandwidth (the fraction of nearby points used at each location). Kernel smoothing is a weighted moving average that does the same job from a different angle. When the flexible curve agrees with the straight OLS line, the linearity assumption is validated.
What you can do here
  • Toggle OLS on/off β€” the parametric straight-line benchmark.
  • Toggle LOWESS on/off β€” the flexible locally-weighted alternative.
  • Drag the LOWESS bandwidth from 0.30 (very local, very wiggly) to 1.00 (global, nearly straight).
  • Toggle kernel smoothing β€” a second nonparametric check using a different weighting scheme.
Try this
  1. Keep OLS and LOWESS on (frac = 0.65). The two curves trace nearly the same path. The linear assumption holds for this sample.
  2. Drag bandwidth down to 0.30. LOWESS becomes wiggly, chasing individual points. With n = 29, a small bandwidth overfits noise.
  3. Push bandwidth up to 1.00. LOWESS flattens into something almost identical to OLS. When a flexible method converges to the rigid one, parsimony wins.
  4. Turn on kernel smoothing. A different nonparametric estimator β€” same verdict. The price–size relationship is genuinely linear, not a hidden curve.

Take-away: nonparametric curves are a sanity check β€” when they echo the OLS line, the straight-line model is trustworthy. Read Β§5.11 in the chapter β†’

Python Libraries and Code

You've explored the key concepts interactively β€” now reproduce them in Python. This self-contained code block covers everything you practiced above. Copy it into an empty notebook and run it.

# =============================================================================
# CHAPTER 5 CHEAT SHEET: Bivariate Data Summary
# =============================================================================

# --- Libraries ---
import pandas as pd                                         # data loading and manipulation
import matplotlib.pyplot as plt                              # creating plots and visualizations
from statsmodels.formula.api import ols                      # OLS regression with R-style formulas
from statsmodels.nonparametric.smoothers_lowess import lowess  # LOWESS nonparametric smoothing

# =============================================================================
# STEP 1: Load data directly from a URL
# =============================================================================
# pd.read_stata() reads Stata .dta files β€” the dataset has 29 house sales
url = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_HOUSE.DTA"
data_house = pd.read_stata(url)

print(f"Dataset: {data_house.shape[0]} observations, {data_house.shape[1]} variables")

# =============================================================================
# STEP 2: Descriptive statistics β€” summarize each variable before comparing
# =============================================================================
# .describe() gives mean, std, min, quartiles, max for both variables
print(data_house[['price', 'size']].describe().round(2))

# =============================================================================
# STEP 3: Scatter plot β€” visualize the relationship before quantifying it
# =============================================================================
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(data_house['size'], data_house['price'], s=60, alpha=0.7)
ax.set_xlabel('House Size (square feet)')
ax.set_ylabel('House Sale Price (dollars)')
ax.set_title('House Price vs Size')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 4: Correlation coefficient β€” one number for direction and strength
# =============================================================================
# .corr() computes the Pearson correlation matrix; r is unit-free and symmetric
corr_matrix = data_house[['price', 'size']].corr()
r = corr_matrix.loc['price', 'size']

print(f"Correlation coefficient: r = {r:.4f}")
print(f"Strength: {'Strong' if abs(r) > 0.7 else 'Moderate' if abs(r) > 0.4 else 'Weak'}")
print(f"rΒ² = {r**2:.4f} ({r**2*100:.1f}% of variation shared)")

# =============================================================================
# STEP 5: OLS regression β€” fit the best-fitting line
# =============================================================================
# Formula syntax: 'y ~ x' regresses y on x (intercept included automatically)
# IMPORTANT: .fit() estimates the model β€” without it, nothing is computed!
model = ols('price ~ size', data=data_house).fit()

slope     = model.params['size']        # marginal effect: $/sq ft
intercept = model.params['Intercept']   # predicted price when size = 0
r_squared = model.rsquared              # proportion of variation explained

print(f"Estimated equation: price = {intercept:,.0f} + {slope:.2f} Γ— size")
print(f"Interpretation: each additional sq ft is associated with ${slope:,.2f} higher price")
print(f"R-squared: {r_squared:.4f} ({r_squared*100:.1f}% of variation explained)")

# Full regression table (coefficients, std errors, t-stats, p-values, RΒ²)
model.summary()

# =============================================================================
# STEP 6: Scatter plot with fitted line and RΒ² β€” visualize model fit
# =============================================================================
# model.fittedvalues contains the predicted y-values from the estimated equation
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(data_house['size'], data_house['price'], s=60, alpha=0.7, label='Actual prices')
ax.plot(data_house['size'], model.fittedvalues, color='red', linewidth=2, label='Fitted line')
ax.set_xlabel('House Size (square feet)')
ax.set_ylabel('House Sale Price (dollars)')
ax.set_title(f'OLS: price = {intercept:,.0f} + {slope:.2f} Γ— size  (RΒ² = {r_squared:.2%})')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# =============================================================================
# STEP 7: Reverse regression β€” association is NOT causation
# =============================================================================
# If regression = causation, the reverse slope would be 1/slope. It is not.
reverse_model = ols('size ~ price', data=data_house).fit()

print(f"price ~ size  slope: {slope:.4f}")
print(f"size ~ price  slope: {reverse_model.params['price']:.6f}")
print(f"1 / original slope:  {1/slope:.6f}")
print(f"Reciprocals match?   {1/slope:.6f} β‰  {reverse_model.params['price']:.6f}")
print("β†’ Regression is asymmetric: association, not causation!")

# =============================================================================
# STEP 8: Nonparametric regression β€” check the linearity assumption
# =============================================================================
# LOWESS fits weighted local regressions; if the curve tracks the OLS line,
# the linear assumption is validated for this dataset
lowess_result = lowess(data_house['price'], data_house['size'], frac=0.6)

fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(data_house['size'], data_house['price'], s=60, alpha=0.6, label='Actual data')
ax.plot(data_house['size'], model.fittedvalues, color='red',
        linewidth=2, label='OLS (parametric)')
ax.plot(lowess_result[:, 0], lowess_result[:, 1], color='green',
        linewidth=2, linestyle='--', label='LOWESS (nonparametric)')
ax.set_xlabel('House Size (square feet)')
ax.set_ylabel('House Sale Price (dollars)')
ax.set_title('Parametric vs Nonparametric Regression')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Open empty Colab notebook β†’