Chapter 10 — Data Summary for Multiple Regression

Why did the bedrooms coefficient collapse from $23,667 to $1,553 the moment we added size? That drop is where multiple regression earns its keep.

Multiple regression separates each variable's partial effect from the indirect effects flowing through its correlation with other regressors. In bivariate regression, the bedrooms coefficient ($23,667) captures both the direct effect and the indirect effect through size. Adding size to the model drops bedrooms to $1,553 — the partial effect holding size constant. This dramatic change illustrates why controlling for confounders is essential for isolating individual variable effects.

Focus variable

Control for

Try this

Set focus = Bedrooms, control = Size. The coefficient drops from $23,667 to $1,553 — a 93% reduction. Size was doing most of the work the bivariate regression attributed to bedrooms.
Switch focus to Bathrooms, still controlling for Size. The partial effect shrinks again, for the same reason. Bigger houses have more bathrooms; controlling for size strips out that shared variation.
Set focus = Age, control = Size. The bivariate effect was already near zero; controlling barely moves it. When a predictor is uncorrelated with the control, the partial effect equals the bivariate effect.
Keep focus = Bedrooms and switch the control to Bedrooms' weaker confounder (e.g., Lot size). The coefficient drops less. The strength of the drop measures how much the control and the focus overlap.

Take-away: The gap between bivariate and partial effects is exactly the contribution that gets miscounted when you omit a correlated control. Read §10.1 in the chapter →

Before running a six-predictor regression, check the grid. Who's correlated with whom, and how much overlap do your regressors carry?

Pairwise scatterplot and correlation matrices reveal all two-way relationships before any formal modeling. They expose linear associations, nonlinearities, clusters, and outliers — and they flag multicollinearity: if two regressors are tightly correlated (e.g., size and bedrooms, r = 0.52), their individual effects may be hard to separate in a regression, and high bivariate correlations (like bedrooms with price) may shrink once you control for confounders (KC 10.3).

Try this

Find the strongest predictor of price. Size leads at r = 0.79. This is why the size-only model gets most of the R² the full model ever reaches.
Find the size–bedrooms cell. r = 0.52 — a substantial overlap. This is why bedrooms' coefficient shrinks so dramatically once size enters the regression.
Find monthsold–bathrooms. r ≈ −0.39 — negative and moderate. Idiosyncratic to this sample; the two aren't causally linked, just correlated in 29 observations.
Find a pair with near-zero correlation. Those pairs barely interact in a regression. Their coefficients stay stable as you move between bivariate and multiple specs.

Take-away: A correlation matrix is the cheapest insurance against multicollinearity surprises — always look at it before you pick your regressors. Read §10.3 in the chapter →

Build your own model, coefficient by coefficient. Which predictors earn their spot, and which just inflate your standard errors?

Each coefficient b_j is the expected change in y when x_j increases by one unit, holding all other regressors constant. For the house-price example, a size coefficient of $68.37 means each additional square foot predicts a $68.37 price increase, controlling for bedrooms, bathrooms, lot size, age, and month sold. Statistical significance is assessed via t-statistics and confidence intervals; p < 0.05 is the conventional threshold.

Variable	Coef	Std Err	t	P>\|t\|	[95% CI]

Try this

Start with size only (R² = 0.618, adj R² = 0.603). Now check bedrooms. R² barely moves (0.618) and adj R² falls to 0.589. Bedrooms adds essentially no new information once you already have size.
Check all 6 variables. R² rises to 0.651 but adj R² falls to 0.555. Raw R² only goes up; adjusted R² punishes complexity, and here it says the extra variables are noise.
Inspect the p-value column. Only size is significant at 5%; every other p > 0.25. Multicollinearity inflates SEs, hiding whatever signal the other predictors might carry.
Uncheck everything except size. The parsimony principle is vindicated here: with n = 29, the simplest adequate specification wins on every criterion.

Take-away: Every coefficient holds all other regressors fixed — add variables deliberately, and let adjusted R², AIC, and BIC tell you when you've gone too far. Read §10.4 in the chapter →

"Holding other variables constant" sounds abstract — until you see it done mechanically. FWL shows you the literal steps that make partial effects partial.

The partial effect of x_j in a multiple regression equals the slope from a bivariate regression of y on the residualized x_j. Residualize x_j by regressing it on all other regressors and keeping the residuals — what's left is the variation in x_j that is independent of the other variables. Regressing y on that residualized x_j reproduces the multiple-regression coefficient exactly. This is how multiple regression "controls for" covariates.

Target variable (X)

Control variable (Z)

Try this

Default: target = size, control = bedrooms. The FWL slope (~72.41) matches the multiple-regression size coefficient exactly. Mathematical proof by identity: the two numbers are the same by construction.
Switch target to bedrooms, control to size. The FWL slope (~$1,553) matches bedrooms' coefficient in price ~ bedrooms + size. Same identity, different focus — partial effects are commutative.
Try target = bathrooms, control = size. The slope reveals bathrooms' partial contribution after size's variation has been residualized away. That residual variation is what a multiple regression actually regresses on — FWL just shows you the mechanic.
Look at the x-axis ("Residualized X"). The points have been shifted so that all the control's influence is removed. This is the "holding other variables constant" phrase made literal.

Take-away: Multiple regression coefficients are bivariate regressions in disguise — on the parts of each regressor that the other regressors cannot explain. Read §10.5 in the chapter →

Adding variables always raises R². Does that mean every extra variable improves the model? Adjusted R², AIC, and BIC say no — and they can all move down as you add junk.

Adjusted R², AIC, and BIC penalize complexity so that extra variables pay only when they earn their keep. Adjusted R² divides sums of squares by degrees of freedom — a mild penalty. AIC and BIC impose stronger penalties; BIC's grows with sample size, favoring more parsimonious specifications (KC 10.6). The parsimony principle: prefer simpler models unless extra variables meaningfully improve fit (KC 10.7).

Try this

Read R² across the models. It climbs from 0.618 (size only) to 0.651 (all six) — a 3.3-point gain for five extra variables. Raw R² is a one-way ratchet.
Read adjusted R² on the same models. It falls from 0.603 to 0.555. Once you penalize degrees of freedom, the extra five variables are net-negative.
Read AIC and BIC. Both rise — worse models. BIC rises more. The BIC penalty grows with sample size; for n = 29 it is especially allergic to overfitting.
Compare the four verdicts. All four criteria agree the best model is size alone. When adj R², AIC, and BIC converge, the decision is easy.

Take-away: R² alone is the wrong criterion for model choice — lean on adjusted R², AIC, and BIC, and let parsimony win ties. Read §10.7 in the chapter →

If two regressors carry the same information, OLS can't tell them apart — but it can tell you how bad the overlap is. That number is called VIF, and it's every multicollinearity test's front door.

The Variance Inflation Factor measures how much a coefficient's variance is inflated by correlation with the other regressors. VIF_j = 1 / (1 − R_j²), where R_j² is from regressing x_j on all other regressors. Rules of thumb: VIF > 5 is moderate concern; VIF > 10 is severe; VIF → ∞ means perfect collinearity and inestimable coefficients. High VIFs inflate SEs and destabilize estimates.

Try this

Read bedrooms and size. VIF ≈ 57.8 and 40.1. Both far past the "severe" threshold — the two variables carry nearly redundant information.
Read lotsize and monthsold. VIF ≈ 12.0 and 12.8. Also above threshold — with 29 observations and 6 correlated predictors, the whole model is over-collinear.
Connect to the earlier widgets. Only size is significant in the full regression because the other SEs are inflated. High VIF → inflated SE → inflated p-value → "insignificant" coefficients that may actually carry signal.
Mentally drop bedrooms or bathrooms from the model and imagine what remains. The remedy for multicollinearity is surgical: drop or combine the redundant predictors until VIFs fall below 5.

Take-away: High VIFs are the fingerprint of multicollinearity — they inflate standard errors and turn real effects into "insignificant" ones, so prune correlated regressors before chasing p-values. Read §10.8 in the chapter →

Two regressions can have nearly the same R². But are their residuals equally well-behaved? Diagnostic plots are where models earn — or lose — your trust.

A well-fit regression shows predictions hugging the 45° line and residuals scattering randomly around zero. The actual-vs-predicted plot surfaces bias and heteroskedasticity; the residual plot reveals curvature, fan shapes, or outliers. For a multiple regression, these two plots are the first place to check whether the model's assumptions are plausible — before reporting any coefficient's standard error.

Model

Try this

Start with Size only. Points scatter around the 45° line with moderate spread; residuals look random. A simple, well-behaved bivariate fit.
Switch to the full six-predictor model. The scatter tightens slightly (R² 0.618 → 0.651). A marginal improvement at a large cost in standard-error precision — exactly what VIF and adj R² warned us about.
Inspect the residual plot for any model. No fan, no curve, no serial pattern. The OLS homoskedasticity and linearity assumptions look defensible on this sample.
Hover the house near $340,000. It is consistently under-predicted across all three specifications. A stable residual outlier — likely an omitted quality variable that no specification captures.

Take-away: Actual-vs-predicted and residual plots are cheap insurance — they can invalidate an R² in a single glance. Read §10.4 in the chapter →

Code Summary

You've explored the key concepts interactively — now reproduce them in code. These self-contained blocks cover everything you practiced above. Pick your language, copy the code, and run it.

# =============================================================================
# CHAPTER 10 CHEAT SHEET: Data Summary for Multiple Regression
# =============================================================================

# --- Libraries ---
import numpy as np                                          # numerical operations
import pandas as pd                                         # data loading and manipulation
import matplotlib.pyplot as plt                             # creating plots and visualizations
import seaborn as sns                                       # statistical visualization (heatmaps, pairplots)
import pyfixest as pf                                       # OLS regression with R-style formulas
# !pip install pyfixest                                     # uncomment if running in Google Colab
from statsmodels.stats.outliers_influence import variance_inflation_factor  # multicollinearity detection

# =============================================================================
# STEP 1: Load data and explore
# =============================================================================
# pd.read_stata() reads Stata .dta files directly from a URL
url = "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_HOUSE.DTA"
data_house = pd.read_stata(url)

print(f"Dataset: {data_house.shape[0]} observations, {data_house.shape[1]} variables")
print(data_house[['price', 'size', 'bedrooms', 'bathrooms', 'lotsize', 'age']].describe().round(2))

# =============================================================================
# STEP 2: Partial effects vs. total effects — why controls matter
# =============================================================================
# Bivariate regression captures TOTAL effect (direct + indirect through size)
fit_bivariate = pf.feols('price ~ bedrooms', data=data_house)

# Multiple regression isolates the PARTIAL effect (holding size constant)
fit_partial = pf.feols('price ~ bedrooms + size', data=data_house)

print(f"Bedrooms coefficient (bivariate):  ${fit_bivariate.coef()['bedrooms']:,.2f}")
print(f"Bedrooms coefficient (multiple):   ${fit_partial.coef()['bedrooms']:,.2f}")
print(f"Change: ${fit_partial.coef()['bedrooms'] - fit_bivariate.coef()['bedrooms']:,.2f}")
# The coefficient drops from ~$23,667 to ~$1,553 once we control for size

# =============================================================================
# STEP 3: Correlation matrix — check pairwise associations before modeling
# =============================================================================
# High correlations between regressors signal potential multicollinearity
corr_vars = ['price', 'size', 'bedrooms', 'bathrooms', 'lotsize', 'age']
corr_matrix = data_house[corr_vars].corr()

fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.3f', cmap='coolwarm', center=0,
            square=True, linewidths=1)
ax.set_title('Correlation Matrix Heatmap')
plt.tight_layout()
plt.show()

print(f"Price-Size correlation: {corr_matrix.loc['price', 'size']:.3f}")
print(f"Size-Bedrooms correlation: {corr_matrix.loc['size', 'bedrooms']:.3f}")

# =============================================================================
# STEP 4: Full multiple regression — estimate partial effects
# =============================================================================
# Each coefficient measures the change in price for a one-unit change in that
# variable, holding ALL other regressors constant
fit_full = pf.feols('price ~ size + bedrooms + bathrooms + lotsize + age + monthsold',
                    data=data_house)

size_coef = fit_full.coef()['size']
print(f"Size effect: each additional sq ft is associated with ${size_coef:,.2f} higher price")
print(f"R-squared: {fit_full._r2:.4f} ({fit_full._r2*100:.1f}% of variation explained)")
print(f"Adjusted R-squared: {fit_full._adj_r2:.4f}")

# Full regression table (coefficients, std errors, t-stats, p-values)
fit_full.summary()

# =============================================================================
# STEP 5: FWL theorem — how "holding constant" actually works
# =============================================================================
# Step A: Regress size on all other regressors, keep residuals
fit_size_on_others = pf.feols('size ~ bedrooms + bathrooms + lotsize + age + monthsold',
                              data=data_house)
resid_size = fit_size_on_others._u_hat

# Step B: Regress price on those residuals — the slope matches the full model
data_house['resid_size'] = resid_size
fit_fwl = pf.feols('price ~ resid_size', data=data_house)

print(f"Size coef from FULL regression:     {fit_full.coef()['size']:.10f}")
print(f"Coef from FWL residual regression:  {fit_fwl.coef()['resid_size']:.10f}")
# These are identical — the FWL theorem in action

# =============================================================================
# STEP 6: Model comparison — parsimony vs. complexity
# =============================================================================
# Compare simple (size only) vs. full model using fit statistics
fit_simple = pf.feols('price ~ size', data=data_house)

# AIC/BIC computed manually: AIC = n*ln(RSS/n) + 2k,  BIC = n*ln(RSS/n) + k*ln(n)
def aic_bic(fit):
    n = fit._N
    k = len(fit.coef())
    rss = np.sum(fit._u_hat**2)
    aic = n * np.log(rss / n) + 2 * k
    bic = n * np.log(rss / n) + k * np.log(n)
    return aic, bic

comparison = pd.DataFrame({
    'Model': ['Size only', 'Full (all variables)'],
    'R\u00b2':     [fit_simple._r2,     fit_full._r2],
    'Adj R\u00b2': [fit_simple._adj_r2, fit_full._adj_r2],
    'AIC':    [aic_bic(fit_simple)[0], aic_bic(fit_full)[0]],
    'BIC':    [aic_bic(fit_simple)[1], aic_bic(fit_full)[1]],
})
print(comparison.to_string(index=False))
# Adj R\u00b2 DECREASES when adding 5 weak predictors — parsimony wins

# =============================================================================
# STEP 7: VIF — detect multicollinearity
# =============================================================================
# VIF_j = 1 / (1 - R\u00b2_j), where R\u00b2_j is from regressing x_j on all other x's
# VIF > 5: moderate concern; VIF > 10: severe multicollinearity
X = data_house[['size', 'bedrooms', 'bathrooms', 'lotsize', 'age', 'monthsold']]
vif_data = pd.DataFrame({
    'Variable': X.columns,
    'VIF': [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
})
print(vif_data.to_string(index=False))

# =============================================================================
# STEP 8: Diagnostics — actual vs. predicted and residual plots
# =============================================================================
# Good fit: points hug the 45\u00b0 line; residuals scatter randomly around zero
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Actual vs. predicted
axes[0].scatter(data_house['price'], fit_full.predict(), alpha=0.7, s=50)
axes[0].plot([data_house['price'].min(), data_house['price'].max()],
             [data_house['price'].min(), data_house['price'].max()],
             'r--', linewidth=2, label='Perfect prediction (45\u00b0 line)')
axes[0].set_xlabel('Actual Price ($)')
axes[0].set_ylabel('Predicted Price ($)')
axes[0].set_title('Actual vs Predicted')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Residual plot
axes[1].scatter(fit_full.predict(), fit_full._u_hat, alpha=0.7, s=50)
axes[1].axhline(y=0, color='red', linestyle='--', linewidth=2)
axes[1].set_xlabel('Fitted Values ($)')
axes[1].set_ylabel('Residuals ($)')
axes[1].set_title('Residual Plot')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Open empty Colab notebook →

* =============================================================================
* CHAPTER 10 CHEAT SHEET: Data Summary for Multiple Regression
* =============================================================================

* --- Setup ---
clear all                                // start with a clean workspace
set more off                             // do not pause output for long results

* =============================================================================
* STEP 1: Load data and explore
* =============================================================================
* use loads a Stata .dta file; "clear" drops any data already in memory
use "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_HOUSE.DTA", clear

describe                                 // list all variables, types, and labels
display "Observations: " _N              // _N is Stata's built-in observation count
summarize price size bedrooms bathrooms lotsize age

* =============================================================================
* STEP 2: Partial effects vs. total effects — why controls matter
* =============================================================================
* Bivariate regression captures TOTAL effect (direct + indirect through size)
regress price bedrooms
estimates store bivariate                // save results for comparison

* Multiple regression isolates the PARTIAL effect (holding size constant)
regress price bedrooms size
estimates store partial                  // save results for comparison

* Compare: the bedrooms coefficient drops from ~$23,667 to ~$1,553
estimates table bivariate partial, b(%9.2f) se(%9.2f) stats(r2 r2_a N)

* =============================================================================
* STEP 3: Correlation matrix — check pairwise associations before modeling
* =============================================================================
* High correlations between regressors signal potential multicollinearity
* correlate shows the full pairwise correlation matrix
correlate price size bedrooms bathrooms lotsize age

* pwcorr adds significance stars and p-values
pwcorr price size bedrooms bathrooms lotsize age, star(0.05) sig

* =============================================================================
* STEP 4: Full multiple regression — estimate partial effects
* =============================================================================
* Each coefficient measures the change in price for a one-unit change in that
* variable, holding ALL other regressors constant
regress price size bedrooms bathrooms lotsize age monthsold
estimates store full                     // save full model results

display "Size effect: each additional sq ft is associated with $" _b[size] " higher price"
display "R-squared: " e(r2)
display "Adjusted R-squared: " e(r2_a)

* =============================================================================
* STEP 5: FWL theorem — how "holding constant" actually works
* =============================================================================
* Step A: Regress size on all other regressors, keep residuals
regress size bedrooms bathrooms lotsize age monthsold
predict resid_size, residuals            // residuals = variation in size unexplained by controls

* Step B: Regress price on those residuals — the slope matches the full model
regress price resid_size

* Compare: this coefficient matches the size coefficient from the full model
display "FWL slope:           " _b[resid_size]
estimates restore full
display "Full regression coef: " _b[size]
// These are identical — the FWL theorem in action

* =============================================================================
* STEP 6: Model comparison — parsimony vs. complexity
* =============================================================================
* Compare simple (size only) vs. full model using fit statistics

* Simple model
quietly regress price size
estimates store simple
estat ic                                 // AIC and BIC

* Full model
quietly regress price size bedrooms bathrooms lotsize age monthsold
estimates store full
estat ic                                 // AIC and BIC

* Side-by-side comparison: R², adj R², coefficients
estimates table simple full, b(%9.2f) se(%9.2f) stats(r2 r2_a N)
// Adj R² DECREASES when adding 5 weak predictors — parsimony wins

* =============================================================================
* STEP 7: VIF — detect multicollinearity
* =============================================================================
* VIF_j = 1 / (1 - R²_j), where R²_j is from regressing x_j on all other x's
* VIF > 5: moderate concern; VIF > 10: severe multicollinearity
quietly regress price size bedrooms bathrooms lotsize age monthsold
estat vif                                // VIF for each regressor after the last regression

* =============================================================================
* STEP 8: Diagnostics — actual vs. predicted and residual plots
* =============================================================================
* Good fit: points hug the 45° line; residuals scatter randomly around zero
quietly regress price size bedrooms bathrooms lotsize age monthsold
predict yhat, xb                         // predicted values
predict resid, residuals                 // residuals

* Actual vs. predicted plot
twoway (scatter price yhat, mcolor(navy) msymbol(circle))                 ///
       (function y=x, range(yhat) lcolor(red) lpattern(dash) lwidth(medthick)), ///
    xtitle("Predicted Price ($)") ytitle("Actual Price ($)")               ///
    title("Actual vs Predicted")                                           ///
    legend(order(1 "Observations" 2 "45° line"))

* Residual plot
twoway (scatter resid yhat, mcolor(navy) msymbol(circle))                 ///
       (function y=0, range(yhat) lcolor(red) lpattern(dash) lwidth(medthick)), ///
    xtitle("Fitted Values ($)") ytitle("Residuals ($)")                    ///
    title("Residual Plot")                                                 ///
    legend(order(1 "Residuals" 2 "Zero line"))

Paste into your Stata do-file editor

# =============================================================================
# CHAPTER 10 CHEAT SHEET: Data Summary for Multiple Regression
# =============================================================================

# --- Libraries ---
library(haven)           # read Stata .dta files
library(fixest)          # fast OLS estimation with feols()
library(dplyr)           # data manipulation
library(ggplot2)         # grammar of graphics
library(car)             # vif() for multicollinearity detection

# =============================================================================
# STEP 1: Load data and explore
# =============================================================================
url <- "https://raw.githubusercontent.com/quarcs-lab/data-open/master/AED/AED_HOUSE.DTA"
data_house <- read_dta(url)

cat("Dataset:", nrow(data_house), "observations,", ncol(data_house), "variables\n")
summary(data_house[, c("price", "size", "bedrooms", "bathrooms", "lotsize", "age")])

# =============================================================================
# STEP 2: Partial effects vs. total effects — why controls matter
# =============================================================================
# Bivariate regression captures TOTAL effect (direct + indirect through size)
model_bivariate <- feols(price ~ bedrooms, data = data_house)

# Multiple regression isolates the PARTIAL effect (holding size constant)
model_partial <- feols(price ~ bedrooms + size, data = data_house)

cat("Bedrooms coefficient (bivariate): $",
    round(coef(model_bivariate)["bedrooms"], 2), "\n")
cat("Bedrooms coefficient (multiple):  $",
    round(coef(model_partial)["bedrooms"], 2), "\n")

# =============================================================================
# STEP 3: Correlation matrix — check pairwise associations before modeling
# =============================================================================
corr_vars <- c("price", "size", "bedrooms", "bathrooms", "lotsize", "age")
corr_matrix <- cor(data_house[, corr_vars])
round(corr_matrix, 3)

# =============================================================================
# STEP 4: Full multiple regression — estimate partial effects
# =============================================================================
model_full <- feols(price ~ size + bedrooms + bathrooms + lotsize + age + monthsold,
                    data = data_house)
summary(model_full)

cat("Size effect: each additional sq ft → $",
    round(coef(model_full)["size"], 2), "higher price\n")
cat("R-squared:", round(r2(model_full), 4), "\n")
cat("Adjusted R-squared:", round(r2(model_full, "ar2"), 4), "\n")

# =============================================================================
# STEP 5: FWL theorem — how "holding constant" actually works
# =============================================================================
# Step A: Regress size on all other regressors, keep residuals
model_size_on_others <- feols(size ~ bedrooms + bathrooms + lotsize + age + monthsold,
                              data = data_house)
resid_size <- residuals(model_size_on_others)

# Step B: Regress price on those residuals — the slope matches the full model
model_fwl <- feols(price ~ resid_size, data = data.frame(
  price = data_house$price, resid_size = resid_size))

cat("Size coef from FULL regression:    ", coef(model_full)["size"], "\n")
cat("Coef from FWL residual regression: ", coef(model_fwl)["resid_size"], "\n")
# These are identical — the FWL theorem in action

# =============================================================================
# STEP 6: Model comparison — parsimony vs. complexity
# =============================================================================
model_simple <- feols(price ~ size, data = data_house)

# etable() from fixest compares models side by side
etable(model_simple, model_full,
       headers = c("Size only", "Full model"),
       fitstat = c("r2", "ar2", "n"))

# =============================================================================
# STEP 7: VIF — detect multicollinearity
# =============================================================================
# VIF_j = 1 / (1 - R²_j); VIF &gt; 5: moderate concern; VIF &gt; 10: severe
# car::vif() requires an lm object
model_lm <- lm(price ~ size + bedrooms + bathrooms + lotsize + age + monthsold,
                data = data_house)
vif(model_lm)

# =============================================================================
# STEP 8: Diagnostics — actual vs. predicted and residual plots
# =============================================================================
data_house$yhat  <- fitted(model_full)
data_house$resid <- residuals(model_full)

library(patchwork)

p1 <- ggplot(data_house, aes(x = yhat, y = price)) +
  geom_point(alpha = 0.7) +
  geom_abline(color = "red", linetype = "dashed", linewidth = 1.2) +
  labs(x = "Predicted Price ($)", y = "Actual Price ($)",
       title = "Actual vs Predicted") +
  theme_minimal()

p2 <- ggplot(data_house, aes(x = yhat, y = resid)) +
  geom_point(alpha = 0.7) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed", linewidth = 1.2) +
  labs(x = "Fitted Values ($)", y = "Residuals ($)",
       title = "Residual Plot") +
  theme_minimal()

p1 + p2

Paste into your R console or RStudio

Data Summary for Multiple Regression

Partial effects vs total effects

Correlation matrix heatmap

Multiple regression builder

Frisch-Waugh-Lovell theorem

Model comparison — when more is less

Variance inflation factors — detecting multicollinearity

Regression diagnostics — actual vs predicted & residuals

Code Summary