5 Classical Synthetic Control

5.1 The SCM idea

Synthetic Control stops using one control state. Instead, it builds a weighted combination of donor states that matches the treated unit’s pre-period as closely as possible on a chosen set of predictors. The weighted combination is “synthetic California”. The gap between observed California and synthetic California is the estimated effect.

Why it works where DiD failed. Difference-in-Differences against Nevada (chapter 4) needed parallel pre-trends with one neighbour. Synthetic Control needs parallel pre-trends with a data-driven blend of many neighbours. The optimisation does the matching, so the analyst no longer has to pick “the right” control state by hand.

5.2 The four-stage tidysynth pipeline

flowchart LR
    A["1. synthetic_control()<br/>declare treated unit<br/>and intervention time"] --> B["2. generate_predictor()<br/>define matching variables<br/>(one call per time window)"]
    B --> C["3. generate_weights()<br/>optimise donor weights<br/>(quadratic programming)"]
    C --> D["4. generate_control()<br/>build synthetic California<br/>and post-period gap series"]
    D --> E["5. plot_/grab_ helpers<br/>trends, weights,<br/>placebos, MSPE ratio,<br/>Fisher exact p-value"]

    style A fill:#6a9bcc,stroke:#cbd5e0,color:#fff
    style B fill:#6a9bcc,stroke:#cbd5e0,color:#fff
    style C fill:#6a9bcc,stroke:#cbd5e0,color:#fff
    style D fill:#d97757,stroke:#cbd5e0,color:#fff
    style E fill:#00d4c8,stroke:#cbd5e0,color:#141413

Figure 5.1: The four-stage tidysynth pipeline: declare → predictors → weights → control, with a fifth stage of plot/grab helpers.

Stages 1–4 produce the estimate. Stage 5 is a battery of inspection helpers — plot_trends(), plot_differences(), plot_weights(), plot_placebos(), plot_mspe_ratio(), grab_unit_weights(), grab_predictor_weights(), grab_balance_table(), grab_significance() — that turn the fitted object into figures and tables. We use all of them below.

5.3 The equation

Let $X_1$ be the vector of $k$ pre-period predictors for the treated unit (California), and let $X_0$ be the $k \times J$ matrix holding the same predictors for the $J = 38$ donor states. The Synthetic Control estimator chooses donor weights $w$ to minimise the (V-weighted) discrepancy between treated and synthetic on the predictors:

\[w^* \, = \, \arg\min_{w \in \mathcal{W}} \, \big(X_1 - X_0 w\big)^\top V \big(X_1 - X_0 w\big),\]

subject to

\[\mathcal{W} = \big\{w \in \mathbb{R}^J \,:\, w_j \ge 0 \,\, \forall j, \,\, \textstyle\sum_{j=1}^J w_j = 1\big\}.\]

The diagonal matrix $V$ holds the predictor importance weights — the optimiser can care more about pre-period cigarette sales than about, say, beer consumption (we inspect $V$ below). Once $w^*$ is solved, the synthetic California outcome at any year $t$ is

\[\widehat{Y_{1t}(0)} = \sum_{j=1}^J w_j^* \, Y_{jt},\]

and the ATT over 1989–2000 is the mean post-period gap between observed California and that synthetic counterfactual.

5.4 Setup and data

Packages. tidyverse covers wrangling and plotting. tidysynth is the workhorse for this chapter: it wraps the classical Abadie-Diamond-Hainmueller synthetic-control optimiser behind a tidy pipeline of synthetic_control() |> generate_predictor() |> generate_weights() |> generate_control() and ships a battery of plot_*() / grab_*() helpers for inspection. The R/table_helpers.R helper provides gt_pretty() for the donor-weight and balance tables.

Code

library(tidyverse)
library(tidysynth)
source("R/table_helpers.R")

set.seed(42)

knitr::opts_chunk$set(dev.args = list(bg = "transparent"))

theme_set(
  theme_minimal(base_size = 12) +
    theme(
      plot.background  = element_rect(fill = "transparent", color = NA),
      panel.background = element_rect(fill = "transparent", color = NA),
      panel.grid.major = element_line(color = "#94a3b8", linewidth = 0.25),
      panel.grid.minor = element_line(color = "#94a3b8", linewidth = 0.15),
      text             = element_text(color = "#94a3b8"),
      axis.text        = element_text(color = "#94a3b8")
    )
)

Dataset. Unlike the ITS, RDD, and DiD chapters — which used only California or only California-plus-Nevada — Synthetic Control uses the full 39-state × 31-year panel. The donor pool is the point: the other 38 states are the raw material the optimiser will blend into “synthetic California”. We don’t pre-filter or restrict to a window here; the chapter-2 outcome cigsale and the covariates lnincome, retprice, age15to24, beer are all in prop99 and are passed straight into synthetic_control() below.

Code

prop99 <- read_rds("data/proposition99.rds") |> as_tibble()

The loaded prop99 is a 1,209-row × 7-column tibble (39 states × 31 years per row, columns state, year, cigsale, plus the four covariates). The pipeline in the next section consumes it as-is.

5.5 Fit the synthetic-control pipeline

Code

prop99_syn <- prop99 |>
  # 1. Declare the panel structure: outcome, unit, time, treated unit
  #    ("California"), and the last full pre-period year (1988).
  #    generate_placebos = TRUE also fits the model treating each donor
  #    state as treated, for the permutation test below.
  synthetic_control(
    outcome  = cigsale, unit = state, time = year,
    i_unit   = "California", i_time = 1988,
    generate_placebos = TRUE
  ) |>
  # 2. Predictors averaged over the full pre-period (1980-1988).
  generate_predictor(
    time_window = 1980:1988,
    lnincome    = mean(lnincome, na.rm = TRUE),
    retprice    = mean(retprice, na.rm = TRUE),
    age15to24   = mean(age15to24, na.rm = TRUE)
  ) |>
  # 2b. beer is sparser, so use a narrower window where data is densest.
  generate_predictor(time_window = 1984:1988,
                     beer = mean(beer, na.rm = TRUE)) |>
  # 2c. Three "lagged outcomes" - cigsale at three pre-period dates.
  #     These pin synthetic California's pre-period trajectory.
  generate_predictor(time_window = 1975, cigsale_1975 = cigsale) |>
  generate_predictor(time_window = 1980, cigsale_1980 = cigsale) |>
  generate_predictor(time_window = 1988, cigsale_1988 = cigsale) |>
  # 3. Solve the constrained QP for donor weights w*. The three IPOP
  #    parameters are tuning knobs for the interior-point optimiser.
  generate_weights(optimization_window = 1970:1988,
                   margin_ipop = .02,
                   sigf_ipop   = 7,
                   bound_ipop  = 6) |>
  # 4. Compute the synthetic California series from w* and donor outcomes.
  generate_control()

Predictor choices. Seven predictors are passed in. Three are pre-period covariate averages over the full pre-period (lnincome, retprice, age15to24 over 1980–1988). One uses a narrower window where data is densest (beer over 1984–1988). Three are lagged outcomes — cigarette sales themselves at 1975, 1980, and 1988. The lagged outcomes are the most important trick: anchoring the synthetic control on the treated unit’s own pre-period outcome levels at multiple time points forces the synthetic series to track California’s pre-1988 trajectory closely.

5.6 Donor weights and predictor weights

The optimisation produces two weight vectors that drive the entire fit. Both are extractable as tidy tables.

Code

grab_unit_weights(prop99_syn) |>
  arrange(desc(weight)) |>
  head(8) |>
  gt_pretty(decimals = 3) |>
  cols_label(unit = "Donor state", weight = "Weight")

Table 5.1: Donor unit weights (top 8 states).

Donor state	Weight
Utah	0.342
Nevada	0.238
Montana	0.209
Colorado	0.149
Connecticut	0.062
New Mexico	0
Idaho	0
Wisconsin	0

Code

grab_predictor_weights(prop99_syn) |>
  arrange(desc(weight)) |>
  gt_pretty(decimals = 3) |>
  cols_label(variable = "Predictor", weight = "V-matrix weight")

Table 5.2: Predictor (V-matrix) weights.

Predictor	V-matrix weight
cigsale_1975	0.468
cigsale_1980	0.412
retprice	0.055
cigsale_1988	0.037
beer	0.02
age15to24	0.007
lnincome	0

Two things to notice.

Five states absorb essentially 100% of the donor weight. Utah, Nevada, Montana, Colorado, Connecticut. Every other state gets effectively zero. California is matched mostly to other Western/sunbelt states with similar age structure and cigarette price levels, plus Connecticut as a smoking-rate counterweight from the east.
The two earliest cigsale levels dominate the V matrix. cigsale_1975 and cigsale_1980 together get roughly 88% of the predictor weight. The four behavioural and demographic covariates get less than 9% combined. The optimiser has effectively decided: “the best way to predict California’s cigarette sales is using other states’ cigarette sales.”

For a one-line visual of both weight vectors, tidysynth ships a plot_weights() helper:

Code

plot_weights(prop99_syn)

Figure 5.2: Donor unit weights (left) and predictor V-matrix weights (right) from tidysynth.

5.6.1 A closer look at the V matrix

The combined plot_weights() view is convenient, but the V matrix deserves a stand-alone chart because it answers a different question than the donor weights. Donor weights say which states mimic California; the V matrix says which variables the optimiser used to decide what “mimics” means.

Code

predw_df <- grab_predictor_weights(prop99_syn) |>
  mutate(variable = fct_reorder(variable, weight))

ggplot(predw_df, aes(x = weight, y = variable)) +
  geom_col(fill = "#6a9bcc") +
  geom_text(aes(label = sprintf("%.3f", weight)), hjust = -0.12) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(x = "V-matrix weight (predictor importance)", y = NULL) +
  theme_minimal()

Figure 5.3: V matrix: cigsale_1975 and cigsale_1980 dominate; behavioural covariates get nearly zero weight.

Two readings of the same picture, one practical and one cautionary.

Practical reading. Two lagged outcomes (cigsale_1975 and cigsale_1980) carry the bulk of the matching information. The optimiser has decided that California’s pre-period cigarette sales — at multiple time points — are the best fingerprint to match.
Cautionary reading. The V matrix is not a causal ranking. It tells you which variables were useful for matching the treated unit’s pre-period, not which variables cause the outcome.

Common pitfall. Treating the V matrix as a list of causal drivers. It is a list of good pre-period predictors for one specific unit, not a structural model of smoking.

5.7 The estimate

Code

# grab_synthetic_control() returns a tidy tibble with observed (real_y)
# and synthetic (synth_y) cigsale for every year. We restrict to the
# post-period and compute the per-year gap.
sc_post <- grab_synthetic_control(prop99_syn) |>
  filter(time_unit > 1988) |>
  mutate(dif = real_y - synth_y)

# Average the per-year gap to recover the ATT.
mean(sc_post$dif)

[1] -18.84561

The Synthetic Control ATT is approximately $-18.85$ packs/capita averaged over 1989–2000. This is the book’s primary causal estimate and within rounding of the canonical Abadie et al. (2010) result.

Code

plot_trends(prop99_syn)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the tidysynth package.
  Please report the issue to the authors.

Figure 5.4: Synthetic Control: California (observed) vs synthetic California (weighted donor combination).

The pre-period fit is excellent — the synthetic and observed series are nearly indistinguishable through 1988. A substantial gap opens immediately after 1989, widening to roughly 30 packs by 2000.

5.8 Predictor balance: did the matching work?

grab_balance_table() shows California, synthetic California, and the unweighted donor average side-by-side on every predictor.

Code

grab_balance_table(prop99_syn) |>
  gt_pretty(decimals = 2)

Table 5.3: Predictor balance: California vs synthetic California vs donor average.

variable	California	synthetic_California	donor_sample
age15to24	0.17	0.17	0.17
lnincome	10.08	9.85	9.83
retprice	89.42	89.39	87.27
beer	24.28	24.22	23.66
cigsale_1975	127.1	126.99	136.93
cigsale_1980	120.2	120.22	138.09
cigsale_1988	90.1	91.39	113.82

On every variable, synthetic California is far closer to California than the unweighted donor average is. The most dramatic improvement is on the lagged outcomes: cigsale_1988 is roughly 90 for California vs ≈91 for the synthetic — a near-perfect match — while the unweighted donor average is around 114. That gap of ~24 packs is exactly the bias the naive pre-post method silently absorbed.

5.9 Visualising the post-period gap

plot_trends() showed both observed and synthetic California on one canvas. The companion helper plot_differences() plots just the gap: $Y_{1t} - \widehat{Y_{1t}(0)}$, year by year. This isolates the treatment-effect curve in its cleanest form.

Code

plot_differences(prop99_syn)

Ignoring unknown labels:
• colour : ""
• linetype : ""

Figure 5.5: Per-year gap between observed California and synthetic California.

Read the line as the effect of Proposition 99 on California in year $t$. The pre-period values hover near zero (the matching worked), the line drops sharply after 1989, and it stays negative — steadily widening — throughout the post-period. The 1989–2000 mean of this series is exactly the $-18.85$ packs ATT reported above.

5.10 Inference via placebo permutation

A “standard error” computed as cross-year SD divided by $\sqrt{N}$ is not a real sampling-distribution-based standard error. The proper Synthetic Control uncertainty quantification is a permutation test.

The recipe. Refit the synthetic-control model treating each donor state as if it had been the treated unit. Compute the post-period gap for each placebo. Compare California’s gap trajectory to those placebo trajectories. If California’s gap is extreme relative to the placebos, the policy probably did something.

Code

plot_placebos(prop99_syn)

Figure 5.6: Placebo permutation: California in orange, donor placebos in grey (badly pre-fit donors pruned by default).

The orange line is California; the grey lines are the donor placebos. By default, plot_placebos() prunes placebos whose pre-period mean squared prediction error (MSPE) exceeds twice California’s — those donors fit their own pre-period so badly that comparing their post-period gap to California’s would be misleading. After pruning, California’s post-period gap sits visibly below every retained placebo, which is the visual signature of a “real” treatment effect.

The unpruned variant keeps every donor for full transparency:

Code

plot_placebos(prop99_syn, prune = FALSE)

Figure 5.7: Same placebo plot with every donor retained — including badly pre-fit ones — to show the full pool of placebo trajectories.

With pruning off, the grey cloud is messier and a few badly-fit donors swing wildly — but California’s post-period descent still ends up at the bottom of the bundle. The qualitative conclusion does not depend on the pruning rule.

5.11 MSPE ratio and Fisher exact p-value

A sharper version of the same test is the MSPE ratio — the ratio of post-period to pre-period mean squared prediction error. If a unit has a tight pre-period fit and a large post-period gap, the ratio is large.

Code

# grab_significance() returns one row per unit (treated + every placebo)
# with pre_mspe, post_mspe, the post/pre ratio, the unit's rank in that
# ratio, and the Fisher-style p-value (rank / n_units).
grab_significance(prop99_syn) |>
  arrange(desc(mspe_ratio)) |>
  head(5) |>
  gt_pretty(decimals = 3)

Table 5.4: MSPE ratios — top 5 units.

unit_name	type	pre_mspe	post_mspe	mspe_ratio	rank	fishers_exact_pvalue	z_score
California	Treated	3.166	392.198	123.87	1	0.026	5.324
Georgia	Donor	3.786	178.712	47.208	2	0.051	1.702
Indiana	Donor	25.171	769.656	30.577	3	0.077	0.916
West Virginia	Donor	9.523	284.105	29.832	4	0.103	0.881
Wisconsin	Donor	11.134	267.763	24.05	5	0.128	0.607

California’s MSPE ratio is around 124 — more than two and a half times higher than the next-highest unit. California ranks 1st out of 39 units. The Fisher exact $p$-value is rank divided by total units, so $1/39 \approx 0.026$. Under the null hypothesis that Proposition 99 had no effect, the probability of seeing a unit this extreme purely by chance is about 2.6%.

Code

plot_mspe_ratio(prop99_syn)

Ignoring unknown labels:
• colour : ""

Figure 5.8: MSPE-ratio bar chart with California highlighted at rank 1 out of 39 units.

The orange bar at the top is California; every blue bar below it is a placebo donor. The gap between California and the second-place state is enormous. That gap is the visual signature of “a real treatment effect that the donor pool does not naturally replicate”.

5.12 Inspecting the nested tidysynth object

prop99_syn is not a plain data frame — it is a nested tibble with one row per unit (treated unit + every donor refit as a placebo) and list-columns that hold every intermediate output of the optimisation.

This one chunk prints with R’s default nested-tibble formatter on purpose — the <tibble [N × M]> glyphs in the list-columns are the pedagogical point that a styled table would hide.

Code

prop99_syn

# A tibble: 78 × 11
   .id      .placebo .type .outcome .predictors .synthetic_control .unit_weights
   <fct>       <dbl> <chr> <list>   <list>      <list>             <list>       
 1 Califor…        0 trea… <tibble> <tibble>    <tibble [31 × 3]>  <tibble>     
 2 Califor…        0 cont… <tibble> <tibble>    <tibble [31 × 3]>  <tibble>     
 3 Alabama         1 trea… <tibble> <tibble>    <tibble [31 × 3]>  <tibble>     
 4 Alabama         1 cont… <tibble> <tibble>    <tibble [31 × 3]>  <tibble>     
 5 Arkansas        1 trea… <tibble> <tibble>    <tibble [31 × 3]>  <tibble>     
 6 Arkansas        1 cont… <tibble> <tibble>    <tibble [31 × 3]>  <tibble>     
 7 Colorado        1 trea… <tibble> <tibble>    <tibble [31 × 3]>  <tibble>     
 8 Colorado        1 cont… <tibble> <tibble>    <tibble [31 × 3]>  <tibble>     
 9 Connect…        1 trea… <tibble> <tibble>    <tibble [31 × 3]>  <tibble>     
10 Connect…        1 cont… <tibble> <tibble>    <tibble [31 × 3]>  <tibble>     
# ℹ 68 more rows
# ℹ 4 more variables: .predictor_weights <list>, .original_data <list>,
#   .meta <list>, .loss <list>

Each list-column can be flattened with tidyr::unnest() for custom downstream work, or pulled out with one of the grab_*() helpers used above.

Code

# Flatten .outcome into a wide table: one row per (unit, year).
# The actual cigarette-sales column is named after each unit, so we
# select metadata + California's series for a clean preview.
prop99_syn |>
  tidyr::unnest(cols = c(.outcome)) |>
  select(.id, .placebo, .type, time_unit, California) |>
  head(8) |>
  gt_pretty(decimals = 2)

Table 5.5: Unnested .outcome (first 8 rows).

.id	.type	time_unit	California
California	treated	1,970	123
California	treated	1,971	121
California	treated	1,972	123.5
California	treated	1,973	124.4
California	treated	1,974	126.7
California	treated	1,975	127.1
California	treated	1,976	128
California	treated	1,977	126.4

For the tidy observed-vs-synthetic table (which is what most analyses want), the dedicated helper is more convenient:

Code

grab_synthetic_control(prop99_syn) |>
  head(8) |>
  gt_pretty(decimals = 2)

Table 5.6: grab_synthetic_control() — observed vs synthetic (first 8 rows).

time_unit	real_y	synth_y
1,970	123	116.79
1,971	121	118.66
1,972	123.5	124.09
1,973	124.4	125.23
1,974	126.7	126.83
1,975	127.1	126.99
1,976	128	127.73
1,977	126.4	125.59

This is the whole point of the nested-tibble design: every step of the optimisation is introspectable from R, with no need to dig into S4 slots or attr() blobs.

5.13 Recap

Question	Answer
What does Synthetic Control estimate?	The ATT on California, 1989–2000
What is the point estimate?	$-18.85$ packs/capita per year
What is “synthetic California”?	A convex combination of five Western/sunbelt states (Utah, Nevada, Montana, Colorado, Connecticut)
What predictors did the matching?	Mostly two lagged outcomes — `cigsale_1975` and `cigsale_1980`
How is the matching quality?	Excellent — synthetic and observed California are near-identical through 1988
What is the inference statistic?	Fisher exact $p \approx 0.026$ (California ranks 1st out of 39 on the MSPE ratio)
What is the design-time pitfall?	Don’t read the V matrix as a list of causal drivers — it is a list of good pre-period predictors

Synthetic Control is the book’s headline causal estimate, and the placebo / MSPE-ratio diagnostics both confirm that California’s post-1989 trajectory is unusual relative to what other states experienced in the same window. In chapter 6 we hand the same donor information to a Bayesian model and ask whether a credible interval (a direct probability statement about the effect) tells the same story.

5.14 Further reading

Abadie et al. (2010) — the original synthetic-control treatment of Proposition 99.
Abadie (2021) — methodological feasibility, data requirements, and assumptions in a JEL review.
Dunford (2024) — documentation for the tidysynth package used here.

--- title: "Classical Synthetic Control" --- ## The SCM idea Synthetic Control stops using one control state. Instead, it builds a *weighted combination* of donor states that matches the treated unit's pre-period as closely as possible on a chosen set of predictors. The weighted combination is "synthetic California". The gap between observed California and synthetic California is the estimated effect. **Why it works where DiD failed.** Difference-in-Differences against Nevada (chapter 4) needed parallel pre-trends with *one* neighbour. Synthetic Control needs parallel pre-trends with a *data-driven blend* of many neighbours. The optimisation does the matching, so the analyst no longer has to pick "the right" control state by hand. ## The four-stage tidysynth pipeline ```{mermaid} %%| label: fig-tidysynth-pipeline %%| fig-cap: "The four-stage tidysynth pipeline: declare → predictors → weights → control, with a fifth stage of plot/grab helpers." flowchart LR A["1. synthetic_control() declare treated unit and intervention time"] --> B["2. generate_predictor() define matching variables (one call per time window)"] B --> C["3. generate_weights() optimise donor weights (quadratic programming)"] C --> D["4. generate_control() build synthetic California and post-period gap series"] D --> E["5. plot_/grab_ helpers trends, weights, placebos, MSPE ratio, Fisher exact p-value"] style A fill:#6a9bcc,stroke:#cbd5e0,color:#fff style B fill:#6a9bcc,stroke:#cbd5e0,color:#fff style C fill:#6a9bcc,stroke:#cbd5e0,color:#fff style D fill:#d97757,stroke:#cbd5e0,color:#fff style E fill:#00d4c8,stroke:#cbd5e0,color:#141413 ``` Stages 1–4 produce the estimate. Stage 5 is a battery of inspection helpers — `plot_trends()`, `plot_differences()`, `plot_weights()`, `plot_placebos()`, `plot_mspe_ratio()`, `grab_unit_weights()`, `grab_predictor_weights()`, `grab_balance_table()`, `grab_significance()` — that turn the fitted object into figures and tables. We use all of them below. ## The equation Let $X_1$ be the vector of $k$ pre-period predictors for the treated unit (California), and let $X_0$ be the $k \times J$ matrix holding the same predictors for the $J = 38$ donor states. The Synthetic Control estimator chooses donor weights $w$ to minimise the (V-weighted) discrepancy between treated and synthetic on the predictors: $$w^* \, = \, \arg\min_{w \in \mathcal{W}} \, \big(X_1 - X_0 w\big)^\top V \big(X_1 - X_0 w\big),$$ subject to $$\mathcal{W} = \big\{w \in \mathbb{R}^J \,:\, w_j \ge 0 \,\, \forall j, \,\, \textstyle\sum_{j=1}^J w_j = 1\big\}.$$ The diagonal matrix $V$ holds the **predictor** importance weights — the optimiser can care more about pre-period cigarette sales than about, say, beer consumption (we inspect $V$ below). Once $w^*$ is solved, the synthetic California outcome at any year $t$ is $$\widehat{Y_{1t}(0)} = \sum_{j=1}^J w_j^* \, Y_{jt},$$ and the ATT over 1989–2000 is the mean post-period gap between observed California and that synthetic counterfactual. ## Setup and data **Packages.** `tidyverse` covers wrangling and plotting. `tidysynth` is the workhorse for this chapter: it wraps the classical Abadie-Diamond-Hainmueller synthetic-control optimiser behind a tidy pipeline of `synthetic_control() |> generate_predictor() |> generate_weights() |> generate_control()` and ships a battery of `plot_*()` / `grab_*()` helpers for inspection. The `R/table_helpers.R` helper provides `gt_pretty()` for the donor-weight and balance tables. ```{r} #| label: setup #| message: false #| warning: false library(tidyverse) library(tidysynth) source("R/table_helpers.R") set.seed(42) knitr::opts_chunk$set(dev.args = list(bg = "transparent")) theme_set( theme_minimal(base_size = 12) + theme( plot.background = element_rect(fill = "transparent", color = NA), panel.background = element_rect(fill = "transparent", color = NA), panel.grid.major = element_line(color = "#94a3b8", linewidth = 0.25), panel.grid.minor = element_line(color = "#94a3b8", linewidth = 0.15), text = element_text(color = "#94a3b8"), axis.text = element_text(color = "#94a3b8") ) ) ``` **Dataset.** Unlike the ITS, RDD, and DiD chapters — which used only California or only California-plus-Nevada — Synthetic Control uses the **full 39-state × 31-year panel**. The donor pool *is* the point: the other 38 states are the raw material the optimiser will blend into "synthetic California". We don't pre-filter or restrict to a window here; the chapter-2 outcome `cigsale` and the covariates `lnincome`, `retprice`, `age15to24`, `beer` are all in `prop99` and are passed straight into `synthetic_control()` below. ```{r} #| label: data-load prop99 <- read_rds("data/proposition99.rds") |> as_tibble() ``` The loaded `prop99` is a 1,209-row × 7-column tibble (39 states × 31 years per row, columns `state`, `year`, `cigsale`, plus the four covariates). The pipeline in the next section consumes it as-is. ## Fit the synthetic-control pipeline **The pipeline.** The chunk below runs the four core `tidysynth` stages back-to-back: stage 1 declares California as the treated unit and 1988 as the last pre-period year (with `generate_placebos = TRUE` to also refit the model treating *each donor state* as if it had been treated — those placebo fits are what powers the permutation test later); stage 2 declares the matching predictors (three pre-period covariate averages, one shorter-window covariate, and three *lagged outcomes* — cigarette sales themselves at 1975, 1980, 1988); stage 3 solves the constrained quadratic program for donor weights; stage 4 multiplies those weights through the donor outcomes to construct the synthetic-California series. The three `*_ipop` arguments are tuning knobs for the interior-point optimiser and only matter if convergence is slow. ```{r} #| label: fit-syn #| message: false #| warning: false prop99_syn <- prop99 |> # 1. Declare the panel structure: outcome, unit, time, treated unit # ("California"), and the last full pre-period year (1988). # generate_placebos = TRUE also fits the model treating each donor # state as treated, for the permutation test below. synthetic_control( outcome = cigsale, unit = state, time = year, i_unit = "California", i_time = 1988, generate_placebos = TRUE ) |> # 2. Predictors averaged over the full pre-period (1980-1988). generate_predictor( time_window = 1980:1988, lnincome = mean(lnincome, na.rm = TRUE), retprice = mean(retprice, na.rm = TRUE), age15to24 = mean(age15to24, na.rm = TRUE) ) |> # 2b. beer is sparser, so use a narrower window where data is densest. generate_predictor(time_window = 1984:1988, beer = mean(beer, na.rm = TRUE)) |> # 2c. Three "lagged outcomes" - cigsale at three pre-period dates. # These pin synthetic California's pre-period trajectory. generate_predictor(time_window = 1975, cigsale_1975 = cigsale) |> generate_predictor(time_window = 1980, cigsale_1980 = cigsale) |> generate_predictor(time_window = 1988, cigsale_1988 = cigsale) |> # 3. Solve the constrained QP for donor weights w*. The three IPOP # parameters are tuning knobs for the interior-point optimiser. generate_weights(optimization_window = 1970:1988, margin_ipop = .02, sigf_ipop = 7, bound_ipop = 6) |> # 4. Compute the synthetic California series from w* and donor outcomes. generate_control() ``` **Predictor choices.** Seven predictors are passed in. Three are pre-period covariate averages over the full pre-period (`lnincome`, `retprice`, `age15to24` over 1980–1988). One uses a narrower window where data is densest (`beer` over 1984–1988). Three are *lagged outcomes* — cigarette sales themselves at 1975, 1980, and 1988. The lagged outcomes are the most important trick: anchoring the synthetic control on the treated unit's own pre-period *outcome levels* at multiple time points forces the synthetic series to track California's pre-1988 trajectory closely. ## Donor weights and predictor weights The optimisation produces two weight vectors that drive the entire fit. Both are extractable as tidy tables. ```{r} #| label: tbl-donor-weights #| tbl-cap: "Donor unit weights (top 8 states)." grab_unit_weights(prop99_syn) |> arrange(desc(weight)) |> head(8) |> gt_pretty(decimals = 3) |> cols_label(unit = "Donor state", weight = "Weight") ``` Donor weights answer the question *which states mimic California*. The companion table — the **predictor (V-matrix) weights** — answers *which variables* the optimiser used to decide what "mimics" means. The two weight vectors are produced by the same `generate_weights()` call and together describe the entire fit. ```{r} #| label: tbl-predictor-weights #| tbl-cap: "Predictor (V-matrix) weights." grab_predictor_weights(prop99_syn) |> arrange(desc(weight)) |> gt_pretty(decimals = 3) |> cols_label(variable = "Predictor", weight = "V-matrix weight") ``` Two things to notice. 1. **Five states absorb essentially 100% of the donor weight.** Utah, Nevada, Montana, Colorado, Connecticut. Every other state gets effectively zero. California is matched mostly to other Western/sunbelt states with similar age structure and cigarette price levels, plus Connecticut as a smoking-rate counterweight from the east. 2. **The two earliest cigsale levels dominate the V matrix.** `cigsale_1975` and `cigsale_1980` together get roughly 88% of the predictor weight. The four behavioural and demographic covariates get less than 9% combined. The optimiser has effectively decided: "the best way to predict California's cigarette sales is using *other states' cigarette sales*." For a one-line visual of both weight vectors, tidysynth ships a `plot_weights()` helper: ```{r} #| label: fig-sc-weights #| fig-cap: "Donor unit weights (left) and predictor V-matrix weights (right) from tidysynth." #| fig-width: 9 #| fig-height: 5 plot_weights(prop99_syn) ``` ### A closer look at the V matrix The combined `plot_weights()` view is convenient, but the V matrix deserves a stand-alone chart because it answers a different question than the donor weights. Donor weights say *which states* mimic California; the V matrix says *which variables* the optimiser used to decide what "mimics" means. ```{r} #| label: fig-sc-predictor-weights #| fig-cap: "V matrix: cigsale_1975 and cigsale_1980 dominate; behavioural covariates get nearly zero weight." #| fig-width: 8 #| fig-height: 4 predw_df <- grab_predictor_weights(prop99_syn) |> mutate(variable = fct_reorder(variable, weight)) ggplot(predw_df, aes(x = weight, y = variable)) + geom_col(fill = "#6a9bcc") + geom_text(aes(label = sprintf("%.3f", weight)), hjust = -0.12) + scale_x_continuous(expand = expansion(mult = c(0, 0.15))) + labs(x = "V-matrix weight (predictor importance)", y = NULL) + theme_minimal() ``` Two readings of the same picture, one practical and one cautionary. - *Practical reading.* Two lagged outcomes (`cigsale_1975` and `cigsale_1980`) carry the bulk of the matching information. The optimiser has decided that California's pre-period cigarette sales — *at multiple time points* — are the best fingerprint to match. - *Cautionary reading.* The V matrix is **not a causal ranking**. It tells you which variables were *useful for matching the treated unit's pre-period*, not which variables *cause* the outcome. **Common pitfall.** Treating the V matrix as a list of causal drivers. It is a list of *good pre-period predictors for one specific unit*, not a structural model of smoking. ## The estimate **The ATT.** With the pipeline fit, the headline causal estimate is the mean of the per-year gap between observed California and synthetic California, restricted to 1989–2000. `grab_synthetic_control()` extracts a tidy long table with one row per year and two columns — `real_y` (observed) and `synth_y` (synthetic counterfactual) — so the ATT is a one-line `mean(real_y - synth_y)` over the post-period rows. ```{r} #| label: sc-att # grab_synthetic_control() returns a tidy tibble with observed (real_y) # and synthetic (synth_y) cigsale for every year. We restrict to the # post-period and compute the per-year gap. sc_post <- grab_synthetic_control(prop99_syn) |> filter(time_unit > 1988) |> mutate(dif = real_y - synth_y) # Average the per-year gap to recover the ATT. mean(sc_post$dif) ``` The Synthetic Control ATT is approximately **$-18.85$ packs/capita** averaged over 1989–2000. This is the book's primary causal estimate and within rounding of the canonical @abadie2010synthetic result. ```{r} #| label: fig-sc-trends #| fig-cap: "Synthetic Control: California (observed) vs synthetic California (weighted donor combination)." #| fig-width: 8 #| fig-height: 5 plot_trends(prop99_syn) ``` The pre-period fit is excellent — the synthetic and observed series are nearly indistinguishable through 1988. A substantial gap opens immediately after 1989, widening to roughly 30 packs by 2000. ## Predictor balance: did the matching work? `grab_balance_table()` shows California, synthetic California, and the unweighted donor average side-by-side on every predictor. ```{r} #| label: tbl-sc-balance #| tbl-cap: "Predictor balance: California vs synthetic California vs donor average." grab_balance_table(prop99_syn) |> gt_pretty(decimals = 2) ``` On every variable, *synthetic California* is far closer to California than the unweighted donor average is. The most dramatic improvement is on the lagged outcomes: `cigsale_1988` is roughly 90 for California vs ≈91 for the synthetic — a near-perfect match — while the unweighted donor average is around 114. That gap of ~24 packs is exactly the bias the naive pre-post method silently absorbed. ## Visualising the post-period gap `plot_trends()` showed *both* observed and synthetic California on one canvas. The companion helper `plot_differences()` plots just the *gap*: $Y_{1t} - \widehat{Y_{1t}(0)}$, year by year. This isolates the treatment-effect curve in its cleanest form. ```{r} #| label: fig-sc-differences #| fig-cap: "Per-year gap between observed California and synthetic California." #| fig-width: 8 #| fig-height: 5 plot_differences(prop99_syn) ``` Read the line as the *effect of Proposition 99 on California in year $t$*. The pre-period values hover near zero (the matching worked), the line drops sharply after 1989, and it stays negative — steadily widening — throughout the post-period. The 1989–2000 mean of this series is exactly the $-18.85$ packs ATT reported above. ## Inference via placebo permutation A "standard error" computed as cross-year SD divided by $\sqrt{N}$ is *not* a real sampling-distribution-based standard error. The proper Synthetic Control uncertainty quantification is a permutation test. **The recipe.** Refit the synthetic-control model treating *each donor state* as if *it* had been the treated unit. Compute the post-period gap for each placebo. Compare California's gap trajectory to those placebo trajectories. If California's gap is extreme relative to the placebos, the policy probably did something. ```{r} #| label: fig-sc-placebos #| fig-cap: "Placebo permutation: California in orange, donor placebos in grey (badly pre-fit donors pruned by default)." #| fig-width: 8 #| fig-height: 5 plot_placebos(prop99_syn) ``` The orange line is California; the grey lines are the donor placebos. By default, `plot_placebos()` *prunes* placebos whose pre-period mean squared prediction error (MSPE) exceeds twice California's — those donors fit their own pre-period so badly that comparing their post-period gap to California's would be misleading. After pruning, California's post-period gap sits visibly below every retained placebo, which is the visual signature of a "real" treatment effect. The unpruned variant keeps every donor for full transparency: ```{r} #| label: fig-sc-placebos-unpruned #| fig-cap: "Same placebo plot with every donor retained — including badly pre-fit ones — to show the full pool of placebo trajectories." #| fig-width: 8 #| fig-height: 5 plot_placebos(prop99_syn, prune = FALSE) ``` With pruning off, the grey cloud is messier and a few badly-fit donors swing wildly — but California's post-period descent still ends up at the bottom of the bundle. The qualitative conclusion does not depend on the pruning rule. ## MSPE ratio and Fisher exact p-value A sharper version of the same test is the **MSPE ratio** — the ratio of post-period to pre-period mean squared prediction error. If a unit has a tight pre-period fit *and* a large post-period gap, the ratio is large. ```{r} #| label: tbl-sc-significance #| tbl-cap: "MSPE ratios — top 5 units." # grab_significance() returns one row per unit (treated + every placebo) # with pre_mspe, post_mspe, the post/pre ratio, the unit's rank in that # ratio, and the Fisher-style p-value (rank / n_units). grab_significance(prop99_syn) |> arrange(desc(mspe_ratio)) |> head(5) |> gt_pretty(decimals = 3) ``` California's MSPE ratio is around **124** — more than two and a half times higher than the next-highest unit. California ranks **1st out of 39 units**. The Fisher exact $p$-value is rank divided by total units, so $1/39 \approx 0.026$. Under the null hypothesis that Proposition 99 had no effect, the probability of seeing a unit this extreme purely by chance is about 2.6%. The same ratio plotted as a bar chart makes the rank-1 gap visible at a glance. ```{r} #| label: fig-sc-mspe-ratio #| fig-cap: "MSPE-ratio bar chart with California highlighted at rank 1 out of 39 units." #| fig-width: 8 #| fig-height: 5 plot_mspe_ratio(prop99_syn) ``` The orange bar at the top is California; every blue bar below it is a placebo donor. The gap between California and the second-place state is enormous. That gap is the visual signature of "a real treatment effect that the donor pool does not naturally replicate". ## Inspecting the nested tidysynth object `prop99_syn` is not a plain data frame — it is a *nested tibble* with one row per unit (treated unit + every donor refit as a placebo) and list-columns that hold every intermediate output of the optimisation. *This one chunk prints with R's default nested-tibble formatter on purpose — the `<tibble [N × M]>` glyphs in the list-columns are the pedagogical point that a styled table would hide.* ```{r} #| label: sc-nested prop99_syn ``` Each list-column can be flattened with `tidyr::unnest()` for custom downstream work, or pulled out with one of the `grab_*()` helpers used above. ```{r} #| label: tbl-sc-unnest #| tbl-cap: "Unnested `.outcome` (first 8 rows)." # Flatten .outcome into a wide table: one row per (unit, year). # The actual cigarette-sales column is named after each unit, so we # select metadata + California's series for a clean preview. prop99_syn |> tidyr::unnest(cols = c(.outcome)) |> select(.id, .placebo, .type, time_unit, California) |> head(8) |> gt_pretty(decimals = 2) ``` For the tidy *observed-vs-synthetic* table (which is what most analyses want), the dedicated helper is more convenient: ```{r} #| label: tbl-sc-unnest-tidy #| tbl-cap: "`grab_synthetic_control()` — observed vs synthetic (first 8 rows)." grab_synthetic_control(prop99_syn) |> head(8) |> gt_pretty(decimals = 2) ``` This is the whole point of the nested-tibble design: every step of the optimisation is *introspectable from R*, with no need to dig into S4 slots or `attr()` blobs. ## Recap | Question | Answer | |---|---| | What does Synthetic Control estimate? | The ATT on California, 1989–2000 | | What is the point estimate? | **$-18.85$ packs/capita per year** | | What is "synthetic California"? | A convex combination of five Western/sunbelt states (Utah, Nevada, Montana, Colorado, Connecticut) | | What predictors did the matching? | Mostly two lagged outcomes — `cigsale_1975` and `cigsale_1980` | | How is the matching quality? | Excellent — synthetic and observed California are near-identical through 1988 | | What is the inference statistic? | Fisher exact $p \approx 0.026$ (California ranks 1st out of 39 on the MSPE ratio) | | What is the design-time pitfall? | Don't read the V matrix as a list of causal drivers — it is a list of *good pre-period predictors* | Synthetic Control is the book's headline causal estimate, and the placebo / MSPE-ratio diagnostics both confirm that California's post-1989 trajectory is unusual relative to what other states experienced in the same window. In chapter 6 we hand the same donor information to a Bayesian model and ask whether a *credible interval* (a direct probability statement about the effect) tells the same story. ## Further reading - @abadie2010synthetic — the original synthetic-control treatment of Proposition 99. - @abadie2021using — methodological feasibility, data requirements, and assumptions in a JEL review. - @dunford2024tidysynth — documentation for the `tidysynth` package used here.