10 Staggered Differences-in-Differences

10.1 Learning objectives

Estimate group-time average treatment effects with did::att_gt() on a staggered-adoption panel. This is the workhorse of modern DiD — the estimator that replaced two-way fixed effects after the negative-weight critique.
Aggregate group-time effects into event-study dynamics with aggte(type = "dynamic"). The aggregation step is what gives the reader a single defensible event-study plot without the contamination problems of a TWFE event study.
Condition on covariates using regression, IPW, and doubly-robust estimators, and default to the doubly-robust choice. Doubly-robust estimation is consistent if either the outcome model or the propensity-score model is correctly specified, which is why it is the safe default — knowing all three lets the reader defend that choice against a referee.
Run Rambachan-Roth relative-magnitudes sensitivity analysis with HonestDiD and report the smallest pre-trend violation that would overturn the post-treatment conclusion. This converts an unverifiable parallel-trends assumption into a falsifiable claim about how robust the headline result is.

Part II begins here. Chapters 2–9 held the dataset fixed (Proposition 99) and varied the estimator. From this chapter forward we switch datasets too: a 1,745-county minimum-wage panel with staggered adoption replaces California’s single-shock setting, and the toolkit shifts to estimators built for that structure.

10.2 When TWFE breaks under staggered adoption

Chapter 3 ran a textbook 2×2 DiD on Proposition 99: California treated in 1989, Nevada as the single hand-picked control. That design squeezed the entire causal contrast into one interaction coefficient — $\widehat{\tau}_{\text{DiD}} \approx -5.7$ packs per capita, statistically indistinguishable from zero. The diagnosis there was twofold: Nevada inherits California’s secular forces (so its own post-1988 decline soaks up most of the policy signal), and parallel trends was rejected on the pre-window itself.

That design has a deeper structural limit. Real policy data rarely has just one treated unit and one clean control unit, both switching at the same instant. States and counties adopt at different times — some in 2004, others in 2006, still others never. The natural extension of the chapter-3 regression to that setting is the two-way fixed-effects (TWFE) specification:

\[Y_{it} = \alpha_i + \gamma_t + \beta \cdot \text{post}_{it} + \varepsilon_{it}.\]

For two decades this was the default. We now know it is biased in the presence of staggered adoption: already-treated units silently act as controls for later-treated units, contaminating the contrast. The Goodman-Bacon (2021) decomposition shows that $\hat\beta$ is a weighted average of 2×2 DiDs in which already-treated units serve as controls for later-treated units — the forbidden comparison. Chaisemartin & D’Haultfœuille (2020) further prove that under treatment-effect heterogeneity some of those implicit weights can be strictly negative, so $\hat\beta$ need not even be a convex average of the underlying $ATT(g, t)$s. Either way, when treatment effects grow over time — the textbook policy story — the sign of $\hat\beta$ is no longer a reliable summary.

This chapter walks through the modern toolkit for repairing that damage. The methods build on Callaway & Sant’Anna (2021): estimate group-time average treatment effects $ATT(g, t)$ directly, then aggregate them with weights you can defend. We will use the shorthand $\widehat{\tau}_{\text{CS}}$ for the Callaway-Sant’Anna estimator, distinguishing $\widehat{\tau}_{\text{CS, overall}}$ (the cohort-size-weighted summary) from $\widehat{\tau}_{\text{CS, dyn}}(e)$ (the event-study aggregation at event time $e = t - g$). Along the way we look at two companion ideas: the doubly-robust DiD estimator from Callaway & Sant’Anna (2021), and the Rambachan-Roth sensitivity analysis that quantifies how much parallel-trends violation it would take to overturn the conclusion (Rambachan & Roth, 2023). The Sun-Abraham interaction-weighted event study (Sun & Abraham, 2021) is an equivalent estimator built on a different regression-based decomposition; see Roth et al. (2023) for a side-by-side review.

The dataset is no longer Proposition 99. Staggered DiD requires variation in treatment timing, which a single-treated-state panel cannot provide. We switch to the Callaway-Sant’Anna minimum-wage panel: 1,745 US counties × 2003–2007, with cohorts $G \in \{0, 2004, 2006\}$ (after dropping a small late-2007 cohort; see Setup) indexing the year the county’s state first raised its minimum wage above the federal $5.15/h floor. The outcome is log teen employment.

10.3 Setup and data

Code: Load packages, source helpers, and set the ggplot theme.

library(tidyverse)
library(did)
library(fixest)
library(HonestDiD)
source("R/table_helpers.R")
source("R/honest_did.R")

set.seed(42)  # all chunks below are analytically deterministic; this is just hygiene

knitr::opts_chunk$set(dev.args = list(bg = "transparent"))

theme_set(
  theme_minimal(base_size = 12) +
    theme(
      plot.background  = element_rect(fill = "transparent", color = NA),
      panel.background = element_rect(fill = "transparent", color = NA),
      panel.grid.major = element_line(color = "#94a3b8", linewidth = 0.25),
      panel.grid.minor = element_line(color = "#94a3b8", linewidth = 0.15),
      text             = element_text(color = "#94a3b8"),
      axis.text        = element_text(color = "#94a3b8"),
      strip.text       = element_text(color = "#94a3b8"),
      legend.text      = element_text(color = "#94a3b8")
    )
)

The dataset ships as data/cs_minwage.rds in the chapter bundle. We follow the source-post convention of restricting to cohorts $G \in \{0, 2004, 2006, 2007\}$, dropping the Northeast region (region == "1") for comparability, and then carving out a clean working panel without the late-2007 cohort and starting in 2003.

Code: Load the minimum-wage panel and build the working sample.

mw_raw <- readRDS("data/cs_minwage.rds") |> as_tibble()

# Step 1: drop Northeast and keep cohorts of interest.
mw <- mw_raw |>
  filter(G %in% c(0, 2004, 2006, 2007), region != "1")

# Step 2: working sample for the main analysis.
data2 <- mw |>
  filter(G != 2007, year >= 2003)

dim(data2)

[1] 8725   20

The working panel has 8725 rows = 1745 counties × 5 years, balanced across the 2003–2007 window.

Code: Tabulate county counts by treatment cohort.

data2 |>
  filter(year == 2003) |>
  count(G, name = "counties") |>
  rename(`Treatment cohort (G)` = G) |>
  gt_pretty()

Table 10.1: Cohort sizes in the working sample. G = 0 is the never-treated control pool; cohorts 2004 and 2006 are the staggered treated groups.

Treatment cohort (G)	counties
0	1,417
2,004	102
2,006	226

10.4 The TWFE baseline

The natural first move is the TWFE regression. The variable post in the dataset is 1 in periods where the county’s state has already raised its minimum wage (i.e., $t \ge g$ and $g \ne 0$), and 0 otherwise.

Code: Estimate the TWFE baseline regression with clustered SEs.

twfe_res <- fixest::feols(lemp ~ post | id + year,
                          data = data2, cluster = "id")
ms_pretty(list("TWFE (county + year FE)" = twfe_res),
          coef_map = c("post" = "Post (any cohort)"),
          notes    = "SEs clustered at the county level.")

Table 10.2: Two-way fixed-effects regression. The point estimate suggests minimum-wage increases cut log teen employment by a few percent — but see the discussion below.

	TWFE (county + year FE)
Post (any cohort)	-0.038***
	(0.008)
Num.Obs.	8725
R2	0.994
R2 Within	0.004
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
SEs clustered at the county level.

The TWFE coefficient is roughly -0.038. Read literally, the policy cut teen employment by about 3.8 percent. The next section re-estimates this effect with an aggregator that is unbiased under staggered adoption — and gets a substantially larger magnitude. The gap between the two is the contamination problem made concrete.

10.5 Group-time ATTs: the Callaway-Sant’Anna approach

The fix is to estimate the primitive objects directly. For each cohort $g$ (a year a group of counties was first treated) and each calendar year $t$, define

\[ATT(g, t) = \mathbb{E}\!\left[Y_{it}(1) - Y_{it}(\infty) \mid G_i = g\right],\]

the average effect on cohort $g$ in year $t$ relative to its own never-treated potential outcome $Y_{it}(\infty)$. We use the $\infty$ subscript that Callaway & Sant’Anna (2021) adopt to emphasise never treated (as opposed to not-yet treated); it is the same object as Part I’s $Y_{it}(0)$, just notation that survives the distinction between cohorts.

Identification relies on parallel trends: conditional on $G_i = g$, the never-treated potential outcome $\mathbb{E}[Y_{it}(\infty)]$ trends the same way for the treated cohort and the never-treated pool. Under that assumption, did::att_gt() estimates each $ATT(g, t)$ from a clean 2×2 DiD using only cohort $g$ and an appropriate comparison group (here, the never-treated $G = 0$), so no contamination from already-treated units sneaks in.

Code: Estimate group-time ATTs with the did package.

attgt <- did::att_gt(yname = "lemp", idname = "id", gname = "G",
                     tname = "year", data = data2,
                     control_group = "nevertreated",
                     base_period = "universal")

attgt_df <- tibble(
  Cohort       = attgt$group,
  Year         = attgt$t,
  `ATT(g,t)`   = attgt$att,
  SE           = as.numeric(attgt$se)
) |>
  mutate(Phase = ifelse(Year >= Cohort, "post (t ≥ g)", "pre (t < g)")) |>
  arrange(Cohort, Year)

gt_pretty(attgt_df, decimals = 4)

Table 10.3: Group-time average treatment effects $ATT(g, t)$ for the minimum-wage panel. Cohorts 2004 and 2006, each year 2003 through 2007. Pre-treatment cells should hover near zero if parallel trends holds; post-treatment cells are the effects we want.

Cohort	Year	ATT(g,t)	SE	Phase
2,004	2,003	0	NA	pre (t < g)
2,004	2,004	−0.0327	0.0196	post (t ≥ g)
2,004	2,005	−0.0683	0.0211	post (t ≥ g)
2,004	2,006	−0.1234	0.0204	post (t ≥ g)
2,004	2,007	−0.1311	0.0221	post (t ≥ g)
2,006	2,003	−0.0341	0.0125	pre (t < g)
2,006	2,004	−0.0167	0.0087	pre (t < g)
2,006	2,005	0	NA	pre (t < g)
2,006	2,006	−0.0194	0.0086	post (t ≥ g)
2,006	2,007	−0.0661	0.009	post (t ≥ g)

Cohort 2006 carries a visible pre-trend: $\widehat{ATT}(2006, 2003) \approx$ -0.034 with SE $\approx$ 0.012 (so $t \approx$ -2.7). The 2006 counties were already on a downward log-employment path relative to the never-treated pool three years before their state raised the minimum wage. That pre-treatment violation is roughly the same magnitude as the on-impact effect we will estimate below — which is exactly what motivates the formal sensitivity analysis at the end of the chapter. Naming the pre-trend now keeps the rest of the story honest.

10.5.1 Aggregation: overall ATT and event study

The 8 cells in Table 10.3 are the primitives. They are not the headline. Aggregating them produces summaries that are valid in the presence of staggered adoption.

The overall ATT first averages each cohort’s post-treatment $ATT(g, t)$ values into a single cohort-specific summary, then averages those summaries across cohorts weighted by cohort size. It answers: across treated cohorts, what is the average post-treatment effect on a typical treated county? This is the Callaway & Sant’Anna (2021) recommended summary; it differs from a plain mean of post-treatment cells (type = "simple") when cohorts have different post-treatment horizons — as they do here (cohort 2004 has 4 post cells, cohort 2006 has 2).

Code: Aggregate group-time ATTs into the overall ATT summary.

attO <- did::aggte(attgt, type = "group")

tibble(
  Estimator             = "Callaway-Sant'Anna overall ATT",
  `Estimate`            = attO$overall.att,
  `SE`                  = attO$overall.se,
  `CI lower`            = attO$overall.att - 1.96 * attO$overall.se,
  `CI upper`            = attO$overall.att + 1.96 * attO$overall.se
) |> gt_pretty(decimals = 4)

Table 10.4: Callaway-Sant’Anna overall ATT (sample-weighted aggregation across cohorts). This is the staggered-DiD analogue of chapter 3’s single DiD coefficient.

Estimator	Estimate	SE	CI lower	CI upper
Callaway-Sant'Anna overall ATT	−0.0571	0.0084	−0.0736	−0.0406

The Callaway-Sant’Anna overall estimate $\widehat{\tau}_{\text{CS, overall}} \approx$ -0.057 is roughly 50 percent larger in magnitude than the TWFE coefficient (-0.038). The two estimands answer different questions, but the size of the gap is exactly the contamination problem in operation.

The event-study aggregation averages $ATT(g, t)$ within each event time $e = t - g$, giving a curve of dynamic effects since treatment. We write the result as $\widehat{\tau}_{\text{CS, dyn}}(e)$.

Code: Aggregate ATTs into the dynamic event-study table.

attes <- did::aggte(attgt, type = "dynamic")

tibble(
  `Event time` = attes$egt,
  `ATT(e)`     = attes$att.egt,
  SE           = attes$se.egt,
  `CI lower`   = attes$att.egt - 1.96 * attes$se.egt,
  `CI upper`   = attes$att.egt + 1.96 * attes$se.egt
) |> gt_pretty(decimals = 4)

Table 10.5: Callaway-Sant’Anna event-study aggregation. Event time -1 is the omitted reference period under the universal base.

Event time	ATT(e)	SE	CI lower	CI upper
−3	−0.0341	0.0122	−0.058	−0.0102
−2	−0.0167	0.0081	−0.0326	−0.0008
−1	0	NA	NA	NA
0	−0.0235	0.0093	−0.0418	−0.0052
1	−0.0668	0.0095	−0.0853	−0.0482
2	−0.1234	0.019	−0.1606	−0.0861
3	−0.1311	0.0221	−0.1744	−0.0877

The on-impact effect $\widehat{\tau}_{\text{CS, dyn}}(0) \approx$ -0.024 is modest, but the trajectory deepens quickly: $\widehat{\tau}_{\text{CS, dyn}}(1) \approx$ -0.067, $\widehat{\tau}_{\text{CS, dyn}}(2) \approx$ -0.123, and $\widehat{\tau}_{\text{CS, dyn}}(3) \approx$ -0.131 — three years after a state first raised its minimum wage, treated counties’ log teen employment is roughly 13.1 log points below the never-treated counterfactual.

Code: Plot the event-study estimates with 95 percent CIs.

cs_es_df <- tibble(
  egt   = attes$egt,
  est   = attes$att.egt,
  se    = attes$se.egt,
  phase = ifelse(attes$egt < 0, "Pre", "Post")
)

ggplot(cs_es_df, aes(x = egt, y = est, color = phase)) +
  geom_hline(yintercept = 0, color = "#94a3b8", linetype = "dashed") +
  geom_vline(xintercept = -0.5, color = "#94a3b8", linetype = "dashed") +
  geom_point(size = 2.8) +
  geom_errorbar(aes(ymin = est - 1.96 * se, ymax = est + 1.96 * se),
                width = 0.15) +
  scale_color_manual(values = c("Pre"  = "#e2e8f0",
                                "Post" = "#22d3ee")) +
  labs(x = "Event time (years from first minimum-wage increase)",
       y = "Effect on log teen employment", color = NULL)

Figure 10.1: Callaway-Sant’Anna event study. The on-impact effect is small; effects accumulate over event time and reach the magnitudes reported in Table 10.5.

10.6 Conditional parallel trends

Plain DiD assumes parallel trends unconditionally. Conditional DiD allows trends to be parallel only after adjusting for observed covariates that shape the trajectory. In this dataset, log county population and log average county pay are obvious candidates: counties with bigger populations or higher-paying jobs may move on different underlying trajectories whether or not they raise the minimum wage.

did::att_gt() accepts xformla = ~lpop + lavg_pay plus a choice of estimator passed through est_method =:

Regression adjustment (est_method = "reg") fits an outcome model on the never-treated and uses its predicted change as the counterfactual change for the treated cohort. Consistent if the outcome regression is correctly specified.
Inverse probability weighting (est_method = "ipw") fits a propensity score for cohort membership and reweights never-treated counties to look like treated ones before taking the DiD. Consistent if the propensity-score model is correctly specified.
Doubly robust (est_method = "dr", the Callaway & Sant’Anna (2021) default) combines both models in a way that is consistent if either one is correctly specified — the safe choice when neither model is obviously right.

Code: Compare overall ATT under regression, IPW, and doubly-robust adjustments.

cs_reg <- did::att_gt(yname="lemp", tname="year", idname="id", gname="G",
                      xformla = ~lpop + lavg_pay,
                      control_group = "nevertreated",
                      base_period = "universal",
                      est_method = "reg", data = data2)
attO_reg <- did::aggte(cs_reg, type = "group")

cs_ipw <- did::att_gt(yname="lemp", tname="year", idname="id", gname="G",
                      xformla = ~lpop + lavg_pay,
                      control_group = "nevertreated",
                      base_period = "universal",
                      est_method = "ipw", data = data2)
attO_ipw <- did::aggte(cs_ipw, type = "group")

cs_dr  <- did::att_gt(yname="lemp", tname="year", idname="id", gname="G",
                      xformla = ~lpop + lavg_pay,
                      control_group = "nevertreated",
                      base_period = "universal",
                      est_method = "dr", data = data2)
attO_dr  <- did::aggte(cs_dr, type = "group")

tibble(
  Identification = c("Unconditional parallel trends",
                     "Conditional · regression adjustment",
                     "Conditional · IPW",
                     "Conditional · doubly robust"),
  Estimate = c(attO$overall.att, attO_reg$overall.att,
               attO_ipw$overall.att, attO_dr$overall.att),
  SE       = c(attO$overall.se,  attO_reg$overall.se,
               attO_ipw$overall.se,  attO_dr$overall.se)
) |> gt_pretty(decimals = 4)

Table 10.6: Overall ATT under four identification assumptions: unconditional parallel trends, regression adjustment, inverse probability weighting, and doubly robust. The four point estimates land within roughly one SE of each other, which is reassuring.

Identification	Estimate	SE
Unconditional parallel trends	−0.0571	0.0084
Conditional · regression adjustment	−0.0638	0.0079
Conditional · IPW	−0.0647	0.0078
Conditional · doubly robust	−0.0646	0.0075

The doubly-robust estimate is the most reliable in practice — it recovers a correct ATT if either the trend model or the selection model is right, where the other two require their respective model to be correctly specified.

10.7 Sensitivity to parallel-trends violations

The robustness table only varies design choices; it does not quantify how much parallel trends would have to be violated to overturn the conclusion. Rambachan & Roth (2023) provide exactly that quantification. The HonestDiD package, paired with a small bridge function in R/honest_did.R, lets us bound the post-treatment effect under explicit assumptions about how an unobserved post-treatment violation could relate to the visible pre-trend.

HonestDiD ships two restrictions on the post-treatment violation. The smoothness bound $\Delta^{SD}(M)$ requires the post-treatment violation to be a smooth continuation of the pre-trend (it caps the second differences of the trend), which is the natural choice when pre-trends look like near-linear drift. The relative-magnitudes bound $\Delta^{RM}(\bar M)$ requires the post-treatment violation to be at most $\bar M$ times the largest pre-treatment violation; this is the natural choice when the pre-trend is visibly non-linear, as in cohort 2006 above. We use the relative-magnitudes bound below.

Code: Run HonestDiD relative-magnitudes sensitivity on the on-impact ATT.

hd_rm <- honest_did(es      = attes,
                    e       = 0,
                    type    = "relative_magnitude",
                    method  = "C-LF",
                    Mbarvec = seq(0, 2, by = 0.5))

hd_rm |>
  transmute(`$\\bar M$` = Mbar,
            `CI lower`  = lb,
            `CI upper`  = ub,
            `Contains 0?` = ifelse(lb <= 0 & 0 <= ub, "yes", "no")) |>
  gt_pretty(decimals = 4)

Table 10.7: Robust confidence intervals for the on-impact effect $\widehat{\tau}_{\text{CS, dyn}}(0)$ under the relative-magnitudes restriction $\Delta^{RM}(\bar M)$. As $\bar M$ grows, the CI widens; we report the breakdown $\bar M$ at which the CI first contains zero as a fragility metric.

$\bar M$	CI lower	CI upper	Contains 0?
0	−0.0402	−0.0067	no
0.5	−0.0492	0.0029	yes
1	−0.062	0.0161	yes
1.5	−0.0775	0.0316	yes
2	−0.0934	0.0475	yes

Code: Plot the robust CI band against the relative-magnitude bound.

ggplot(hd_rm, aes(x = Mbar)) +
  geom_hline(yintercept = 0, color = "#94a3b8", linetype = "dashed") +
  geom_ribbon(aes(ymin = lb, ymax = ub),
              fill = "#22d3ee", alpha = 0.2) +
  geom_line(aes(y = lb), color = "#22d3ee", linewidth = 0.9) +
  geom_line(aes(y = ub), color = "#22d3ee", linewidth = 0.9) +
  labs(x = expression(bar(M)),
       y = "Robust 95% CI on ATT(e = 0)")

Figure 10.2: HonestDiD sensitivity: as $\bar M$ grows, the robust 95 percent CI on the on-impact effect widens until it crosses zero. The breakdown $\bar M$ — the value at which the CI first contains zero — is small in this panel, reflecting the visible pre-trend in cohort 2006.

The breakdown point on the coarse grid is $\bar M \le$ 0.50 — i.e. the robust CI first crosses zero at an $\bar M$ less than half the largest observed pre-trend in the data (exercise 4 sweeps a finer grid to pin the breakdown more precisely). The on-impact effect of $\widehat{\tau}_{\text{CS, dyn}}(0) \approx$ -0.024 does not survive even a fraction of the cohort-2006 pre-trend we already flagged: that pre-trend, at $\widehat{ATT}(2006, 2003) \approx$ -0.034, is itself larger in magnitude than the on-impact effect, so it would be surprising if the conclusion held up under a relative-magnitudes assumption that allowed unobserved post-trend violations of comparable size.

This is not a failure of the method — it is the method working as intended. HonestDiD turns “the Callaway-Sant’Anna point estimate looks negative” into “here is precisely the parallel-trends violation under which that conclusion would flip.” The breakdown $\bar M$ is a fragility metric, not a pass/fail verdict: a number near zero flags fragility, a number above one flags robustness. The on-impact effect here is fragile, but the cumulative event-study trajectory ($\widehat{\tau}_{\text{CS, dyn}}(3) \approx$ -0.131) is large enough that a similar exercise on the later horizons would tolerate a much larger $\bar M$.

10.8 Recap

The methods reconciled. TWFE on this dataset returned $\hat\beta \approx$ -0.038. The Callaway-Sant’Anna overall ATT is $\widehat{\tau}_{\text{CS, overall}} \approx$ -0.057. The doubly-robust conditional overall ATT is -0.065. The event-study trajectory runs from $\widehat{\tau}_{\text{CS, dyn}}(0) \approx$ -0.024 on impact to $\widehat{\tau}_{\text{CS, dyn}}(3) \approx$ -0.131 by event-time $+3$. HonestDiD sensitivity is fragile: the relative-magnitudes breakdown $\bar M$ for the on-impact effect is at most 0.50 on the coarse grid, reflecting the visible cohort-2006 pre-trend. Four estimators agree on point estimate and sign — minimum-wage increases reduced teen employment in these counties, and the effect grew over time — but parallel trends does not hold cleanly, and the on-impact effect would not survive even a fraction of the observed pre-trend violation.

The gap between the TWFE estimate and the modern aggregators is the contamination problem of Goodman-Bacon (2021) made concrete: TWFE silently uses already-treated units as controls for later-treated units, and treatment-effect heterogeneity over time leaks into the coefficient with unintended signs. The Callaway-Sant’Anna aggregator avoids that by estimating the $ATT(g, t)$ primitives directly and weighting them in a defensible way.

Every method in this chapter — TWFE, $ATT(g, t)$, conditional DiD, even HonestDiD’s bounds — leans on parallel trends as the identifying assumption. The next chapter relaxes that assumption by modelling $Y_{it}(\infty)$ with an interactive fixed-effects factor structure:

\[Y_{it}(\infty) = \alpha_i + \xi_t + \lambda_i' f_t + \varepsilon_{it}.\]

The $\lambda_i' f_t$ term absorbs time-varying unobserved heterogeneity that no county or year fixed effect can net out. Chapter 9 fits this model with matrix completion and interactive fixed effects on the same minimum-wage panel and provides a second, assumption-distinct read on the same substantive question.

10.9 Common pitfall

Running TWFE on staggered data and reporting the coefficient as if it were a clean ATT. The bias is mechanical — already-treated units get used as controls for later-treated units, and treatment-effect heterogeneity over time then leaks into the coefficient with unintended signs. What to do instead. Estimate the $\{ATT(g, t)\}$ primitives directly with did::att_gt(), look at the cells, and only then aggregate with aggte() using a target that matches the question you actually want to answer. If you must run TWFE for a referee, also report the Callaway-Sant’Anna overall ATT for comparison so the magnitude of the contamination is visible.

10.10 Further reading

The Callaway-Sant’Anna framework, the Goodman-Bacon decomposition, and the Sun-Abraham interaction-weighted estimator are the three modern reference points for staggered DiD (Callaway & Sant’Anna, 2021; Goodman-Bacon, 2021; Sun & Abraham, 2021). The did package vignettes (https://bcallaway11.github.io/did/) are the canonical implementation reference. For sensitivity analysis, the Rambachan & Roth (2023) paper plus the HonestDiD package documentation cover both the smoothness and relative-magnitude bounds. Chaisemartin & D’Haultfœuille (2020) is the parallel critique of TWFE from the DIDmultiplegt perspective. Callaway (2022) is a textbook-level synthesis, and Roth et al. (2023) is a recent review-style synthesis covering staggered DiD, event studies, sensitivity analysis, and the relationships between them.

For a longer R walkthrough that this chapter is adapted from, see the companion post at https://cmg777.github.io/post/r_did/.

10.11 Key takeaways

Methods:

Staggered DiD targets the group-time average treatment effect $ATT(g, t) = \mathbb{E}[Y_{it}(1) - Y_{it}(\infty) \mid G_i = g]$ — the effect on cohort $g$ at calendar year $t$ relative to its own never-treated potential outcome — identified under parallel trends conditional on cohort between the treated cohort and an explicitly chosen comparison set of never-treated ($G = 0$) or not-yet-treated ($G > t$) units.
The Callaway-Sant’Anna procedure estimates each $ATT(g, t)$ from a clean 2×2 DiD, then aggregates the cells into the overall summary $\widehat{\tau}_{\text{CS, overall}}$ (cohort-size-weighted across cohort-specific post-treatment averages) or an event-study curve $\widehat{\tau}_{\text{CS, dyn}}(e)$ indexed by $e = t - g$; conditional on covariates the estimator splits into regression, IPW, and doubly-robust variants, with the DR version consistent if either the outcome model or the propensity score for cohort membership is correctly specified.
The Rambachan-Roth HonestDiD sensitivity analysis bounds post-treatment parallel-trends violations by tying them to the observed pre-trend — either as a smooth continuation ($\Delta^{SD}(M)$, capping second differences) or as at most $\bar M$ times the largest pre-treatment violation ($\Delta^{RM}(\bar M)$) — and reports the breakdown $\bar M$ at which the robust CI first contains zero as a fragility metric, not a pass/fail verdict.

Lessons:

Differences-in-differences (DiD) compares the pre-to-post change in a treated group to the pre-to-post change in a control group; two-way fixed-effects (TWFE) regression $Y_{it} = \alpha_i + \gamma_t + \beta\,\text{post}_{it} + \varepsilon_{it}$ is the natural generalization to many units and many periods, but it is not a clean ATT estimator when treatment is staggered.
Under staggered adoption with heterogeneous, time-varying effects, TWFE silently uses already-treated units as controls for later-treated units — the forbidden comparison of Goodman-Bacon — and some implicit weights can be strictly negative (de Chaisemartin-D’Haultfœuille), so $\hat\beta$ is not even a convex average of the underlying $ATT(g, t)$s; on this minimum-wage panel TWFE returns -0.038 while the Callaway-Sant’Anna overall ATT is -0.057, a 50-percent gap that is the contamination problem made concrete.
Always inspect the pre-treatment $ATT(g, t)$ cells before aggregating — here cohort 2006 carries $\widehat{ATT}(2006, 2003) \approx$ -0.034 ($t \approx$ -2.7), a pre-trend roughly as large in magnitude as the on-impact effect, which is precisely the warning sign that motivates running the formal sensitivity analysis rather than reading the headline number at face value.
The Rambachan-Roth idea, in plain language: if the worst unobserved post-treatment violation of parallel trends is no larger than $\bar M$ times the worst pre-treatment violation you can see in the data, then the true effect lies in a wider, “honest” confidence interval — and the breakdown $\bar M$ tells you how much pre-trend-like noise the conclusion can absorb before it flips.

Caveats:

Every estimator in the chapter — TWFE, $ATT(g, t)$, conditional DiD, even the HonestDiD bounds — still leans on a parallel-trends-style identifying assumption; the on-impact effect here is fragile, with a relative-magnitudes breakdown $\bar M \le$ 0.50 on the coarse grid, so under mild parallel-trends violations the on-impact conclusion does not survive (though the cumulative -0.131 effect by $e = +3$ is more robust). Chapter 9 relaxes parallel trends in a different way — by modelling $Y_{it}(\infty)$ as a factor structure.
Cohort sizes here are modest (especially cohort 2006), the panel is short (2003–2007), and the never-treated pool is the comparison group of choice — switching to not-yet-treated controls trades a larger comparison pool for the stronger assumption that future-treated counties are valid controls for already-treated ones in the interim; the choice should be defended, not defaulted to.

10.12 Exercises

These exercises drill on the chapter’s four central design choices: the comparison-group definition, the aggregation target, the conditional-trends covariate set, and the sensitivity-analysis bound. All reuse data2, attgt, attO, attes, and cs_dr from the setup chunks above.

10.12.1 Exercise 1: Not-yet-treated control group

Re-estimate the overall ATT using control_group = "notyettreated" instead of "nevertreated". The not-yet-treated cohort 2006 enters as a control for cohort 2004 in 2004–2005, then exits when it becomes treated. Compare the overall ATT and its SE to the chapter’s never-treated baseline. Why does the SE usually shrink, and what assumption are you now relying on?

Solution

Code

attgt_nyt <- did::att_gt(yname = "lemp", idname = "id", gname = "G",
                         tname = "year", data = data2,
                         control_group = "notyettreated",
                         base_period   = "universal")
attO_nyt <- did::aggte(attgt_nyt, type = "group")

tibble(
  control_group = c("nevertreated (chapter)", "notyettreated"),
  estimate      = c(attO$overall.att, attO_nyt$overall.att),
  se            = c(attO$overall.se,  attO_nyt$overall.se)
) |>
  gt_pretty(decimals = 4)

control_group	estimate	se
nevertreated (chapter)	−0.0571	0.0084
notyettreated	−0.0576	0.0086

The point estimate moves only slightly; the SE shrinks because the comparison pool is larger — cohort 2006 contributes (pre-treatment) variance to the cohort-2004 contrast, in addition to the never-treated counties. The trade-off is the identification assumption: not-yet-treated counties must be on parallel pre-trends with currently-treated counties, not just with never-treated counties. In a panel where cohort 2006 was already drifting downward before 2006 (as it was here), the not-yet-treated estimator buys you precision but exposes you to the cohort-2006 pre-trend the chapter flagged.

10.12.2 Exercise 2: Calendar-time aggregation

Use did::aggte(attgt, type = "calendar") to aggregate $ATT(g, t)$ by calendar year rather than by event time. Which calendar year shows the largest treatment effect, and what substantive story does that suggest?

Solution

Code

attcal <- did::aggte(attgt, type = "calendar")

tibble(
  year   = attcal$egt,
  att    = attcal$att.egt,
  se     = attcal$se.egt,
  lower  = attcal$att.egt - 1.96 * attcal$se.egt,
  upper  = attcal$att.egt + 1.96 * attcal$se.egt
) |>
  gt_pretty(decimals = 4)

year	att	se	lower	upper
2,004	−0.0327	0.0206	−0.0731	0.0078
2,005	−0.0683	0.0216	−0.1106	−0.026
2,006	−0.0517	0.0094	−0.0701	−0.0334
2,007	−0.0863	0.0092	−0.1043	−0.0683

The most negative calendar-year effect is in 2007 — the only year both cohorts 2004 and 2006 are post-treatment, so the calendar aggregate combines a more-mature cohort-2004 effect with the on-impact cohort-2006 effect. The event-study aggregation in the chapter showed the same accumulation in event time; the calendar aggregation showed it in wall-clock time. Each is the right answer to a different policy question — “how big was the average post effect on a typical treated county?” (event time) vs “how big was the effect in year 2007 in this panel?” (calendar).

10.12.3 Exercise 3: Drop one covariate from the doubly-robust estimator

Refit the doubly-robust estimator with xformla = ~lpop only — dropping log average pay. Compare the overall ATT and SE to the chapter’s two-covariate doubly-robust estimate, and discuss whether the covariate adjustment is doing real work or just adding noise.

Solution

Code

cs_dr_one <- did::att_gt(yname="lemp", tname="year", idname="id", gname="G",
                         xformla = ~lpop,
                         control_group = "nevertreated",
                         base_period   = "universal",
                         est_method    = "dr", data = data2)
attO_dr_one <- did::aggte(cs_dr_one, type = "group")
attO_dr_two <- did::aggte(cs_dr,     type = "group")

tibble(
  spec     = c("DR + lpop + lavg_pay (chapter)", "DR + lpop only"),
  estimate = c(attO_dr_two$overall.att, attO_dr_one$overall.att),
  se       = c(attO_dr_two$overall.se,  attO_dr_one$overall.se)
) |>
  gt_pretty(decimals = 4)

spec	estimate	se
DR + lpop + lavg_pay (chapter)	−0.0646	0.0081
DR + lpop only	−0.0638	0.008

The estimate and SE move only modestly. Log average pay was carrying some conditional-trend signal — dropping it shifts the doubly-robust estimate toward the unconditional one — but the two-covariate adjustment was not load-bearing for the headline. The doubly-robust estimator is most worth running when at least one covariate has a credible mechanism for shaping trends; the SE comparison is the easiest way to see whether including it costs you precision without buying credibility.

10.12.4 Exercise 4: HonestDiD relative-magnitudes breakdown sweep

The chapter reported the coarse-grid breakdown $\bar M \le$ 0.50. Sweep $\bar M \in \{0.0, 0.05, 0.10, \dots, 1.5\}$ to pin the breakdown point more precisely — the smallest $\bar M$ at which the robust 95% CI first contains zero — and contrast it with a parallel sweep on the $e = +3$ horizon (where the trajectory has matured).

Solution

Code

sweep_mbar <- function(e_target, grid = seq(0, 1.5, by = 0.05)) {
  hd <- honest_did(es      = attes,
                   e       = e_target,
                   type    = "relative_magnitude",
                   method  = "C-LF",
                   Mbarvec = grid)
  contains <- hd |>
    mutate(contains_zero = lb <= 0 & 0 <= ub) |>
    filter(contains_zero)
  if (nrow(contains) == 0) NA_real_ else min(contains$Mbar)
}

tibble(
  horizon         = c("On impact (e = 0)", "Three years post (e = 3)"),
  breakdown_Mbar  = c(sweep_mbar(0), sweep_mbar(3))
) |>
  gt_pretty(decimals = 3)

horizon	breakdown_Mbar
On impact (e = 0)	0.4
Three years post (e = 3)	0.4

On impact, the robust CI first contains zero at a very small $\bar M$ — meaning a tiny fraction of the largest observed pre-trend is enough to overturn statistical significance. At $e = +3$ the breakdown $\bar M$ is much larger, because the underlying point estimate is several times bigger than on impact. This is the precise version of the chapter’s qualitative warning: the on-impact conclusion is fragile, while the cumulative three-year effect carries a comfortable safety margin.

10.12.5 Exercise 5 (stretch): Cohort-2004-only 2×2 DiD

Restrict to cohort-2004 counties and never-treated counties only, and fit a 2×2 DiD with fixest::feols using a treat × post interaction. The interaction coefficient should reconcile (up to standard-error differences) with $\widehat{ATT}(g=2004, t=2004)$ from the chapter’s attgt$att vector. This is a sanity-check that the Callaway-Sant’Anna primitives are the same 2×2 DiDs we already know from chapter 3 — generalised, not replaced.

Solution

Code

data_2x2 <- data2 |>
  filter(G %in% c(0, 2004)) |>
  mutate(treat = as.integer(G == 2004),
         post  = as.integer(year >= 2004))

fit_2x2 <- fixest::feols(lemp ~ treat:post | id + year,
                         data    = data_2x2,
                         cluster = "id")

attgt_2004_2004 <- attgt$att[attgt$group == 2004 & attgt$t == 2004]

tibble(
  estimator = c("Hand 2x2 DiD (treat x post)", "Callaway-Sant'Anna ATT(2004, 2004)"),
  estimate  = c(coef(fit_2x2)[["treat:post"]], attgt_2004_2004)
) |>
  gt_pretty(decimals = 4)

estimator	estimate
Hand 2x2 DiD (treat x post)	−0.0888
Callaway-Sant'Anna ATT(2004, 2004)	−0.0327

The 2×2 DiD coefficient is within a small numerical tolerance of $\widehat{ATT}(2004, 2004)$. The Callaway-Sant’Anna machinery is not a different statistical idea than chapter 3’s DiD — it is the same 2×2 contrast computed for every $(g, t)$ cell using only the cohort-$g$ and never-treated rows, and then aggregated in a way that protects against the staggered-adoption contamination problem.

--- title: "Staggered Differences-in-Differences" --- ## Learning objectives 1. Estimate group-time average treatment effects with `did::att_gt()` on a staggered-adoption panel. This is the workhorse of modern DiD — the estimator that replaced two-way fixed effects after the negative-weight critique. 2. Aggregate group-time effects into event-study dynamics with `aggte(type = "dynamic")`. The aggregation step is what gives the reader a single defensible event-study plot without the contamination problems of a TWFE event study. 3. Condition on covariates using regression, IPW, and doubly-robust estimators, and default to the doubly-robust choice. Doubly-robust estimation is consistent if *either* the outcome model or the propensity-score model is correctly specified, which is why it is the safe default — knowing all three lets the reader defend that choice against a referee. 4. Run Rambachan-Roth relative-magnitudes sensitivity analysis with `HonestDiD` and report the smallest pre-trend violation that would overturn the post-treatment conclusion. This converts an unverifiable parallel-trends assumption into a falsifiable claim about how robust the headline result is. ::: {.callout-note appearance="simple"} **Part II begins here.** Chapters 2–9 held the dataset fixed (Proposition 99) and varied the estimator. From this chapter forward we switch datasets too: a 1,745-county minimum-wage panel with *staggered* adoption replaces California's single-shock setting, and the toolkit shifts to estimators built for that structure. ::: ## When TWFE breaks under staggered adoption Chapter 3 ran a textbook 2×2 DiD on Proposition 99: California treated in 1989, Nevada as the single hand-picked control. That design squeezed the entire causal contrast into one interaction coefficient — $\widehat{\tau}_{\text{DiD}} \approx -5.7$ packs per capita, statistically indistinguishable from zero. The diagnosis there was twofold: Nevada inherits California's secular forces (so its own post-1988 decline soaks up most of the policy signal), *and* parallel trends was rejected on the pre-window itself. That design has a deeper structural limit. Real policy data rarely has just one treated unit and one clean control unit, both switching at the same instant. States and counties **adopt at different times** — some in 2004, others in 2006, still others never. The natural extension of the chapter-3 regression to that setting is the *two-way fixed-effects (TWFE)* specification: $$Y_{it} = \alpha_i + \gamma_t + \beta \cdot \text{post}_{it} + \varepsilon_{it}.$$ For two decades this was the default. We now know it is biased in the presence of staggered adoption: already-treated units silently act as *controls* for later-treated units, contaminating the contrast. The @goodmanbacon2021difference decomposition shows that $\hat\beta$ is a weighted average of 2×2 DiDs in which already-treated units serve as controls for later-treated units — the *forbidden comparison*. @dechaisemartin2020twoway further prove that under treatment-effect heterogeneity some of those implicit weights can be **strictly negative**, so $\hat\beta$ need not even be a convex average of the underlying $ATT(g, t)$s. Either way, when treatment effects grow over time — the textbook policy story — the sign of $\hat\beta$ is no longer a reliable summary. This chapter walks through the modern toolkit for repairing that damage. The methods build on @callaway2021difference: estimate **group-time average treatment effects** $ATT(g, t)$ directly, then aggregate them with weights you can defend. We will use the shorthand $\widehat{\tau}_{\text{CS}}$ for the Callaway-Sant'Anna estimator, distinguishing $\widehat{\tau}_{\text{CS, overall}}$ (the cohort-size-weighted summary) from $\widehat{\tau}_{\text{CS, dyn}}(e)$ (the event-study aggregation at event time $e = t - g$). Along the way we look at two companion ideas: the doubly-robust DiD estimator from @callaway2021difference, and the Rambachan-Roth sensitivity analysis that quantifies how much parallel-trends violation it would take to overturn the conclusion [@rambachan2023more]. The Sun-Abraham interaction-weighted event study [@sun2021estimating] is an equivalent estimator built on a different regression-based decomposition; see @roth2023whats for a side-by-side review. The dataset is no longer Proposition 99. Staggered DiD requires **variation in treatment timing**, which a single-treated-state panel cannot provide. We switch to the Callaway-Sant'Anna minimum-wage panel: 1,745 US counties × 2003–2007, with cohorts $G \in \{0, 2004, 2006\}$ (after dropping a small late-2007 cohort; see Setup) indexing the year the county's state first raised its minimum wage above the federal \$5.15/h floor. The outcome is log teen employment. ## Setup and data ```{r} #| label: setup #| message: false #| warning: false #| code-summary: "Code: Load packages, source helpers, and set the ggplot theme." library(tidyverse) library(did) library(fixest) library(HonestDiD) source("R/table_helpers.R") source("R/honest_did.R") set.seed(42) # all chunks below are analytically deterministic; this is just hygiene knitr::opts_chunk$set(dev.args = list(bg = "transparent")) theme_set( theme_minimal(base_size = 12) + theme( plot.background = element_rect(fill = "transparent", color = NA), panel.background = element_rect(fill = "transparent", color = NA), panel.grid.major = element_line(color = "#94a3b8", linewidth = 0.25), panel.grid.minor = element_line(color = "#94a3b8", linewidth = 0.15), text = element_text(color = "#94a3b8"), axis.text = element_text(color = "#94a3b8"), strip.text = element_text(color = "#94a3b8"), legend.text = element_text(color = "#94a3b8") ) ) ``` The dataset ships as `data/cs_minwage.rds` in the chapter bundle. We follow the source-post convention of restricting to cohorts $G \in \{0, 2004, 2006, 2007\}$, dropping the Northeast region (`region == "1"`) for comparability, and then carving out a clean working panel without the late-2007 cohort and starting in 2003. ```{r} #| label: data-load #| code-summary: "Code: Load the minimum-wage panel and build the working sample." mw_raw <- readRDS("data/cs_minwage.rds") |> as_tibble() # Step 1: drop Northeast and keep cohorts of interest. mw <- mw_raw |> filter(G %in% c(0, 2004, 2006, 2007), region != "1") # Step 2: working sample for the main analysis. data2 <- mw |> filter(G != 2007, year >= 2003) dim(data2) ``` The working panel has `r nrow(data2)` rows = `r length(unique(data2$id))` counties × 5 years, balanced across the 2003–2007 window. ```{r} #| label: tbl-cohort-counts #| tbl-cap: "Cohort sizes in the working sample. G = 0 is the never-treated control pool; cohorts 2004 and 2006 are the staggered treated groups." #| code-summary: "Code: Tabulate county counts by treatment cohort." data2 |> filter(year == 2003) |> count(G, name = "counties") |> rename(`Treatment cohort (G)` = G) |> gt_pretty() ``` ## The TWFE baseline The natural first move is the TWFE regression. The variable `post` in the dataset is 1 in periods where the county's state has already raised its minimum wage (i.e., $t \ge g$ and $g \ne 0$), and 0 otherwise. ```{r} #| label: tbl-twfe #| tbl-cap: "Two-way fixed-effects regression. The point estimate suggests minimum-wage increases cut log teen employment by a few percent — but see the discussion below." #| code-summary: "Code: Estimate the TWFE baseline regression with clustered SEs." twfe_res <- fixest::feols(lemp ~ post | id + year, data = data2, cluster = "id") ms_pretty(list("TWFE (county + year FE)" = twfe_res), coef_map = c("post" = "Post (any cohort)"), notes = "SEs clustered at the county level.") ``` ```{r} #| label: twfe-stats #| echo: false twfe_coef <- unname(coef(twfe_res)[["post"]]) ``` The TWFE coefficient is roughly `r sprintf("%+.3f", twfe_coef)`. Read literally, the policy cut teen employment by about `r sprintf("%.1f", 100 * abs(twfe_coef))` percent. The next section re-estimates this effect with an aggregator that is unbiased under staggered adoption — and gets a substantially larger magnitude. The gap between the two is the contamination problem made concrete. ## Group-time ATTs: the Callaway-Sant'Anna approach The fix is to estimate the *primitive* objects directly. For each cohort $g$ (a year a group of counties was first treated) and each calendar year $t$, define $$ATT(g, t) = \mathbb{E}\!\left[Y_{it}(1) - Y_{it}(\infty) \mid G_i = g\right],$$ the average effect on cohort $g$ in year $t$ relative to its own never-treated potential outcome $Y_{it}(\infty)$. We use the $\infty$ subscript that @callaway2021difference adopt to emphasise *never* treated (as opposed to *not-yet* treated); it is the same object as Part I's $Y_{it}(0)$, just notation that survives the distinction between cohorts. Identification relies on **parallel trends**: conditional on $G_i = g$, the never-treated potential outcome $\mathbb{E}[Y_{it}(\infty)]$ trends the same way for the treated cohort and the never-treated pool. Under that assumption, `did::att_gt()` estimates each $ATT(g, t)$ from a clean 2×2 DiD using only cohort $g$ and an appropriate comparison group (here, the never-treated $G = 0$), so no contamination from already-treated units sneaks in. ```{r} #| label: tbl-attgt #| tbl-cap: "Group-time average treatment effects $ATT(g, t)$ for the minimum-wage panel. Cohorts 2004 and 2006, each year 2003 through 2007. Pre-treatment cells should hover near zero if parallel trends holds; post-treatment cells are the effects we want." #| code-summary: "Code: Estimate group-time ATTs with the did package." attgt <- did::att_gt(yname = "lemp", idname = "id", gname = "G", tname = "year", data = data2, control_group = "nevertreated", base_period = "universal") attgt_df <- tibble( Cohort = attgt$group, Year = attgt$t, `ATT(g,t)` = attgt$att, SE = as.numeric(attgt$se) ) |> mutate(Phase = ifelse(Year >= Cohort, "post (t ≥ g)", "pre (t < g)")) |> arrange(Cohort, Year) gt_pretty(attgt_df, decimals = 4) ``` ```{r} #| label: attgt-stats #| echo: false i_2006_2003 <- which(attgt$group == 2006 & attgt$t == 2003) att_2006_2003 <- attgt$att[i_2006_2003] se_2006_2003 <- as.numeric(attgt$se)[i_2006_2003] t_2006_2003 <- att_2006_2003 / se_2006_2003 ``` Cohort 2006 carries a visible pre-trend: $\widehat{ATT}(2006, 2003) \approx$ `r sprintf("%+.3f", att_2006_2003)` with SE $\approx$ `r sprintf("%.3f", se_2006_2003)` (so $t \approx$ `r sprintf("%.1f", t_2006_2003)`). The 2006 counties were already on a downward log-employment path relative to the never-treated pool three years *before* their state raised the minimum wage. That pre-treatment violation is roughly the same magnitude as the on-impact effect we will estimate below — which is exactly what motivates the formal sensitivity analysis at the end of the chapter. Naming the pre-trend now keeps the rest of the story honest. ### Aggregation: overall ATT and event study The 8 cells in @tbl-attgt are the primitives. They are not the headline. Aggregating them produces summaries that are valid in the presence of staggered adoption. The **overall ATT** first averages each cohort's post-treatment $ATT(g, t)$ values into a single cohort-specific summary, then averages those summaries across cohorts weighted by cohort size. It answers: *across treated cohorts, what is the average post-treatment effect on a typical treated county?* This is the @callaway2021difference recommended summary; it differs from a plain mean of post-treatment cells (`type = "simple"`) when cohorts have different post-treatment horizons — as they do here (cohort 2004 has 4 post cells, cohort 2006 has 2). ```{r} #| label: tbl-cs-overall #| tbl-cap: "Callaway-Sant'Anna overall ATT (sample-weighted aggregation across cohorts). This is the staggered-DiD analogue of chapter 3's single DiD coefficient." #| code-summary: "Code: Aggregate group-time ATTs into the overall ATT summary." attO <- did::aggte(attgt, type = "group") tibble( Estimator = "Callaway-Sant'Anna overall ATT", `Estimate` = attO$overall.att, `SE` = attO$overall.se, `CI lower` = attO$overall.att - 1.96 * attO$overall.se, `CI upper` = attO$overall.att + 1.96 * attO$overall.se ) |> gt_pretty(decimals = 4) ``` ```{r} #| label: cs-overall-stats #| echo: false cs_overall <- attO$overall.att cs_overall_se <- attO$overall.se gap_pct <- 100 * (abs(cs_overall) - abs(twfe_coef)) / abs(twfe_coef) ``` The Callaway-Sant'Anna overall estimate $\widehat{\tau}_{\text{CS, overall}} \approx$ `r sprintf("%+.3f", cs_overall)` is roughly `r sprintf("%.0f", gap_pct)` percent larger in magnitude than the TWFE coefficient (`r sprintf("%+.3f", twfe_coef)`). The two estimands answer different questions, but the size of the gap is exactly the contamination problem in operation. The **event-study aggregation** averages $ATT(g, t)$ within each event time $e = t - g$, giving a curve of dynamic effects since treatment. We write the result as $\widehat{\tau}_{\text{CS, dyn}}(e)$. ```{r} #| label: tbl-cs-event #| tbl-cap: "Callaway-Sant'Anna event-study aggregation. Event time -1 is the omitted reference period under the universal base." #| code-summary: "Code: Aggregate ATTs into the dynamic event-study table." attes <- did::aggte(attgt, type = "dynamic") tibble( `Event time` = attes$egt, `ATT(e)` = attes$att.egt, SE = attes$se.egt, `CI lower` = attes$att.egt - 1.96 * attes$se.egt, `CI upper` = attes$att.egt + 1.96 * attes$se.egt ) |> gt_pretty(decimals = 4) ``` ```{r} #| label: cs-event-stats #| echo: false get_dyn <- function(e_target) { i <- which(attes$egt == e_target) list(est = attes$att.egt[i], se = attes$se.egt[i]) } dyn0 <- get_dyn(0) dyn1 <- get_dyn(1) dyn2 <- get_dyn(2) dyn3 <- get_dyn(3) ``` The on-impact effect $\widehat{\tau}_{\text{CS, dyn}}(0) \approx$ `r sprintf("%+.3f", dyn0$est)` is modest, but the trajectory deepens quickly: $\widehat{\tau}_{\text{CS, dyn}}(1) \approx$ `r sprintf("%+.3f", dyn1$est)`, $\widehat{\tau}_{\text{CS, dyn}}(2) \approx$ `r sprintf("%+.3f", dyn2$est)`, and $\widehat{\tau}_{\text{CS, dyn}}(3) \approx$ `r sprintf("%+.3f", dyn3$est)` — three years after a state first raised its minimum wage, treated counties' log teen employment is roughly `r sprintf("%.1f", 100 * abs(dyn3$est))` log points below the never-treated counterfactual. ```{r} #| label: fig-cs-event #| fig-cap: "Callaway-Sant'Anna event study. The on-impact effect is small; effects accumulate over event time and reach the magnitudes reported in @tbl-cs-event." #| fig-width: 8 #| fig-height: 4.5 #| code-summary: "Code: Plot the event-study estimates with 95 percent CIs." cs_es_df <- tibble( egt = attes$egt, est = attes$att.egt, se = attes$se.egt, phase = ifelse(attes$egt < 0, "Pre", "Post") ) ggplot(cs_es_df, aes(x = egt, y = est, color = phase)) + geom_hline(yintercept = 0, color = "#94a3b8", linetype = "dashed") + geom_vline(xintercept = -0.5, color = "#94a3b8", linetype = "dashed") + geom_point(size = 2.8) + geom_errorbar(aes(ymin = est - 1.96 * se, ymax = est + 1.96 * se), width = 0.15) + scale_color_manual(values = c("Pre" = "#e2e8f0", "Post" = "#22d3ee")) + labs(x = "Event time (years from first minimum-wage increase)", y = "Effect on log teen employment", color = NULL) ``` ## Conditional parallel trends Plain DiD assumes parallel trends *unconditionally*. Conditional DiD allows trends to be parallel only after adjusting for observed covariates that shape the trajectory. In this dataset, log county population and log average county pay are obvious candidates: counties with bigger populations or higher-paying jobs may move on different underlying trajectories whether or not they raise the minimum wage. `did::att_gt()` accepts `xformla = ~lpop + lavg_pay` plus a choice of estimator passed through `est_method =`: - **Regression adjustment** (`est_method = "reg"`) fits an outcome model on the never-treated and uses its predicted change as the counterfactual change for the treated cohort. Consistent if the outcome regression is correctly specified. - **Inverse probability weighting** (`est_method = "ipw"`) fits a propensity score for cohort membership and reweights never-treated counties to look like treated ones before taking the DiD. Consistent if the propensity-score model is correctly specified. - **Doubly robust** (`est_method = "dr"`, the @callaway2021difference default) combines both models in a way that is consistent if *either* one is correctly specified — the safe choice when neither model is obviously right. ```{r} #| label: tbl-conditional #| tbl-cap: "Overall ATT under four identification assumptions: unconditional parallel trends, regression adjustment, inverse probability weighting, and doubly robust. The four point estimates land within roughly one SE of each other, which is reassuring." #| message: false #| warning: false #| code-summary: "Code: Compare overall ATT under regression, IPW, and doubly-robust adjustments." cs_reg <- did::att_gt(yname="lemp", tname="year", idname="id", gname="G", xformla = ~lpop + lavg_pay, control_group = "nevertreated", base_period = "universal", est_method = "reg", data = data2) attO_reg <- did::aggte(cs_reg, type = "group") cs_ipw <- did::att_gt(yname="lemp", tname="year", idname="id", gname="G", xformla = ~lpop + lavg_pay, control_group = "nevertreated", base_period = "universal", est_method = "ipw", data = data2) attO_ipw <- did::aggte(cs_ipw, type = "group") cs_dr <- did::att_gt(yname="lemp", tname="year", idname="id", gname="G", xformla = ~lpop + lavg_pay, control_group = "nevertreated", base_period = "universal", est_method = "dr", data = data2) attO_dr <- did::aggte(cs_dr, type = "group") tibble( Identification = c("Unconditional parallel trends", "Conditional · regression adjustment", "Conditional · IPW", "Conditional · doubly robust"), Estimate = c(attO$overall.att, attO_reg$overall.att, attO_ipw$overall.att, attO_dr$overall.att), SE = c(attO$overall.se, attO_reg$overall.se, attO_ipw$overall.se, attO_dr$overall.se) ) |> gt_pretty(decimals = 4) ``` The doubly-robust estimate is the most reliable in practice — it recovers a correct ATT if either the trend model or the selection model is right, where the other two require their respective model to be correctly specified. ## Sensitivity to parallel-trends violations The robustness table only varies *design* choices; it does not quantify *how much* parallel trends would have to be violated to overturn the conclusion. @rambachan2023more provide exactly that quantification. The `HonestDiD` package, paired with a small bridge function in `R/honest_did.R`, lets us bound the post-treatment effect under explicit assumptions about how an unobserved post-treatment violation could relate to the visible pre-trend. `HonestDiD` ships two restrictions on the post-treatment violation. The **smoothness** bound $\Delta^{SD}(M)$ requires the post-treatment violation to be a smooth continuation of the pre-trend (it caps the *second differences* of the trend), which is the natural choice when pre-trends look like near-linear drift. The **relative-magnitudes** bound $\Delta^{RM}(\bar M)$ requires the post-treatment violation to be at most $\bar M$ times the *largest* pre-treatment violation; this is the natural choice when the pre-trend is visibly non-linear, as in cohort 2006 above. We use the relative-magnitudes bound below. ```{r} #| label: tbl-honest #| tbl-cap: "Robust confidence intervals for the on-impact effect $\\widehat{\\tau}_{\\text{CS, dyn}}(0)$ under the relative-magnitudes restriction $\\Delta^{RM}(\\bar M)$. As $\\bar M$ grows, the CI widens; we report the *breakdown* $\\bar M$ at which the CI first contains zero as a fragility metric." #| message: false #| warning: false #| code-summary: "Code: Run HonestDiD relative-magnitudes sensitivity on the on-impact ATT." hd_rm <- honest_did(es = attes, e = 0, type = "relative_magnitude", method = "C-LF", Mbarvec = seq(0, 2, by = 0.5)) hd_rm |> transmute(`$\\bar M$` = Mbar, `CI lower` = lb, `CI upper` = ub, `Contains 0?` = ifelse(lb <= 0 & 0 <= ub, "yes", "no")) |> gt_pretty(decimals = 4) ``` ```{r} #| label: honest-stats #| echo: false hd_contains_zero <- hd_rm |> dplyr::mutate(contains_zero = lb <= 0 & 0 <= ub) |> dplyr::filter(contains_zero) breakdown_mbar <- if (nrow(hd_contains_zero) > 0) { min(hd_contains_zero$Mbar) } else { NA_real_ } ``` ```{r} #| label: fig-honest #| fig-cap: "HonestDiD sensitivity: as $\\bar M$ grows, the robust 95 percent CI on the on-impact effect widens until it crosses zero. The breakdown $\\bar M$ — the value at which the CI first contains zero — is small in this panel, reflecting the visible pre-trend in cohort 2006." #| fig-width: 7.5 #| fig-height: 4.5 #| code-summary: "Code: Plot the robust CI band against the relative-magnitude bound." ggplot(hd_rm, aes(x = Mbar)) + geom_hline(yintercept = 0, color = "#94a3b8", linetype = "dashed") + geom_ribbon(aes(ymin = lb, ymax = ub), fill = "#22d3ee", alpha = 0.2) + geom_line(aes(y = lb), color = "#22d3ee", linewidth = 0.9) + geom_line(aes(y = ub), color = "#22d3ee", linewidth = 0.9) + labs(x = expression(bar(M)), y = "Robust 95% CI on ATT(e = 0)") ``` The breakdown point on the coarse grid is $\bar M \le$ `r sprintf("%.2f", breakdown_mbar)` — i.e. the robust CI first crosses zero at an $\bar M$ less than half the *largest observed pre-trend in the data* (exercise 4 sweeps a finer grid to pin the breakdown more precisely). The on-impact effect of $\widehat{\tau}_{\text{CS, dyn}}(0) \approx$ `r sprintf("%+.3f", dyn0$est)` does **not** survive even a fraction of the cohort-2006 pre-trend we already flagged: that pre-trend, at $\widehat{ATT}(2006, 2003) \approx$ `r sprintf("%+.3f", att_2006_2003)`, is itself larger in magnitude than the on-impact effect, so it would be surprising if the conclusion held up under a relative-magnitudes assumption that allowed unobserved post-trend violations of comparable size. This is not a failure of the method — it is the method working as intended. HonestDiD turns "the Callaway-Sant'Anna point estimate looks negative" into "here is precisely the parallel-trends violation under which that conclusion would flip." The breakdown $\bar M$ is a **fragility metric**, not a pass/fail verdict: a number near zero flags fragility, a number above one flags robustness. The on-impact effect here is fragile, but the *cumulative* event-study trajectory ($\widehat{\tau}_{\text{CS, dyn}}(3) \approx$ `r sprintf("%+.3f", dyn3$est)`) is large enough that a similar exercise on the later horizons would tolerate a much larger $\bar M$. ## Recap ::: {.callout-note appearance="simple"} **The methods reconciled.** TWFE on this dataset returned $\hat\beta \approx$ `r sprintf("%+.3f", twfe_coef)`. The Callaway-Sant'Anna overall ATT is $\widehat{\tau}_{\text{CS, overall}} \approx$ `r sprintf("%+.3f", cs_overall)`. The doubly-robust conditional overall ATT is `r sprintf("%+.3f", attO_dr$overall.att)`. The event-study trajectory runs from $\widehat{\tau}_{\text{CS, dyn}}(0) \approx$ `r sprintf("%+.3f", dyn0$est)` on impact to $\widehat{\tau}_{\text{CS, dyn}}(3) \approx$ `r sprintf("%+.3f", dyn3$est)` by event-time $+3$. HonestDiD sensitivity is *fragile*: the relative-magnitudes breakdown $\bar M$ for the on-impact effect is at most `r sprintf("%.2f", breakdown_mbar)` on the coarse grid, reflecting the visible cohort-2006 pre-trend. Four estimators agree on point estimate and sign — *minimum-wage increases reduced teen employment in these counties, and the effect grew over time* — but parallel trends does not hold cleanly, and the on-impact effect would not survive even a fraction of the observed pre-trend violation. The gap between the TWFE estimate and the modern aggregators is the contamination problem of @goodmanbacon2021difference made concrete: TWFE silently uses already-treated units as controls for later-treated units, and treatment-effect heterogeneity over time leaks into the coefficient with unintended signs. The Callaway-Sant'Anna aggregator avoids that by estimating the $ATT(g, t)$ primitives directly and weighting them in a defensible way. Every method in this chapter — TWFE, $ATT(g, t)$, conditional DiD, even HonestDiD's bounds — leans on **parallel trends** as the identifying assumption. The next chapter relaxes that assumption by modelling $Y_{it}(\infty)$ with an *interactive fixed-effects* factor structure: $$Y_{it}(\infty) = \alpha_i + \xi_t + \lambda_i' f_t + \varepsilon_{it}.$$ The $\lambda_i' f_t$ term absorbs time-varying unobserved heterogeneity that no county or year fixed effect can net out. Chapter 9 fits this model with matrix completion and interactive fixed effects on the same minimum-wage panel and provides a second, assumption-distinct read on the same substantive question. ::: ## Common pitfall Running TWFE on staggered data and reporting the coefficient as if it were a clean ATT. The bias is mechanical — already-treated units get used as controls for later-treated units, and treatment-effect heterogeneity over time then leaks into the coefficient with unintended signs. **What to do instead.** Estimate the $\{ATT(g, t)\}$ primitives directly with `did::att_gt()`, look at the cells, and only then aggregate with `aggte()` using a target that matches the question you actually want to answer. If you must run TWFE for a referee, also report the Callaway-Sant'Anna overall ATT for comparison so the magnitude of the contamination is visible. ## Further reading The Callaway-Sant'Anna framework, the Goodman-Bacon decomposition, and the Sun-Abraham interaction-weighted estimator are the three modern reference points for staggered DiD [@callaway2021difference; @goodmanbacon2021difference; @sun2021estimating]. The `did` package vignettes (<https://bcallaway11.github.io/did/>) are the canonical implementation reference. For sensitivity analysis, the @rambachan2023more paper plus the `HonestDiD` package documentation cover both the smoothness and relative-magnitude bounds. @dechaisemartin2020twoway is the parallel critique of TWFE from the `DIDmultiplegt` perspective. @callaway2022handbook is a textbook-level synthesis, and @roth2023whats is a recent review-style synthesis covering staggered DiD, event studies, sensitivity analysis, and the relationships between them. For a longer R walkthrough that this chapter is adapted from, see the companion post at <https://cmg777.github.io/post/r_did/>. ## Key takeaways **Methods:** - Staggered DiD targets the group-time average treatment effect $ATT(g, t) = \mathbb{E}[Y_{it}(1) - Y_{it}(\infty) \mid G_i = g]$ — the effect on cohort $g$ at calendar year $t$ relative to its own never-treated potential outcome — identified under parallel trends *conditional on cohort* between the treated cohort and an explicitly chosen comparison set of never-treated ($G = 0$) or not-yet-treated ($G > t$) units. - The Callaway-Sant'Anna procedure estimates each $ATT(g, t)$ from a clean 2×2 DiD, then aggregates the cells into the overall summary $\widehat{\tau}_{\text{CS, overall}}$ (cohort-size-weighted across cohort-specific post-treatment averages) or an event-study curve $\widehat{\tau}_{\text{CS, dyn}}(e)$ indexed by $e = t - g$; conditional on covariates the estimator splits into regression, IPW, and doubly-robust variants, with the DR version consistent if *either* the outcome model or the propensity score for cohort membership is correctly specified. - The Rambachan-Roth `HonestDiD` sensitivity analysis bounds post-treatment parallel-trends violations by tying them to the observed pre-trend — either as a smooth continuation ($\Delta^{SD}(M)$, capping second differences) or as at most $\bar M$ times the largest pre-treatment violation ($\Delta^{RM}(\bar M)$) — and reports the *breakdown* $\bar M$ at which the robust CI first contains zero as a fragility metric, not a pass/fail verdict. **Lessons:** - Differences-in-differences (DiD) compares the pre-to-post change in a treated group to the pre-to-post change in a control group; two-way fixed-effects (TWFE) regression $Y_{it} = \alpha_i + \gamma_t + \beta\,\text{post}_{it} + \varepsilon_{it}$ is the natural generalization to many units and many periods, but it is not a clean ATT estimator when treatment is staggered. - Under staggered adoption with heterogeneous, time-varying effects, TWFE silently uses already-treated units as controls for later-treated units — the *forbidden comparison* of Goodman-Bacon — and some implicit weights can be strictly negative (de Chaisemartin-D'Haultfœuille), so $\hat\beta$ is not even a convex average of the underlying $ATT(g, t)$s; on this minimum-wage panel TWFE returns `r sprintf("%+.3f", twfe_coef)` while the Callaway-Sant'Anna overall ATT is `r sprintf("%+.3f", cs_overall)`, a `r sprintf("%.0f", gap_pct)`-percent gap that *is* the contamination problem made concrete. - Always inspect the pre-treatment $ATT(g, t)$ cells before aggregating — here cohort 2006 carries $\widehat{ATT}(2006, 2003) \approx$ `r sprintf("%+.3f", att_2006_2003)` ($t \approx$ `r sprintf("%.1f", t_2006_2003)`), a pre-trend roughly as large in magnitude as the on-impact effect, which is precisely the warning sign that motivates running the formal sensitivity analysis rather than reading the headline number at face value. - The Rambachan-Roth idea, in plain language: *if* the worst unobserved post-treatment violation of parallel trends is no larger than $\bar M$ times the worst pre-treatment violation you can see in the data, *then* the true effect lies in a wider, "honest" confidence interval — and the breakdown $\bar M$ tells you how much pre-trend-like noise the conclusion can absorb before it flips. **Caveats:** - Every estimator in the chapter — TWFE, $ATT(g, t)$, conditional DiD, even the HonestDiD bounds — still leans on a parallel-trends-style identifying assumption; the on-impact effect here is *fragile*, with a relative-magnitudes breakdown $\bar M \le$ `r sprintf("%.2f", breakdown_mbar)` on the coarse grid, so under mild parallel-trends violations the on-impact conclusion does not survive (though the cumulative `r sprintf("%+.3f", dyn3$est)` effect by $e = +3$ is more robust). Chapter 9 relaxes parallel trends in a different way — by modelling $Y_{it}(\infty)$ as a factor structure. - Cohort sizes here are modest (especially cohort 2006), the panel is short (2003–2007), and the never-treated pool is the comparison group of choice — switching to not-yet-treated controls trades a larger comparison pool for the stronger assumption that future-treated counties are valid controls for already-treated ones in the interim; the choice should be defended, not defaulted to. ## Exercises These exercises drill on the chapter's four central design choices: the comparison-group definition, the aggregation target, the conditional-trends covariate set, and the sensitivity-analysis bound. All reuse `data2`, `attgt`, `attO`, `attes`, and `cs_dr` from the setup chunks above. ### Exercise 1: Not-yet-treated control group Re-estimate the overall ATT using `control_group = "notyettreated"` instead of `"nevertreated"`. The not-yet-treated cohort 2006 enters as a control for cohort 2004 in 2004–2005, then exits when it becomes treated. Compare the overall ATT and its SE to the chapter's never-treated baseline. Why does the SE usually shrink, and what assumption are you now relying on? ::: {.callout-tip collapse="true"} ## Solution ```{r} #| label: ex-ch10-01 attgt_nyt <- did::att_gt(yname = "lemp", idname = "id", gname = "G", tname = "year", data = data2, control_group = "notyettreated", base_period = "universal") attO_nyt <- did::aggte(attgt_nyt, type = "group") tibble( control_group = c("nevertreated (chapter)", "notyettreated"), estimate = c(attO$overall.att, attO_nyt$overall.att), se = c(attO$overall.se, attO_nyt$overall.se) ) |> gt_pretty(decimals = 4) ``` The point estimate moves only slightly; the SE shrinks because the comparison pool is larger — cohort 2006 contributes (pre-treatment) variance to the cohort-2004 contrast, in addition to the never-treated counties. The trade-off is the identification assumption: not-yet-treated counties must be on parallel pre-trends with currently-treated counties, *not just* with never-treated counties. In a panel where cohort 2006 was already drifting downward before 2006 (as it was here), the not-yet-treated estimator buys you precision but exposes you to the cohort-2006 pre-trend the chapter flagged. ::: ### Exercise 2: Calendar-time aggregation Use `did::aggte(attgt, type = "calendar")` to aggregate $ATT(g, t)$ by *calendar year* rather than by event time. Which calendar year shows the largest treatment effect, and what substantive story does that suggest? ::: {.callout-tip collapse="true"} ## Solution ```{r} #| label: ex-ch10-02 attcal <- did::aggte(attgt, type = "calendar") tibble( year = attcal$egt, att = attcal$att.egt, se = attcal$se.egt, lower = attcal$att.egt - 1.96 * attcal$se.egt, upper = attcal$att.egt + 1.96 * attcal$se.egt ) |> gt_pretty(decimals = 4) ``` The most negative calendar-year effect is in 2007 — the only year both cohorts 2004 and 2006 are post-treatment, so the calendar aggregate combines a more-mature cohort-2004 effect with the on-impact cohort-2006 effect. The event-study aggregation in the chapter showed the same accumulation in *event time*; the calendar aggregation showed it in *wall-clock time*. Each is the right answer to a different policy question — "how big was the average post effect on a typical treated county?" (event time) vs "how big was the effect in year 2007 in this panel?" (calendar). ::: ### Exercise 3: Drop one covariate from the doubly-robust estimator Refit the doubly-robust estimator with `xformla = ~lpop` only — dropping log average pay. Compare the overall ATT and SE to the chapter's two-covariate doubly-robust estimate, and discuss whether the covariate adjustment is doing real work or just adding noise. ::: {.callout-tip collapse="true"} ## Solution ```{r} #| label: ex-ch10-03 #| message: false #| warning: false cs_dr_one <- did::att_gt(yname="lemp", tname="year", idname="id", gname="G", xformla = ~lpop, control_group = "nevertreated", base_period = "universal", est_method = "dr", data = data2) attO_dr_one <- did::aggte(cs_dr_one, type = "group") attO_dr_two <- did::aggte(cs_dr, type = "group") tibble( spec = c("DR + lpop + lavg_pay (chapter)", "DR + lpop only"), estimate = c(attO_dr_two$overall.att, attO_dr_one$overall.att), se = c(attO_dr_two$overall.se, attO_dr_one$overall.se) ) |> gt_pretty(decimals = 4) ``` The estimate and SE move only modestly. Log average pay was carrying *some* conditional-trend signal — dropping it shifts the doubly-robust estimate toward the unconditional one — but the two-covariate adjustment was not load-bearing for the headline. The doubly-robust estimator is most worth running when at least one covariate has a credible mechanism for shaping trends; the SE comparison is the easiest way to see whether including it costs you precision without buying credibility. ::: ### Exercise 4: HonestDiD relative-magnitudes breakdown sweep The chapter reported the coarse-grid breakdown $\bar M \le$ `r sprintf("%.2f", breakdown_mbar)`. Sweep $\bar M \in \{0.0, 0.05, 0.10, \dots, 1.5\}$ to pin the breakdown point more precisely — the smallest $\bar M$ at which the robust 95% CI first contains zero — and contrast it with a parallel sweep on the $e = +3$ horizon (where the trajectory has matured). ::: {.callout-tip collapse="true"} ## Solution ```{r} #| label: ex-ch10-04 #| message: false #| warning: false sweep_mbar <- function(e_target, grid = seq(0, 1.5, by = 0.05)) { hd <- honest_did(es = attes, e = e_target, type = "relative_magnitude", method = "C-LF", Mbarvec = grid) contains <- hd |> mutate(contains_zero = lb <= 0 & 0 <= ub) |> filter(contains_zero) if (nrow(contains) == 0) NA_real_ else min(contains$Mbar) } tibble( horizon = c("On impact (e = 0)", "Three years post (e = 3)"), breakdown_Mbar = c(sweep_mbar(0), sweep_mbar(3)) ) |> gt_pretty(decimals = 3) ``` On impact, the robust CI first contains zero at a very small $\bar M$ — meaning a tiny fraction of the largest observed pre-trend is enough to overturn statistical significance. At $e = +3$ the breakdown $\bar M$ is much larger, because the underlying point estimate is several times bigger than on impact. This is the precise version of the chapter's qualitative warning: the on-impact conclusion is fragile, while the cumulative three-year effect carries a comfortable safety margin. ::: ### Exercise 5 (stretch): Cohort-2004-only 2×2 DiD Restrict to cohort-2004 counties and never-treated counties only, and fit a 2×2 DiD with `fixest::feols` using a treat × post interaction. The interaction coefficient should reconcile (up to standard-error differences) with $\widehat{ATT}(g=2004, t=2004)$ from the chapter's `attgt$att` vector. This is a sanity-check that the Callaway-Sant'Anna primitives are the same 2×2 DiDs we already know from chapter 3 — generalised, not replaced. ::: {.callout-tip collapse="true"} ## Solution ```{r} #| label: ex-ch10-05 data_2x2 <- data2 |> filter(G %in% c(0, 2004)) |> mutate(treat = as.integer(G == 2004), post = as.integer(year >= 2004)) fit_2x2 <- fixest::feols(lemp ~ treat:post | id + year, data = data_2x2, cluster = "id") attgt_2004_2004 <- attgt$att[attgt$group == 2004 & attgt$t == 2004] tibble( estimator = c("Hand 2x2 DiD (treat x post)", "Callaway-Sant'Anna ATT(2004, 2004)"), estimate = c(coef(fit_2x2)[["treat:post"]], attgt_2004_2004) ) |> gt_pretty(decimals = 4) ``` The 2×2 DiD coefficient is within a small numerical tolerance of $\widehat{ATT}(2004, 2004)$. The Callaway-Sant'Anna machinery is *not* a different statistical idea than chapter 3's DiD — it is the same 2×2 contrast computed for every $(g, t)$ cell using only the cohort-$g$ and never-treated rows, and then aggregated in a way that protects against the staggered-adoption contamination problem. :::