10  Staggered Differences-in-Differences

10.1 Learning objectives

  1. Estimate group-time average treatment effects with did::att_gt() on a staggered-adoption panel. This is the workhorse of modern DiD — the estimator that replaced two-way fixed effects after the negative-weight critique.
  2. Aggregate group-time effects into event-study dynamics with aggte(type = "dynamic"). The aggregation step is what gives the reader a single defensible event-study plot without the contamination problems of a TWFE event study.
  3. Condition on covariates using regression, IPW, and doubly-robust estimators, and default to the doubly-robust choice. Doubly-robust estimation is consistent if either the outcome model or the propensity-score model is correctly specified, which is why it is the safe default — knowing all three lets the reader defend that choice against a referee.
  4. Run Rambachan-Roth relative-magnitudes sensitivity analysis with HonestDiD and report the smallest pre-trend violation that would overturn the post-treatment conclusion. This converts an unverifiable parallel-trends assumption into a falsifiable claim about how robust the headline result is.

Part II begins here. Chapters 2–9 held the dataset fixed (Proposition 99) and varied the estimator. From this chapter forward we switch datasets too: a 1,745-county minimum-wage panel with staggered adoption replaces California’s single-shock setting, and the toolkit shifts to estimators built for that structure.

10.2 When TWFE breaks under staggered adoption

Chapter 3 ran a textbook 2×2 DiD on Proposition 99: California treated in 1989, Nevada as the single hand-picked control. That design squeezed the entire causal contrast into one interaction coefficient — \(\widehat{\tau}_{\text{DiD}} \approx -5.7\) packs per capita, statistically indistinguishable from zero. The diagnosis there was twofold: Nevada inherits California’s secular forces (so its own post-1988 decline soaks up most of the policy signal), and parallel trends was rejected on the pre-window itself.

That design has a deeper structural limit. Real policy data rarely has just one treated unit and one clean control unit, both switching at the same instant. States and counties adopt at different times — some in 2004, others in 2006, still others never. The natural extension of the chapter-3 regression to that setting is the two-way fixed-effects (TWFE) specification:

\[Y_{it} = \alpha_i + \gamma_t + \beta \cdot \text{post}_{it} + \varepsilon_{it}.\]

For two decades this was the default. We now know it is biased in the presence of staggered adoption: already-treated units silently act as controls for later-treated units, contaminating the contrast. The Goodman-Bacon (2021) decomposition shows that \(\hat\beta\) is a weighted average of 2×2 DiDs in which already-treated units serve as controls for later-treated units — the forbidden comparison. Chaisemartin & D’Haultfœuille (2020) further prove that under treatment-effect heterogeneity some of those implicit weights can be strictly negative, so \(\hat\beta\) need not even be a convex average of the underlying \(ATT(g, t)\)s. Either way, when treatment effects grow over time — the textbook policy story — the sign of \(\hat\beta\) is no longer a reliable summary.

This chapter walks through the modern toolkit for repairing that damage. The methods build on Callaway & Sant’Anna (2021): estimate group-time average treatment effects \(ATT(g, t)\) directly, then aggregate them with weights you can defend. We will use the shorthand \(\widehat{\tau}_{\text{CS}}\) for the Callaway-Sant’Anna estimator, distinguishing \(\widehat{\tau}_{\text{CS, overall}}\) (the cohort-size-weighted summary) from \(\widehat{\tau}_{\text{CS, dyn}}(e)\) (the event-study aggregation at event time \(e = t - g\)). Along the way we look at two companion ideas: the doubly-robust DiD estimator from Callaway & Sant’Anna (2021), and the Rambachan-Roth sensitivity analysis that quantifies how much parallel-trends violation it would take to overturn the conclusion (Rambachan & Roth, 2023). The Sun-Abraham interaction-weighted event study (Sun & Abraham, 2021) is an equivalent estimator built on a different regression-based decomposition; see Roth et al. (2023) for a side-by-side review.

The dataset is no longer Proposition 99. Staggered DiD requires variation in treatment timing, which a single-treated-state panel cannot provide. We switch to the Callaway-Sant’Anna minimum-wage panel: 1,745 US counties × 2003–2007, with cohorts \(G \in \{0, 2004, 2006\}\) (after dropping a small late-2007 cohort; see Setup) indexing the year the county’s state first raised its minimum wage above the federal $5.15/h floor. The outcome is log teen employment.

10.3 Setup and data

Code: Load packages, source helpers, and set the ggplot theme.
library(tidyverse)
library(did)
library(fixest)
library(HonestDiD)
source("R/table_helpers.R")
source("R/honest_did.R")

set.seed(42)  # all chunks below are analytically deterministic; this is just hygiene

knitr::opts_chunk$set(dev.args = list(bg = "transparent"))

theme_set(
  theme_minimal(base_size = 12) +
    theme(
      plot.background  = element_rect(fill = "transparent", color = NA),
      panel.background = element_rect(fill = "transparent", color = NA),
      panel.grid.major = element_line(color = "#94a3b8", linewidth = 0.25),
      panel.grid.minor = element_line(color = "#94a3b8", linewidth = 0.15),
      text             = element_text(color = "#94a3b8"),
      axis.text        = element_text(color = "#94a3b8"),
      strip.text       = element_text(color = "#94a3b8"),
      legend.text      = element_text(color = "#94a3b8")
    )
)

The dataset ships as data/cs_minwage.rds in the chapter bundle. We follow the source-post convention of restricting to cohorts \(G \in \{0, 2004, 2006, 2007\}\), dropping the Northeast region (region == "1") for comparability, and then carving out a clean working panel without the late-2007 cohort and starting in 2003.

Code: Load the minimum-wage panel and build the working sample.
mw_raw <- readRDS("data/cs_minwage.rds") |> as_tibble()

# Step 1: drop Northeast and keep cohorts of interest.
mw <- mw_raw |>
  filter(G %in% c(0, 2004, 2006, 2007), region != "1")

# Step 2: working sample for the main analysis.
data2 <- mw |>
  filter(G != 2007, year >= 2003)

dim(data2)
[1] 8725   20

The working panel has 8725 rows = 1745 counties × 5 years, balanced across the 2003–2007 window.

Code: Tabulate county counts by treatment cohort.
data2 |>
  filter(year == 2003) |>
  count(G, name = "counties") |>
  rename(`Treatment cohort (G)` = G) |>
  gt_pretty()
Table 10.1: Cohort sizes in the working sample. G = 0 is the never-treated control pool; cohorts 2004 and 2006 are the staggered treated groups.
Treatment cohort (G) counties
0 1,417
2,004 102
2,006 226

10.4 The TWFE baseline

The natural first move is the TWFE regression. The variable post in the dataset is 1 in periods where the county’s state has already raised its minimum wage (i.e., \(t \ge g\) and \(g \ne 0\)), and 0 otherwise.

Code: Estimate the TWFE baseline regression with clustered SEs.
twfe_res <- fixest::feols(lemp ~ post | id + year,
                          data = data2, cluster = "id")
ms_pretty(list("TWFE (county + year FE)" = twfe_res),
          coef_map = c("post" = "Post (any cohort)"),
          notes    = "SEs clustered at the county level.")
Table 10.2: Two-way fixed-effects regression. The point estimate suggests minimum-wage increases cut log teen employment by a few percent — but see the discussion below.
TWFE (county + year FE)
Post (any cohort) -0.038***
(0.008)
Num.Obs. 8725
R2 0.994
R2 Within 0.004
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
SEs clustered at the county level.

The TWFE coefficient is roughly -0.038. Read literally, the policy cut teen employment by about 3.8 percent. The next section re-estimates this effect with an aggregator that is unbiased under staggered adoption — and gets a substantially larger magnitude. The gap between the two is the contamination problem made concrete.

10.5 Group-time ATTs: the Callaway-Sant’Anna approach

The fix is to estimate the primitive objects directly. For each cohort \(g\) (a year a group of counties was first treated) and each calendar year \(t\), define

\[ATT(g, t) = \mathbb{E}\!\left[Y_{it}(1) - Y_{it}(\infty) \mid G_i = g\right],\]

the average effect on cohort \(g\) in year \(t\) relative to its own never-treated potential outcome \(Y_{it}(\infty)\). We use the \(\infty\) subscript that Callaway & Sant’Anna (2021) adopt to emphasise never treated (as opposed to not-yet treated); it is the same object as Part I’s \(Y_{it}(0)\), just notation that survives the distinction between cohorts.

Identification relies on parallel trends: conditional on \(G_i = g\), the never-treated potential outcome \(\mathbb{E}[Y_{it}(\infty)]\) trends the same way for the treated cohort and the never-treated pool. Under that assumption, did::att_gt() estimates each \(ATT(g, t)\) from a clean 2×2 DiD using only cohort \(g\) and an appropriate comparison group (here, the never-treated \(G = 0\)), so no contamination from already-treated units sneaks in.

Code: Estimate group-time ATTs with the did package.
attgt <- did::att_gt(yname = "lemp", idname = "id", gname = "G",
                     tname = "year", data = data2,
                     control_group = "nevertreated",
                     base_period = "universal")

attgt_df <- tibble(
  Cohort       = attgt$group,
  Year         = attgt$t,
  `ATT(g,t)`   = attgt$att,
  SE           = as.numeric(attgt$se)
) |>
  mutate(Phase = ifelse(Year >= Cohort, "post (t ≥ g)", "pre (t < g)")) |>
  arrange(Cohort, Year)

gt_pretty(attgt_df, decimals = 4)
Table 10.3: Group-time average treatment effects \(ATT(g, t)\) for the minimum-wage panel. Cohorts 2004 and 2006, each year 2003 through 2007. Pre-treatment cells should hover near zero if parallel trends holds; post-treatment cells are the effects we want.
Cohort Year ATT(g,t) SE Phase
2,004 2,003 0 NA pre (t < g)
2,004 2,004 −0.0327 0.0196 post (t ≥ g)
2,004 2,005 −0.0683 0.0211 post (t ≥ g)
2,004 2,006 −0.1234 0.0204 post (t ≥ g)
2,004 2,007 −0.1311 0.0221 post (t ≥ g)
2,006 2,003 −0.0341 0.0125 pre (t < g)
2,006 2,004 −0.0167 0.0087 pre (t < g)
2,006 2,005 0 NA pre (t < g)
2,006 2,006 −0.0194 0.0086 post (t ≥ g)
2,006 2,007 −0.0661 0.009 post (t ≥ g)

Cohort 2006 carries a visible pre-trend: \(\widehat{ATT}(2006, 2003) \approx\) -0.034 with SE \(\approx\) 0.012 (so \(t \approx\) -2.7). The 2006 counties were already on a downward log-employment path relative to the never-treated pool three years before their state raised the minimum wage. That pre-treatment violation is roughly the same magnitude as the on-impact effect we will estimate below — which is exactly what motivates the formal sensitivity analysis at the end of the chapter. Naming the pre-trend now keeps the rest of the story honest.

10.5.1 Aggregation: overall ATT and event study

The 8 cells in Table 10.3 are the primitives. They are not the headline. Aggregating them produces summaries that are valid in the presence of staggered adoption.

The overall ATT first averages each cohort’s post-treatment \(ATT(g, t)\) values into a single cohort-specific summary, then averages those summaries across cohorts weighted by cohort size. It answers: across treated cohorts, what is the average post-treatment effect on a typical treated county? This is the Callaway & Sant’Anna (2021) recommended summary; it differs from a plain mean of post-treatment cells (type = "simple") when cohorts have different post-treatment horizons — as they do here (cohort 2004 has 4 post cells, cohort 2006 has 2).

Code: Aggregate group-time ATTs into the overall ATT summary.
attO <- did::aggte(attgt, type = "group")

tibble(
  Estimator             = "Callaway-Sant'Anna overall ATT",
  `Estimate`            = attO$overall.att,
  `SE`                  = attO$overall.se,
  `CI lower`            = attO$overall.att - 1.96 * attO$overall.se,
  `CI upper`            = attO$overall.att + 1.96 * attO$overall.se
) |> gt_pretty(decimals = 4)
Table 10.4: Callaway-Sant’Anna overall ATT (sample-weighted aggregation across cohorts). This is the staggered-DiD analogue of chapter 3’s single DiD coefficient.
Estimator Estimate SE CI lower CI upper
Callaway-Sant'Anna overall ATT −0.0571 0.0084 −0.0736 −0.0406

The Callaway-Sant’Anna overall estimate \(\widehat{\tau}_{\text{CS, overall}} \approx\) -0.057 is roughly 50 percent larger in magnitude than the TWFE coefficient (-0.038). The two estimands answer different questions, but the size of the gap is exactly the contamination problem in operation.

The event-study aggregation averages \(ATT(g, t)\) within each event time \(e = t - g\), giving a curve of dynamic effects since treatment. We write the result as \(\widehat{\tau}_{\text{CS, dyn}}(e)\).

Code: Aggregate ATTs into the dynamic event-study table.
attes <- did::aggte(attgt, type = "dynamic")

tibble(
  `Event time` = attes$egt,
  `ATT(e)`     = attes$att.egt,
  SE           = attes$se.egt,
  `CI lower`   = attes$att.egt - 1.96 * attes$se.egt,
  `CI upper`   = attes$att.egt + 1.96 * attes$se.egt
) |> gt_pretty(decimals = 4)
Table 10.5: Callaway-Sant’Anna event-study aggregation. Event time -1 is the omitted reference period under the universal base.
Event time ATT(e) SE CI lower CI upper
−3 −0.0341 0.0122 −0.058 −0.0102
−2 −0.0167 0.0081 −0.0326 −0.0008
−1 0 NA NA NA
0 −0.0235 0.0093 −0.0418 −0.0052
1 −0.0668 0.0095 −0.0853 −0.0482
2 −0.1234 0.019 −0.1606 −0.0861
3 −0.1311 0.0221 −0.1744 −0.0877

The on-impact effect \(\widehat{\tau}_{\text{CS, dyn}}(0) \approx\) -0.024 is modest, but the trajectory deepens quickly: \(\widehat{\tau}_{\text{CS, dyn}}(1) \approx\) -0.067, \(\widehat{\tau}_{\text{CS, dyn}}(2) \approx\) -0.123, and \(\widehat{\tau}_{\text{CS, dyn}}(3) \approx\) -0.131 — three years after a state first raised its minimum wage, treated counties’ log teen employment is roughly 13.1 log points below the never-treated counterfactual.

Code: Plot the event-study estimates with 95 percent CIs.
cs_es_df <- tibble(
  egt   = attes$egt,
  est   = attes$att.egt,
  se    = attes$se.egt,
  phase = ifelse(attes$egt < 0, "Pre", "Post")
)

ggplot(cs_es_df, aes(x = egt, y = est, color = phase)) +
  geom_hline(yintercept = 0, color = "#94a3b8", linetype = "dashed") +
  geom_vline(xintercept = -0.5, color = "#94a3b8", linetype = "dashed") +
  geom_point(size = 2.8) +
  geom_errorbar(aes(ymin = est - 1.96 * se, ymax = est + 1.96 * se),
                width = 0.15) +
  scale_color_manual(values = c("Pre"  = "#e2e8f0",
                                "Post" = "#22d3ee")) +
  labs(x = "Event time (years from first minimum-wage increase)",
       y = "Effect on log teen employment", color = NULL)
Figure 10.1: Callaway-Sant’Anna event study. The on-impact effect is small; effects accumulate over event time and reach the magnitudes reported in Table 10.5.

10.8 Recap

The methods reconciled. TWFE on this dataset returned \(\hat\beta \approx\) -0.038. The Callaway-Sant’Anna overall ATT is \(\widehat{\tau}_{\text{CS, overall}} \approx\) -0.057. The doubly-robust conditional overall ATT is -0.065. The event-study trajectory runs from \(\widehat{\tau}_{\text{CS, dyn}}(0) \approx\) -0.024 on impact to \(\widehat{\tau}_{\text{CS, dyn}}(3) \approx\) -0.131 by event-time \(+3\). HonestDiD sensitivity is fragile: the relative-magnitudes breakdown \(\bar M\) for the on-impact effect is at most 0.50 on the coarse grid, reflecting the visible cohort-2006 pre-trend. Four estimators agree on point estimate and sign — minimum-wage increases reduced teen employment in these counties, and the effect grew over time — but parallel trends does not hold cleanly, and the on-impact effect would not survive even a fraction of the observed pre-trend violation.

The gap between the TWFE estimate and the modern aggregators is the contamination problem of Goodman-Bacon (2021) made concrete: TWFE silently uses already-treated units as controls for later-treated units, and treatment-effect heterogeneity over time leaks into the coefficient with unintended signs. The Callaway-Sant’Anna aggregator avoids that by estimating the \(ATT(g, t)\) primitives directly and weighting them in a defensible way.

Every method in this chapter — TWFE, \(ATT(g, t)\), conditional DiD, even HonestDiD’s bounds — leans on parallel trends as the identifying assumption. The next chapter relaxes that assumption by modelling \(Y_{it}(\infty)\) with an interactive fixed-effects factor structure:

\[Y_{it}(\infty) = \alpha_i + \xi_t + \lambda_i' f_t + \varepsilon_{it}.\]

The \(\lambda_i' f_t\) term absorbs time-varying unobserved heterogeneity that no county or year fixed effect can net out. Chapter 9 fits this model with matrix completion and interactive fixed effects on the same minimum-wage panel and provides a second, assumption-distinct read on the same substantive question.

10.9 Common pitfall

Running TWFE on staggered data and reporting the coefficient as if it were a clean ATT. The bias is mechanical — already-treated units get used as controls for later-treated units, and treatment-effect heterogeneity over time then leaks into the coefficient with unintended signs. What to do instead. Estimate the \(\{ATT(g, t)\}\) primitives directly with did::att_gt(), look at the cells, and only then aggregate with aggte() using a target that matches the question you actually want to answer. If you must run TWFE for a referee, also report the Callaway-Sant’Anna overall ATT for comparison so the magnitude of the contamination is visible.

10.10 Further reading

The Callaway-Sant’Anna framework, the Goodman-Bacon decomposition, and the Sun-Abraham interaction-weighted estimator are the three modern reference points for staggered DiD (Callaway & Sant’Anna, 2021; Goodman-Bacon, 2021; Sun & Abraham, 2021). The did package vignettes (https://bcallaway11.github.io/did/) are the canonical implementation reference. For sensitivity analysis, the Rambachan & Roth (2023) paper plus the HonestDiD package documentation cover both the smoothness and relative-magnitude bounds. Chaisemartin & D’Haultfœuille (2020) is the parallel critique of TWFE from the DIDmultiplegt perspective. Callaway (2022) is a textbook-level synthesis, and Roth et al. (2023) is a recent review-style synthesis covering staggered DiD, event studies, sensitivity analysis, and the relationships between them.

For a longer R walkthrough that this chapter is adapted from, see the companion post at https://cmg777.github.io/post/r_did/.

10.11 Key takeaways

Methods:

  • Staggered DiD targets the group-time average treatment effect \(ATT(g, t) = \mathbb{E}[Y_{it}(1) - Y_{it}(\infty) \mid G_i = g]\) — the effect on cohort \(g\) at calendar year \(t\) relative to its own never-treated potential outcome — identified under parallel trends conditional on cohort between the treated cohort and an explicitly chosen comparison set of never-treated (\(G = 0\)) or not-yet-treated (\(G > t\)) units.
  • The Callaway-Sant’Anna procedure estimates each \(ATT(g, t)\) from a clean 2×2 DiD, then aggregates the cells into the overall summary \(\widehat{\tau}_{\text{CS, overall}}\) (cohort-size-weighted across cohort-specific post-treatment averages) or an event-study curve \(\widehat{\tau}_{\text{CS, dyn}}(e)\) indexed by \(e = t - g\); conditional on covariates the estimator splits into regression, IPW, and doubly-robust variants, with the DR version consistent if either the outcome model or the propensity score for cohort membership is correctly specified.
  • The Rambachan-Roth HonestDiD sensitivity analysis bounds post-treatment parallel-trends violations by tying them to the observed pre-trend — either as a smooth continuation (\(\Delta^{SD}(M)\), capping second differences) or as at most \(\bar M\) times the largest pre-treatment violation (\(\Delta^{RM}(\bar M)\)) — and reports the breakdown \(\bar M\) at which the robust CI first contains zero as a fragility metric, not a pass/fail verdict.

Lessons:

  • Differences-in-differences (DiD) compares the pre-to-post change in a treated group to the pre-to-post change in a control group; two-way fixed-effects (TWFE) regression \(Y_{it} = \alpha_i + \gamma_t + \beta\,\text{post}_{it} + \varepsilon_{it}\) is the natural generalization to many units and many periods, but it is not a clean ATT estimator when treatment is staggered.
  • Under staggered adoption with heterogeneous, time-varying effects, TWFE silently uses already-treated units as controls for later-treated units — the forbidden comparison of Goodman-Bacon — and some implicit weights can be strictly negative (de Chaisemartin-D’Haultfœuille), so \(\hat\beta\) is not even a convex average of the underlying \(ATT(g, t)\)s; on this minimum-wage panel TWFE returns -0.038 while the Callaway-Sant’Anna overall ATT is -0.057, a 50-percent gap that is the contamination problem made concrete.
  • Always inspect the pre-treatment \(ATT(g, t)\) cells before aggregating — here cohort 2006 carries \(\widehat{ATT}(2006, 2003) \approx\) -0.034 (\(t \approx\) -2.7), a pre-trend roughly as large in magnitude as the on-impact effect, which is precisely the warning sign that motivates running the formal sensitivity analysis rather than reading the headline number at face value.
  • The Rambachan-Roth idea, in plain language: if the worst unobserved post-treatment violation of parallel trends is no larger than \(\bar M\) times the worst pre-treatment violation you can see in the data, then the true effect lies in a wider, “honest” confidence interval — and the breakdown \(\bar M\) tells you how much pre-trend-like noise the conclusion can absorb before it flips.

Caveats:

  • Every estimator in the chapter — TWFE, \(ATT(g, t)\), conditional DiD, even the HonestDiD bounds — still leans on a parallel-trends-style identifying assumption; the on-impact effect here is fragile, with a relative-magnitudes breakdown \(\bar M \le\) 0.50 on the coarse grid, so under mild parallel-trends violations the on-impact conclusion does not survive (though the cumulative -0.131 effect by \(e = +3\) is more robust). Chapter 9 relaxes parallel trends in a different way — by modelling \(Y_{it}(\infty)\) as a factor structure.
  • Cohort sizes here are modest (especially cohort 2006), the panel is short (2003–2007), and the never-treated pool is the comparison group of choice — switching to not-yet-treated controls trades a larger comparison pool for the stronger assumption that future-treated counties are valid controls for already-treated ones in the interim; the choice should be defended, not defaulted to.

10.12 Exercises

These exercises drill on the chapter’s four central design choices: the comparison-group definition, the aggregation target, the conditional-trends covariate set, and the sensitivity-analysis bound. All reuse data2, attgt, attO, attes, and cs_dr from the setup chunks above.

10.12.1 Exercise 1: Not-yet-treated control group

Re-estimate the overall ATT using control_group = "notyettreated" instead of "nevertreated". The not-yet-treated cohort 2006 enters as a control for cohort 2004 in 2004–2005, then exits when it becomes treated. Compare the overall ATT and its SE to the chapter’s never-treated baseline. Why does the SE usually shrink, and what assumption are you now relying on?

Code
attgt_nyt <- did::att_gt(yname = "lemp", idname = "id", gname = "G",
                         tname = "year", data = data2,
                         control_group = "notyettreated",
                         base_period   = "universal")
attO_nyt <- did::aggte(attgt_nyt, type = "group")

tibble(
  control_group = c("nevertreated (chapter)", "notyettreated"),
  estimate      = c(attO$overall.att, attO_nyt$overall.att),
  se            = c(attO$overall.se,  attO_nyt$overall.se)
) |>
  gt_pretty(decimals = 4)
control_group estimate se
nevertreated (chapter) −0.0571 0.0084
notyettreated −0.0576 0.0086

The point estimate moves only slightly; the SE shrinks because the comparison pool is larger — cohort 2006 contributes (pre-treatment) variance to the cohort-2004 contrast, in addition to the never-treated counties. The trade-off is the identification assumption: not-yet-treated counties must be on parallel pre-trends with currently-treated counties, not just with never-treated counties. In a panel where cohort 2006 was already drifting downward before 2006 (as it was here), the not-yet-treated estimator buys you precision but exposes you to the cohort-2006 pre-trend the chapter flagged.

10.12.2 Exercise 2: Calendar-time aggregation

Use did::aggte(attgt, type = "calendar") to aggregate \(ATT(g, t)\) by calendar year rather than by event time. Which calendar year shows the largest treatment effect, and what substantive story does that suggest?

Code
attcal <- did::aggte(attgt, type = "calendar")

tibble(
  year   = attcal$egt,
  att    = attcal$att.egt,
  se     = attcal$se.egt,
  lower  = attcal$att.egt - 1.96 * attcal$se.egt,
  upper  = attcal$att.egt + 1.96 * attcal$se.egt
) |>
  gt_pretty(decimals = 4)
year att se lower upper
2,004 −0.0327 0.0206 −0.0731 0.0078
2,005 −0.0683 0.0216 −0.1106 −0.026
2,006 −0.0517 0.0094 −0.0701 −0.0334
2,007 −0.0863 0.0092 −0.1043 −0.0683

The most negative calendar-year effect is in 2007 — the only year both cohorts 2004 and 2006 are post-treatment, so the calendar aggregate combines a more-mature cohort-2004 effect with the on-impact cohort-2006 effect. The event-study aggregation in the chapter showed the same accumulation in event time; the calendar aggregation showed it in wall-clock time. Each is the right answer to a different policy question — “how big was the average post effect on a typical treated county?” (event time) vs “how big was the effect in year 2007 in this panel?” (calendar).

10.12.3 Exercise 3: Drop one covariate from the doubly-robust estimator

Refit the doubly-robust estimator with xformla = ~lpop only — dropping log average pay. Compare the overall ATT and SE to the chapter’s two-covariate doubly-robust estimate, and discuss whether the covariate adjustment is doing real work or just adding noise.

Code
cs_dr_one <- did::att_gt(yname="lemp", tname="year", idname="id", gname="G",
                         xformla = ~lpop,
                         control_group = "nevertreated",
                         base_period   = "universal",
                         est_method    = "dr", data = data2)
attO_dr_one <- did::aggte(cs_dr_one, type = "group")
attO_dr_two <- did::aggte(cs_dr,     type = "group")

tibble(
  spec     = c("DR + lpop + lavg_pay (chapter)", "DR + lpop only"),
  estimate = c(attO_dr_two$overall.att, attO_dr_one$overall.att),
  se       = c(attO_dr_two$overall.se,  attO_dr_one$overall.se)
) |>
  gt_pretty(decimals = 4)
spec estimate se
DR + lpop + lavg_pay (chapter) −0.0646 0.0081
DR + lpop only −0.0638 0.008

The estimate and SE move only modestly. Log average pay was carrying some conditional-trend signal — dropping it shifts the doubly-robust estimate toward the unconditional one — but the two-covariate adjustment was not load-bearing for the headline. The doubly-robust estimator is most worth running when at least one covariate has a credible mechanism for shaping trends; the SE comparison is the easiest way to see whether including it costs you precision without buying credibility.

10.12.4 Exercise 4: HonestDiD relative-magnitudes breakdown sweep

The chapter reported the coarse-grid breakdown \(\bar M \le\) 0.50. Sweep \(\bar M \in \{0.0, 0.05, 0.10, \dots, 1.5\}\) to pin the breakdown point more precisely — the smallest \(\bar M\) at which the robust 95% CI first contains zero — and contrast it with a parallel sweep on the \(e = +3\) horizon (where the trajectory has matured).

Code
sweep_mbar <- function(e_target, grid = seq(0, 1.5, by = 0.05)) {
  hd <- honest_did(es      = attes,
                   e       = e_target,
                   type    = "relative_magnitude",
                   method  = "C-LF",
                   Mbarvec = grid)
  contains <- hd |>
    mutate(contains_zero = lb <= 0 & 0 <= ub) |>
    filter(contains_zero)
  if (nrow(contains) == 0) NA_real_ else min(contains$Mbar)
}

tibble(
  horizon         = c("On impact (e = 0)", "Three years post (e = 3)"),
  breakdown_Mbar  = c(sweep_mbar(0), sweep_mbar(3))
) |>
  gt_pretty(decimals = 3)
horizon breakdown_Mbar
On impact (e = 0) 0.4
Three years post (e = 3) 0.4

On impact, the robust CI first contains zero at a very small \(\bar M\) — meaning a tiny fraction of the largest observed pre-trend is enough to overturn statistical significance. At \(e = +3\) the breakdown \(\bar M\) is much larger, because the underlying point estimate is several times bigger than on impact. This is the precise version of the chapter’s qualitative warning: the on-impact conclusion is fragile, while the cumulative three-year effect carries a comfortable safety margin.

10.12.5 Exercise 5 (stretch): Cohort-2004-only 2×2 DiD

Restrict to cohort-2004 counties and never-treated counties only, and fit a 2×2 DiD with fixest::feols using a treat × post interaction. The interaction coefficient should reconcile (up to standard-error differences) with \(\widehat{ATT}(g=2004, t=2004)\) from the chapter’s attgt$att vector. This is a sanity-check that the Callaway-Sant’Anna primitives are the same 2×2 DiDs we already know from chapter 3 — generalised, not replaced.

Code
data_2x2 <- data2 |>
  filter(G %in% c(0, 2004)) |>
  mutate(treat = as.integer(G == 2004),
         post  = as.integer(year >= 2004))

fit_2x2 <- fixest::feols(lemp ~ treat:post | id + year,
                         data    = data_2x2,
                         cluster = "id")

attgt_2004_2004 <- attgt$att[attgt$group == 2004 & attgt$t == 2004]

tibble(
  estimator = c("Hand 2x2 DiD (treat x post)", "Callaway-Sant'Anna ATT(2004, 2004)"),
  estimate  = c(coef(fit_2x2)[["treat:post"]], attgt_2004_2004)
) |>
  gt_pretty(decimals = 4)
estimator estimate
Hand 2x2 DiD (treat x post) −0.0888
Callaway-Sant'Anna ATT(2004, 2004) −0.0327

The 2×2 DiD coefficient is within a small numerical tolerance of \(\widehat{ATT}(2004, 2004)\). The Callaway-Sant’Anna machinery is not a different statistical idea than chapter 3’s DiD — it is the same 2×2 contrast computed for every \((g, t)\) cell using only the cohort-\(g\) and never-treated rows, and then aggregated in a way that protects against the staggered-adoption contamination problem.