# geometrics > Regional growth, convergence, and inequality analysis on the PySAL stack > (libpysal, esda, giddy, inequality, mapclassify, spreg, mgwr) with Plotly > figures, Great Tables, and plain-language interpretation on every result. Three inputs: gdf (geometry with ONLY the entity ID; shapefile / zipped shapefile / GeoJSON / GeoPackage via read_gdf), df (long-form panel declared with set_panel / set_labels), df_dict (6-column data dictionary: var_name, var_def, label, type, role, can_be_na). Every public function returns a frozen result dataclass with .df, .fig and/or .gt, .interpret() and .explain(). Install: pip install geometrics (extras: [dynamics] for Markov via giddy, [png] for static export, [all]). ## Docs - [Explore](https://quarcs-lab.github.io/geometrics/explore.html): maps, weights, Moran/LISA, space-time views — India - [Analyze](https://quarcs-lab.github.io/geometrics/analyze.html): beta/sigma/club convergence, spatial models with impacts, Markov, inequality — Bolivia - [Learn](https://quarcs-lab.github.io/geometrics/learn.html): the learn_* concept sandboxes and the explainer registry - [The data model](https://quarcs-lab.github.io/geometrics/articles/data-model.html): the (gdf, df, df_dict) contract - [Convergence](https://quarcs-lab.github.io/geometrics/articles/convergence.html): beta/sigma convergence and clubs - [Spatial dependence](https://quarcs-lab.github.io/geometrics/articles/spatial-dependence.html): weights, Moran, LISA - [Spatial spillovers](https://quarcs-lab.github.io/geometrics/articles/spillovers.html): the spreg suite and impacts - [Regional inequality](https://quarcs-lab.github.io/geometrics/articles/inequality.html): Gini/Theil and decompositions - [Distribution dynamics](https://quarcs-lab.github.io/geometrics/articles/dynamics.html): Markov and spatial Markov - [The India case study](https://quarcs-lab.github.io/geometrics/articles/india-case-study.html): the full replication arc - [The Bolivia dataset](https://quarcs-lab.github.io/geometrics/articles/bolivia-dataset.html): PWT-anchored local GDP at three scales - [For AI / LLMs](https://quarcs-lab.github.io/geometrics/use-with-llms.html): how AI agents should install and call geometrics - [Changelog](https://quarcs-lab.github.io/geometrics/changelog.html): release notes ## API - explore_*: explore_choropleth_map, explore_connectivity_map, explore_moran_plot, explore_lisa_cluster_map, explore_moran_over_time, explore_distribution_over_time, explore_spacetime_heatmap - analyze_*: analyze_beta_convergence, analyze_sigma_convergence, analyze_convergence_clubs, analyze_spatial_model, analyze_spatial_diagnostics, analyze_spatial_model_by_weights, analyze_markov_transitions, analyze_spatial_markov, analyze_inequality_over_time, analyze_theil_decomposition, analyze_gwr, analyze_mgwr - learn_*: learn_spatial_autocorrelation, learn_spatial_weights, learn_lisa_clusters, learn_spatial_spillovers, learn_omitted_spatial_lag, learn_beta_convergence, learn_sigma_convergence, learn_convergence_clubs, learn_markov_chains, learn_spatial_markov, learn_theil_decomposition - utilities: read_gdf, make_weights, growth_cross_section, set_panel, resolve_panel, set_labels, resolve_label, set_roles, build_data_dict, set_palette, get_palette, explain, list_topics - geometrics.data: load_india, load_india_states, load_india_raw, load_bolivia, load_bolivia_departments, load_bolivia_grid, load_bolivia_raw, clear_cache ## Source - [Repository](https://github.com/quarcs-lab/geometrics) - [API reference](https://quarcs-lab.github.io/geometrics/reference/index.html) - [llms-full.txt](https://quarcs-lab.github.io/geometrics/llms-full.txt): full docs text + signatures # ===== Docs pages (source) ===== ## ----- explore.qmd ----- --- title: "Explore regional data" aliases: - /quickstart.html --- ```{=html} ``` The **Explore** module is your first look at a regional dataset — before you estimate a single model. This page is a **case study**: you have just been handed 520 Indian districts observed by DMSP-OLS satellite nighttime lights between 1996 and 2010 (from [Mendez, Kabiraj & Li](https://github.com/quarcs-lab/project2025s-py)) and asked three questions an analyst always starts with: *is development spatially clustered, where exactly, and how did the whole regional distribution move over time?* Every Explore function takes the panel and returns a small **result object** carrying a tidy `.df` plus an interactive [Plotly](https://plotly.com/python/) figure (`.fig`), and most offer a plain-language `.interpret()`. Read this page top to bottom: the functions are ordered as a **workflow** — *load the three inputs → map the level → encode the neighborhood → test and localize clustering → watch the distribution move in time and space*. ::: {.callout-note} This is **exploratory** analysis: every reading below describes an *association*, never a cause. The [Analyze](analyze.qmd) module turns these patterns into estimates, and [Learn](learn.qmd) explains the ideas behind them with simulations you control. ::: ## Stage 0 — Load the three inputs geometrics separates geometry, data, and metadata: a geometry with **only the entity ID** (`gdf`), a **long-form panel** (`df`), and a **data dictionary** (`df_dict`). The bundled India case study ships all three; `set_labels` attaches the dictionary's labels to every future figure and declares the (entity, time) coordinates once. ```{python} import warnings warnings.filterwarnings("ignore") import geometrics as gm gdf, df, df_dict = gm.data.load_india() df = gm.set_labels(df, df_dict, set_panel=True) # labels + entity/time + roles, once df.head(3) ``` The dictionary is data too — it documents every column and drives the labels on every figure: ```{python} df_dict.head(8) ``` ## Stage 1 — See the map `explore_choropleth_map` classifies with mapclassify (Fisher-Jenks by default, `k` classes) and draws one legend entry per class, so the legend *is* the classification: ```{python} gm.explore_choropleth_map(df, "ntl_total", gdf=gdf, period=2010).fig ``` Pass `animate=True` instead of a single `period` to play the whole 1996–2010 film, or switch `scheme` (`"quantiles"`, `"equalinterval"`, …) to see how much the story depends on the classification — `gm.explain("choropleth_classification")` explains why. ## Stage 2 — Encode the neighborhood Everything spatial starts with a weights matrix **W** — the formal answer to "who is whose neighbor?". The paper uses 6 nearest neighbors; `explore_connectivity_map` draws the graph so you can inspect it *before* trusting it: ```{python} w = gm.make_weights(gdf, method="knn", k=6) gm.explore_connectivity_map(gdf, w=w).fig ``` Contiguity is the common alternative (`method="queen"`) — see [Spatial dependence and LISA](articles/spatial-dependence.qmd) for how to choose, and [Analyze](analyze.qmd) for checking that results survive the choice. ## Stage 3 — Is development spatially clustered? The Moran scatterplot puts each district's (standardized) value against the average of its neighbors; the slope is **global Moran's I**, the workhorse test of spatial autocorrelation: ```{python} moran = gm.explore_moran_plot(df, "log_ntl_pc_1996", gdf=gdf, w=w, period=1996) moran.fig ``` ```{python} print(moran.interpret()) ``` ## Stage 4 — Where exactly? (LISA) Global Moran's I says *whether* the map clusters; **LISA** (local Moran) says *where* — each district is classified as a High-High hot spot, Low-Low cold spot, or a spatial outlier (High-Low / Low-High), masked at 5% significance: ```{python} lisa = gm.explore_lisa_cluster_map(df, "log_ntl_pc_1996", gdf=gdf, w=w, period=1996) lisa.fig ``` ```{python} print(lisa.interpret()) ``` ## Stage 5 — The whole distribution, year by year Convergence questions are distribution questions. The ridgeline stacks the cross-sectional density of each year on one shared grid, so you can watch the shape — not just the mean — move: ```{python} gm.explore_distribution_over_time(df, "log_ntl_pc_1996").fig ``` (`kind="animated"` plays the same densities as an animation instead.) ## Stage 6 — Every district, every year The space-time heatmap keeps *every* unit visible: one row per district, one column per year. Sorting the rows by latitude turns geography itself into the y-axis — a north–south transect of the whole panel: ```{python} gm.explore_spacetime_heatmap( df, "log_ntl_pc_1996", gdf=gdf, sort_by="north_south" ).fig ``` Rows that keep their shading left to right are persistent; rows that lighten or darken are mobile. `sort_by="value"` orders by the first period instead. ## Stage 7 — Does the clustering strengthen or fade? Stage 3 tested one year. Running Moran's I per year closes the loop — is the spatial structure of development deepening or dissolving? ```{python} mot = gm.explore_moran_over_time(df, "log_ntl_pc_1996", gdf=gdf, w=w) mot.fig ``` ```{python} print(mot.interpret()) ``` ## Where next You now know the map clusters, where it clusters, and how the distribution moved. - [Analyze](analyze.qmd) — estimate it: β/σ/club convergence, spatial models with spillovers, Markov dynamics, inequality decompositions (on the Bolivia case study) - [The India case study](articles/india-case-study.qmd) — the full replication arc on this same panel - [Learn](learn.qmd) — the ideas behind W, Moran's I and LISA, taught with simulations where you plant the truth - [The data model](articles/data-model.qmd) — bring your own `(gdf, df, df_dict)` ## ----- analyze.qmd ----- --- title: "Analyze convergence and inequality" --- ```{=html} ``` The **Analyze** module estimates what [Explore](explore.qmd) described. This page is a **case study** on the bundled Bolivia panel — 112 provinces with PWT-anchored GDP per capita, 2012–2022 (from [Rossi-Hansberg & Zhang's local GDP](articles/bolivia-dataset.qmd), rescaled to Penn World Table 11.0) — asking the three standing questions of the convergence literature: *are poorer provinces catching up, do spillovers carry growth across borders, and is the national gap narrowing?* The functions appear in the order an analysis actually runs: *build the growth cross-section → estimate β without space → add spillovers → let the diagnostics pick the model → check robustness to W → σ-convergence → clubs → distribution dynamics → inequality and its decomposition → local heterogeneity.* (The [India case study](articles/india-case-study.qmd) runs this same arc on the flagship 520-district panel.) ::: {.callout-note} Every `.interpret()` below reads an *association*, never a cause — estimates from observational regional data describe patterns, not policy effects. The [Learn](learn.qmd) module demonstrates each estimator on simulated data where the truth is planted. ::: ## Stage 0 — Load and declare ```{python} import warnings warnings.filterwarnings("ignore") import numpy as np import geometrics as gm gdf, df, df_dict = gm.data.load_bolivia() # 112 provinces x 2012-2022 df = gm.set_labels(df, df_dict, set_panel=True) # labels + (gid, year), once w = gm.make_weights(gdf) # queen contiguity, row-standardized df.head(3) ``` Five provinces are fully censored in the source product (polygons but no panel rows) — geometrics warns and carries on; see [the Bolivia dataset](articles/bolivia-dataset.qmd). ## Stage 1 — The growth cross-section Every convergence regression starts from the same frame: one row per province, its initial level and its annualized log growth. `growth_cross_section` builds it explicitly, so the β regression that follows has no hidden steps: ```{python} cs = gm.growth_cross_section(df, "gdppc") cs.head() ``` ## Stage 2 — β-convergence, without space Do initially-poorer provinces grow faster? A negative slope of growth on the initial (log) level says yes; `speed` and `half_life` translate it into years: ```{python} ols = gm.analyze_beta_convergence(df, "gdppc", model="ols") print( f"beta = {ols.beta_total:.4f} (speed {ols.speed:.3f}, " f"half-life {ols.half_life:.0f} yr)" ) ols.fig ``` ```{python} print(ols.interpret()) ``` ## Stage 3 — β with spillovers (the spatial Durbin model) Provinces are not islands: `model="sdm"` adds the spatial lags of outcome and covariates, and the convergence estimate becomes a LeSage-Pace **total impact** with direct and indirect (spillover) components — Monte-Carlo standard errors included: ```{python} sdm = gm.analyze_beta_convergence( df, "gdppc", model="sdm", gdf=gdf, w=w, n_draws=1000 ) print( f"SDM total: {sdm.beta_total:.4f} = direct {sdm.beta_direct:.4f} " f"+ indirect {sdm.beta_indirect:.4f} (rho = {sdm.rho:.2f})" ) ``` ```{python} print(sdm.interpret()) ``` ## Stage 4 — Which spatial model do the data ask for? Rather than assuming the SDM, let the Lagrange-multiplier diagnostics inspect the OLS residuals and apply the Anselin-Florax decision rule: ```{python} cs["ln_initial"] = np.log(cs["initial"]) cs["year"] = 2012 cs = gm.set_panel(cs, entity="gid", time="year") diag = gm.analyze_spatial_diagnostics( cs, outcome="growth", covariates=["ln_initial"], gdf=gdf, w=w ) print(f"Recommendation: {diag.recommendation}\n") print(diag.reasoning) ``` The recommended specification is one `model=` switch away — and its full spreg table with the impact decomposition comes from `analyze_spatial_model`: ```{python} model = gm.analyze_spatial_model( cs, outcome="growth", covariates=["ln_initial"], gdf=gdf, w=w, model="durbin", n_draws=1000, ) model.gt ``` ## Stage 5 — Robust to the weights choice? A conclusion that only holds under one definition of "neighbor" is fragile. `analyze_spatial_model_by_weights` re-estimates the same model under alternative weights and compares the impacts side by side: ```{python} robust = gm.analyze_spatial_model_by_weights( cs, outcome="growth", covariates=["ln_initial"], gdf=gdf, weights={ "queen": gm.make_weights(gdf, method="queen"), "knn4": gm.make_weights(gdf, method="knn", k=4), "knn6": gm.make_weights(gdf, method="knn", k=6), }, model="durbin", n_draws=1000, ) robust.fig ``` ```{python} print(robust.interpret()) ``` ## Stage 6 — Is the gap narrowing? (σ-convergence) β asks about catch-up; σ asks whether cross-sectional **dispersion** actually shrank. Both matter — fast catch-up can coexist with stable dispersion: ```{python} sigma = gm.analyze_sigma_convergence(df, "gdppc") sigma.fig ``` ```{python} print(sigma.interpret()) ``` ::: {.callout-tip} Dispersion is measured on logs, so the series must be strictly positive — `gdppc` is. For a panel with zeros (India's night lights), filter first: the [India case study](articles/india-case-study.qmd) shows the pattern. ::: ## Stage 7 — One Bolivia or several? (convergence clubs) Global convergence can fail while **clubs** of provinces converge to different steady states. The Phillips-Sul log(t) test and clustering find them from the data: ```{python} clubs = gm.analyze_convergence_clubs(df, "ln_gdppc", gdf=gdf) print(clubs.interpret()) ``` ```{python} clubs.fig_map if clubs.fig_map is not None else clubs.fig ``` ## Stage 8 — Distribution dynamics (Markov chains) How mobile are provinces across the income distribution — and does mobility depend on the neighbors? (Requires the `dynamics` extra: `pip install "geometrics[dynamics]"`.) ```{python} mkv = gm.analyze_markov_transitions(df, "gdppc", k=4) mkv.gt ``` The spatially conditioned chains need every mapped province observed in every period — so drop the five censored polygons from the geometry first (their weights are rebuilt on the subset automatically): ```{python} gdf_obs = gdf[gdf["gid"].isin(df["gid"])] spm = gm.analyze_spatial_markov(df, "gdppc", gdf=gdf_obs, k=4, m=4) print(spm.interpret()) ``` ## Stage 9 — Inequality: trend and decomposition The same panel, read through inequality indices — including the spatial Gini, which splits inequality into neighbor and non-neighbor pairs: ```{python} ineq = gm.analyze_inequality_over_time(df, "gdppc", gdf=gdf, w=w) ineq.fig ``` ```{python} print(ineq.interpret()) ``` How much of provincial inequality is *between departments* rather than within them? The Theil index decomposes exactly: ```{python} theil = gm.analyze_theil_decomposition(df, "gdppc", "name1") theil.fig ``` ```{python} print(theil.interpret()) ``` ## Stage 10 — Local heterogeneity (GWR, briefly) One β for all of Bolivia may hide local stories. Geographically weighted regression lets the convergence coefficient vary over space and maps it: ```{python} gwr = gm.analyze_gwr( cs, outcome="growth", covariates=["ln_initial"], gdf=gdf ) gwr.figs["ln_initial"] ``` Multiscale GWR (`analyze_mgwr`) lets each term choose its own bandwidth — see the [reference](reference/analyze_mgwr.qmd) and the India article for a full run. ## Where next - [Explore](explore.qmd) — the descriptive workflow that should precede all of this - [Learn](learn.qmd) — every estimator above, demonstrated on simulated data with a planted truth - [Spatial spillovers](articles/spillovers.qmd) — the spreg suite in depth; [Distribution dynamics](articles/dynamics.qmd); [Regional inequality](articles/inequality.qmd) - [The India case study](articles/india-case-study.qmd) — this arc on 520 districts ## ----- learn.qmd ----- --- title: "Learn spatial analysis" --- ```{=html} ``` The **Learn** module is geometrics' teaching layer, and it works two complementary ways: 1. **Every result speaks.** Each `explore_*` / `analyze_*` result carries `.interpret()` — a plain-language reading of *that* result — and `.explain()`, the concept behind the method. 2. **Sandboxes with a planted truth.** The `learn_*` functions simulate data from a known data-generating process, run the *real* geometrics estimator on it, and show whether the truth you planted comes back. Turn the knobs (`rho=`, `shift=`, `convergence_rate=`, …) and watch the concept respond. ::: {.callout-note} Sandboxes are for learning, never for your data — they *generate* their own. And even here, where the truth is literally known, `.interpret()` keeps its associational discipline: the habit should transfer to real data, where no truth is planted. ::: ## Stage 1 — Read a real result in plain language Any result can explain itself. A quick β-convergence on the Bolivia provinces: ```{python} import warnings warnings.filterwarnings("ignore") import geometrics as gm gdf, df, df_dict = gm.data.load_bolivia() df = gm.set_labels(df, df_dict, set_panel=True) res = gm.analyze_beta_convergence(df, "gdppc", model="ols") print(res.interpret()) ``` And the concept behind it, from the built-in explainer registry: ```{python} print(res.explain().to_markdown()[:600], "...") ``` ## Stage 2 — The browsable concept index Thirty topics ship with the package — ESDA, weights, spatial models and impacts, convergence, distribution dynamics, inequality, and foundations. Every key works with `gm.explain(...)`: ```{python} gm.list_topics() ``` ```{python} print(gm.explain("spatial_autocorrelation").to_markdown()) ``` ## Stage 3 — Sandbox: *seeing* spatial autocorrelation What does ρ actually look like? Plant no dependence, then strong dependence — the left panel is one simulated map, the right panel tracks Moran's I across planted ρ: ```{python} gm.learn_spatial_autocorrelation(rho=0.0).fig ``` ```{python} strong = gm.learn_spatial_autocorrelation(rho=0.8) strong.fig ``` ```{python} print(strong.interpret()) ``` ## Stage 4 — Sandbox: why spatial econometrics exists Simulate outcomes that spill over (`y = (I - ρW)⁻¹(βx + ε)`), then fit OLS as if space did not exist. The omitted spatial lag inflates the slope; the SAR model recovers both β and ρ: ```{python} omit = gm.learn_omitted_spatial_lag(rho=0.7) omit.fig ``` ```{python} print(omit.interpret()) ``` ## Stage 5 — Sandbox: spillovers you planted, impacts recovered In a spatial Durbin world the true direct and indirect effects are known in closed form — so you can watch the LeSage-Pace decomposition earn its keep. This is the idea behind [Analyze Stage 3](analyze.qmd#stage-3-β-with-spillovers-the-spatial-durbin-model): ```{python} spill = gm.learn_spatial_spillovers(rho=0.5, gamma=0.5) spill.fig ``` ```{python} print(spill.interpret()) ``` ## Stage 6 — Sandbox: convergence at a known speed Plant a 2% convergence rate; the growth-on-initial regression should hand it back — slope, speed, and half-life: ```{python} beta = gm.learn_beta_convergence(convergence_rate=0.02) beta.fig ``` ```{python} beta.df ``` ## Stage 7 — The full sandbox catalog Eleven sandboxes cover the package's method families — each links to its reference page with every knob documented: | Sandbox | The lesson | |---|---| | [`learn_spatial_autocorrelation`](reference/learn_spatial_autocorrelation.qmd) | What ρ looks like, and how Moran's I tracks it | | [`learn_spatial_weights`](reference/learn_spatial_weights.qmd) | The same field under queen / rook / knn — W is a choice | | [`learn_lisa_clusters`](reference/learn_lisa_clusters.qmd) | Planted hot/cold spots, recovered (and false positives counted) | | [`learn_spatial_spillovers`](reference/learn_spatial_spillovers.qmd) | Direct/indirect/total impacts vs a closed-form truth | | [`learn_omitted_spatial_lag`](reference/learn_omitted_spatial_lag.qmd) | The bias of ignoring Wy — and how SAR repairs it | | [`learn_beta_convergence`](reference/learn_beta_convergence.qmd) | A planted convergence rate, recovered | | [`learn_sigma_convergence`](reference/learn_sigma_convergence.qmd) | A planted dispersion path; trend = ln ρ exactly | | [`learn_convergence_clubs`](reference/learn_convergence_clubs.qmd) | Two planted clubs; Phillips-Sul finds them | | [`learn_markov_chains`](reference/learn_markov_chains.qmd) | A planted transition matrix, recovered cell by cell | | [`learn_spatial_markov`](reference/learn_spatial_markov.qmd) | Mobility that depends on the neighbors — detected | | [`learn_theil_decomposition`](reference/learn_theil_decomposition.qmd) | A planted between/within split, decomposed exactly | (The two Markov sandboxes need the `dynamics` extra: `pip install "geometrics[dynamics]"`.) ## Prefer sliders? The Learn app wraps every sandbox knob in a slider and pairs it with the explainer browser — no install needed: ```{=html}

📚 Launch the Learn app

``` ## Where next - [Explore](explore.qmd) — apply the ESDA ideas to the India panel - [Analyze](analyze.qmd) — the estimators these sandboxes demystify, on Bolivia - [API reference](reference/index.qmd) — every knob of every sandbox ## ----- articles/data-model.qmd ----- --- title: "The data model (gdf, df, df_dict)" subtitle: "One ID-only geometry, one long panel, one data dictionary — declared once, used everywhere" --- Every geometrics analysis takes the same three inputs: 1. **`gdf`** — the entity geometry, carrying **only the entity ID** (plus an optional human-readable name) and the geometry column. Loaded and validated by `read_gdf`. 2. **`df`** — a **long-form panel**: one row per (entity, period), one column per variable. Its identifiers are declared once with `set_panel` (or in the same call as the labels, below). 3. **`df_dict`** — a six-column **data dictionary** with one row per `df` column. The dictionary is data too: it supplies the labels on every figure and table, and it can declare the panel IDs and analytical roles in one step. Geometry is a *join table*, not a data table — the variables live in `df`, and every spatial function matches the two on the entity ID. The bundled India case study ships all three inputs in exactly this shape: ```{python} import warnings warnings.filterwarnings("ignore") import geopandas as gpd import pandas as pd import geometrics as gm gdf, df, df_dict = gm.data.load_india() df = gm.set_labels(df, df_dict, set_panel=True) # labels + entity/time + roles, once print(f"gdf: {gdf.shape[0]} units x {list(gdf.columns)}") print(f"df: {df.shape[0]} rows (520 districts x 6 years), {df.shape[1]} columns") print(f"df_dict: {df_dict.shape[0]} rows — one per df column") ``` Note what the geometry table holds: the ID and the polygons, nothing else. ```{python} gdf.head(3) ``` ## `read_gdf`: the geometry entry point `gm.read_gdf` is the single door for user geometry. It accepts a GeoDataFrame or a path in any of four formats — a **shapefile** (`.shp`), a **zipped shapefile** (`.zip`), **GeoJSON** (`.geojson` / `.json`), or a **GeoPackage** (`.gpkg`, with `layer=` for multi-layer files) — and enforces the geometry contract every spatial function relies on: a declared CRS, valid non-empty geometries (invalid ones are repaired with a warning), and a **unique** entity ID. Round-trip a toy grid through three of them: ```{python} import shutil import tempfile from pathlib import Path from shapely.geometry import box cells = [box(x, y, x + 1, y + 1) for y in range(2) for x in range(3)] grid = gpd.GeoDataFrame( {"cell_id": [f"c{i}" for i in range(6)]}, geometry=cells, crs="EPSG:4326" ) tmp = Path(tempfile.mkdtemp()) grid.to_file(tmp / "grid.geojson") grid.to_file(tmp / "grid.gpkg", layer="grid") (tmp / "shp").mkdir() grid.to_file(tmp / "shp" / "grid.shp") shutil.make_archive(str(tmp / "grid"), "zip", tmp / "shp") # -> grid.zip for name in ("grid.geojson", "grid.gpkg", "grid.zip"): back = gm.read_gdf(tmp / name) print(f"{name:13} -> {back.shape[0]} units, CRS EPSG:{back.crs.to_epsg()}, " f"entity = {back.attrs['geometrics_geo']['entity']!r}") ``` Two things happened silently there. First, the **ID-only rule**: because `cell_id` is the sole non-geometry column, it *is* the entity ID — no argument needed. When a file carries several columns, `read_gdf` looks for name hints (`id`, `code`, `region`, `district`, ...) and asks you to pass `entity=` when the choice is ambiguous; an `entity_name=` column (a readable label such as a district name next to a census code) can ride along. Everything else belongs in `df` — trim your geometry down to the ID (and optional name) before analysis. Second, the resolved IDs were stored on `gdf.attrs["geometrics_geo"]`, so every later call can omit `entity=`. **CRS handling** follows one principle: `read_gdf` *declares*, it never *reprojects*. A source with no CRS is an error until you say what the coordinates mean: ```{python} naked = gpd.GeoDataFrame({"cell_id": ["a"]}, geometry=[box(0, 0, 1, 1)]) # no CRS try: gm.read_gdf(naked) except ValueError as err: print(err) gm.read_gdf(naked, crs="EPSG:4326").crs ``` Reprojection happens later, on demand: metric operations (k-nearest-neighbor centroids, distance bands) project to an estimated UTM CRS automatically — or keep raw lon/lat coordinates with `crs=None`, as the India paper does. ## Declare once: `set_panel`, `set_labels`, `set_roles` Rather than repeat `entity="statedist", time="year"` on every call, geometrics stashes structural metadata on `df.attrs` — the **declare-once pattern**. Three helpers write it: `set_panel` (entity / time / entity-name IDs), `set_labels` (`{column: label}` display names), and `set_roles` (the default `outcome` and `covariates`). When the labels come from a `df_dict`, one call does all three: ```{python} df = gm.set_labels(df, df_dict, set_panel=True) print(df.attrs["geometrics_panel"]) print(df.attrs["geometrics_roles"]["outcome"], df.attrs["geometrics_roles"]["covariates"][:3], "...") ``` Two rules make this safe to rely on: - **Explicit arguments always win.** `gm.explore_choropleth_map(df, "ntl_total", gdf=gdf, entity="statedist")` uses the argument, not the stored default — the attrs are a convenience, never a constraint. - **attrs can be dropped by pandas.** Some operations (merges, certain column selections) return a frame without them. The fix is one line — call `set_labels` again: ```{python} lookup = pd.DataFrame({"state": df["state"].unique()}) merged = df.merge(lookup, on="state") print("attrs after merge:", merged.attrs) merged = gm.set_labels(merged, df_dict, set_panel=True) # recall: declare again print("attrs after recall:", merged.attrs["geometrics_panel"]) ``` ## The `df_dict` contract A data dictionary is a plain DataFrame with **six columns, in this order**: `var_name`, `var_def` (the long definition), `label` (the short display name), `type`, `role`, `can_be_na`. Its vocabulary is fixed: - `type` ∈ `entity` / `time` / `factor` / `logical` / `numeric` - `role` ∈ `""` (blank — no special role) / `outcome` / `covariate` / `entity_name` - `can_be_na` ∈ `True` / `False` Six rows of the India dictionary show the vocabulary at work — the entity ID, a factor tagged as the readable entity name, the time ID, a plain numeric column, the declared covariate, and the declared outcome (this panel happens to have no `logical` column; the inferred dictionary below has one): ```{python} df_dict.iloc[[0, 2, 3, 6, 8, 9]] ``` ```{python} print("type vocabulary:", sorted(df_dict["type"].unique())) print("role vocabulary:", sorted(df_dict["role"].fillna("").unique())) ``` (Blank roles read back from CSV as missing values — treat blank and `NaN` as the same "no role".) The `label` column titles every axis, legend and table header; the `type` column lets `set_labels(..., set_panel=True)` find the entity and time IDs; the `role` column feeds the default outcome/covariates and the `Name (id)` hover labels. ## No dictionary? Infer a starting point `gm.build_data_dict` produces a best-guess dictionary for any frame, from column names and dtypes — entity hints like `iso`/`id`/`region`, time hints like `year`/`date`, calendar-year integer detection, and a text column that is ~1:1 with the entity becomes the `entity_name`: ```{python} import numpy as np rng = np.random.default_rng(7) iso = ["BOL", "COL", "ECU", "PER"] names = {"BOL": "Bolivia", "COL": "Colombia", "ECU": "Ecuador", "PER": "Peru"} years = list(range(2000, 2011)) raw = pd.DataFrame( { "iso3": np.repeat(iso, len(years)), "country": np.repeat([names[i] for i in iso], len(years)), "year": years * len(iso), "gdp_pc": rng.uniform(3_000, 12_000, size=len(iso) * len(years)).round(0), "landlocked": np.repeat([True, False, False, False], len(years)), } ) ddict = gm.build_data_dict(raw) ddict ``` The inference is deliberately conservative: it is a *guess* to edit, not a verdict. Pass `entity=` / `time=` to pin the IDs when the hints misfire, rewrite the humanized labels, and mark the analytical roles yourself — `outcome` and `covariate` are **never** guessed. Then declare it the usual way: ```{python} raw = gm.set_labels(raw, ddict, set_panel=True) print(gm.resolve_panel(raw)) ``` ## How `df` meets `gdf`: alignment You never merge the panel onto the geometry yourself. Every spatial function funnels through one alignment path that joins the two on the **entity ID** and returns rows in `gdf` order (killing the classic row-order mismatch bug): 1. slice the requested `period` (defaulting to the latest, with a note); 2. join on the IDs — exact match first, and if there is **zero** overlap, one retry with string-normalized keys (`str.strip` on both sides), reported as a warning; 3. warn about **unmatched IDs** on either side (naming examples, so typos surface); 4. drop rows with **missing values** in the analysis columns, with a warning; 5. when rows dropped and a `w` was given, **restrict the weights** to the kept units (`libpysal`'s `w_subset`) and re-apply the row-standardization, so `w.n` always equals the number of analyzed rows. Watch all of it on a deliberately messy copy of the panel — one district's 2010 value blanked out, plus a "Ghost district" that has no polygon: ```{python} w = gm.make_weights(gdf, method="knn", k=6, crs=None) messy = df.copy() one = messy["statedist"].iloc[0] messy.loc[(messy["statedist"] == one) & (messy["year"] == 2010), "ntl_total"] = np.nan ghost = messy[messy["year"] == 2010].iloc[[1]].copy() ghost["statedist"] = "Ghost district" messy = gm.set_labels( pd.concat([messy, ghost], ignore_index=True), df_dict, set_panel=True ) with warnings.catch_warnings(record=True) as caught: warnings.simplefilter("always") lisa = gm.explore_lisa_cluster_map(messy, "ntl_total", gdf=gdf, w=w, period=2010) for c in caught: if c.category.__name__ == "GeometricsWarning": print("warning:", c.message) ``` Nothing crashed and nothing was silently misaligned: the ghost was named, the incomplete district was dropped, and the weights were subset to match. The result object keeps the full audit trail in `notes`, and `w_spec` records the weights recipe every spatial result carries: ```{python} print("rows analyzed:", len(lisa.df)) print("w_spec:", lisa.w_spec) for note in lisa.notes: print("-", note) ``` ## Bring your own data: the checklist - **Geometry** — load through `gm.read_gdf(...)` (shapefile, zipped shapefile, GeoJSON, or GeoPackage); one row per entity, a **unique** ID, a declared CRS (pass `crs=` when the file carries none), and no other columns beyond the ID and an optional name. - **Panel** — reshape to long form: one row per (entity, period). Entity IDs must be written exactly as in the geometry (pure whitespace differences are retried automatically; anything else is warned about, unit by unit). - **Dictionary** — author the six columns (`var_name, var_def, label, type, role, can_be_na`) with the vocabulary above, or bootstrap with `gm.build_data_dict(df)` and edit. Mark `outcome` / `covariate` yourself. - **Declare once** — `df = gm.set_labels(df, df_dict, set_panel=True)`; re-declare after merges or column subsets, since pandas can drop `attrs`. - **Weights** — build from the *same* geometry (`gm.make_weights(gdf, ...)`) and inspect the graph with `gm.explore_connectivity_map(gdf, w=w)` before trusting it. From here, the [Explore page](../explore.qmd) runs the whole workflow in a page, [Beta and sigma convergence](convergence.qmd) covers the convergence toolkit, and [the India case study](india-case-study.qmd) shows the three inputs carrying a full replication. ## ----- articles/convergence.qmd ----- --- title: "Beta and sigma convergence (and clubs)" subtitle: "Do laggards catch up, is the gap narrowing, and does everyone converge to the same place?" --- Three questions, three tools. **β-convergence** asks whether units that start behind grow faster (`analyze_beta_convergence`); **σ-convergence** asks whether the cross-sectional spread is actually narrowing (`analyze_sigma_convergence`); and when neither gives a clean verdict, **convergence clubs** ask whether the panel splits into groups that each converge to their own path (`analyze_convergence_clubs`). This article runs all three on the bundled India panel — 520 districts observed by satellite nighttime lights, 1996-2010 — using the raw panel variable `ntl_total` (the paper's per-capita replication lives in [the India case study](india-case-study.qmd)). ```{python} import warnings warnings.filterwarnings("ignore") import numpy as np import pandas as pd import geometrics as gm gdf, df, df_dict = gm.data.load_india() df = gm.set_labels(df, df_dict, set_panel=True) ``` Every concept in the library ships a built-in explainer. Here is how `gm.explain` introduces the β-convergence idea: ```{python} print(gm.explain("beta_convergence").to_markdown()[:960], "...") ``` ## β-convergence, first ignoring space `analyze_beta_convergence` builds the growth cross-section internally: for each district, total luminosity in levels at 1996 and 2010, and the annualized log growth between them, regressed on the **initial log level**. A negative slope is convergence. ```{python} ols = gm.analyze_beta_convergence(df, "ntl_total", model="ols") ols.fig ``` The result object carries the headline scalars — the slope, the implied structural speed λ = -ln(1 + β·T)/T, and the half-life ln 2 / λ (the years needed to close half of an initial gap): ```{python} print( f"beta = {ols.beta_total:.4f} (SE {ols.se_total:.4f}), R2 = {ols.r2:.3f}, " f"N = {ols.n_obs}\n" f"speed = {ols.speed:.4f} per year -> half-life = {ols.half_life:.0f} years" ) ``` ```{python} print(ols.interpret()) ``` ## Adding spillovers: the spatial Durbin model Districts are not islands — initial luminosity and its growth are both spatially clustered (see [the India case study](india-case-study.qmd) for the LISA maps). The `model` switch re-estimates the same regression with `spreg`'s spatial family; the paper's choice is the **spatial Durbin model** (SDM) on 6-nearest-neighbor weights built, like the paper, on plain lon/lat centroids (`crs=None`): ```{python} w = gm.make_weights(gdf, method="knn", k=6, crs=None) sdm = gm.analyze_beta_convergence( df, "ntl_total", model="sdm", gdf=gdf, w=w, n_draws=2000 ) ``` With a spatial lag in the model, the raw coefficient is no longer the answer. The convergence estimate becomes a LeSage-Pace **impact**: a **direct** part (a district's own initial level and its own growth), an **indirect** part (the neighborhood's initial level — the spillover), and their **total**, with Monte-Carlo standard errors from `n_draws` draws (2,000 here for speed; the default is 10,000). `res.impacts` tabulates the decomposition for every regressor: ```{python} sdm.impacts ``` Side by side with OLS — the pattern of the source paper's Table 1: ```{python} comparison = pd.DataFrame( { "OLS": [ols.beta_total, np.nan, ols.beta_total, np.nan, ols.speed, ols.half_life], "SDM": [sdm.beta_direct, sdm.beta_indirect, sdm.beta_total, sdm.rho, sdm.speed, sdm.half_life], }, index=["direct", "indirect", "total", "rho (spatial lag)", "speed (per yr)", "half-life (yr)"], ).round(4) comparison ``` ```{python} print(sdm.interpret()) ``` The OLS β understates catch-up: once the spatial lag (ρ ≈ 0.7) and the neighbors' initial levels enter, the total impact is larger in magnitude than the OLS slope, and the implied convergence speed rises — part of every district's catch-up is associated with its *neighborhood*, which OLS attributes to nothing. How these impacts are computed, tested and stress-checked against other weights is the subject of the [spatial spillovers article](spillovers.qmd). ## Mapping who grew: `growth_cross_section` The same one-row-per-unit growth table the regression uses is available directly — handy for mapping the dependent variable before modelling it. It returns a plain DataFrame (entity, `initial`, `final`, `growth`) with the panel entity already declared, so it feeds straight into `explore_choropleth_map`: ```{python} cs = gm.growth_cross_section(df, "ntl_total") cs = gm.set_labels(cs, {"growth": "NTL growth (annualized log), 1996-2010"}) gm.explore_choropleth_map(cs, "growth", gdf=gdf).fig ``` ## σ-convergence: is the gap actually narrowing? β-convergence is necessary but not sufficient for the distribution to compress — new shocks can re-spread it even while laggards catch up. `analyze_sigma_convergence` tracks the cross-sectional dispersion of the **log** of the variable per period (the standard deviation, the Gini, the coefficient of variation) and tests the trend of the log dispersion over time. Because dispersion is measured on logs, the series must be strictly positive — and one district (Lahul and Spiti, Himachal Pradesh) records zero luminosity in some years. Pass the full panel and geometrics refuses, telling you exactly why: ```{python} try: gm.analyze_sigma_convergence(df, "ntl_total") except ValueError as err: print(err) ``` So filter to the always-positive balanced panel first, exactly as the [Explore page](../explore.qmd) does — dropping the offending *district* (all its years, keeping the panel balanced), not just the offending rows: ```{python} bad = df.loc[df["ntl_total"] <= 0, "statedist"].unique() pos = gm.set_labels( df[~df["statedist"].isin(bad)].copy(), df_dict, set_panel=True ) sigma = gm.analyze_sigma_convergence(pos, "ntl_total") sigma.fig ``` ```{python} print(sigma.interpret()) ``` Both answers agree here: dimmer districts grew faster (β), *and* the distribution narrowed (σ). That is not guaranteed — report both. ## Convergence clubs: one destination, or several? A single β can also paper over a split panel: some districts converging to a high path, others to a low one. The Phillips-Sul **log(t)** machinery tests whole-panel convergence and, when it is rejected, clusters the districts into data-driven **convergence clubs** from their relative transition paths. It runs on the same always-positive subset (the HP smoothing and relative transitions need a gap-free, strictly positive series); pass the matching geometry to get the club map. The clustering sieve refits thousands of log(t) regressions, so expect roughly a minute for the 519 districts — that is normal: ```{python} gdf_pos = gdf[~gdf["statedist"].isin(bad)].copy() clubs = gm.analyze_convergence_clubs(pos, "ntl_total", gdf=gdf_pos) print( f"global log-t = {clubs.global_tstat:.1f} (converged: {clubs.converged}) -> " f"{clubs.n_clubs} clubs, {clubs.n_divergent} divergent districts" ) ``` Global convergence is emphatically rejected — the districts sort into clubs, each converging to its own path. The membership map shows where those paths live: ```{python} clubs.fig ``` ```{python} clubs.fig_map ``` ```{python} print(clubs.interpret()) ``` ## Where next - [Spatial spillovers](spillovers.qmd) — the full spreg suite behind `model="sdm"`: specification diagnostics, impact inference, and robustness to the weights choice - [The India case study](india-case-study.qmd) — this toolkit inside the complete replication arc, on the paper's exact per-capita growth variable - [The data model](data-model.qmd) — bring your own `(gdf, df, df_dict)` ## ----- articles/spatial-dependence.qmd ----- --- title: "Spatial dependence and LISA" subtitle: "Who counts as a neighbor, whether the map clusters, and where" --- Every spatial statistic answers a question *relative to a definition of neighborhood*. This article walks the exploratory spatial data analysis (ESDA) arc on the bundled India case study — 520 districts observed by satellite nighttime lights, 1996–2010: encode the neighborhood as a weights matrix, audit it visually, test whether luminosity clusters globally (Moran's I), locate the clusters locally (LISA), and track the clustering over time. ```{python} import warnings warnings.filterwarnings("ignore") import geometrics as gm gdf, df, df_dict = gm.data.load_india() df = gm.set_labels(df, df_dict, set_panel=True) # The paper's weights: 6 nearest neighbors on plain lon/lat centroids (crs=None) w = gm.make_weights(gdf, method="knn", k=6, crs=None) ``` ## What a weights matrix is The spatial weights matrix **W** is the formal answer to "who is whose neighbor": cell (i, j) is nonzero when district j counts as a neighbor of district i. It is not a nuisance parameter — Moran's I, LISA, and every spatial regression are defined *conditional on the chosen W*. The built-in explainer says it best: ```{python} print(gm.explain("spatial_weights").to_markdown()[:800], "...") ``` ## Building weights: `make_weights` `make_weights` covers the standard families with the library's conventions baked in: entity ids as the weight ids, row standardization, and a human-readable one-liner recorded on `w.geometrics_meta["spec"]` (every spatial result carries it forward as `w_spec`, so figures and tables always document their W). Three variants on the India map: ```{python} variants = { "queen": gm.make_weights(gdf, method="queen"), "knn6": w, "inverse distance": gm.make_weights(gdf, method="inverse_distance", power=2), } for name, wx in variants.items(): print(f"{name:>18}: {wx.geometrics_meta['spec']}") ``` Each spec string tells a small story: - **Queen contiguity** connects districts that share a border or corner. One district — Mumbai, an island city that shares no boundary in this topology — has no queen neighbor at all, and `make_weights` **attaches it to its nearest neighbor automatically** (with a `GeometricsWarning` naming it), because a zero-neighbor unit has no spatial lag and breaks estimators downstream. The fix is recorded in the spec. - **k-nearest neighbors** guarantees every district exactly k neighbors — useful when polygons vary wildly in size. `crs=None` reproduces the source paper's lon/lat-centroid construction; the geometrics default (`crs="auto"`) would project to a metric CRS first. - **Inverse distance** encodes smooth decay, $1/d^p$ within a radius. With `threshold=None` the band is the smallest radius that leaves no district isolated — about 163 km here. ## Audit the graph before trusting it `explore_connectivity_map` draws the neighbor graph over the map and summarizes its health — the standard visual check of a W before any statistic is computed on it: ```{python} conn = gm.explore_connectivity_map(gdf, w=w) print(f"{conn.n_units} units, {conn.mean_neighbors:.0f} neighbors each " f"({conn.min_neighbors}-{conn.max_neighbors}); " f"{conn.pct_nonzero:.2f}% of pairs connected; " f"{conn.n_components} connected component(s); islands: {list(conn.islands)}") ``` ```{python} conn.fig ``` The scalars are the audit: a **single connected component** (global statistics implicitly assume every unit can reach every other through a chain of neighbors), **no islands**, and a sparse graph — each spatial lag averages over a small, local set of six districts. A companion histogram of neighbor counts is available as `conn.fig_hist` (trivially a spike at 6 for k-NN weights, more informative for contiguity). ```{python} print(conn.interpret()) ``` ## Is initial luminosity clustered? The Moran scatterplot `explore_moran_plot` z-standardizes the variable, plots each district's value against the average of its neighbors' values (the **spatial lag**), and colors the four quadrants: **High-High** and **Low-Low** (clustering of similars) on the diagonal, **Low-High** and **High-Low** (spatial outliers) off it. Under row-standardized weights the slope of the fitted line *equals* global Moran's I, and a stat box in the corner reports I, its expectation under spatial randomness, and the permutation pseudo p-value: ```{python} moran = gm.explore_moran_plot(df, "log_ntl_pc_1996", gdf=gdf, w=w, period=1996) moran.fig ``` ```{python} moran.glance().round(3) ``` Initial luminosity per capita is strongly and positively autocorrelated — **Moran's I ≈ 0.73** against an expectation near zero, the value the source paper reports. Most districts sit in the two clustering quadrants: ```{python} print(moran.df["quadrant"].value_counts().to_string()) print() print(moran.interpret()) ``` ## Where does it cluster? LISA Global Moran's I says *that* the map clusters; local Moran statistics (LISA) say *where*. Each district gets a local statistic and its own permutation pseudo p-value; districts significant at `alpha` receive their quadrant's cluster label, everything else is "Not significant": ```{python} lisa = gm.explore_lisa_cluster_map(df, "log_ntl_pc_1996", gdf=gdf, w=w, period=1996) lisa.fig ``` ```{python} print(lisa.interpret()) ``` The picture matches the literature: a bright **High-High belt** and a dim **Low-Low block**, with only a handful of spatial outliers. ### How sensitive are the clusters to alpha? The cluster labels are a significance mask, so they move with the threshold. Re-running at `alpha=0.01` keeps the geography but shrinks the significant set: ```{python} import pandas as pd lisa01 = gm.explore_lisa_cluster_map( df, "log_ntl_pc_1996", gdf=gdf, w=w, period=1996, alpha=0.01 ) pd.DataFrame( { "alpha = 0.05": [lisa.n_hh, lisa.n_ll, lisa.n_lh, lisa.n_hl, lisa.n_ns], "alpha = 0.01": [lisa01.n_hh, lisa01.n_ll, lisa01.n_lh, lisa01.n_hl, lisa01.n_ns], }, index=["High-High", "Low-Low", "Low-High", "High-Low", "Not significant"], ) ``` Tightening alpha from 0.05 to 0.01 drops the significant count from 281 to 175 districts. The hot and cold *cores* survive; the borderline fringe does not — which is exactly how a LISA map should be read. ## Is the clustering strengthening? Moran's I over time `explore_moran_over_time` pivots the panel wide, keeps the one entity set with complete data in every period (so the same W applies throughout and the values are comparable), and re-tests each of the six satellite years: ```{python} mot = gm.explore_moran_over_time(df, "ntl_total", gdf=gdf, w=w) mot.df.round(3) ``` ```{python} mot.fig ``` ```{python} print(mot.interpret()) ``` The answer is: **strong throughout, but not strengthening**. Moran's I for total luminosity fluctuates in a narrow band (roughly 0.39–0.45) and every year rejects spatial randomness at the 1% level — spatial dependence is a persistent feature of India's economic geography over the window, not a trend. ## A caveat, and where next LISA runs one permutation test **per district** — 520 tests here, with no multiple-testing adjustment, so at `alpha = 0.05` roughly 26 "significant" districts would be expected even under complete spatial randomness. Treat the cluster map as *descriptive* of where dependence concentrates (the alpha sensitivity table above is the practical check), not as 520 confirmatory hypothesis tests. The same warning is built into `lisa.interpret()`. Detecting dependence is the exploratory half. Modeling it — letting growth respond to neighbors' outcomes and characteristics, and decomposing associations into direct and spillover components — is the job of the spreg suite, covered in [Spatial spillovers: the spreg suite](spillovers.qmd). ## ----- articles/spillovers.qmd ----- --- title: "Spatial spillovers: the spreg suite" subtitle: "SAR, SEM, SLX, and the spatial Durbin model — diagnosed, estimated, and read through impacts" --- [Spatial dependence and LISA](spatial-dependence.qmd) established that Indian district luminosity clusters in space. This article models that dependence: it runs the Lagrange-multiplier specification tests, estimates the spatial Durbin model of the paper's convergence regression, decomposes the association into direct and spillover (indirect) impacts, and checks that the conclusion survives alternative weights. ```{python} import warnings warnings.filterwarnings("ignore") import geometrics as gm gdf, df, df_dict = gm.data.load_india() df = gm.set_labels(df, df_dict, set_panel=True) # The paper's weights: 6 nearest neighbors on plain lon/lat centroids (crs=None) w = gm.make_weights(gdf, method="knn", k=6, crs=None) ``` ## The specification family Space can enter a regression through three channels, and the classic cross-sectional models are just the on/off combinations. The **spatial lag model (SAR)**, $y = \rho W y + X\beta + \varepsilon$, lets the outcome respond to neighbors' outcomes — one district's growth feeds its neighbors' growth and echoes back, a *global* feedback governed by $\rho$. The **spatial error model (SEM)**, $y = X\beta + u$ with $u = \lambda W u + \varepsilon$, puts the dependence in the disturbances instead: spatially correlated omitted factors, with no substantive spillover in the outcome equation. The **SLX model**, $y = X\beta + WX\gamma + \varepsilon$, is the *local* alternative — neighbors' characteristics matter directly (each $\gamma$ is itself the spillover), but there is no outcome feedback. The **spatial Durbin model (SDM)** combines the lag and SLX channels, $y = \rho W y + X\beta + WX\gamma + \varepsilon$, and is the workhorse of the convergence-with-spillovers literature precisely because it nests the others and lets the data separate the channels. The built-in explainer: ```{python} print(gm.explain("spatial_durbin_model").to_markdown()[:700], "...") ``` All of these are one `model=` switch away in `analyze_spatial_model` (`"ols"`, `"lag"`, `"error"`, `"slx"`, `"durbin"`, `"durbin_error"`), estimated by spreg. ## Which model do the data ask for? `analyze_spatial_diagnostics` estimates the non-spatial OLS benchmark — here the paper's unconditional convergence regression, growth of luminosity per capita 1996–2010 on its initial level, on the 1996 cross-section — then runs Moran's I on the residuals and the five LM tests, and applies the Anselin-Florax decision rule: ```{python} growth = df.query("year == 1996") diag = gm.analyze_spatial_diagnostics( growth, outcome="growth_ntl_pc_9610", covariates=["log_ntl_pc_1996"], gdf=gdf, w=w, ) diag.df ``` ```{python} print(f"Recommendation: {diag.recommendation}\n") print(diag.reasoning) ``` Residual Moran's I of 0.58 means OLS leaves a lot of spatial structure on the table. Both simple LM tests reject overwhelmingly, so the *robust* forms decide, and robust LM error wins by a wide margin — the mechanical rule points to the **error model**. The paper (and this article) nonetheless proceeds with the **spatial Durbin model**: the SDM nests the error model under a common-factor restriction and keeps the spatially lagged covariates that theory suggests, so it is the safe superset when any robust test fires. The diagnostics tell you dependence is real and must be modeled; they do not have the last word on which superset to estimate. ## The spatial Durbin model `analyze_spatial_model` aligns the cross-section to the geometry and weights, estimates by maximum likelihood, and computes the LeSage-Pace impact decomposition from `betas` + `vm` with Monte-Carlo standard errors (`n_draws`; the result records how many were used): ```{python} sdm = gm.analyze_spatial_model( growth, outcome="growth_ntl_pc_9610", covariates=["log_ntl_pc_1996"], gdf=gdf, w=w, model="durbin", n_draws=2000, ) sdm.gt ``` The coefficient table is only the raw material. The spatial autoregressive parameter is large — with $\rho \approx 0.8$, every local difference is amplified through the neighborhood multiplier $1/(1-\rho) \approx 5$ — and the reportable quantities are the impacts: ```{python} print(f"rho = {sdm.rho:.3f} pseudo-R2 = {sdm.r2:.2f} " f"AIC = {sdm.aic:.0f} n = {sdm.n_obs}") sdm.impacts.round(4) ``` ```{python} print(sdm.interpret()) ``` The convergence reading: dimmer districts grow faster (**direct impact −0.026**), and the **total impact of −0.022** — the number that maps to the speed of convergence — folds in the spillover that arrives through the neighborhood. This is the machinery behind `analyze_beta_convergence(..., model="sdm")` in the [Analyze page](../analyze.qmd). ## SLX for contrast: local spillovers only Setting `model="slx"` keeps the spatially lagged covariates but switches off the outcome feedback ($\rho = 0$). Impacts are then **analytic** — no Monte-Carlo draws needed — because direct = $\beta$ and indirect = $\gamma$ exactly, with no multiplier to simulate: ```{python} slx = gm.analyze_spatial_model( growth, outcome="growth_ntl_pc_9610", covariates=["log_ntl_pc_1996"], gdf=gdf, w=w, model="slx", ) slx.impacts.round(4) ``` The contrast is instructive: without the $\rho W y$ channel, the estimated spillover of initial luminosity is small and statistically indistinguishable from zero, and the model has no way to represent growth diffusing between neighbors. The SDM attributed most of the spatial action to the global feedback loop — exactly the channel SLX rules out by construction. AIC comparisons across `sdm.aic` and `slx.aic` (same sample, same W) make the same point. ## Is the result an artifact of the weights? Every impact above is conditional on the 6-nearest-neighbor W. `analyze_spatial_model_by_weights` re-estimates the same model under alternative specifications and compares the focal covariate's direct/indirect/total impacts against a baseline — the source paper's notebook-c07 robustness check: ```{python} alt = { "knn4": gm.make_weights(gdf, method="knn", k=4, crs=None), "knn6": w, "knn8": gm.make_weights(gdf, method="knn", k=8, crs=None), "queen": gm.make_weights(gdf, method="queen"), } robust = gm.analyze_spatial_model_by_weights( growth, outcome="growth_ntl_pc_9610", covariates=["log_ntl_pc_1996"], gdf=gdf, weights=alt, baseline="knn6", n_draws=1000, ) robust.fig ``` ```{python} print(robust.interpret()) ``` The dot-whisker figure is the headline: the total impact keeps its sign in all four specifications and every 95% interval covers the baseline — the convergence conclusion does not hinge on the weights choice. The per-W detail (including each `w_spec` and AIC) lives in `robust.df`, and `robust.gt` renders the comparison table. ## Coefficients are not impacts The closing rule of the spreg suite. In any model with $\rho W y$, a change in one district's covariate propagates to its neighbors and feeds back, so no single regression coefficient describes the association — reading $\beta$ off the SDM coefficient table understates or misstates everything the model has to say. The explainer: ```{python} print(gm.explain("spatial_impacts").to_markdown()[:800], "...") ``` Report **direct** (own-district, feedback included), **indirect** (spillover to the rest of the map), and **total** impacts — never raw coefficients — and say which W they are conditional on. Every geometrics result does the second part for you via `w_spec`. ## Where next - [The India case study](india-case-study.qmd) — the full replication arc these pieces come from - [Spatial dependence and LISA](spatial-dependence.qmd) — the exploratory half: weights, Moran's I, and cluster maps ## ----- articles/inequality.qmd ----- --- title: "Regional inequality" subtitle: "Gini and Theil over time, the spatial Gini, and the between/within split" --- Convergence asks whether poor regions catch up; **inequality analysis asks how far apart regions are right now, and whether that gap is closing**. This article tracks regional inequality in the bundled Indian district panel — 520 districts observed by DMSP-OLS nighttime lights, 1996–2010 — with the PySAL `inequality` stack: level measures over time, the Rey–Smith spatial Gini, and the Theil between/within decomposition by state. ```{python} import warnings warnings.filterwarnings("ignore") import geometrics as gm gdf, df, df_dict = gm.data.load_india() df = gm.set_labels(df, df_dict, set_panel=True) # log-based measures need strictly positive values (see below) bad = df.loc[df["ntl_total"] <= 0, "statedist"].unique() pos = gm.set_labels( df[~df["statedist"].isin(bad)].copy(), df_dict, set_panel=True ) gdf_pos = gdf[~gdf["statedist"].isin(bad)].copy() w_pos = gm.make_weights(gdf_pos, method="knn", k=6, crs=None) ``` ## Gini or Theil? The **Gini index** compares every pair of regions: it is the mean absolute difference between two randomly drawn units, scaled by twice the mean, running from 0 (everyone equal) to (nearly) 1 (one unit holds everything). It is the most widely reported inequality measure, robust to how you group the data, and most sensitive to transfers around the middle of the distribution — but it does not decompose cleanly into parts. The **Theil index** is an entropy measure: it weighs each region's *share* of the total by the log of that share relative to an equal split. It is more sensitive to the top of the distribution, and its killer feature is **exact additive decomposability** — total inequality splits into a between-group plus a within-group component, with nothing left over. That is what makes it the workhorse for regional hierarchies like districts within states: ```{python} print(gm.explain("theil_decomposition").to_markdown()[:600], "...") ``` ## Strictly positive, or geometrics tells you why not The Theil index takes logarithms of shares, so it is undefined at zero. One Indian district (Lahul and Spiti, up in the Himalayas) records **zero luminosity** in some years — and instead of a cryptic numpy warning, geometrics raises a `ValueError` that names the offenders: ```{python} try: gm.analyze_inequality_over_time( df, "ntl_total", measures=("gini", "theil", "cv") ) except ValueError as err: print(err) ``` That is why the first chunk filtered to the always-positive panel (`pos`, 519 districts) exactly as the [Explore page](../explore.qmd) does, and rebuilt the geometry and weights to match. ## Three measures, one trend test `analyze_inequality_over_time` computes each requested measure per period, then regresses the **log** of each measure on time (OLS, HC1 standard errors). A negative, significant slope means inequality is narrowing — the inequality-narrative complement of σ-convergence: ```{python} ineq = gm.analyze_inequality_over_time( pos, "ntl_total", measures=("gini", "theil", "cv") ) ineq.df.round(3) ``` ```{python} ineq.fig ``` The trend table makes the verdict explicit: ```{python} ineq.gt ``` ```{python} print(ineq.interpret()) ``` All three measures agree: district-level luminosity inequality is high (Gini around 0.54) and essentially flat over 1996–2010 — no narrowing the trend test can distinguish from noise. ## The spatial Gini: inequality between neighbors The Gini is a sum over *pairs* of regions, and every pair is either a **neighbor pair** or a **non-neighbor pair** under a spatial weights matrix. Rey & Smith (2013) split it exactly along that line. Pass the geometry and weights and `analyze_inequality_over_time` adds the decomposition per period: ```{python} sg = gm.analyze_inequality_over_time( pos, "ntl_total", measures=("gini", "theil", "cv"), gdf=gdf_pos, w=w_pos, ) sg.df[["time", "n_units", "gini", "gini_spatial", "gini_spatial_p"]].round(4) ``` `gini_spatial` is the component of the overall Gini owed to differences between **neighboring** districts — here well under 1% of a Gini of ~0.54, so almost all pairwise inequality lives between districts that are *not* neighbors. Neighbors resemble each other; the big gaps are long-distance gaps. `gini_spatial_p` is a permutation pseudo p-value (99 permutations by default) testing whether the non-neighbor component exceeds what spatial randomness would produce — at p = 0.01 in every year, the spatial structure of inequality is no accident. The `w_spec` field records the weights used: ```{python} print(sg.w_spec) ``` ## How much inequality is *between states*? Districts nest inside states, so the Theil index splits exactly into a **between-state** component (inequality across state means) and a **within-state** component (inequality among districts inside each state). With `permutations=99`, districts are randomly reassigned to states and `p_between` reports how often a random partition captures a between share at least as large: ```{python} theil = gm.analyze_theil_decomposition( pos, "ntl_total", "state", permutations=99 ) theil.df.round(4) ``` ```{python} theil.fig ``` ```{python} print(theil.interpret()) ``` About 60% of district luminosity inequality is a *between-state* phenomenon, and that share drifted down (62% → 59%) over the window — state means pulled slightly closer together while within-state gaps held. The permutation p-values (0.01 in every period) say the state partition captures far more inequality than chance groupings do: in India, geography — which state you are in — is the dominant layer of regional inequality. ## A lightweight companion: 32 states, one year `load_india_states()` ships a small state-level cross-section (32 states and union territories, corrected DMSP-OLS lights over gridded population, 1992). With a single year there is no trend to test — `analyze_inequality_over_time` needs at least two periods — but a one-off snapshot takes three lines: ```{python} import pandas as pd from inequality.gini import Gini from inequality.theil import Theil gdf_s, df_s, dict_s = gm.data.load_india_states() df_s = gm.set_labels(df_s, dict_s, set_panel=True) y = df_s["ntl_pc"].to_numpy(dtype=float) pd.DataFrame( { "measure": ["Gini", "Theil", "CV"], "value": [Gini(y).g, Theil(y).T, y.std(ddof=1) / y.mean()], } ).round(3) ``` State-level inequality (Gini 0.39) is well below district-level inequality (0.54) — aggregation averages away the within-state gaps, which is precisely the between/ within story again. The map shows where the bright and dark states are: ```{python} gm.explore_choropleth_map(df_s, "log_ntl_pc", gdf=gdf_s, period=1992).fig ``` ## Where next - [Distribution dynamics](dynamics.qmd) — the same panel through Markov and spatial Markov chains: who moves within the distribution, and does the neighborhood condition mobility? - [The India case study](india-case-study.qmd) — the full replication arc, with σ-convergence and this Theil decomposition in context - `gm.explain("gini")`, `gm.explain("theil_index")`, `gm.explain("theil_decomposition")` — the concept explainers quoted above ## ----- articles/dynamics.qmd ----- --- title: "Distribution dynamics: Markov and spatial Markov" subtitle: "Following the whole cross-sectional distribution — and asking whether geography conditions mobility" --- A convergence regression compresses regional growth into one slope. Quah's critique is that the slope can mislead: β-convergence is perfectly compatible with a widening or *polarizing* distribution — poor regions can grow faster on average while the cross-section splits into "twin peaks" of rich and poor clubs. Distribution dynamics therefore follows the **entire cross-sectional distribution** over time — its shape period by period, and the movement of individual regions within it, summarized by Markov transition matrices: ```{python} import warnings warnings.filterwarnings("ignore") import geometrics as gm gdf, df, df_dict = gm.data.load_india() df = gm.set_labels(df, df_dict, set_panel=True) # relative (mean-normalized) measures use the always-positive panel bad = df.loc[df["ntl_total"] <= 0, "statedist"].unique() pos = gm.set_labels( df[~df["statedist"].isin(bad)].copy(), df_dict, set_panel=True ) gdf_pos = gdf[~gdf["statedist"].isin(bad)].copy() w_pos = gm.make_weights(gdf_pos, method="knn", k=6, crs=None) print(gm.explain("distribution_dynamics").to_markdown()[:600], "...") ``` ::: {.callout-note} The two `analyze_*` functions in this article require the optional **giddy** dependency: `pip install "geometrics[dynamics]"` (it is included in `geometrics[all]`). ::: ## The shape of the distribution, period by period `explore_distribution_over_time` estimates one kernel density per period on a shared grid. With `relative=True` each district is divided by the period's cross-sectional mean first — the distribution-dynamics convention — so 1.0 marks the period average and the plot isolates changes in *shape* from changes in the overall level: ```{python} gm.explore_distribution_over_time(pos, "ntl_total", relative=True).fig ``` The relative distribution is strongly right-skewed and stable: a heavy mass of districts below the mean and a long bright tail, with no sign of the mass narrowing toward 1. The same densities animate over the slider if you prefer one curve at a time: ```{python} gm.explore_distribution_over_time( pos, "ntl_total", relative=True, kind="animated" ).fig ``` ## Every district keeps its row Densities hide *who* is where. The space-time heatmap pivots the panel to one row per district and one column per year, so persistence (rows keeping their shading left to right) is visible unit by unit. `sort_by="north_south"` orders rows by centroid latitude using the geometry, and `relative=True` compares districts within each year rather than tracking the level: ```{python} gm.explore_spacetime_heatmap( pos, "ntl_total", gdf=gdf_pos, sort_by="north_south", relative=True ).fig ``` Read top to bottom, the bright bands are not scattered — brightness comes in latitudinal blocks that persist across all six years, a first hint that both the distribution *and* the map are sticky. ## Markov transitions: who moves between quintiles? `analyze_markov_transitions` discretizes each year's relative luminosity into `k=5` quintiles and pools every year-to-year move into a transition-probability matrix — row `Q1`, column `Q2` is the probability that a bottom-quintile district climbs one class by the next period: ```{python} mk = gm.analyze_markov_transitions(pos, "ntl_total", k=5, relative=True) mk.fig ``` The result carries the long-run implications of the matrix — the ergodic (steady-state) distribution, the expected sojourn time in each class, and scalar mobility indices: ```{python} import pandas as pd long_run = pd.DataFrame( {"steady state": mk.steady_state, "sojourn (periods)": mk.sojourn} ) print(long_run.round(3)) print( f"\nmobility: Shorrocks {mk.shorrocks:.3f}, Prais {mk.prais:.3f}, " f"Bartholomew {mk.bartholomew:.3f} ({mk.n_transitions:,} transitions)" ) ``` ```{python} print(mk.interpret()) ``` The diagonal dominates: a district's position in the luminosity distribution is highly persistent (average stay probability 0.83, Shorrocks 0.21), and the extremes are the stickiest — a district entering the top quintile stays roughly fifteen periods. Low mobility is the transition-matrix face of the flat inequality trend in the [regional inequality article](inequality.qmd). ## Spatial Markov: does the neighborhood condition mobility? The classic chain treats every district's move as exchangeable. Rey's **spatial Markov** re-estimates the transition matrix *conditional on the spatial lag* — the average state of each district's 6 nearest neighbors — giving one matrix per neighborhood class, and the Bickenbach–Bode LR / Q tests of whether those conditional matrices differ from the pooled one: ```{python} smk = gm.analyze_spatial_markov(pos, "ntl_total", gdf=gdf_pos, w=w_pos, k=4) smk.fig ``` ```{python} smk.gt ``` ```{python} print(smk.interpret()) ``` Both homogeneity tests reject decisively (LR = 73.6, Q = 73.5, p < 0.001): transition dynamics are **not** the same in every neighborhood. Districts surrounded by dim neighbors move around more (and find it harder to hold a high rank), while districts embedded in bright neighborhoods stay put — geography conditions mobility, the distribution-dynamics counterpart of the spatial spillovers found by the SDM in the [India case study](india-case-study.qmd). ## Where next - [Regional inequality](inequality.qmd) — Gini/Theil levels and the between-state decomposition behind this persistence - [The India case study](india-case-study.qmd) — convergence regressions, spillovers, and convergence clubs on the same panel - `gm.explain("markov_chains")`, `gm.explain("spatial_markov")`, `gm.explain("mobility_measures")` — the concept explainers ## ----- articles/india-case-study.qmd ----- --- title: "The India case study" subtitle: "Regional growth, convergence, and spatial spillovers — a reproducible view from outer space" --- This article replicates and extends the analysis of *"Regional growth, convergence, and spatial spillovers in India"* ([Mendez, Kabiraj & Li](https://github.com/quarcs-lab/project2025s-py); building on Chanda & Kabiraj 2020, *World Development*): 520 Indian districts observed by radiance-calibrated DMSP-OLS **nighttime lights** between 1996 and 2010, used as a satellite proxy for economic activity. Three questions organize everything: 1. **Convergence** — do dimmer (poorer) districts grow faster than brighter ones? 2. **Spatial dependence** — do neighboring districts light up together? 3. **Spillovers** — does a neighborhood's brightness help local growth? ```{python} import warnings warnings.filterwarnings("ignore") import geometrics as gm gdf, df, df_dict = gm.data.load_india() df = gm.set_labels(df, df_dict, set_panel=True) print(f"{gdf.shape[0]} districts x {df['year'].nunique()} years; " f"{df_dict.shape[0]} documented variables") ``` ## A view from space Total district luminosity, classified with Fisher-Jenks. The animation steps through all six satellite years with a **pooled** classification, so colors are comparable across frames: ```{python} gm.explore_choropleth_map(df, "ntl_total", gdf=gdf, animate=True).fig ``` ## Spatial dependence The paper's weights are 6 nearest neighbors (built, like the paper, on plain lon/lat centroids — pass `crs=None`; the geometrics default would project first): ```{python} w = gm.make_weights(gdf, method="knn", k=6, crs=None) lisa_initial = gm.explore_lisa_cluster_map( df, "log_ntl_pc_1996", gdf=gdf, w=w, period=1996 ) lisa_initial.fig ``` Initial luminosity is strongly clustered — the paper reports Moran's I = 0.73: ```{python} print(f"Moran's I (initial log luminosity pc): {lisa_initial.moran_i:.2f} " f"(pseudo p = {lisa_initial.p_sim_global:.3f})") print(f"High-High districts: {lisa_initial.n_hh}, " f"Low-Low: {lisa_initial.n_ll}") ``` And so is growth: ```{python} growth_lisa = gm.explore_lisa_cluster_map( df.query("year == 1996"), "growth_ntl_pc_9610", gdf=gdf, w=w ) print(f"Moran's I (growth 1996-2010): {growth_lisa.moran_i:.2f} " f"(pseudo p = {growth_lisa.p_sim_global:.3f})") growth_lisa.fig ``` ## Convergence: OLS vs the spatial Durbin model The paper's dependent variable is the **per-capita** luminosity growth rate 1996-2010, shipped verbatim by `load_india()` (an honest per-capita *panel* is impossible — district population exists only for 1996 and 2001 — so the paper's pre-computed columns are carried unchanged). To run the paper's exact cross-section through the panel API, rebuild a two-period panel whose growth reproduces the paper's dependent variable identically: ```{python} import numpy as np import pandas as pd HORIZON = 14 # 1996 -> 2010 base = df.query("year == 1996")[ ["statedist", "state", "district", "ntl_pc_1996", "growth_ntl_pc_9610"] ] paper_panel = pd.concat( [ base.assign(year=1996, ntl_pc=base["ntl_pc_1996"]), base.assign( year=2010, ntl_pc=base["ntl_pc_1996"] * np.exp(HORIZON * base["growth_ntl_pc_9610"]), ), ], ignore_index=True, ) paper_panel = gm.set_panel(paper_panel, entity="statedist", time="year") ``` Unconditional convergence, first ignoring space, then with the SDM (the paper's Table 1, Model 1): ```{python} ols = gm.analyze_beta_convergence(paper_panel, "ntl_pc", model="ols") sdm = gm.analyze_beta_convergence( paper_panel, "ntl_pc", model="sdm", gdf=gdf, w=w, n_draws=5000 ) summary = pd.DataFrame( { "OLS": [ols.beta_total, np.nan, ols.beta_total, ols.speed, ols.half_life], "SDM": [sdm.beta_direct, sdm.beta_indirect, sdm.beta_total, sdm.speed, sdm.half_life], }, index=["direct", "indirect", "total", "speed (per yr)", "half-life (yr)"], ).round(4) summary ``` The headline finding: **spatial spillovers raise the estimated speed of convergence**. Part of every district's catch-up arrives through its neighborhood — the indirect impact — which OLS attributes to nothing. ```{python} sdm.fig ``` ```{python} print(sdm.interpret()) ``` ## Which spatial model do the data ask for? ```{python} diag = gm.analyze_spatial_diagnostics( df.query("year == 1996"), outcome="growth_ntl_pc_9610", covariates=["log_ntl_pc_1996"], gdf=gdf, w=w, ) print(diag.recommendation) print(diag.reasoning) diag.gt ``` ## Robustness to the weights choice The paper re-estimates its preferred SDM under seven alternative weights (notebook c07). Here are four: ```{python} alt = { "knn4": gm.make_weights(gdf, method="knn", k=4, crs=None), "knn6": w, "knn8": gm.make_weights(gdf, method="knn", k=8, crs=None), "queen": gm.make_weights(gdf, method="queen"), } robust = gm.analyze_spatial_model_by_weights( df.query("year == 1996"), outcome="growth_ntl_pc_9610", covariates=["log_ntl_pc_1996"], gdf=gdf, weights=alt, baseline="knn6", n_draws=2000, ) robust.fig ``` ## Local convergence: GWR Is the growth-initial association uniform across India? Geographically weighted regression maps the local convergence coefficient: ```{python} gwr = gm.analyze_gwr( df.query("year == 1996"), outcome="growth_ntl_pc_9610", covariates=["log_ntl_pc_1996"], gdf=gdf, ) print(f"adaptive bandwidth: {gwr.bw:.0f} neighbors; local R2 mean " f"{gwr.df['local_r2'].mean():.2f}") gwr.figs["log_ntl_pc_1996"] ``` ## Distribution dynamics Beyond the regression slope: how does the whole distribution move? (One district records zero luminosity in some years; log-based and relative measures use the always-positive panel.) ```{python} bad = df.loc[df["ntl_total"] <= 0, "statedist"].unique() pos = gm.set_labels( df[~df["statedist"].isin(bad)].copy(), df_dict, set_panel=True ) gdf_pos = gdf[~gdf["statedist"].isin(bad)].copy() w_pos = gm.make_weights(gdf_pos, method="knn", k=6, crs=None) gm.explore_distribution_over_time(pos, "ntl_total", relative=True).fig ``` Quintile-to-quintile mobility, then conditioned on the neighborhood: ```{python} mk = gm.analyze_markov_transitions(pos, "ntl_total", k=5, relative=True) print(f"Shorrocks mobility: {mk.shorrocks:.2f} " f"(diagonal persistence {np.diag(mk.p).mean():.2f})") mk.fig ``` ```{python} smk = gm.analyze_spatial_markov(pos, "ntl_total", gdf=gdf_pos, w=w_pos, k=4) print(f"Homogeneity LR test: {smk.lr_stat:.1f} (p = {smk.lr_p:.2g}) — " "transition dynamics differ by neighborhood") smk.fig ``` ```{python} print(smk.interpret()) ``` ## Regional inequality σ-convergence and the between/within split — how much of district inequality is *between states*? ```{python} sigma = gm.analyze_sigma_convergence(pos, "ntl_total") sigma.fig ``` ```{python} theil = gm.analyze_theil_decomposition(pos, "ntl_total", "state") theil.fig ``` ```{python} print(theil.interpret()) ``` ## Convergence clubs Finally, the Phillips-Sul log(t) machinery asks whether all districts share one steady-state path or sort into clubs: ```{python} clubs = gm.analyze_convergence_clubs(pos, "ntl_total", gdf=gdf_pos) print(f"{clubs.n_clubs} clubs, {clubs.n_divergent} divergent districts " f"(whole-panel log-t = {clubs.global_tstat:.1f})") clubs.fig_map ``` ## Sources - Mendez, C., Kabiraj, S., & Li, J. — *Regional growth, convergence, and spatial spillovers in India: A reproducible view from outer space* ([repository](https://github.com/quarcs-lab/project2025s-py), [interactive manuscript](https://quarcs-lab.github.io/project2025s-py/)) - Chanda, A., & Kabiraj, S. (2020). Shedding light on regional growth and convergence in India. *World Development*, 133. - Data: DMSP-OLS radiance-calibrated nighttime lights (NOAA/NGDC), district boundaries from the 2001 Census geography. ## ----- articles/bolivia-dataset.qmd ----- --- title: "The Bolivia dataset" subtitle: "PWT-anchored local GDP at three spatial scales, 2012-2022" --- geometrics ships a second case study: **BOL-005popAdj-PWTscaled**, a Bolivia subnational GDP product built from the 0.25° gridded estimates of Rossi-Hansberg & Zhang (2026) under their most aggressive low-density censoring (`0_05`), **rescaled so the national totals exactly equal Penn World Table 11.0** (`rgdpo` and `pop`). GDP and population are therefore in interpretable **2021 PPP US dollars**, while the model's relative spatial pattern is preserved exactly. The collection is delivered at three analysis scales — **departments** (ADM1, n=9), **provinces** (ADM2, n=112), and the raw **grid cells** (n=1,603) — for 2012–2022, with GADM 4.10 boundaries. If you use it, cite the underlying estimates, the benchmark, and the boundaries: **Rossi-Hansberg & Zhang (2026, *J. Urban Economics* 154)**, **Feenstra, Inklaar & Timmer (2015, *AER* 105(10); PWT 11.0)**, and **GADM 4.10**. The full method (with the rescaling math) is documented in [`datasets/BOL-005popAdj-PWTscaled/README.md`](https://github.com/quarcs-lab/geometrics/tree/main/datasets/BOL-005popAdj-PWTscaled). ## Three scales, one contract Each loader returns the usual geometrics trio — ID-only geometry, long panel, data dictionary: ```{python} import warnings warnings.filterwarnings("ignore") import geometrics as gm gdf, df, df_dict = gm.data.load_bolivia() # 112 provinces (ADM2) df = gm.set_labels(df, df_dict, set_panel=True) print(f"provinces: {gdf.shape[0]} polygons, panel {df.shape}") gdf1, df1, dd1 = gm.data.load_bolivia_departments() # 9 departments (ADM1) df1 = gm.set_labels(df1, dd1, set_panel=True) ``` The dictionaries ship with the data and document every column, including the scaling provenance: ```{python} df_dict[df_dict["var_name"].isin(["gid", "year", "gdp_pwt", "gdppc", "ln_gdppc"])] ``` One data fact to know: **five provinces have boundary polygons but no panel rows** — all of their grid cells fall below the `0_05` population-density censoring threshold. geometrics' alignment machinery warns about them and carries on; the warnings below are expected, not a bug. ## The map ```{python} gm.explore_choropleth_map(df, "ln_gdppc", gdf=gdf, period=2022).fig ``` ## Convergence across provinces, 2012-2022 Did poorer provinces grow faster? First aspatially, then with the spatial Durbin model on queen-contiguity weights: ```{python} w = gm.make_weights(gdf) # queen contiguity, row-standardized ols = gm.analyze_beta_convergence(df, "gdppc", model="ols") sdm = gm.analyze_beta_convergence( df, "gdppc", model="sdm", gdf=gdf, w=w, n_draws=2000 ) print( f"OLS beta: {ols.beta_total:.4f} (half-life {ols.half_life:.0f} yr)\n" f"SDM total: {sdm.beta_total:.4f} = direct {sdm.beta_direct:.4f} " f"+ indirect {sdm.beta_indirect:.4f} (rho = {sdm.rho:.2f})" ) ``` ```{python} sdm.fig ``` ```{python} print(sdm.interpret()) ``` ## How much inequality lies between departments? The province panel carries its parent department (`name1`), so the Theil between/within decomposition is one call: ```{python} theil = gm.analyze_theil_decomposition(df, "gdppc", "name1") theil.fig ``` ```{python} print(theil.interpret()) ``` ## Down to the grid The raw 0.25° cells behind the admin aggregates — useful when administrative boundaries are themselves part of the question: ```{python} gdfg, dfg, ddg = gm.data.load_bolivia_grid() dfg = gm.set_labels(dfg, ddg, set_panel=True) print(f"cells: {gdfg.shape[0]}, panel {dfg.shape}") gm.explore_choropleth_map(dfg, "ln_gdppc", gdf=gdfg, period=2022, tiles=None).fig ``` And spatial structure at cell level — LISA clusters of log GDP per capita: ```{python} wg = gm.make_weights(gdfg) # queen on the grid = rook + corners lisa = gm.explore_lisa_cluster_map(dfg, "ln_gdppc", gdf=gdfg, w=wg, period=2022) print(f"Moran's I = {lisa.moran_i:.2f} (p = {lisa.p_sim_global:.3f}); " f"HH cells: {lisa.n_hh}, LL cells: {lisa.n_ll}") lisa.fig ``` ## Where next - [Beta and sigma convergence](convergence.qmd) — the convergence toolkit in depth - [Spatial spillovers](spillovers.qmd) — diagnostics, the spreg suite, robustness - [The data model](data-model.qmd) — bring your own (gdf, df, df_dict) - The other bundled case study: [India](india-case-study.qmd) ## ----- use-with-llms.qmd ----- --- title: "geometrics for AI agents and LLMs" --- This page is the contract an AI agent needs to use geometrics correctly: where the machine-readable documentation lives, how to install the package, the three-input data model, how to pick a function, and what every result object guarantees. ## Machine-readable entry points - **[llms.txt](https://quarcs-lab.github.io/geometrics/llms.txt)** — the curated index ([llmstxt.org](https://llmstxt.org) convention): what the package is, the docs pages, and the full public API grouped by prefix. Fetch this first. - **[llms-full.txt](https://quarcs-lab.github.io/geometrics/llms-full.txt)** — the full dump: every docs page's source (prose + code) plus every public signature and docstring. Use it when you need exact parameter names and defaults without importing the package. Every page on this site also advertises both files through `` tags in its ``. ## Installing ```bash pip install geometrics # core (maps, ESDA, convergence, spreg, GWR) pip install "geometrics[dynamics]" # + Markov / spatial Markov (giddy) pip install "geometrics[all]" # everything, incl. static PNG export ``` - Python **3.11+**. - Bleeding edge: `pip install "git+https://github.com/quarcs-lab/geometrics.git"`. - In scripts, never call `.fig.show()` — persist figures with `res.fig.write_html("out.html")` (interactive) or `res.fig.write_image("out.png")` (needs the `png` extra; tile-free maps via `tiles=None` export deterministically). ## The three-input contract geometrics separates geometry, data, and metadata. Users supply all three: | Input | What it is | How it enters | |---|---|---| | `gdf` | Geometry with **only the entity ID** (+ optional name) — shapefile, zipped shapefile, GeoJSON, GeoPackage, or a GeoDataFrame | `gm.read_gdf("districts.gpkg", entity="district_id")` | | `df` | A **long-form panel** — one row per (entity, time) | `gm.set_panel(df, entity="district_id", time="year")` | | `df_dict` | A **6-column data dictionary** — `var_name, var_def, label, type, role, can_be_na` | `gm.set_labels(df, df_dict, set_panel=True)` | Declare once, use everywhere: ```python import geometrics as gm gdf, df, df_dict = gm.data.load_india() # or the user's own three inputs df = gm.set_labels(df, df_dict, set_panel=True) # wires entity/time into df.attrs w = gm.make_weights(gdf, method="knn", k=6) # weights are built once, explicitly ``` Rules an agent must follow: - The vocabulary is always **`entity`** and **`time`**. After `set_panel` / `set_labels(..., set_panel=True)` they resolve automatically from `df.attrs`; an explicit `entity=` / `time=` argument always wins. - **`gdf` and `w` are always explicit keyword arguments** — geometry is data, never hidden state. Any function that draws a map or estimates a spatial model takes `gdf=` (and usually `w=`). - `type` in the dictionary is one of `entity / time / factor / logical / numeric`; `role` is one of `"" / outcome / covariate / entity_name`. ## Picking a function The public API is grouped by prefix — the module map of this site: - **`explore_*`** — describe and visualize: choropleths, the weights connectivity graph, Moran scatterplots, LISA cluster maps, Moran's I over time, distribution ridgelines, space-time heatmaps. - **`analyze_*`** — estimate and test: β/σ/club convergence, the spreg suite (OLS/SAR/SEM/SLX/SDM) with LeSage-Pace impacts and LM diagnostics, Markov and spatial Markov transitions, Gini/Theil inequality, GWR/MGWR. - **`learn_*`** — teaching sandboxes that simulate data from a known data-generating process so a learner can watch an estimator recover a planted parameter. **Never point them at user data** — they take knobs (e.g. `rho=`, `seed=`), not DataFrames. - Unprefixed **utilities** — `read_gdf`, `make_weights`, `set_panel`, `set_labels`, `build_data_dict`, `explain`, `list_topics`, ... The full membership of each group is listed in [llms.txt](https://quarcs-lab.github.io/geometrics/llms.txt) and the [API reference](reference/index.qmd). ## The result-object contract Every public function returns a **frozen dataclass**. Depending on the function it exposes: | Attribute | What it is | |---|---| | `.df` | The tidy result DataFrame (always present) | | `.fig` | An interactive Plotly figure (themed; never auto-shown) | | `.gt` | A publication-ready Great Tables object (estimation tables) | | named scalars | e.g. `beta`, `speed`, `half_life`, `rho`, `moran_i`, `p_value` | | `.interpret()` | A plain-language reading of this specific result | | `.explain()` | The concept explainer behind the method | | `notes` | Advisory notes accumulated during estimation | | `w_spec` | A human-readable description of the spatial weights used | Results are immutable — build a new call rather than mutating a result. ## Interpretation guardrails - Lead with `res.interpret()` when reporting to a user — it is written in **association-only** language. Follow it: "is associated with", never "causes" or "the effect of". - Ground concepts with the built-in explainer registry: `gm.list_topics()` → `gm.explain("spatial_autocorrelation")`. ## One runnable recipe ```python import geometrics as gm gdf, df, df_dict = gm.data.load_india() # ID-only geometry, long panel, dictionary df = gm.set_labels(df, df_dict, set_panel=True) w = gm.make_weights(gdf, method="knn", k=6) lisa = gm.explore_lisa_cluster_map(df, "log_ntl_pc_1996", gdf=gdf, w=w) print(lisa.interpret()) # where the hot/cold spots are res = gm.analyze_beta_convergence(df, "ntl_total", model="sdm", gdf=gdf, w=w) print(res.interpret()) # catch-up, speed, spillovers res.fig.write_html("convergence.html") ``` ## Links - [Repository](https://github.com/quarcs-lab/geometrics) · [API reference](reference/index.qmd) · [Changelog](changelog.qmd) - Bundled case studies: `gm.data.load_india()`, `gm.data.load_bolivia()` / `load_bolivia_departments()` / `load_bolivia_grid()` ## ----- changelog.qmd ----- --- title: "Changelog" --- ## v0.1.3 (2026-07-02) - **Three no-code Streamlit apps** — one per module, sharing a lean shell (bundled case-study picker + spatial-weights controls): the **Explore app** (choropleths, connectivity, Moran/LISA, distributions over time), the **Analyze app** (β/σ convergence, clubs, spatial models with impacts and LM diagnostics, weights robustness, Markov dynamics, inequality/Theil, button-gated GWR), and the **Learn app** (all 11 sandboxes with sliders + the searchable explainer browser). Pages gate themselves on the active dataset (panel length, unit count, optional extras). - New `streamlit` extra (`pip install "geometrics[streamlit]"`, included in `[all]`), Python launchers `ExploreApp` / `AnalyzeApp` / `LearnApp`, console scripts `geometrics-explore` / `-analyze` / `-learn`, and repo-root entry points (`app_explore.py`, `app_analyze.py`, `app_learn.py`, `streamlit_app.py` chooser) for Streamlit Community Cloud. - The site now presents the library in the three modules (v0.1.2/v0.1.3 together): per-module pedagogical pages, Colab notebooks, and the rewritten landing page. ## v0.1.2 (2026-07-02) - **The Learn module: 11 `learn_*` concept sandboxes.** Each simulates data from a known data-generating process so a learner can watch the estimator recover a planted parameter, returning a frozen `SandboxResult` (`.df` estimated-vs-truth table, `.fig`, `.summary` scalar facts, `.data` raw simulated frame, `.interpret()` / `.explain()`): `learn_spatial_autocorrelation`, `learn_spatial_weights`, `learn_lisa_clusters`, `learn_spatial_spillovers` (closed-form LeSage-Pace truth), `learn_omitted_spatial_lag`, `learn_beta_convergence`, `learn_sigma_convergence`, `learn_convergence_clubs`, `learn_markov_chains`, `learn_spatial_markov` (both via the `dynamics` extra), and `learn_theil_decomposition` (independent numpy truth). - The shared synthetic geographies and planted-parameter processes now live in `geometrics.sandbox._dgp`; the test suite's known-answer fixtures delegate to them. - Visual identity: the "classified lattice" logo/favicon and a hero image built from the real India LISA cluster map; new [For AI / LLMs](use-with-llms.qmd) page. ## v0.1.1 (2026-07-02) - **New dataset: Bolivia (BOL-005popAdj-PWTscaled)** — the 0.25° gridded GDP of Rossi-Hansberg & Zhang (2026) under `0_05` censoring, rescaled so national totals equal Penn World Table 11.0 (2021 PPP US$), 2012–2022, on GADM 4.10 boundaries. Committed under `datasets/` in this repository and served from pinned, hash-verified raw URLs. - New loaders: `load_bolivia()` (112 provinces; 5 fully-censored provinces have polygons but no panel rows — documented), `load_bolivia_departments()` (9 departments), `load_bolivia_grid()` (1,603 cells with a synthesized single `cell` entity id), `load_bolivia_raw()` (any level incl. ADM0). - New article: [The Bolivia dataset](articles/bolivia-dataset.qmd). ## v0.1.0 (2026-07-02) First public release. - **The three-input data contract**: `read_gdf` (shapefile / zipped shapefile / GeoJSON / GeoPackage → ID-only geometry), a long-form panel declared with `set_panel` / `set_labels`, and a six-column data dictionary (`df_dict`, inferable with `build_data_dict`). - **Maps & ESDA**: `explore_choropleth_map` (classified, animated), `explore_connectivity_map`, `explore_moran_plot`, `explore_lisa_cluster_map`, `explore_moran_over_time`, `explore_distribution_over_time`, `explore_spacetime_heatmap`. - **Convergence**: `analyze_beta_convergence` (OLS / SAR / SEM / SLX / SDM with LeSage-Pace impacts and Monte-Carlo standard errors), `analyze_sigma_convergence`, `analyze_convergence_clubs` (Phillips-Sul log(t) with club maps). - **Spatial econometrics**: `analyze_spatial_model`, `analyze_spatial_diagnostics` (Anselin-Florax recommendation), `analyze_spatial_model_by_weights`. - **Distribution dynamics**: `analyze_markov_transitions`, `analyze_spatial_markov` (via the `dynamics` extra). - **Inequality**: `analyze_inequality_over_time` (Gini / Theil / CV, spatial Gini), `analyze_theil_decomposition`. - **Local models**: `analyze_gwr`, `analyze_mgwr`. - **Pedagogy**: every result carries `.interpret()` and `.explain()`; 30 concept explainers registered. - **Data**: the Indian district case study (520 districts, DMSP-OLS nighttime lights 1996-2010) and a 32-state demo, fetched from [quarcs-lab/project2025s-py](https://github.com/quarcs-lab/project2025s-py) at a pinned commit with hash verification. # ===== Public API signatures ===== ## explore_* ### explore_choropleth_map(df: 'pd.DataFrame', var: 'str', *, gdf: 'gpd.GeoDataFrame', period: 'Any' = None, animate: 'bool' = False, entity: 'str | None' = None, time: 'str | None' = None, scheme: 'str | None' = 'fisherjenks', k: 'int' = 5, bins: 'Sequence[float] | None' = None, tiles: 'str | None' = 'carto-positron', hover: 'str | Sequence[str] | None' = None, simplify: 'float | str | None' = 'auto', title: 'str | None' = None) -> 'ChoroplethMapResult' Map one variable across entities as a classed (or continuous) choropleth. The panel ``df`` is aligned to the entity geometry ``gdf`` for one cross section (the latest period by default) and drawn as a Plotly choropleth with one legend-togglable trace per class. With ``animate=True`` every period becomes an animation frame, classified on **pooled** breaks (computed from all periods together) so colors are comparable over time. Parameters ---------- df Long panel (or cross section) holding ``var`` per entity. var Numeric column of ``df`` to map. gdf Entity geometry; must carry the same entity-id column as ``df``. period Period to map. Defaults to the latest period when ``df`` has a time dimension (a note records this). Ignored when ``animate=True``. animate Draw every period as an animation frame with a slider and play button. entity, time Panel identifiers; default to the ids declared via :func:`geometrics.set_panel`. scheme A mapclassify scheme name (``"fisherjenks"``, ``"quantiles"``, ...), or ``None`` for a continuous colorbar map. k Number of classes for ``scheme`` (ignored when ``bins`` are given). bins Explicit upper class bounds (overrides ``scheme`` / ``k``). tiles MapLibre base-map style (default ``"carto-positron"``) or ``None`` for the vector backend (deterministic PNG export). hover Extra ``df`` column(s) appended to the hover box. simplify Geometry simplification: ``"auto"`` (metric tolerance = max bounding-box dimension / 2000), a float tolerance in meters, or ``None`` to disable. title Figure title. Defaults to the variable label plus the mapped period. Returns ------- ChoroplethMapResult Frozen result with ``df`` (entity, value, class label), ``fig``, ``gdf_plotted`` (the WGS84 geometry actually drawn, value and class attached), the applied ``bins``, and ``notes``. Examples -------- Map a small two-period panel (latest period by default): ```python import geopandas as gpd import pandas as pd from shapely.geometry import box from geometrics.maps import explore_choropleth_map gdf = gpd.GeoDataFrame( {"region": ["a", "b", "c", "d"]}, geometry=[box(i % 2, i // 2, i % 2 + 1, i // 2 + 1) for i in range(4)], crs="EPSG:4326", ) df = pd.DataFrame( { "region": ["a", "b", "c", "d"] * 2, "year": [2000] * 4 + [2010] * 4, "gdppc": [1.0, 2.0, 3.0, 4.0, 1.5, 2.5, 3.5, 4.5], } ) res = explore_choropleth_map( df, "gdppc", gdf=gdf, entity="region", time="year", k=2, tiles=None ) print(res.period, res.bins) ``` ### explore_connectivity_map(gdf: 'gpd.GeoDataFrame', *, w: 'W | None' = None, entity: 'str | None' = None, tiles: 'str | None' = 'carto-positron', title: 'str | None' = None) -> 'ConnectivityMapResult' Draw the spatial weights graph over the map and summarize its connectivity. The figure overlays the neighbor graph (edges between adjacent centroids, one node per unit) on a light-grey polygon layer, and the companion histogram shows the neighbor-cardinality distribution — the standard visual audit of a ``W`` before any spatial statistic is computed on it. Parameters ---------- gdf Geometry frame (see :func:`geometrics.read_gdf`). w ``libpysal`` weights aligned to the gdf entity ids. ``None`` builds the default weights (queen contiguity for polygons, 6-nearest-neighbor otherwise) with a :class:`~geometrics.GeometricsWarning`. entity Entity id column of ``gdf``; resolved automatically when ``None``. tiles MapLibre basemap style (e.g. ``"carto-positron"``). ``None`` draws a vector (tile-free) figure suitable for deterministic PNG export. title Figure title (a default naming the weights is used when ``None``). Returns ------- ConnectivityMapResult The per-entity neighbor-cardinality frame, the graph figure (``fig``), the cardinality histogram (``fig_hist``), the connectivity scalars and ``w_spec``. Examples -------- Connectivity of a two-cell map (each unit has exactly one neighbor): ```python import geopandas as gpd from shapely.geometry import box from geometrics.weights import explore_connectivity_map, make_weights gdf = gpd.GeoDataFrame( {"region": ["A", "B"]}, geometry=[box(0, 0, 1, 1), box(1, 0, 2, 1)], crs="EPSG:4326", ) res = explore_connectivity_map(gdf, w=make_weights(gdf), tiles=None) (res.n_units, res.mean_neighbors) ``` ### explore_moran_plot(df: 'pd.DataFrame', var: 'str', *, gdf: 'gpd.GeoDataFrame', w: 'W | None' = None, period: 'Any' = None, entity: 'str | None' = None, time: 'str | None' = None, permutations: 'int' = 999, seed: 'int | None' = 12345, title: 'str | None' = None) -> 'MoranPlotResult' Draw the Moran scatterplot and test global spatial autocorrelation in ``var``. The panel is aligned to the geometry for one cross-section (the latest period by default), the variable is z-standardized, and its row-standardized spatial lag is plotted against it, colored by scatter quadrant (HH, LH, LL, HL). The OLS slope of the fitted line equals global Moran's I under row-standardized weights, whose significance is assessed with ``permutations`` conditional permutations (:class:`esda.moran.Moran`). Parameters ---------- df Long panel (or cross-section) holding ``var`` per entity. var Numeric column of ``df`` to test. gdf Entity geometry; must carry the same entity-id column as ``df``. w ``libpysal`` weights aligned to the gdf entity ids. ``None`` builds the default weights (queen contiguity for polygons, 6-nearest-neighbor otherwise) with a :class:`~geometrics.GeometricsWarning`. esda row-standardizes the weights for the statistic (its ``transformation="r"`` convention), which is also what makes the scatter slope equal Moran's I. period Period to analyze. Defaults to the latest period when ``df`` has a time dimension (a note records this). entity, time Panel identifiers; default to the ids declared via :func:`geometrics.set_panel`. permutations Number of conditional permutations behind ``p_sim`` / ``z_sim``. seed esda's global :class:`~esda.moran.Moran` draws its permutations from NumPy's **global** random state and exposes no seed argument, so when ``seed`` is not ``None`` geometrics calls ``numpy.random.seed(seed)`` immediately before the test to make ``p_sim`` reproducible. Pass ``None`` to leave the global state untouched. title Figure title. Defaults to ``"Moran scatterplot: ()"``. Returns ------- MoranPlotResult Frozen result with ``df`` (``entity``, standardized ``value``, spatial ``lag``, ``quadrant``), the quadrant-colored scatter ``fig``, the global Moran scalars (``moran_i``, ``expected_i``, ``p_sim``, ``z_sim``) and ``w_spec``. Examples -------- Moran's I on a four-cell strip where value increases smoothly west to east: ```python import geopandas as gpd import pandas as pd from shapely.geometry import box from geometrics.dependence import explore_moran_plot from geometrics.weights import make_weights gdf = gpd.GeoDataFrame( {"region": ["a", "b", "c", "d"]}, geometry=[box(i, 0, i + 1, 1) for i in range(4)], crs="EPSG:4326", ) df = pd.DataFrame({"region": ["a", "b", "c", "d"], "gdppc": [1.0, 2.0, 3.0, 4.0]}) res = explore_moran_plot( df, "gdppc", gdf=gdf, w=make_weights(gdf), entity="region", permutations=99 ) print(res.df["quadrant"].tolist(), round(res.moran_i, 3)) ``` ### explore_lisa_cluster_map(df: 'pd.DataFrame', var: 'str', *, gdf: 'gpd.GeoDataFrame', w: 'W | None' = None, period: 'Any' = None, entity: 'str | None' = None, time: 'str | None' = None, permutations: 'int' = 999, seed: 'int | None' = 12345, alpha: 'float' = 0.05, tiles: 'str | None' = 'carto-positron', title: 'str | None' = None) -> 'LisaClusterMapResult' Map local Moran (LISA) clusters of ``var`` and the matching Moran scatterplot. :class:`esda.moran.Moran_Local` assigns each entity a scatter quadrant (via its ``q``: 1=HH, 2=LH, 3=LL, 4=HL) and a permutation pseudo p-value; entities with ``p_sim < alpha`` receive their quadrant's cluster label (High-High hot spots, Low-Low cold spots, Low-High / High-Low spatial outliers) and everything else is ``"Not significant"``. The cluster map uses the ecosystem-fixed LISA colors (GeoDa / splot convention); ``fig_scatter`` is the same Moran scatter as :func:`explore_moran_plot`, colored by cluster. Global Moran's I accompanies the local statistics (``moran_i``, ``p_sim_global``). Parameters ---------- df Long panel (or cross-section) holding ``var`` per entity. var Numeric column of ``df`` to analyze. gdf Entity geometry; must carry the same entity-id column as ``df``. w ``libpysal`` weights aligned to the gdf entity ids. ``None`` builds the default weights (queen contiguity for polygons, 6-nearest-neighbor otherwise) with a :class:`~geometrics.GeometricsWarning`. period Period to analyze. Defaults to the latest period when ``df`` has a time dimension (a note records this). entity, time Panel identifiers; default to the ids declared via :func:`geometrics.set_panel`. permutations Number of conditional permutations behind the local and global pseudo p-values. seed Reproducibility seed. ``Moran_Local`` accepts it directly; the global :class:`~esda.moran.Moran` has no seed argument, so ``numpy.random.seed(seed)`` is set immediately before it (see :func:`explore_moran_plot`). ``None`` leaves the random state untouched. alpha Significance level masking the cluster labels (``p_sim < alpha``). tiles MapLibre base-map style for the cluster map (default ``"carto-positron"``) or ``None`` for the vector backend (deterministic PNG export). title Cluster-map title. Defaults to ``"LISA clusters: ()"``; the scatter always uses its own composed title. Returns ------- LisaClusterMapResult Frozen result with the per-entity frame (``entity``, standardized ``value``, ``lag``, ``local_i``, ``quadrant``, ``p_sim``, ``cluster``), the cluster map ``fig``, the cluster-colored ``fig_scatter``, the global test scalars, the per-class counts and ``w_spec``. Examples -------- LISA on a four-cell strip (tiny n, so nothing is significant at 5%): ```python import geopandas as gpd import pandas as pd from shapely.geometry import box from geometrics.dependence import explore_lisa_cluster_map from geometrics.weights import make_weights gdf = gpd.GeoDataFrame( {"region": ["a", "b", "c", "d"]}, geometry=[box(i, 0, i + 1, 1) for i in range(4)], crs="EPSG:4326", ) df = pd.DataFrame({"region": ["a", "b", "c", "d"], "gdppc": [1.0, 2.0, 3.0, 4.0]}) res = explore_lisa_cluster_map( df, "gdppc", gdf=gdf, w=make_weights(gdf), entity="region", permutations=99, tiles=None, ) print(res.df["cluster"].tolist()) ``` ### explore_moran_over_time(df: 'pd.DataFrame', var: 'str', *, gdf: 'gpd.GeoDataFrame', w: 'W | None' = None, entity: 'str | None' = None, time: 'str | None' = None, permutations: 'int' = 999, seed: 'int | None' = 12345, title: 'str | None' = None) -> 'MoranOverTimeResult' Track global Moran's I in ``var`` period by period on a fixed entity set. The long panel is pivoted to one row per entity and one column per period, and a **single** entity set — the entities with complete data across every kept period — is used throughout, so the same (possibly restricted) weights apply to every period and the Moran's I values are comparable over time. Periods with no data at all and entities with incomplete series are dropped with a note. Each period's test uses ``permutations`` conditional permutations (:class:`esda.moran.Moran`). Parameters ---------- df Long panel holding ``var`` per (entity, period). var Numeric column of ``df`` to track. gdf Entity geometry; must carry the same entity-id column as ``df``. w ``libpysal`` weights aligned to the gdf entity ids. ``None`` builds the default weights (queen contiguity for polygons, 6-nearest-neighbor otherwise) with a :class:`~geometrics.GeometricsWarning`. When entities drop, the weights are restricted (``w_subset``) and their transform re-applied. entity, time Panel identifiers; default to the ids declared via :func:`geometrics.set_panel`. Both are required here. permutations Number of conditional permutations behind each period's ``p_sim`` / ``z_sim``. seed esda's :class:`~esda.moran.Moran` draws its permutations from NumPy's **global** random state and exposes no seed argument, so when ``seed`` is not ``None`` geometrics calls ``numpy.random.seed(seed)`` immediately before *each* period's test — every period's p-value is then reproducible on its own, independent of which other periods are present. ``None`` leaves the random state untouched. title Figure title. Defaults to ``"Global Moran's I over time: "``. Returns ------- MoranOverTimeResult Frozen result with one row per period (``period``, ``moran_i``, ``z_sim``, ``p_sim``, ``n_obs``), the line-and-marker ``fig`` (filled markers flag ``p_sim < 0.05``; the dashed line marks E[I] under spatial randomness) and ``w_spec``. Examples -------- Two periods on a four-cell strip with a smooth west-east gradient: ```python import geopandas as gpd import pandas as pd from shapely.geometry import box from geometrics.dependence import explore_moran_over_time from geometrics.weights import make_weights gdf = gpd.GeoDataFrame( {"region": ["a", "b", "c", "d"]}, geometry=[box(i, 0, i + 1, 1) for i in range(4)], crs="EPSG:4326", ) df = pd.DataFrame( { "region": ["a", "b", "c", "d"] * 2, "year": [2000] * 4 + [2010] * 4, "gdppc": [1.0, 2.0, 3.0, 4.0, 1.5, 2.5, 3.5, 4.5], } ) res = explore_moran_over_time( df, "gdppc", gdf=gdf, w=make_weights(gdf), entity="region", time="year", permutations=99, ) print(res.df[["period", "n_obs"]].to_dict("list")) ``` ### explore_distribution_over_time(df: 'pd.DataFrame', var: 'str', *, entity: 'str | None' = None, time: 'str | None' = None, relative: 'bool' = False, periods: 'Sequence[Any] | None' = None, kind: "Literal['ridgeline', 'animated']" = 'ridgeline', bandwidth: 'float | str | None' = None, title: 'str | None' = None) -> 'DistributionOverTimeResult' Track how the cross-sectional distribution of one variable evolves over time. A Gaussian kernel density of ``var`` is estimated per period (:class:`scipy.stats.gaussian_kde`) and evaluated on a single grid shared by all periods, so the densities are directly comparable. ``kind="ridgeline"`` stacks one filled density per period with a subtle vertical offset (newest period on top); ``kind="animated"`` shows a single density animated over the periods with a play button and slider. Parameters ---------- df Long panel holding ``var`` per entity and period. var Numeric column of ``df`` whose distribution is tracked. entity, time Panel identifiers; default to the ids declared via :func:`geometrics.set_panel`. A ``time`` id is required. relative Divide ``var`` by its cross-sectional mean per period before density estimation (the distribution-dynamics convention): 1.0 marks the period average and a dashed vertical line is drawn at 1. periods Subset of periods to include (default: all periods in ``df``). Unknown periods raise :class:`ValueError`. kind ``"ridgeline"`` (stacked filled densities, one trace per period) or ``"animated"`` (one density trace animated over periods with a slider). bandwidth Kernel bandwidth passed to :class:`scipy.stats.gaussian_kde` as ``bw_method`` (a scalar factor or ``"scott"`` / ``"silverman"``). ``None`` uses scipy's default (Scott's rule). title Figure title. Defaults to a description built from the variable label. Returns ------- DistributionOverTimeResult Frozen result with the tidy evaluation frame ``df`` (columns ``time``, ``value``, ``density``), the themed ``fig``, and ``notes``. Raises ------ KeyError If ``var`` is not a column of ``df``. TypeError If ``var`` is not numeric. ValueError If no ``time`` id resolves, ``kind`` is unknown, a requested period is absent, or a period has fewer than 2 distinct values. Examples -------- Ridgeline of a small two-period panel: ```python import pandas as pd from geometrics.spacetime import explore_distribution_over_time df = pd.DataFrame( { "region": list("abcdefgh") * 2, "year": [2000] * 8 + [2010] * 8, "gdppc": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0] + [2.0, 2.5, 3.5, 4.5, 5.0, 6.5, 7.0, 7.5], } ) res = explore_distribution_over_time(df, "gdppc", entity="region", time="year") len(res.fig.data) ``` ### explore_spacetime_heatmap(df: 'pd.DataFrame', var: 'str', *, entity: 'str | None' = None, time: 'str | None' = None, gdf: 'gpd.GeoDataFrame | None' = None, sort_by: "Literal['value', 'name', 'north_south', 'east_west']" = 'value', relative: 'bool' = False, title: 'str | None' = None) -> 'SpacetimeHeatmapResult' Draw one variable as an entity-by-time heatmap (every unit keeps its row). The long panel is pivoted to one row per entity and one column per period and drawn as a heatmap on the library's sequential scale, so persistence (rows that keep their shading left to right) and mobility (rows that change shade) are visible unit by unit. Parameters ---------- df Long panel holding ``var`` per entity and period. var Numeric column of ``df`` to draw. entity, time Panel identifiers; default to the ids declared via :func:`geometrics.set_panel`. Both are required. gdf Entity geometry, required for the geographic row orders (``sort_by="north_south"`` / ``"east_west"``); entities are matched on the gdf's entity-id column. sort_by Row order: ``"value"`` (mean value per unit, highest first), ``"name"`` (alphabetical), ``"north_south"`` (metric-CRS centroid latitude, north first) or ``"east_west"`` (centroid longitude, west first). relative Divide each column by its period mean, so 1.0 marks the period average and shading compares units within a period rather than tracking the level. title Figure title. Defaults to a description built from the variable label. Returns ------- SpacetimeHeatmapResult Frozen result with ``df`` (the entity-by-time pivot in display order, entities as the index), the themed ``fig``, and ``notes``. Raises ------ KeyError If ``var`` is not a column of ``df``. TypeError If ``var`` is not numeric, or ``gdf`` is not a GeoDataFrame. ValueError If the panel ids do not resolve, ``sort_by`` is unknown, a geographic sort is requested without a ``gdf``, or no complete observations remain. Examples -------- Heatmap of a small two-period panel: ```python import pandas as pd from geometrics.spacetime import explore_spacetime_heatmap df = pd.DataFrame( { "region": ["a", "b", "a", "b"], "year": [2000, 2000, 2010, 2010], "gdppc": [1.0, 2.0, 1.5, 2.5], } ) res = explore_spacetime_heatmap(df, "gdppc", entity="region", time="year") res.df.shape ``` ## analyze_* ### analyze_beta_convergence(df: 'pd.DataFrame', var: 'str', controls: 'Sequence[str] | str | None' = None, *, entity: 'str | None' = None, time: 'str | None' = None, start: 'float | None' = None, end: 'float | None' = None, model: 'str' = 'ols', gdf: 'gpd.GeoDataFrame | None' = None, w: 'W | None' = None, fixed_effects: 'Sequence[str] | str | None' = None, vcov: 'str' = 'hetero', n_draws: 'int' = 10000, seed: 'int' = 20250620, min_obs: 'int' = 10, title: 'str | None' = None) -> 'BetaConvergenceResult' β-convergence of a panel variable, from plain OLS to the spatial Durbin model. Builds the growth cross-section (:func:`growth_cross_section`) over a common window and regresses each unit's annualized log growth on its **initial log level** (plus initial-period ``controls`` and ``fixed_effects`` dummies). A **negative** slope β is convergence: units that start lower grow faster. The slope maps to a speed ``λ = -ln(1 + β·T)/T`` and a half-life ``ln 2 / λ``. With a spatial ``model`` the cross-section is first aligned to the entity geometry ``gdf`` (and the weights ``w``) so rows, polygons and weights always match, and the regression is estimated with :mod:`spreg`: ``"sar"`` (ML spatial lag), ``"sem"`` (ML spatial error), ``"slx"`` (spatially lagged regressors) or ``"sdm"`` (spatial Durbin). The initial-level term is decomposed into **direct**, **indirect** (spillover) and **total** impacts with Monte-Carlo standard errors from ``n_draws`` draws (the ``impacts`` table covers every regressor); speed and half-life derive from the **total** impact. An OLS baseline on the same sample is always reported alongside. Parameters ---------- df Long panel data frame. var Numeric, strictly positive variable in **levels** (the log is taken internally; growth is the annualized log-difference). controls Optional control name(s), entering at their initial-period values (conditional convergence). entity, time Panel identifiers. Default to those declared via :func:`geometrics.set_panel`. start, end Growth window endpoints; default to the earliest and latest period. model ``"ols"`` (default), ``"sar"``, ``"sem"``, ``"slx"`` or ``"sdm"``. gdf Entity geometry (required for the spatial models; optional for OLS, where it only adds the growth map and restricts the sample to matched units). w ``libpysal`` weights aligned to the ``gdf`` ids. ``None`` builds the default weights (queen contiguity for polygons, 6-nearest-neighbor otherwise) with a :class:`~geometrics.GeometricsWarning`. fixed_effects Optional categorical column name(s) (e.g. a state id) entered as dummy variables (first level dropped), valued at the initial period. vcov Standard errors of the OLS baseline: ``"hetero"`` (HC1, default) or ``"iid"``. The spatial models use their spreg (ML) covariance. n_draws Monte-Carlo draws for the SAR/SDM impact standard errors. seed Seed for the Monte-Carlo draws (reproducible). min_obs Minimum number of cross-section units required. title Title for the growth-vs-initial scatter. Returns ------- BetaConvergenceResult The growth cross-section ``df``; the scatter ``fig``, the Frisch-Waugh-Lovell partial scatter ``fig_conditional`` (``None`` without controls/fixed effects) and the growth choropleth ``fig_map`` (``None`` without ``gdf``); the estimate table ``gt`` / ``summary`` (OLS column always, an impact column for spatial models); the fitted ``models``; the ``beta_direct`` / ``beta_indirect`` / ``beta_total`` triple with standard errors; ``rho`` / ``lam``; ``speed`` and ``half_life``; and the per-regressor ``impacts`` table for spatial models. Raises ------ KeyError If ``var``, a control or a fixed-effect column is missing from ``df``. TypeError If ``var`` or a control is not numeric. ValueError For an unknown ``model`` / ``vcov``, a spatial model without ``gdf``, an empty or inverted window, non-positive values of ``var``, too few units, or a zero-variance initial level. Examples -------- Unconditional convergence across 12 regions over one decade (β < 0 by construction — low-income regions grow faster): ```python import numpy as np import pandas as pd from geometrics.convergence import analyze_beta_convergence ids = [f"r{i}" for i in range(12)] y0 = np.linspace(1000.0, 12000.0, 12) growth = 0.05 - 0.02 * np.log(y0) / np.log(y0).mean() # poorer -> faster df = pd.concat( [ pd.DataFrame({"region": ids, "year": 2000, "gdppc": y0}), pd.DataFrame( {"region": ids, "year": 2010, "gdppc": y0 * np.exp(10 * growth)} ), ] ) res = analyze_beta_convergence(df, "gdppc", entity="region", time="year") round(res.beta_total, 4), res.model ``` ### analyze_sigma_convergence(df: 'pd.DataFrame', var: 'str', *, entity: 'str | None' = None, time: 'str | None' = None, start: 'float | None' = None, end: 'float | None' = None, min_periods: 'int' = 3, vcov: 'str' = 'hetero', title: 'str | None' = None) -> 'SigmaConvergenceResult' σ-convergence: track and test the cross-sectional dispersion of a panel variable. For each period the function measures how spread out the **log** of ``var`` is across units — the standard deviation (the classic σ), the Gini index and the coefficient of variation — and then asks whether that dispersion shrinks over time by regressing the **log dispersion** on time. A **negative** trend slope is σ-convergence: the cross-sectional distribution is narrowing (units are becoming more alike). This is the distributional complement to β-convergence (:func:`analyze_beta_convergence`). The variable is taken in **levels** and logged internally (so pass GDP per capita, not log GDP per capita). The panel must be **balanced** (every unit present in every period) so the dispersion is comparable across periods. Parameters ---------- df Long panel data frame. var Numeric, strictly positive variable in levels whose log dispersion is tracked. entity, time Panel identifiers. Default to those declared via :func:`geometrics.set_panel`. start, end Optional first and last period to include. Default to the full range; the retained window must still be balanced. min_periods Minimum number of periods required to estimate a dispersion trend (at least 3). vcov Standard errors of the trend regressions: ``"hetero"`` (HC1, default) or ``"iid"``. Does not change the point estimates. title Title for the dual-axis figure. Returns ------- SigmaConvergenceResult The per-period dispersion table ``df`` (``time``, ``n_units``, ``mean``, ``std``, ``gini``, ``cv`` — all on log values); the dual-axis ``fig`` (std on the left axis, Gini on the right, with fitted trend overlays); the trend table ``gt`` / ``summary``; the fitted trend ``models``; and the headline trend scalars (``std_slope`` / ``std_se`` / ``std_pvalue`` / ``std_r2`` plus the ``gini_*`` and ``cv_*`` counterparts). ``notes`` records any degraded measure. Raises ------ KeyError If ``var`` is not a column of ``df``. TypeError If ``var`` is not numeric. ValueError If no usable rows remain, ``var`` has non-positive values (the log is undefined), the panel is unbalanced, or there are too few units/periods. Notes ----- For a dispersion measure :math:`D_t` computed cross-sectionally at each period ``t``, the trend is the OLS slope ``b`` in :math:`\ln D_t = a + b t + \varepsilon_t`, so ``b`` is the average proportional change in dispersion per period and ``b < 0`` is σ-convergence. The standard deviation uses ``ddof = 1``; the Gini index is the relative mean absolute difference over twice the mean; the coefficient of variation is the standard deviation over the mean. See Barro & Sala-i-Martin, *Economic Growth*, ch. 11. Examples -------- Dispersion shrinks by construction (each unit moves halfway to the mean): ```python import numpy as np import pandas as pd from geometrics.convergence import analyze_sigma_convergence ids = [f"r{i}" for i in range(8)] y0 = np.linspace(1000.0, 8000.0, 8) frames = [ pd.DataFrame( { "region": ids, "year": 2000 + t, "gdppc": np.exp( np.log(y0).mean() + (np.log(y0) - np.log(y0).mean()) * 0.8**t ), } ) for t in range(5) ] res = analyze_sigma_convergence( pd.concat(frames), "gdppc", entity="region", time="year" ) res.std_slope < 0 ``` ### analyze_convergence_clubs(df: 'pd.DataFrame', var: 'str', *, entity: 'str | None' = None, time: 'str | None' = None, gdf: 'gpd.GeoDataFrame | None' = None, hp_filter: 'bool' = True, hp_lambda: 'float' = 400.0, trim: 'float' = 0.3, tcrit: 'float' = -1.65, cr: 'float' = 0.0, increment: 'float' = 0.05, max_cr: 'float' = 3.0, fraction: 'float' = 0.0, adjust: 'bool' = False, merge: 'str' = 'ps', tiles: 'str | None' = 'carto-positron', title: 'str | None' = None) -> 'ConvergenceClubsResult' Phillips-Sul log(t) convergence test and data-driven club clustering for a panel. Runs the full club-convergence workflow on one variable: optionally smooth each unit's series with the **Hodrick-Prescott filter** (``lambda = 400`` for annual data); form the **relative transition path** ``h_it = x_it / mean_i(x_it)``; run the **log(t) regression test** for the whole panel; and, when global convergence is rejected, apply the **clustering algorithm** to split the units into convergence **clubs**, then **merge** adjacent clubs that jointly converge. This answers the descriptive question "do these units form one converging group, several catch-up clubs, or none?". The variable is used **as supplied** — no log is taken — so for the canonical income case pass *log* GDP per capita (or log labor productivity). The panel must be **balanced** (every unit present in every period) because the HP filter needs a gap-free series. Parameters ---------- df Balanced panel data frame. var Numeric variable to analyse (e.g. ``"log_gdppc"``). Used as supplied. entity, time Panel identifiers. Default to those declared via :func:`geometrics.set_panel`. gdf Optional entity geometry; when given, the result carries a club-membership choropleth ``fig_map`` (``None`` otherwise). hp_filter Apply the Hodrick-Prescott filter per unit and analyse the **trend** (default). ``False`` analyses the variable as given (already smooth). hp_lambda HP smoothing parameter (``400`` for annual data, the convergence-literature default). trim Initiating sample fraction ``r`` of the log(t) regression: the first ``round(r*T)`` periods are discarded. Phillips-Sul recommend ``0.3`` for small/moderate ``T`` and ``0.2`` for large ``T``. tcrit One-sided convergence critical value for the t-statistic (``-1.65``, the 5% level). cr Sieve inclusion threshold ``c*`` for adding units to a core group. increment Increment by which ``cr`` is raised (original PS-2007 refinement rule) when the assembled club fails its joint test. max_cr Ceiling for the raised ``cr``. fraction Cross-section sort key: ``0`` (default) sorts by the last period; ``> 0`` sorts by the mean of the last ``(1 - fraction)`` share of periods (for noisy endpoints). adjust Use the Schnurbus et al. (2016) club refinement (add the best candidate one at a time) instead of the original Phillips-Sul ``cr``-increment rule. merge Adjacent-club merging after clustering: ``"ps"`` (default) applies the Phillips-Sul (2009) merge test iteratively until no clubs merge, ``"single"`` does one pass, ``"none"`` reports the raw clusters. tiles MapLibre basemap style for ``fig_map`` (``None`` draws the vector backend). title Title for the headline figure. Returns ------- ConvergenceClubsResult The tidy long ``df`` (``entity``, ``time``, ``value`` = HP trend, ``relative`` = ``h_it``, ``club`` with ``0`` = divergent); the within-club average figure ``fig``; the all-paths figure ``fig_paths``; the per-club small-multiples ``fig_clubs``; the membership choropleth ``fig_map`` (``None`` without ``gdf``); the classification table ``gt`` / ``summary`` and the ``membership`` frame; the whole-panel ``global_beta`` / ``global_tstat`` and ``converged`` flag; and the club counts and run parameters. Raises ------ KeyError If ``var`` is not a column of ``df``. TypeError If ``var`` is not numeric. ValueError If ``trim`` is out of ``(0, 1)``, ``merge`` is unknown, the panel is unbalanced or too short/small, the per-period cross-sectional mean is (near) zero, or the global log(t) test is not estimable. Notes ----- The log(t) test regresses, for :math:`t = [rT] \ldots T`, .. math:: \log(H_1 / H_t) - 2 \log(\log t) = a + b \log t + \varepsilon_t, where :math:`H_t = N^{-1} \sum_i (h_{it} - 1)^2` is the cross-sectional variance of the relative transition paths. Under the null of convergence ``b = 2*alpha >= 0``; a one-sided ``t_b > -1.65`` fails to reject it. The standard error is the Phillips-Sul scalar long-run variance form with an Andrews (1991) quadratic-spectral HAC of the residuals. The clustering sorts units by their final value, forms a core group by maximising ``t_b``, sieves in the remaining units, and recurses on the residual; adjacent clubs are then merged when they jointly converge. This is a faithful port of the Stata ``psecta`` package (Du 2017); see Phillips & Sul (2007, 2009) and Schnurbus et al. (2016). Examples -------- Two planted clubs (units converge within their group, not across groups): ```python import numpy as np import pandas as pd from geometrics.clubs import analyze_convergence_clubs rng = np.random.default_rng(0) rows = [] for k, mu in enumerate((10.0, 8.5), start=1): for j in range(10): dev = rng.uniform(-0.4, 0.4) for t in range(1, 31): rows.append((f"c{k}u{j}", t, mu + dev * 0.9 ** (t - 1))) df = pd.DataFrame(rows, columns=["unit", "year", "log_y"]) res = analyze_convergence_clubs(df, "log_y", entity="unit", time="year") res.n_clubs, res.converged ``` ### analyze_spatial_model(df: 'pd.DataFrame', outcome: 'str | None' = None, covariates: 'str | Sequence[str] | None' = None, *, gdf: 'gpd.GeoDataFrame', w: 'W | None' = None, model: 'str' = 'durbin', method: 'str' = 'ml', period: 'Any' = None, entity: 'str | None' = None, time: 'str | None' = None, fixed_effects: 'str | None' = None, impacts: 'bool' = True, n_draws: 'int' = 10000, seed: 'int | None' = 20250620, spat_diag: 'bool' = False, title: 'str | None' = None) -> 'SpatialModelResult' Estimate a cross-sectional spatial econometric model with impact decomposition. One period of the panel (the latest by default) is aligned to the geometry and weights, the requested :mod:`spreg` model is estimated, and — for models with a spatially lagged outcome or regressors — the per-covariate LeSage-Pace direct/indirect/total impacts are computed from ``betas`` + ``vm`` (Monte-Carlo standard errors for lag/Durbin models, analytic for SLX/Durbin-error). Parameters ---------- df Long panel (or cross-section) holding the outcome and covariates per entity. outcome Dependent variable. Defaults to the outcome declared via :func:`geometrics.set_roles`. covariates Regressor column(s). Default to the covariates declared via :func:`geometrics.set_roles`. gdf Entity geometry carrying the same entity ids as ``df``. w ``libpysal`` weights aligned to the gdf ids. ``None`` builds the default weights (queen contiguity for polygons) with a :class:`~geometrics.GeometricsWarning`. model ``"ols"``, ``"lag"`` (SAR), ``"error"`` (SEM), ``"slx"``, ``"durbin"`` (SDM) or ``"durbin_error"`` (SDEM). method ``"ml"`` (maximum likelihood) or ``"gm"`` (method of moments / GMM; not available for ``durbin_error``). OLS-based models (``ols`` / ``slx``) ignore it. period Period to model when ``df`` has a time dimension; ``None`` uses the latest period and records a note. entity, time Panel identifiers; default to the ids declared via :func:`geometrics.set_panel`. fixed_effects Categorical column expanded to ``drop_first`` dummy regressors (e.g. state fixed effects). Dummies are never spatially lagged when their lag is collinear (full-rank check). impacts Compute the impact table where defined (lag/durbin: Monte-Carlo; slx/ durbin_error: analytic). OLS and pure error models have no impact decomposition (``impacts`` is ``None``). n_draws Monte-Carlo draws for the impact standard errors. seed Seed for the Monte-Carlo draws (reproducible by default). spat_diag For ``model="ols"`` only: attach spreg's spatial diagnostics to the fitted object (see :func:`analyze_spatial_diagnostics` for the full workflow). title Header for the coefficient table. Defaults to the model and outcome labels. Returns ------- SpatialModelResult Frozen result with the tidy coefficient frame (``df``), the Great Tables coefficient table (``gt``), the fitted spreg object (``model_obj``), the spatial parameters (``rho`` / ``lam``), fit scalars, the impact table (``impacts``) and ``w_spec``. Raises ------ KeyError If a requested column is not in ``df``. TypeError If the outcome or a covariate is not numeric. ValueError For an unknown ``model`` / ``method``, the unsupported ``durbin_error`` + ``gm`` combination, an unknown ``period``, or too few / degenerate observations. Examples -------- A spatial lag model on a small constructed lattice: ```python import geopandas as gpd import numpy as np import pandas as pd from shapely.geometry import box from geometrics.spatial_models import analyze_spatial_model from geometrics.weights import make_weights cells = [box(i % 4, i // 4, i % 4 + 1, i // 4 + 1) for i in range(16)] gdf = gpd.GeoDataFrame( {"id": [f"r{i}" for i in range(16)]}, geometry=cells, crs="EPSG:4326" ) rng = np.random.default_rng(0) df = pd.DataFrame({"id": gdf["id"], "x": rng.normal(size=16)}) df["y"] = 2.0 * df["x"] + rng.normal(scale=0.1, size=16) res = analyze_spatial_model( df, "y", ["x"], gdf=gdf, w=make_weights(gdf), model="lag", entity="id", n_draws=200, ) print(res.model, res.n_obs, res.impacts.shape) ``` ### analyze_spatial_diagnostics(df: 'pd.DataFrame', outcome: 'str | None' = None, covariates: 'str | Sequence[str] | None' = None, *, gdf: 'gpd.GeoDataFrame', w: 'W | None' = None, period: 'Any' = None, entity: 'str | None' = None, time: 'str | None' = None, fixed_effects: 'str | None' = None, alpha: 'float' = 0.05) -> 'SpatialDiagnosticsResult' Run the LM specification tests on OLS residuals and recommend a spatial model. Estimates the non-spatial OLS benchmark, computes Moran's I on its residuals and the five Lagrange-multiplier tests (LM lag / LM error, their robust forms, and LM SARMA), then applies the Anselin-Florax decision rule: no LM rejection keeps OLS; otherwise the significant *robust* test picks the lag or error model, and when both robust tests reject the larger statistic wins (with a pointer to the spatial Durbin model, which nests both channels). Parameters ---------- df Long panel (or cross-section) holding the outcome and covariates per entity. outcome, covariates Dependent variable and regressors; default to the roles declared via :func:`geometrics.set_roles`. gdf Entity geometry carrying the same entity ids as ``df``. w ``libpysal`` weights aligned to the gdf ids; ``None`` builds the default weights with a :class:`~geometrics.GeometricsWarning`. period Period to test; ``None`` uses the latest period and records a note. entity, time Panel identifiers; default to the ids declared via :func:`geometrics.set_panel`. fixed_effects Categorical column expanded to ``drop_first`` dummies in the OLS design. alpha Significance level for the decision rule. Returns ------- SpatialDiagnosticsResult Frozen result with one row per test (``test``, ``statistic``, ``df``, ``p``), the rendered table, the residual Moran's I, the ``recommendation`` and its ``reasoning``, the fitted OLS benchmark and ``w_spec``. Raises ------ KeyError If a requested column is not in ``df``. TypeError If the outcome or a covariate is not numeric. ValueError If ``alpha`` is not in (0, 1), the period is unknown, or the aligned cross-section is too small or degenerate. Examples -------- Diagnostics on a small constructed lattice: ```python import geopandas as gpd import numpy as np import pandas as pd from shapely.geometry import box from geometrics.spatial_models import analyze_spatial_diagnostics from geometrics.weights import make_weights cells = [box(i % 4, i // 4, i % 4 + 1, i // 4 + 1) for i in range(16)] gdf = gpd.GeoDataFrame( {"id": [f"r{i}" for i in range(16)]}, geometry=cells, crs="EPSG:4326" ) rng = np.random.default_rng(0) df = pd.DataFrame({"id": gdf["id"], "x": rng.normal(size=16)}) df["y"] = 2.0 * df["x"] + rng.normal(scale=0.1, size=16) res = analyze_spatial_diagnostics(df, "y", ["x"], gdf=gdf, w=make_weights(gdf), entity="id") print(res.recommendation) ``` ### analyze_spatial_model_by_weights(df: 'pd.DataFrame', outcome: 'str | None' = None, covariates: 'str | Sequence[str] | None' = None, *, gdf: 'gpd.GeoDataFrame', weights: 'Mapping[str, W] | None' = None, baseline: 'str | None' = None, focal: 'str | None' = None, model: 'str' = 'durbin', period: 'Any' = None, entity: 'str | None' = None, time: 'str | None' = None, fixed_effects: 'str | None' = None, n_draws: 'int' = 10000, seed: 'int | None' = 20250620, title: 'str | None' = None) -> 'WeightsRobustnessResult' Re-estimate a spatial model under alternative weights and compare the impacts. The weights-choice robustness check of the source paper (notebook c07): the same model is re-estimated under each spatial weights specification, and the focal regressor's direct/indirect/total impacts are compared across specifications in a table and a three-facet dot-whisker figure (95% Monte-Carlo confidence intervals, baseline highlighted with a dashed reference line). Parameters ---------- df Long panel (or cross-section) holding the outcome and covariates per entity. outcome, covariates Dependent variable and regressors; default to the roles declared via :func:`geometrics.set_roles`. gdf Entity geometry carrying the same entity ids as ``df``. weights Mapping of specification name to ``libpysal`` weights. ``None`` builds the paper's suite from the geometry: 4/6/8-nearest-neighbor, queen and rook contiguity, and inverse distance with powers 1 and 2 (all row-standardized). baseline Name of the reference specification (highlighted in the figure). Defaults to the first key of ``weights``. focal Covariate whose impacts are compared. Defaults to the first covariate. model ``"lag"``, ``"slx"``, ``"durbin"`` (default) or ``"durbin_error"`` — a model with a defined impact decomposition. Estimated by ML. period Period to model; ``None`` uses the latest period and records a note. entity, time Panel identifiers; default to the ids declared via :func:`geometrics.set_panel`. fixed_effects Categorical column expanded to ``drop_first`` dummies in each design. n_draws Monte-Carlo draws for the impact standard errors (the RNG is re-seeded per weights specification, so rows are reproducible individually). seed Seed for the Monte-Carlo draws. title Figure title. Defaults to a title naming the focal regressor. Returns ------- WeightsRobustnessResult Frozen result with one row per specification (``weights``, ``rho``, ``direct`` / ``indirect`` / ``total`` and their standard errors, ``aic``, ``n_obs``, ``w_spec``), the dot-whisker figure and the comparison table. Raises ------ KeyError If a requested column is not in ``df``. TypeError If the outcome or a covariate is not numeric. ValueError For a model without impacts (``ols`` / ``error``), an empty ``weights`` mapping, an unknown ``baseline`` or ``focal``, or degenerate data. Examples -------- Compare queen contiguity against 4-nearest-neighbor weights: ```python import geopandas as gpd import numpy as np import pandas as pd from shapely.geometry import box from geometrics.spatial_models import analyze_spatial_model_by_weights from geometrics.weights import make_weights cells = [box(i % 4, i // 4, i % 4 + 1, i // 4 + 1) for i in range(16)] gdf = gpd.GeoDataFrame( {"id": [f"r{i}" for i in range(16)]}, geometry=cells, crs="EPSG:4326" ) rng = np.random.default_rng(0) df = pd.DataFrame({"id": gdf["id"], "x": rng.normal(size=16)}) df["y"] = 2.0 * df["x"] + rng.normal(scale=0.1, size=16) res = analyze_spatial_model_by_weights( df, "y", ["x"], gdf=gdf, model="lag", entity="id", n_draws=200, weights={ "queen": make_weights(gdf, method="queen"), "knn4": make_weights(gdf, method="knn", k=4), }, ) print(res.baseline, list(res.df["weights"])) ``` ### analyze_markov_transitions(df: 'pd.DataFrame', var: 'str', *, entity: 'str | None' = None, time: 'str | None' = None, k: 'int' = 5, scheme: 'str' = 'quantiles', bins: 'Sequence[float] | None' = None, per_period: 'bool' = True, relative: 'bool' = False, title: 'str | None' = None) -> 'MarkovTransitionsResult' Estimate a discrete Markov chain of movement between distribution states. Each region's ``var`` is discretized into ``k`` states (per period by default, so a state is a *rank* within that period's cross-section) and every period-to-period move is pooled into a ``k``-by-``k`` transition-probability matrix (:class:`giddy.markov.Markov`). The result carries the ergodic (steady-state) distribution, expected sojourn times, and the Shorrocks / Prais / Bartholomew mobility indices of the matrix. Parameters ---------- df Long-form panel with entity, time and ``var`` columns. The panel must be balanced in ``var`` (every entity observed in every period). var Numeric variable whose distribution dynamics are analyzed. entity Entity (unit) id column; defaults to the panel declared via :func:`geometrics.set_panel`. time Time id column; defaults to the declared panel. k Number of states (classes) to discretize into (default 5). scheme Classification scheme: ``"quantiles"`` (default), ``"equal_interval"`` or ``"fisher_jenks"``. Ignored when ``bins`` is given. bins Explicit upper class bounds (:class:`mapclassify.UserDefined`); the same fixed intervals apply in every period and ``scheme`` / ``per_period`` are ignored. per_period Classify each period's cross-section separately (default ``True``, the distribution-dynamics convention: states are positions *within* the period's distribution). ``False`` pools all ``n*t`` values into one classification. relative Divide ``var`` by its cross-sectional mean per period first (so 1.0 marks the period average). Default ``False``. title Figure title (a default naming the variable is used when ``None``). Returns ------- MarkovTransitionsResult The long panel with each (entity, period) ``state``, the labelled transition matrix ``p`` and ``counts``, the annotated heatmap ``fig``, the summary table ``gt``, the ``steady_state`` and ``sojourn`` series, and the ``shorrocks`` / ``prais`` / ``bartholomew`` mobility indices. Raises ------ ImportError If the optional ``giddy`` dependency is not installed. KeyError If ``var`` is not a column of ``df``. TypeError If ``var`` is not numeric. ValueError If ``k < 2``, the scheme is unknown, the panel is unbalanced, or fewer than two periods are observed. Notes ----- Mobility indices use :func:`giddy.mobility.markov_mobility` measure codes: ``shorrocks`` is measure ``"P"`` (the trace index :math:`(k - \mathrm{tr}\,P)/(k-1)`), ``prais`` is measure ``"D"`` (the determinant index :math:`1 - |\det P|`), and ``bartholomew`` is measure ``"B1"`` (the trace index weighted by the first period's observed state distribution). Examples -------- Three groups of regions that keep their income rank from year to year: ```python import numpy as np import pandas as pd from geometrics.distribution_dynamics import analyze_markov_transitions rng = np.random.default_rng(0) units = [f"r{i}" for i in range(9)] base = np.repeat([1.0, 2.0, 3.0], 3) df = pd.DataFrame( [ {"region": u, "year": y, "income": b + rng.normal(0, 0.5)} for y in (2000, 2001, 2002, 2003) for u, b in zip(units, base) ] ) res = analyze_markov_transitions(df, "income", entity="region", time="year", k=3) res.p.round(2) ``` ### analyze_spatial_markov(df: 'pd.DataFrame', var: 'str', *, gdf: 'gpd.GeoDataFrame', w: 'W | None' = None, entity: 'str | None' = None, time: 'str | None' = None, k: 'int' = 5, m: 'int | None' = None, fixed: 'bool' = True, relative: 'bool' = True, title: 'str | None' = None) -> 'SpatialMarkovResult' Estimate a spatial Markov chain: transitions conditioned on the neighbors' state. Rey's (2001) spatial Markov splits the classic transition matrix by the *spatial lag* of each region — the (weighted) average of its neighbors — discretized into ``m`` classes. One ``k``-by-``k`` matrix per neighbor class shows whether upward or downward moves happen at different rates in rich versus poor neighborhoods, and the Bickenbach-Bode LR / Q tests ask whether those conditional dynamics differ from the pooled (unconditional) matrix. Parameters ---------- df Long-form panel with entity, time and ``var`` columns. The panel must be balanced in ``var`` (every entity observed in every period). var Numeric variable whose distribution dynamics are analyzed. gdf Geometry frame carrying the entity ids (see :func:`geometrics.read_gdf`); the panel is aligned to the weights' row order through it. w ``libpysal`` weights aligned to the gdf entity ids. ``None`` builds the default weights (queen contiguity for polygons, 6-nearest-neighbor otherwise) with a :class:`~geometrics.GeometricsWarning`. entity Entity (unit) id column of ``df``; defaults to the declared panel. time Time id column; defaults to the declared panel. k Number of states for the variable itself (default 5). m Number of classes for the spatial lag (default: same as ``k``). fixed Pool the ``n*t`` values into one quantile classification (default ``True``, giddy's convention); ``False`` re-classifies each period separately. relative Divide ``var`` by its cross-sectional mean per period first (default ``True``, the distribution-dynamics convention for income data). title Figure title (a default naming the variable is used when ``None``). Returns ------- SpatialMarkovResult The long panel with each (entity, period) ``state`` and ``neighbor_state``, the unconditional ``p_global``, the tuple of ``m`` conditional matrices ``p_conditional`` (with ``steady_states`` stacking their ergodic distributions), the small-multiple heatmap ``fig``, the homogeneity-test table ``gt``, and the LR / Q statistics with their p-values and ``dof``. Raises ------ ImportError If the optional ``giddy`` dependency is not installed. KeyError If ``var`` is not a column of ``df``. TypeError If ``var`` is not numeric. ValueError If ``k < 2`` or ``m < 2``, the panel is unbalanced, fewer than two periods are observed, or the weights ids do not match the geometry. Examples -------- A 3x3 lattice where income levels follow a smooth spatial gradient: ```python import geopandas as gpd import numpy as np import pandas as pd from shapely.geometry import box from geometrics.distribution_dynamics import analyze_spatial_markov from geometrics.weights import make_weights cells = [box(x, y, x + 1, y + 1) for y in range(3) for x in range(3)] gdf = gpd.GeoDataFrame( {"region": [f"r{i}" for i in range(9)]}, geometry=cells, crs="EPSG:4326" ) w = make_weights(gdf, method="queen") rng = np.random.default_rng(1) df = pd.DataFrame( [ {"region": f"r{i}", "year": y, "income": 1.0 + i / 4 + rng.normal(0, 0.3)} for y in (2000, 2001, 2002, 2003) for i in range(9) ] ) res = analyze_spatial_markov( df, "income", gdf=gdf, w=w, entity="region", time="year", k=2, m=2 ) res.p_global.round(2) ``` ### analyze_inequality_over_time(df: 'pd.DataFrame', var: 'str', *, entity: 'str | None' = None, time: 'str | None' = None, measures: 'Sequence[str]' = ('gini', 'theil'), gdf: 'gpd.GeoDataFrame | None' = None, w: 'W | None' = None, permutations: 'int' = 99, start: 'Any' = None, end: 'Any' = None, title: 'str | None' = None) -> 'InequalityOverTimeResult' Track cross-sectional inequality measures over time and test their trend. For every period the function computes the requested inequality measures of ``var`` across units — the **Gini index** (:class:`inequality.gini.Gini`), the **Theil index** (:class:`inequality.theil.Theil`) and the **coefficient of variation** (sample std over mean) — then regresses the **log** of each measure on time (OLS, HC1 standard errors). A negative, significant slope means inequality is narrowing: the inequality-narrative complement of σ-convergence. When geometry is supplied (``gdf``, with ``w`` optional), the per-period **spatial Gini decomposition** of Rey & Smith (2013) (:class:`inequality.gini.Gini_Spatial`) is added: ``gini_spatial`` is the component of the overall Gini owed to *neighbor* pairs under ``w`` (so ``gini_spatial <= gini``, with the remainder owed to non-neighbor pairs), and ``gini_spatial_p`` is the permutation pseudo p-value testing whether the non-neighbor component exceeds its expectation under spatial randomness. Units are aligned to the geometry per period with the **same** entity set across periods (the intersection of the per-period complete cases). Parameters ---------- df Long-form panel data frame. var Numeric variable whose cross-sectional inequality is tracked (e.g. GDP per capita in levels). Used as supplied; the Theil index requires strictly positive values. entity, time Panel identifiers. Default to those declared via :func:`geometrics.set_panel`. measures Measures to compute per period, from ``"gini"``, ``"theil"`` and ``"cv"``. gdf Geometry frame (see :func:`geometrics.read_gdf`) enabling the spatial Gini decomposition. ``None`` skips it. w ``libpysal`` weights aligned to the ``gdf`` entity ids (only its neighbor structure is used, as a binary graph). ``None`` with a ``gdf`` builds the default weights (queen contiguity for polygons) with a :class:`~geometrics.GeometricsWarning`. permutations Number of permutations for the spatial-Gini inference (``0`` disables it; ``gini_spatial_p`` is then ``NaN``). start, end Optional first and last period to include (inclusive, on the scale of the time column). Default to the full range. title Title for the figure. Returns ------- InequalityOverTimeResult Per-period measures ``df`` (``time``, ``n_units`` and one column per measure, plus ``gini_spatial`` / ``gini_spatial_p`` when spatial); the measures-over-time ``fig`` with dashed fitted trends; the per-measure trend table ``gt`` / ``summary`` (``measure``, ``slope``, ``se``, ``pvalue``, ``r2``, ``converging``); the fitted trend ``models``; ``n_periods`` / ``n_units``; and ``w_spec`` describing the weights (``None`` without geometry). Raises ------ KeyError If ``var`` is not a column of ``df``. TypeError If ``var`` is not numeric. ValueError For unknown ``measures``, a Theil request on non-positive values (the offending entities are named), ``w`` without ``gdf``, fewer than two periods, or a period with fewer than two complete observations. Notes ----- ``Gini_Spatial`` draws permutations from NumPy's global RNG and has no seed parameter, so the global seed is set to a fixed value (12345) at the start of the spatial loop to make ``gini_spatial_p`` reproducible. Examples -------- Inequality trend across three regions over three years: ```python import pandas as pd from geometrics.regional_inequality import analyze_inequality_over_time df = pd.DataFrame( { "region": ["A", "B", "C"] * 3, "year": [2000] * 3 + [2001] * 3 + [2002] * 3, "gdppc": [10.0, 20.0, 40.0, 12.0, 21.0, 38.0, 14.0, 22.0, 36.0], } ) res = analyze_inequality_over_time( df, "gdppc", entity="region", time="year", measures=("gini", "theil") ) (res.df["gini"].round(3).tolist(), bool(res.summary.loc[0, "converging"])) ``` ### analyze_theil_decomposition(df: 'pd.DataFrame', var: 'str', group: 'str', *, entity: 'str | None' = None, time: 'str | None' = None, permutations: 'int' = 0, seed: 'int' = 12345, title: 'str | None' = None) -> 'TheilDecompositionResult' Decompose the Theil index between and within a group partition, per period. For every period the Theil index of ``var`` across units is split additively (:class:`inequality.theil.TheilD`) into a **between-group** component (inequality across the mean levels of the ``group`` partition, e.g. states) and a **within-group** component (inequality among units inside each group): ``theil = between + within`` exactly. The ``between_share`` tracks how much of total inequality is a group-level phenomenon. With ``permutations > 0`` the between component gets a permutation pseudo p-value (:class:`inequality.theil.TheilDSim`): units are randomly reassigned to groups and ``p_between`` reports how often a random partition yields a between share at least as large. Parameters ---------- df Long-form panel data frame. var Numeric variable to decompose (strictly positive — the Theil index takes logarithms of shares). group Partition column (e.g. a state id for district units). It must be constant within each entity across periods, and define at least two groups. entity, time Panel identifiers. Default to those declared via :func:`geometrics.set_panel`. permutations Number of permutations for the between-component inference (``0`` disables it and omits the ``p_between`` column). seed Seed for the permutation draws. ``TheilDSim`` has no seed parameter and draws from NumPy's global RNG, so ``np.random.seed(seed)`` is called once before the per-period loop. title Title for the figure. Returns ------- TheilDecompositionResult Per-period frame ``df`` (``time``, ``theil``, ``between``, ``within``, ``between_share``, plus ``p_between`` when ``permutations > 0``); the stacked between/within area ``fig`` with the between-share line on the secondary axis; the per-period ``gt`` table; and ``group`` / ``n_groups`` / ``permutations``. Raises ------ KeyError If ``var`` or ``group`` is not a column of ``df``. TypeError If ``var`` is not numeric. ValueError If ``group`` varies within an entity (the offenders are named), defines fewer than two groups, or ``var`` has non-positive values (the offending entities are named). Examples -------- Two states with two districts each, over two years: ```python import pandas as pd from geometrics.regional_inequality import analyze_theil_decomposition df = pd.DataFrame( { "district": ["d1", "d2", "d3", "d4"] * 2, "state": ["north", "north", "south", "south"] * 2, "year": [2000] * 4 + [2001] * 4, "income": [10.0, 12.0, 30.0, 36.0, 11.0, 13.0, 33.0, 40.0], } ) res = analyze_theil_decomposition( df, "income", "state", entity="district", time="year" ) res.df[["time", "between_share"]].round(3) ``` ### analyze_gwr(df: 'pd.DataFrame', outcome: 'str | None' = None, covariates: 'str | Sequence[str] | None' = None, *, gdf: 'gpd.GeoDataFrame', period: 'Any' = None, entity: 'str | None' = None, time: 'str | None' = None, bw: 'float | None' = None, fixed: 'bool' = False, kernel: 'str' = 'bisquare', criterion: 'str' = 'AICc', standardize: 'bool' = False, alpha: 'float' = 0.05, tiles: 'str | None' = None, title: 'str | None' = None) -> 'GWRResult' Fit a geographically weighted regression and map each local surface. A separate distance-weighted regression of ``outcome`` on ``covariates`` is calibrated at every entity (mgwr's ``GWR``), so each term's coefficient becomes a local surface. The bandwidth is selected by golden-section search on ``criterion`` when ``bw`` is ``None``; local significance applies the da Silva & Fotheringham multiple-testing correction (corrected alpha → critical t) at the nominal ``alpha`` level, and non-significant units are greyed on the coefficient maps. Parameters ---------- df Long panel (or cross section) holding the outcome and covariates per entity. outcome Numeric outcome column. Defaults to the outcome declared via :func:`geometrics.set_roles`. covariates Numeric covariate column(s). Default to the covariates declared via :func:`geometrics.set_roles`. gdf Entity geometry; must carry the same entity-id column as ``df``. Calibration coordinates are the polygon centroids in a metric CRS. period Period to analyze. Defaults to the latest period when ``df`` has a time dimension (a note records this). entity, time Panel identifiers; default to the ids declared via :func:`geometrics.set_panel`. bw Bandwidth: number of nearest neighbors when ``fixed=False`` (adaptive), a distance in metric-CRS units when ``fixed=True``. ``None`` (default) selects it by golden-section search on ``criterion``. fixed Use a fixed-distance kernel instead of an adaptive nearest-neighbor one. kernel Kernel weighting function: ``"bisquare"`` (default), ``"gaussian"`` or ``"exponential"``. criterion Bandwidth-selection criterion: ``"AICc"`` (default), ``"AIC"``, ``"BIC"`` or ``"CV"``. standardize Z-standardize the outcome and covariates before fitting, so local coefficients are comparable across terms (a note records this). alpha Nominal significance level for the corrected local t-tests. tiles MapLibre base-map style for the coefficient maps, or ``None`` (default) for the vector backend (deterministic PNG export). title Title for the local-R² map; per-term maps append the term label. Returns ------- GWRResult Frozen result with the per-entity local frame (``df``), one diverging coefficient map per term (``figs``), the local-R² map (``fig``), the global summary table (``gt``) and the fitted mgwr results (``model_obj``). Raises ------ KeyError If the outcome or a covariate is not a column of ``df``. TypeError If a focal variable is not numeric, or ``df``/``gdf`` have the wrong type. ValueError If roles cannot be resolved, arguments are invalid, or too few complete observations remain after alignment. Examples -------- A 3x3 lattice with an explicit bandwidth (skipping the search): ```python import geopandas as gpd import pandas as pd from shapely.geometry import box from geometrics.gwr import analyze_gwr gdf = gpd.GeoDataFrame( {"cell": [f"c{i}" for i in range(9)]}, geometry=[box(78 + i % 3, 20 + i // 3, 79 + i % 3, 21 + i // 3) for i in range(9)], crs="EPSG:4326", ) df = pd.DataFrame( { "cell": [f"c{i}" for i in range(9)], "x1": [0.1, 0.9, 0.4, 0.7, 0.2, 0.8, 0.5, 0.3, 0.6], "y": [0.2, 1.4, 0.8, 1.5, 0.6, 2.0, 1.4, 1.0, 1.9], } ) res = analyze_gwr(df, "y", ["x1"], gdf=gdf, entity="cell", bw=8, tiles=None) print(res.bw, sorted(res.figs)) ``` ### analyze_mgwr(df: 'pd.DataFrame', outcome: 'str | None' = None, covariates: 'str | Sequence[str] | None' = None, *, gdf: 'gpd.GeoDataFrame', period: 'Any' = None, entity: 'str | None' = None, time: 'str | None' = None, kernel: 'str' = 'bisquare', criterion: 'str' = 'AICc', max_iter: 'int' = 200, tiles: 'str | None' = None, title: 'str | None' = None) -> 'MGWRResult' Fit a multiscale GWR: every term gets its own spatial scale (bandwidth). MGWR relaxes GWR's single shared bandwidth: a backfitting algorithm (mgwr's ``Sel_BW(multi=True)`` + ``MGWR``) selects one adaptive bandwidth per term, so each covariate's association can vary at its own spatial scale. Following the MGWR requirement, the outcome and covariates are **always z-standardized**, so local coefficients are on the standardized scale (a note records this). Local significance applies per-term da Silva & Fotheringham corrected alphas at the 5% nominal level. Parameters ---------- df Long panel (or cross section) holding the outcome and covariates per entity. outcome Numeric outcome column. Defaults to the outcome declared via :func:`geometrics.set_roles`. covariates Numeric covariate column(s). Default to the covariates declared via :func:`geometrics.set_roles`. gdf Entity geometry; must carry the same entity-id column as ``df``. Calibration coordinates are the polygon centroids in a metric CRS. period Period to analyze. Defaults to the latest period when ``df`` has a time dimension (a note records this). entity, time Panel identifiers; default to the ids declared via :func:`geometrics.set_panel`. kernel Kernel weighting function: ``"bisquare"`` (default), ``"gaussian"`` or ``"exponential"``. criterion Bandwidth-selection criterion: ``"AICc"`` (default), ``"AIC"``, ``"BIC"`` or ``"CV"``. max_iter Maximum number of multiscale backfitting iterations. tiles MapLibre base-map style for the coefficient maps, or ``None`` (default) for the vector backend (deterministic PNG export). title Title for the residual map; per-term maps append the term label. Returns ------- MGWRResult Frozen result with the per-entity local frame (``df``), one diverging coefficient map per term (``figs``), the residual map (``fig`` — mgwr does not define a local R² under multiple bandwidths), the summary and bandwidth tables (``gt`` / ``gt_bw``), the per-term ``bw`` / ``adj_alpha`` / ``critical_t`` dicts and the fitted mgwr results (``model_obj``). Raises ------ KeyError If the outcome or a covariate is not a column of ``df``. TypeError If a focal variable is not numeric, or ``df``/``gdf`` have the wrong type. ValueError If roles cannot be resolved, arguments are invalid, or too few complete observations remain after alignment. Examples -------- A 3x3 lattice cross-section (bandwidths selected by backfitting): ```python import geopandas as gpd import pandas as pd from shapely.geometry import box from geometrics.gwr import analyze_mgwr gdf = gpd.GeoDataFrame( {"cell": [f"c{i}" for i in range(9)]}, geometry=[box(78 + i % 3, 20 + i // 3, 79 + i % 3, 21 + i // 3) for i in range(9)], crs="EPSG:4326", ) df = pd.DataFrame( { "cell": [f"c{i}" for i in range(9)], "x1": [0.1, 0.9, 0.4, 0.7, 0.2, 0.8, 0.5, 0.3, 0.6], "y": [0.2, 1.4, 0.8, 1.5, 0.6, 2.0, 1.4, 1.0, 1.9], } ) res = analyze_mgwr(df, "y", ["x1"], gdf=gdf, entity="cell", tiles=None) print(sorted(res.bw)) ``` ## learn_* ### learn_spatial_autocorrelation(*, side: 'int' = 12, rho: 'float' = 0.6, n_sims: 'int' = 10, permutations: 'int' = 199, seed: 'int' = 0) -> 'SandboxResult' See what spatial autocorrelation looks like — and how Moran's I tracks it. Simulates fields ``y = (I - rho W)^-1 eps`` on a ``side x side`` lattice with row-standardized queen weights, sweeping the planted dependence ρ over a grid that includes the focal ``rho``. The figure pairs the focal simulated field (left) with the Moran's I recovered at each planted ρ (right): at ρ = 0 the statistic sits at its null expectation E[I] = -1/(n-1); as ρ rises, neighbors look alike and I climbs. Parameters ---------- side Lattice side length (n = side²). rho The focal planted spatial dependence, |ρ| < 1. The left panel draws a field at this value and the sweep curve highlights it. n_sims Simulated fields per ρ (the faint markers behind the mean curve). permutations Conditional permutations behind each Moran's I pseudo p-value. seed Random seed. Returns ------- SandboxResult ``df`` (one row per ρ and simulation), ``fig``, ``summary``, ``topic`` and the focal simulated field in ``data``. Examples -------- The knob variation is the lesson — compare no dependence with strong dependence: ```python import geometrics as gm gm.learn_spatial_autocorrelation(rho=0.0).fig gm.learn_spatial_autocorrelation(rho=0.8).fig ``` ### learn_spatial_weights(*, side: 'int' = 12, rho: 'float' = 0.6, k: 'int' = 4, permutations: 'int' = 199, seed: 'int' = 0) -> 'SandboxResult' See how the choice of spatial weights changes what "neighbors" means. Simulates one field with dependence ``rho`` under **queen** contiguity (the true graph), then re-tests the same field under queen, rook and k-nearest-neighbor weights. All three detect the clustering, but the statistic shifts with the graph — the substantive conclusion should not hinge on one W, which is why :func:`geometrics.analyze_spatial_model_by_weights` exists. Parameters ---------- side Lattice side length (n = side²). rho Planted spatial dependence under the queen graph, |ρ| < 1. k Neighbors for the k-nearest-neighbor variant. permutations Conditional permutations behind each pseudo p-value. seed Random seed. Returns ------- SandboxResult ``df`` (one row per weights choice), ``fig``, ``summary``, ``topic`` and the simulated field in ``data``. Examples -------- ```python import geometrics as gm res = gm.learn_spatial_weights(rho=0.6, k=8) res.df ``` ### learn_lisa_clusters(*, side: 'int' = 12, block: 'int' = 3, shift: 'float' = 2.0, alpha: 'float' = 0.05, permutations: 'int' = 999, seed: 'int' = 0) -> 'SandboxResult' Plant hot and cold spots, then watch LISA find them (and sometimes cry wolf). Draws iid noise on a ``side x side`` lattice, shifts a ``block x block`` **hot** block up by ``shift`` and a **cold** block down by the same amount, and runs local Moran (LISA) with significance masking at ``alpha``. The map shows the recovered High-High / Low-Low clusters with the planted blocks outlined; the summary reports the hit rates and the false-positive share — a reminder that with hundreds of local tests, some cells are flagged by chance alone. Parameters ---------- side Lattice side length (n = side²); must fit two disjoint blocks. block Side length of each planted block. shift Size of the planted level shift (in standard deviations of the noise). alpha Significance level masking the cluster labels (``p_sim < alpha``). permutations Conditional permutations behind the local pseudo p-values. seed Random seed (drives both the noise and the permutations). Returns ------- SandboxResult ``df`` (one row per cell with planted and detected labels), ``fig``, ``summary``, ``topic`` and the raw field in ``data``. Examples -------- ```python import geometrics as gm res = gm.learn_lisa_clusters(shift=2.0) print(res.summary["sensitivity_hot"], res.summary["false_positive_rate"]) ``` ### learn_spatial_spillovers(*, side: 'int' = 10, beta: 'float' = 1.0, gamma: 'float' = 0.5, rho: 'float' = 0.5, noise: 'float' = 0.5, n_draws: 'int' = 5000, seed: 'int' = 0) -> 'SandboxResult' Plant direct and indirect effects, then recover them as LeSage-Pace impacts. Simulates ``y = (I - rho W)^-1 (beta x + gamma Wx + eps)`` on a lattice, so the true impacts are known in closed form — direct = tr[(I-ρW)⁻¹(βI+γW)]/n and total = (β+γ)/(1-ρ) — then estimates a spatial Durbin model with :func:`geometrics.analyze_spatial_model` and compares its Monte-Carlo impact decomposition against the truth. This is why spatial-model coefficients are read through impacts: β alone is not the marginal effect once feedback via ρ exists. Parameters ---------- side Lattice side length (n = side²). beta Planted own-place coefficient on ``x``. gamma Planted neighbor coefficient on ``Wx`` (drives spillovers alongside ρ). rho Planted spatial-lag parameter, |ρ| < 1. noise Standard deviation of the innovation. n_draws Monte-Carlo draws behind the estimated impact standard errors. seed Random seed (also passed to the estimator's Monte-Carlo step). Returns ------- SandboxResult ``df`` (direct/indirect/total, estimated vs true), ``fig``, ``summary``, ``topic`` and the simulated cross-section in ``data``. Examples -------- ```python import geometrics as gm res = gm.learn_spatial_spillovers(rho=0.7) res.df ``` ### learn_omitted_spatial_lag(*, side: 'int' = 10, beta: 'float' = 1.0, rho: 'float' = 0.7, noise: 'float' = 0.5, seed: 'int' = 0) -> 'SandboxResult' Show why ignoring spatial dependence biases OLS — and how SAR repairs it. Simulates ``y = (I - rho W)^-1 (beta x + eps)``: outcomes spill over through the spatial multiplier, so OLS on ``y ~ x`` (which omits the spatial lag ``Wy``) absorbs the feedback into its slope and overstates β. The ML spatial-lag (SAR) estimator models the dependence and recovers both β and ρ. Parameters ---------- side Lattice side length (n = side²). beta Planted coefficient on ``x``. rho Planted spatial-lag parameter, |ρ| < 1 (drives the OLS bias). noise Standard deviation of the innovation. seed Random seed. Returns ------- SandboxResult ``df`` (OLS vs SAR vs true coefficient), ``fig``, ``summary``, ``topic`` and the simulated cross-section in ``data``. Examples -------- ```python import geometrics as gm res = gm.learn_omitted_spatial_lag(rho=0.7) print(res.summary["ols_coef"], res.summary["sar_beta"]) ``` ### learn_beta_convergence(*, n_units: 'int' = 60, n_periods: 'int' = 6, convergence_rate: 'float' = 0.02, growth_const: 'float' = 0.05, noise: 'float' = 0.005, seed: 'int' = 0) -> 'SandboxResult' Plant a convergence rate, then watch the growth-on-initial regression find it. Simulates ``log y_it = log y_i0 + t (a - b log y_i0) + noise`` so annualized growth is exactly ``a - b log y_i0`` (up to noise) and the β-convergence slope is ``-b`` by construction. The figure is the real :func:`geometrics.analyze_beta_convergence` scatter with the planted-truth line drawn on top; the summary compares estimated and true β, speed and half-life. Parameters ---------- n_units Number of entities. n_periods Number of periods (the horizon is ``n_periods - 1``). convergence_rate The planted ``b`` > 0: poorer units grow faster by ``b`` per unit of initial log level, so the regression slope is ``-b``. growth_const The common growth constant ``a``. noise Standard deviation of the per-period log noise. seed Random seed. Returns ------- SandboxResult ``df`` (β / speed / half-life, estimated vs true), ``fig``, ``summary``, ``topic`` and the simulated panel in ``data``. Examples -------- ```python import geometrics as gm res = gm.learn_beta_convergence(convergence_rate=0.03) print(res.summary["est_beta"], res.summary["true_beta"]) ``` ### learn_sigma_convergence(*, n_units: 'int' = 60, n_periods: 'int' = 21, rho: 'float' = 0.93, noise: 'float' = 0.0, seed: 'int' = 0) -> 'SandboxResult' Plant a shrinking dispersion path, then watch the σ trend recover it. Simulates ``log y_it = mu + (log y_i0 - mu) rho^t`` so the cross-sectional standard deviation of ``log y`` contracts geometrically — ``sigma_t = sigma_0 rho^t`` — and the log-dispersion trend of the standard deviation is exactly ``ln(rho)`` per period. :func:`geometrics.analyze_sigma_convergence` fits that trend (plus Gini and CV variants, which track it only approximately because they are computed on levels). Parameters ---------- n_units Number of entities. n_periods Number of periods. rho Per-period contraction factor of deviations, 0 < ρ < 1 (smaller = faster σ-convergence; the planted trend slope is ``ln ρ``). noise Standard deviation of optional per-period log noise (0 keeps the geometric path exact). seed Random seed. Returns ------- SandboxResult ``df`` (trend slope per measure vs the planted ``ln ρ``), ``fig``, ``summary``, ``topic`` and the simulated panel in ``data``. Examples -------- ```python import geometrics as gm res = gm.learn_sigma_convergence(rho=0.9) print(res.summary["std_slope"], res.summary["true_slope"]) ``` ### learn_convergence_clubs(*, n_per_club: 'int' = 15, levels: 'tuple[float, ...]' = (10.0, 9.0), n_periods: 'int' = 35, rho: 'float' = 0.9, spread: 'float' = 0.4, noise: 'float' = 0.002, seed: 'int' = 0) -> 'SandboxResult' Plant convergence clubs, then watch the Phillips-Sul algorithm find them. Each club ``k`` converges to its own level: unit ``j`` follows ``x_jt = levels[k] + dev_j rho^t + noise`` with ``dev_j ~ U(-spread, spread)``, so within a club the transition paths contract while the between-club gaps persist — global convergence should be rejected and the clustering should recover the planted groups. The summary reports the detected club count and the assignment accuracy. Parameters ---------- n_per_club Units per planted club. levels The clubs' long-run (log) levels — one entry per club, at least two. n_periods Number of periods (the log(t) test needs a long panel). rho Per-period contraction of within-club deviations, 0 < ρ < 1. spread Half-width of the uniform initial deviations around each club level. noise Standard deviation of the per-period noise. seed Random seed. Returns ------- SandboxResult ``df`` (unit, planted club, detected club), ``fig`` (the within-club transition paths from the real estimator), ``summary``, ``topic`` and the simulated panel in ``data``. Examples -------- ```python import geometrics as gm res = gm.learn_convergence_clubs(levels=(10.0, 9.2, 8.5)) print(res.summary["detected_clubs"], res.summary["accuracy"]) ``` ### learn_markov_chains(*, n_units: 'int' = 100, n_periods: 'int' = 30, p: 'tuple[tuple[float, ...], ...]' = ((0.8, 0.15, 0.05), (0.1, 0.8, 0.1), (0.05, 0.15, 0.8)), seed: 'int' = 0) -> 'SandboxResult' Plant a transition matrix, then watch the estimated chain recover it. Simulates ``n_units`` independent chains from the planted row-stochastic ``p`` (started at its ergodic distribution), maps states to well-separated continuous values, and estimates the chain with :func:`geometrics.analyze_markov_transitions` using explicit class bins so the discretization is exact. The figure compares every estimated transition probability with its planted value; the summary reports the largest cell error and the ergodic-distribution error. Requires the ``dynamics`` extra (``pip install "geometrics[dynamics]"``). Parameters ---------- n_units Number of independent chains (entities). n_periods Chain length; ``n_units * (n_periods - 1)`` transitions are observed. p The planted k x k row-stochastic transition matrix (k >= 2). seed Random seed. Returns ------- SandboxResult ``df`` (one row per matrix cell, planted vs estimated), ``fig``, ``summary``, ``topic`` and the simulated panel in ``data``. Examples -------- ```python import geometrics as gm res = gm.learn_markov_chains(n_units=200) print(res.summary["max_abs_error"]) ``` ### learn_spatial_markov(*, side: 'int' = 10, n_periods: 'int' = 30, base_move: 'float' = 0.1, contextual: 'float' = 0.25, seed: 'int' = 0) -> 'SandboxResult' Plant neighbor-dependent mobility, then watch the spatial Markov test flag it. Simulates three-state dynamics on a lattice where each unit moves **toward its neighbors' average state** with probability ``base_move + contextual`` but away with only ``base_move`` — so transition probabilities genuinely depend on the spatial context (set ``contextual=0`` to restore homogeneity). Rey's spatial Markov (:func:`geometrics.analyze_spatial_markov`) conditions the transition matrix on the neighbors' class; its LR test should reject homogeneity and the upward-move probability should rise with richer neighbors. Requires the ``dynamics`` extra (``pip install "geometrics[dynamics]"``). Parameters ---------- side Lattice side length (n = side²). n_periods Number of simulated periods. base_move Baseline per-period probability of moving a state in either direction. contextual Extra probability of moving *toward* the neighbors' average state — the planted spatial effect the test should detect. seed Random seed. Returns ------- SandboxResult ``df`` (upward/stay/downward probabilities of the middle state by neighbor class), ``fig``, ``summary``, ``topic`` and the simulated panel in ``data``. Examples -------- ```python import geometrics as gm res = gm.learn_spatial_markov(contextual=0.25) print(res.summary["lr_p"], res.summary["contextual_gap_est"]) ``` ### learn_theil_decomposition(*, n_groups: 'int' = 3, n_per_group: 'int' = 40, gaps: 'tuple[float, ...]' = (0.0, 0.25, 0.5, 0.75, 1.0), within_sd: 'float' = 0.5, jitter: 'float' = 0.0, seed: 'int' = 0) -> 'SandboxResult' Plant a between/within inequality split, then watch Theil decompose it. Builds group log-income distributions on a deterministic quantile grid — group ``g`` at gap level ``tau`` is centered at ``g * tau`` with within-group spread ``within_sd`` — and sweeps ``gaps`` as successive "periods" of a panel, so one call to :func:`geometrics.analyze_theil_decomposition` traces how the between-group share rises with the planted gap. The truth is computed with an independent numpy implementation of the Theil-T decomposition. Parameters ---------- n_groups Number of groups. n_per_group Units per group. gaps The planted between-group log gaps, swept as periods (at least two values). within_sd Within-group standard deviation of log income (drives the within share). jitter Standard deviation of optional extra log noise (0 keeps the grid exact). seed Random seed (only used when ``jitter > 0``). Returns ------- SandboxResult ``df`` (per gap: Theil, between, within, estimated and true between share), ``fig``, ``summary``, ``topic`` and the simulated panel in ``data``. Examples -------- ```python import geometrics as gm res = gm.learn_theil_decomposition(within_sd=0.3) res.df[["gap", "between_share_est", "between_share_true"]] ``` ## utilities ### read_gdf(source: 'gpd.GeoDataFrame | str | Path', *, entity: 'str | None' = None, entity_name: 'str | None' = None, layer: 'str | None' = None, crs: 'Any | None' = None, make_valid: 'bool' = True) -> 'gpd.GeoDataFrame' Read user geometry into a validated GeoDataFrame (the geometry entry point). Accepts a GeoDataFrame or a path to a shapefile (plain or zipped), GeoJSON, or GeoPackage, and enforces the geometry contract every spatial function relies on: a declared CRS, no empty/missing geometries (invalid ones repaired), and a unique entity id. The resolved ids are stored on ``gdf.attrs["geometrics_geo"]`` so later calls can omit ``entity=``. Parameters ---------- source A :class:`geopandas.GeoDataFrame` (copied, never mutated) or a path to a ``.shp`` / ``.zip`` / ``.geojson`` / ``.json`` / ``.gpkg`` file. entity Name of the entity (unit) id column. When ``None`` it is resolved automatically: the sole non-geometry column if there is exactly one, else a column whose name matches the entity-id hints (``id`` / ``code`` / ``region`` / ...). entity_name Optional column holding a human-readable label for each unit (e.g. a district name next to a census code). layer Layer name for multi-layer sources (GeoPackage), forwarded to :func:`geopandas.read_file`. crs Coordinate reference system to *declare* (``set_crs``) when the source carries none. ``read_gdf`` never reprojects — use :func:`ensure_metric_crs` for that. make_valid Repair invalid geometries with :func:`shapely.make_valid` (a :class:`~geometrics.GeometricsWarning` reports how many were repaired). Returns ------- geopandas.GeoDataFrame The validated geometry, with ``attrs["geometrics_geo"]`` recording the resolved ``entity`` (and ``entity_name``). Raises ------ TypeError If ``source`` is neither a GeoDataFrame nor a path. KeyError If an explicit ``entity`` / ``entity_name`` column is absent. ValueError If the format is unsupported, no CRS is declared and ``crs`` is ``None``, the entity cannot be resolved, ids are duplicated, or geometries are empty/missing. Examples -------- Validate an in-memory GeoDataFrame (the entity id is auto-resolved): ```python import geopandas as gpd from shapely.geometry import box from geometrics._geo import read_gdf gdf = gpd.GeoDataFrame( {"region": ["A", "B"]}, geometry=[box(0, 0, 1, 1), box(1, 0, 2, 1)], crs="EPSG:4326", ) validated = read_gdf(gdf) validated.attrs["geometrics_geo"]["entity"] ``` ### make_weights(gdf: 'gpd.GeoDataFrame', *, method: 'str' = 'queen', k: 'int' = 6, threshold: 'float | None' = None, power: 'float' = 1.0, row_standardize: 'bool' = True, attach_islands: 'bool' = True, entity: 'str | None' = None, crs: 'Any' = 'auto') -> 'W' Build a spatial weights matrix from geometry with the library's conventions. The weights are keyed by the gdf's entity ids (so results are auditable by unit), contiguity islands are attached to their nearest neighbor, rows are standardized to sum to one, and a machine- and human-readable description is stored on ``w.geometrics_meta`` (``spec`` is the one-liner every spatial result records as ``w_spec``). Parameters ---------- gdf Geometry frame (see :func:`geometrics.read_gdf`); its entity column supplies the weight ids. method ``"queen"`` / ``"rook"`` (shared-boundary contiguity), ``"knn"`` (k-nearest-neighbor centroids), ``"distance_band"`` (binary within a radius) or ``"inverse_distance"`` (:math:`1/d^{p}` within a radius). k Number of neighbors for ``method="knn"``. threshold Radius for the distance-based methods. ``None`` uses the smallest distance that leaves no unit isolated (``min_threshold_distance``). power Distance-decay exponent :math:`p` for ``method="inverse_distance"``. row_standardize Standardize each row of ``W`` to sum to one (the convention for spatial lags). attach_islands For contiguity methods, connect units with no shared-boundary neighbor to their nearest neighbor (a :class:`~geometrics.GeometricsWarning` names them). entity Entity id column of ``gdf``; resolved automatically when ``None``. crs CRS handling for centroid distances (knn / distance methods): ``"auto"`` projects to an estimated UTM CRS, ``None`` keeps the raw coordinates (reproducing lat/lon-centroid k-NN analyses), anything else is passed to ``to_crs``. Returns ------- libpysal.weights.W The weights, with ``w.geometrics_meta`` recording ``method``, ``k``, ``threshold``, ``power``, ``crs``, ``islands_attached``, ``row_standardized``, ``n`` and the human-readable ``spec``. Raises ------ ValueError For an unknown ``method``, duplicate entity ids, or an out-of-range ``k``. Examples -------- Queen contiguity on a two-cell map (each cell has one neighbor): ```python import geopandas as gpd from shapely.geometry import box from geometrics.weights import make_weights gdf = gpd.GeoDataFrame( {"region": ["A", "B"]}, geometry=[box(0, 0, 1, 1), box(1, 0, 2, 1)], crs="EPSG:4326", ) w = make_weights(gdf, method="queen") (w.neighbors["A"], w.geometrics_meta["spec"]) ``` ### growth_cross_section(df: 'pd.DataFrame', var: 'str', controls: 'Sequence[str] | str | None' = None, *, entity: 'str | None' = None, time: 'str | None' = None, start: 'float | None' = None, end: 'float | None' = None, annualize: 'bool' = True) -> 'pd.DataFrame' Build the per-unit growth cross-section a convergence analysis starts from. For each unit observed at both endpoints of a common window, the function records the ``initial`` and ``final`` level of ``var`` and the log growth between them: ``growth = (log(final) - log(initial)) / T`` when ``annualize`` (the average per-period log growth over the horizon ``T = end - start``), or the raw log-difference otherwise. Controls are attached at their **initial-period** values. Parameters ---------- df Long panel data frame. var Numeric, strictly positive variable in **levels** (e.g. GDP per capita); the log is taken internally. controls Optional column name(s) whose initial-period values are carried into the cross-section (the conditional-convergence controls). entity, time Panel identifiers. Default to those declared via :func:`geometrics.set_panel`. start, end First and last period of the growth window. Default to the earliest and latest period in the panel; only units observed at **both** endpoints are kept. annualize Divide the log-difference by the horizon ``T`` (default). ``False`` returns the total log growth over the window. Returns ------- pandas.DataFrame One row per unit with columns ``entity``, ``initial``, ``final``, ``growth`` and one column per control, with the panel entity re-declared on ``df.attrs`` (:func:`geometrics.set_panel`). Raises ------ KeyError If ``var`` or a control is not a column of ``df``. TypeError If ``var`` or a control is not numeric. ValueError If the window is empty or inverted, no unit spans it, or ``var`` has non-positive endpoint values (the log is undefined). Examples -------- Two units over a 10-period window; the low-income unit grows faster: ```python import pandas as pd from geometrics.convergence import growth_cross_section df = pd.DataFrame( { "region": ["A", "A", "B", "B"], "year": [2000, 2010, 2000, 2010], "gdppc": [1000.0, 2000.0, 4000.0, 5000.0], } ) cs = growth_cross_section(df, "gdppc", entity="region", time="year") cs[["region", "initial", "final", "growth"]] ``` ### set_panel(df: 'pd.DataFrame', *, entity: 'str | None' = None, time: 'str | None' = None, entity_name: 'str | None' = None) -> 'pd.DataFrame' Declare the panel's ``entity``, ``time`` (and optional ``entity_name``) columns on ``df``. The ids are stored under ``df.attrs["geometrics_panel"]`` so that subsequent ``explore_*`` / ``analyze_*`` calls can omit them. Explicit arguments to those functions still take precedence. Parameters ---------- df The panel data frame (modified in place — its ``attrs`` are updated and the same object is returned). entity Name of the cross-sectional (unit) identifier column, or ``None`` to leave it unset. time Name of the time identifier column, or ``None`` to leave it unset. entity_name Name of a column holding a human-readable label for each unit (e.g. ``"country"`` when ``entity`` is an ISO code). When declared, figures render units as ``Name (id)``. ``None`` leaves it unset. Returns ------- pandas.DataFrame The same ``df``, with ``df.attrs["geometrics_panel"]`` updated. Examples -------- Declare the panel once, then explore without repeating the ids: ```python import pandas as pd import geometrics as gm df = pd.DataFrame( { "region": ["A", "A", "B", "B"], "year": [2000, 2001, 2000, 2001], "gdp_pc": [1.0, 1.1, 2.0, 2.1], } ) df = gm.set_panel(df, entity="region", time="year") ``` ### resolve_panel(df: 'pd.DataFrame', entity: 'str | None' = None, time: 'str | None' = None, *, require_entity: 'bool' = False, require_time: 'bool' = False) -> 'tuple[str | None, str | None]' Resolve the ``(entity, time)`` ids for ``df``: explicit args win, else ``df.attrs``. Parameters ---------- df The panel data frame. entity, time Explicit identifiers. When ``None``, fall back to the values stored by :func:`set_panel` (if any). require_entity, require_time When ``True``, raise :class:`ValueError` if the corresponding id cannot be resolved. Returns ------- tuple of (str or None, str or None) The resolved ``(entity, time)`` column names. Raises ------ ValueError If a resolved column is not present in ``df``, or a required id is unresolved. ### set_labels(df: 'pd.DataFrame', labels: 'Mapping[str, str] | pd.DataFrame | None' = None, *, set_panel: 'bool' = False) -> 'pd.DataFrame' Declare human-readable variable labels on ``df`` and return it. The labels are stored under ``df.attrs["geometrics_labels"]`` so that subsequent ``explore_*`` / ``analyze_*`` calls can title axes, legends and table headers with them. Explicit ``label=`` arguments to those functions still take precedence. Parameters ---------- df The data frame (modified in place — its ``attrs`` are updated and the same object is returned). labels Either a ``{column_name: label}`` mapping, or a data-dictionary frame (``df_dict``) whose ``label`` / ``var_def`` columns supply the labels. ``None`` leaves the stored mapping unchanged. set_panel When ``True`` and ``labels`` is a ``df_dict``, also declare the structural metadata it carries: the panel (``entity`` / ``time``, plus an ``entity_name`` column tagged ``role == "entity_name"``) via :func:`~geometrics.set_panel`, and the analytical roles (``role`` of ``outcome`` / ``covariate``) via :func:`~geometrics.set_roles`. Returns ------- pandas.DataFrame The same ``df``, with ``df.attrs["geometrics_labels"]`` updated. Examples -------- Declare labels once, then explore with readable titles: ```python import pandas as pd import geometrics as gm df = pd.DataFrame({"region": ["A", "B"], "gini": [0.42, 0.35]}) df = gm.set_labels(df, {"gini": "Regional inequality (Gini)"}) ``` ### resolve_label(df: 'pd.DataFrame', name: 'str', *, label: 'str | None' = None) -> 'str' Resolve the display label for ``name``: explicit ``label`` wins, else ``attrs``, else name. Never raises on an unknown ``name`` (regression terms such as ``log_gdp_pc_sq`` are not columns), so it is safe to call on any axis variable or model term. Parameters ---------- df The data frame whose ``attrs`` may hold the stored labels. name The column or term name to label. label An explicit override; when given it is returned unchanged. Returns ------- str The resolved label. ### set_roles(df: 'pd.DataFrame', *, outcome: 'str | None' = None, covariates: 'str | Sequence[str] | None' = None) -> 'pd.DataFrame' Declare the main ``outcome`` and ``covariates`` on ``df`` and return it. The roles are stored under ``df.attrs["geometrics_roles"]`` so that subsequent ``explore_*`` / ``analyze_*`` calls (and the no-code apps) can default to them when their primary variable argument is omitted. Explicit arguments to those functions still take precedence. Parameters ---------- df The data frame (modified in place — its ``attrs`` are updated and the same object is returned). outcome Name of the main outcome (dependent) variable, or ``None`` to leave it unset. covariates Name(s) of the main covariate(s) — a single column or a sequence — or ``None`` to leave them unset. Returns ------- pandas.DataFrame The same ``df``, with ``df.attrs["geometrics_roles"]`` updated. Examples -------- Declare the key variables once, then explore/analyze without repeating them: ```python import pandas as pd import geometrics as gm df = pd.DataFrame( { "region": ["A", "A", "B", "B"], "year": [2000, 2001, 2000, 2001], "gini": [0.42, 0.41, 0.35, 0.34], "log_gdp_pc": [8.1, 8.2, 9.0, 9.1], } ) df = gm.set_panel(df, entity="region", time="year") df = gm.set_roles(df, outcome="gini", covariates=["log_gdp_pc"]) ``` ### build_data_dict(df: 'pd.DataFrame', *, entity: 'str | Sequence[str] | None' = None, time: 'str | None' = None, factor_cutoff: 'int' = 10) -> 'pd.DataFrame' Infer a best-guess data dictionary (``df_dict``) for ``df``. Produces one row per column with an inferred ``type`` and a humanized ``label``, ready to pass to :func:`~geometrics.set_labels`. Column-name hints and dtypes drive the guess: a column is typed ``entity`` (name hints like ``country`` / ``iso`` / ``id``, or — failing that — the column that uniquely keys the rows together with the time id), ``time`` (name hints like ``year`` / ``date``, a datetime dtype, or an integer column in the calendar-year range), ``logical`` (boolean or two-valued), ``factor`` (categorical/object, or numeric with at most ``factor_cutoff`` distinct values), else ``numeric``. A best-guess ``role`` is also filled: a text column that is constant within the entity and ~1:1 with it (a readable label for the unit, e.g. a country name beside an ISO code) is tagged ``entity_name``; all other rows are left blank. The analytical roles ``outcome`` / ``covariate`` are never guessed — mark them yourself (in the dictionary or via :func:`~geometrics.set_roles`). Parameters ---------- df The data frame to describe. entity Explicit entity (unit) identifier column name(s); when given, these win over detection (and are validated against ``df``). time Explicit time identifier column name; when given, it wins over detection. factor_cutoff Numeric columns with at most this many distinct values are typed ``factor``. Returns ------- pandas.DataFrame A dictionary frame with columns ``var_name``, ``var_def``, ``label``, ``type``, ``role`` and ``can_be_na`` (one row per column of ``df``, in column order). Examples -------- Build a dictionary for any frame, then attach labels + declare the panel in one step: ```python import pandas as pd import geometrics as gm df = pd.DataFrame( { "region": ["A", "A", "B", "B"], "year": [2000, 2001, 2000, 2001], "gdp_pc": [1.0, 1.1, 2.0, 2.1], } ) ddict = gm.build_data_dict(df) df = gm.set_labels(df, ddict, set_panel=True) ddict.head() ``` ### set_palette(mode: 'str') -> 'None' Switch the global geometrics color palette. The palette is **process-global**: it affects every subsequent geometrics figure — grouped series colors (via :func:`color_for`), the heatmap / scatter color scales, and the registered Plotly template's colorway. The default look is unchanged until you opt in, so existing figures keep their colors unless you call this. Parameters ---------- mode ``"default"`` (the Tableau 10 palette, today's look) or ``"colorblind"`` (the Okabe-Ito colorblind-safe qualitative palette plus colorblind-safe sequential and diverging scales). Raises ------ ValueError If ``mode`` is not a known palette. Examples -------- Switch to the colorblind-safe palette, then restore the default: ```python import geometrics as gm gm.set_palette("colorblind") # every later figure uses the Okabe-Ito palette print(gm.get_palette()) gm.set_palette("default") # restore the Tableau 10 default ``` ### get_palette() -> 'str' Return the name of the currently active palette (``"default"`` / ``"colorblind"``). ### explain(topic: 'str', *, lang: 'str' = 'en') -> 'Explainer' Return the :class:`Explainer` for a method or concept. Parameters ---------- topic A topic key or alias (see :func:`list_topics`). lang Language code. Only ``"en"`` ships today; the parameter is reserved so that adding translations later is non-breaking. Returns ------- Explainer The matching explainer. Raises ------ KeyError If ``topic`` is unknown; the message lists the available topics. Examples -------- ```python from geometrics.pedagogy import explain explain("beta_convergence").title ``` ### list_topics() -> 'list[str]' Return the sorted list of canonical topic keys (for app menus and docs). ## geometrics.data ### load_india() -> 'tuple[gpd.GeoDataFrame, pd.DataFrame, pd.DataFrame]' Load the India district (n=520) nighttime lights case study. Downloads (or reads from the local cache) the source files of quarcs-lab/project2025s-py, pinned to a commit, and reshapes them into the three geometrics inputs. Returns ------- gdf : geopandas.GeoDataFrame 520 district geometries with columns ``["statedist", "geometry"]``, CRS EPSG:4326. df : pandas.DataFrame Long panel of 3120 rows (520 districts x 6 years: 1996, 1999, 2000, 2004, 2005, 2010). Year-varying nighttime luminosity (``ntl_rural``, ``ntl_urban``, ``ntl_total``), the paper-replication columns (``ntl_pc_1996``, ``log_ntl_pc_1996``, ``growth_ntl_pc_9610``; carried verbatim from the source), two population columns, and the 16 conditional controls (repeated per year). Sorted by ``statedist`` then ``year``. df_dict : pandas.DataFrame Data dictionary with one row per ``df`` column, in ``df`` column order, with columns ``var_name, var_def, label, type, role, can_be_na``. Raises ------ GeometricsDataError If a source file cannot be downloaded or fails hash verification. See Also -------- load_india_raw : The same source files without any reshaping. load_india_states : State-level (n=32) companion dataset. Examples -------- >>> from geometrics.data import load_india >>> gdf, df, df_dict = load_india() # doctest: +SKIP >>> df.shape # doctest: +SKIP (3120, 28) ### load_india_states() -> 'tuple[gpd.GeoDataFrame, pd.DataFrame, pd.DataFrame]' Load the India states (n=32) nighttime lights cross-section for 1992. Regional sums of corrected DMSP-OLS nighttime lights (CCNL v1) over GlobPOP gridded population, computed in Google Earth Engine by the authors of quarcs-lab/project2025s-py. Returns ------- gdf : geopandas.GeoDataFrame 32 state/union territory geometries with columns ``["region", "geometry"]``, CRS EPSG:4326. df : pandas.DataFrame 32 rows with columns ``["region", "year", "ntl_sum", "pop", "ntl_pc", "log_ntl_pc"]`` (``year`` is always 1992). df_dict : pandas.DataFrame Data dictionary with one row per ``df`` column, in ``df`` column order, with columns ``var_name, var_def, label, type, role, can_be_na``. Raises ------ GeometricsDataError If a source file cannot be downloaded or fails hash verification. See Also -------- load_india : District-level (n=520) panel case study. ### load_india_raw() -> 'tuple[gpd.GeoDataFrame, pd.DataFrame]' Load the untouched source files of the India district case study. Returns ------- gdf : geopandas.GeoDataFrame india520.geojson with all of its properties (25 columns including ``geometry``), CRS EPSG:4326. df : pandas.DataFrame india520.dta read as-is: 520 rows, 341 columns. Raises ------ GeometricsDataError If a source file cannot be downloaded or fails hash verification. See Also -------- load_india : The reshaped ``(gdf, df, df_dict)`` version of this data. ### load_bolivia() -> 'tuple[gpd.GeoDataFrame, pd.DataFrame, pd.DataFrame]' Load the Bolivia province (ADM2, n=112) PWT-anchored GDP panel. Subnational GDP for 2012--2022 derived from the 0.25-degree gridded estimates of Rossi-Hansberg & Zhang (2026) under their most aggressive low-population-density censoring (``0_05``), proportionally rescaled so Bolivian national totals equal Penn World Table 11.0 (``rgdpo`` and ``pop``), and aggregated to GADM 4.10 provinces. GDP and population are therefore in interpretable 2021 PPP US$ units and the relative spatial pattern of the underlying model is preserved exactly. The product is derived data; cite the underlying GDP estimates, the national benchmark, and the boundaries: - Rossi-Hansberg, E., & Zhang, J. (2026). Local GDP estimates around the world. *Journal of Urban Economics*, 154, 103871. - Feenstra, R. C., Inklaar, R., & Timmer, M. P. (2015). The Next Generation of the Penn World Table. *American Economic Review*, 105(10), 3150-3182. Data: Penn World Table 11.0. - GADM (2022). Database of Global Administrative Areas, version 4.10. Full methodological documentation (the proportional rescaling to PWT national totals, the 0_05 low-density censoring, and per-level dictionaries) lives in ``datasets/BOL-005popAdj-PWTscaled/README.md`` of the geometrics repository. Returns ------- gdf : geopandas.GeoDataFrame 112 province geometries with columns ``["gid", "geometry"]``, CRS EPSG:4326. Five provinces (``BOL.2.1_2``, ``BOL.2.8_2``, ``BOL.2.11_2``, ``BOL.2.13_2``, ``BOL.5.16_2``) have **no panel rows**: all of their grid cells are censored at the 0_05 threshold. geometrics' alignment warns about them, which is expected. df : pandas.DataFrame Balanced panel of 1177 rows (107 provinces x 11 years, 2012--2022). Key variables: ``gdp_pwt`` (millions of 2021 PPP US$), ``pop_pwt`` (millions of persons), ``gdppc`` (2021 PPP US$ per person) and ``ln_gdppc``, plus provenance/scaling columns documented in the dictionary. Sorted by ``gid`` then ``year``. df_dict : pandas.DataFrame Data dictionary with one row per ``df`` column, in ``df`` column order (``gid`` is the entity, ``name`` the entity name, ``year`` the time id). Raises ------ GeometricsDataError If a source file cannot be downloaded or fails hash verification. See Also -------- load_bolivia_departments : Department-level (ADM1, n=9) version. load_bolivia_grid : The underlying 0.25-degree grid cells (n=1603). load_bolivia_raw : Untouched files of any level, including ADM0. Examples -------- >>> from geometrics.data import load_bolivia >>> gdf, df, df_dict = load_bolivia() # doctest: +SKIP >>> df.shape # doctest: +SKIP (1177, 21) ### load_bolivia_departments() -> 'tuple[gpd.GeoDataFrame, pd.DataFrame, pd.DataFrame]' Load the Bolivia department (ADM1, n=9) PWT-anchored GDP panel. The department-level aggregation of the same product as :func:`load_bolivia`: Rossi-Hansberg & Zhang (2026) gridded GDP under 0_05 censoring, rescaled to Penn World Table 11.0 national totals, on GADM 4.10 boundaries, 2012--2022, in 2021 PPP US$. The product is derived data; cite the underlying GDP estimates, the national benchmark, and the boundaries: - Rossi-Hansberg, E., & Zhang, J. (2026). Local GDP estimates around the world. *Journal of Urban Economics*, 154, 103871. - Feenstra, R. C., Inklaar, R., & Timmer, M. P. (2015). The Next Generation of the Penn World Table. *American Economic Review*, 105(10), 3150-3182. Data: Penn World Table 11.0. - GADM (2022). Database of Global Administrative Areas, version 4.10. Full methodological documentation (the proportional rescaling to PWT national totals, the 0_05 low-density censoring, and per-level dictionaries) lives in ``datasets/BOL-005popAdj-PWTscaled/README.md`` of the geometrics repository. Returns ------- gdf : geopandas.GeoDataFrame 9 department geometries with columns ``["gid", "geometry"]``, CRS EPSG:4326. df : pandas.DataFrame Balanced panel of 99 rows (9 departments x 11 years). Key variables as in :func:`load_bolivia`. df_dict : pandas.DataFrame Data dictionary with one row per ``df`` column, in ``df`` column order. Raises ------ GeometricsDataError If a source file cannot be downloaded or fails hash verification. See Also -------- load_bolivia : Province-level (ADM2, n=112) version. load_bolivia_grid : The underlying 0.25-degree grid cells (n=1603). ### load_bolivia_grid() -> 'tuple[gpd.GeoDataFrame, pd.DataFrame, pd.DataFrame]' Load the Bolivia 0.25-degree grid cells (n=1603) PWT-anchored GDP panel. The raw cells of the same product as :func:`load_bolivia` before any administrative aggregation: Rossi-Hansberg & Zhang (2026) gridded GDP under 0_05 censoring, rescaled to Penn World Table 11.0 national totals, 2012--2022, in 2021 PPP US$. The source keys cells by the compound (``cell_id``, ``subcell_id``, ``subcell_id_0_25``); geometrics needs a single entity id shared between ``gdf`` and ``df``, so the loader synthesizes ``cell`` as ``cell_id.subcell_id.subcell_id_0_25`` and joins the geometry on the cells' unique (longitude, latitude) centers. The product is derived data; cite the underlying GDP estimates, the national benchmark, and the boundaries: - Rossi-Hansberg, E., & Zhang, J. (2026). Local GDP estimates around the world. *Journal of Urban Economics*, 154, 103871. - Feenstra, R. C., Inklaar, R., & Timmer, M. P. (2015). The Next Generation of the Penn World Table. *American Economic Review*, 105(10), 3150-3182. Data: Penn World Table 11.0. - GADM (2022). Database of Global Administrative Areas, version 4.10. Full methodological documentation (the proportional rescaling to PWT national totals, the 0_05 low-density censoring, and per-level dictionaries) lives in ``datasets/BOL-005popAdj-PWTscaled/README.md`` of the geometrics repository. Returns ------- gdf : geopandas.GeoDataFrame 1603 grid-cell polygons with columns ``["cell", "geometry"]``, CRS EPSG:4326. df : pandas.DataFrame Balanced panel of 17633 rows (1603 cells x 11 years) with ``cell`` first, then every source column. Sorted by ``cell`` then ``year``. df_dict : pandas.DataFrame Data dictionary with one row per ``df`` column, in ``df`` column order (``cell`` is the sole entity-typed row; the three source key components are kept as factors). Raises ------ GeometricsDataError If a source file cannot be downloaded or fails hash verification. See Also -------- load_bolivia : Province-level (ADM2, n=112) aggregation. load_bolivia_departments : Department-level (ADM1, n=9) aggregation. ### load_bolivia_raw(level: "Literal['adm0', 'adm1', 'adm2', 'grid']" = 'adm2') -> 'tuple[pd.DataFrame, gpd.GeoDataFrame]' Load the untouched files of one Bolivia level (including ADM0). Parameters ---------- level : {"adm0", "adm1", "adm2", "grid"} Which level of the collection to load. ``"adm0"`` is the national aggregate (one unit; useful for checking the PWT anchoring). Returns ------- df : pandas.DataFrame The level's long panel CSV read as-is. gdf : geopandas.GeoDataFrame The level's geometry with all of its attribute columns (the boundaries GeoPackage for admin levels; the cells GeoPackage for ``"grid"``), CRS EPSG:4326. Raises ------ GeometricsDataError If a source file cannot be downloaded or fails hash verification. See Also -------- load_bolivia : The reshaped ``(gdf, df, df_dict)`` province version. ### clear_cache() -> 'None' Remove the local cache of downloaded case-study files. Deletes the pooch cache directory used by the loaders (the OS cache directory for ``geometrics``, or the directory named by the ``GEOMETRICS_DATA_DIR`` environment variable). The next loader call downloads the files again.