# geometrics > Regional growth, convergence, and inequality analysis on the PySAL stack > (libpysal, esda, giddy, inequality, mapclassify, spreg, mgwr) with Plotly > figures, Great Tables, and plain-language interpretation on every result. Three inputs: gdf (geometry with ONLY the entity ID; shapefile / zipped shapefile / GeoJSON / GeoPackage via read_gdf), df (long-form panel declared with set_panel / set_labels), df_dict (6-column data dictionary: var_name, var_def, label, type, role, can_be_na). Every public function returns a frozen result dataclass with .df, .fig and/or .gt, .interpret() and .explain(). Install: pip install geometrics (extras: [dynamics] for Markov via giddy, [png] for static export, [all]). ## Docs - [Explore](https://quarcs-lab.github.io/geometrics/explore.html): maps, weights, Moran/LISA, space-time views β India - [Analyze](https://quarcs-lab.github.io/geometrics/analyze.html): beta/sigma/club convergence, spatial models with impacts, Markov, inequality β Bolivia - [Learn](https://quarcs-lab.github.io/geometrics/learn.html): the learn_* concept sandboxes and the explainer registry - [The data model](https://quarcs-lab.github.io/geometrics/articles/data-model.html): the (gdf, df, df_dict) contract - [Convergence](https://quarcs-lab.github.io/geometrics/articles/convergence.html): beta/sigma convergence and clubs - [Spatial dependence](https://quarcs-lab.github.io/geometrics/articles/spatial-dependence.html): weights, Moran, LISA - [Spatial spillovers](https://quarcs-lab.github.io/geometrics/articles/spillovers.html): the spreg suite and impacts - [Regional inequality](https://quarcs-lab.github.io/geometrics/articles/inequality.html): Gini/Theil and decompositions - [Distribution dynamics](https://quarcs-lab.github.io/geometrics/articles/dynamics.html): Markov and spatial Markov - [The India case study](https://quarcs-lab.github.io/geometrics/articles/india-case-study.html): the full replication arc - [The Bolivia dataset](https://quarcs-lab.github.io/geometrics/articles/bolivia-dataset.html): PWT-anchored local GDP at three scales - [For AI / LLMs](https://quarcs-lab.github.io/geometrics/use-with-llms.html): how AI agents should install and call geometrics - [Changelog](https://quarcs-lab.github.io/geometrics/changelog.html): release notes ## API - explore_*: explore_choropleth_map, explore_connectivity_map, explore_moran_plot, explore_lisa_cluster_map, explore_moran_over_time, explore_distribution_over_time, explore_spacetime_heatmap - analyze_*: analyze_beta_convergence, analyze_sigma_convergence, analyze_convergence_clubs, analyze_spatial_model, analyze_spatial_diagnostics, analyze_spatial_model_by_weights, analyze_markov_transitions, analyze_spatial_markov, analyze_inequality_over_time, analyze_theil_decomposition, analyze_gwr, analyze_mgwr - learn_*: learn_spatial_autocorrelation, learn_spatial_weights, learn_lisa_clusters, learn_spatial_spillovers, learn_omitted_spatial_lag, learn_beta_convergence, learn_sigma_convergence, learn_convergence_clubs, learn_markov_chains, learn_spatial_markov, learn_theil_decomposition - utilities: read_gdf, make_weights, growth_cross_section, set_panel, resolve_panel, set_labels, resolve_label, set_roles, build_data_dict, set_palette, get_palette, explain, list_topics - geometrics.data: load_india, load_india_states, load_india_raw, load_bolivia, load_bolivia_departments, load_bolivia_grid, load_bolivia_raw, clear_cache ## Source - [Repository](https://github.com/quarcs-lab/geometrics) - [API reference](https://quarcs-lab.github.io/geometrics/reference/index.html) - [llms-full.txt](https://quarcs-lab.github.io/geometrics/llms-full.txt): full docs text + signatures # ===== Docs pages (source) ===== ## ----- explore.qmd ----- --- title: "Explore regional data" aliases: - /quickstart.html --- ```{=html}
``` The **Explore** module is your first look at a regional dataset β before you estimate a single model. This page is a **case study**: you have just been handed 520 Indian districts observed by DMSP-OLS satellite nighttime lights between 1996 and 2010 (from [Mendez, Kabiraj & Li](https://github.com/quarcs-lab/project2025s-py)) and asked three questions an analyst always starts with: *is development spatially clustered, where exactly, and how did the whole regional distribution move over time?* Every Explore function takes the panel and returns a small **result object** carrying a tidy `.df` plus an interactive [Plotly](https://plotly.com/python/) figure (`.fig`), and most offer a plain-language `.interpret()`. Read this page top to bottom: the functions are ordered as a **workflow** β *load the three inputs β map the level β encode the neighborhood β test and localize clustering β watch the distribution move in time and space*. ::: {.callout-note} This is **exploratory** analysis: every reading below describes an *association*, never a cause. The [Analyze](analyze.qmd) module turns these patterns into estimates, and [Learn](learn.qmd) explains the ideas behind them with simulations you control. ::: ## Stage 0 β Load the three inputs geometrics separates geometry, data, and metadata: a geometry with **only the entity ID** (`gdf`), a **long-form panel** (`df`), and a **data dictionary** (`df_dict`). The bundled India case study ships all three; `set_labels` attaches the dictionary's labels to every future figure and declares the (entity, time) coordinates once. ```{python} import warnings warnings.filterwarnings("ignore") import geometrics as gm gdf, df, df_dict = gm.data.load_india() df = gm.set_labels(df, df_dict, set_panel=True) # labels + entity/time + roles, once df.head(3) ``` The dictionary is data too β it documents every column and drives the labels on every figure: ```{python} df_dict.head(8) ``` ## Stage 1 β See the map `explore_choropleth_map` classifies with mapclassify (Fisher-Jenks by default, `k` classes) and draws one legend entry per class, so the legend *is* the classification: ```{python} gm.explore_choropleth_map(df, "ntl_total", gdf=gdf, period=2010).fig ``` Pass `animate=True` instead of a single `period` to play the whole 1996β2010 film, or switch `scheme` (`"quantiles"`, `"equalinterval"`, β¦) to see how much the story depends on the classification β `gm.explain("choropleth_classification")` explains why. ## Stage 2 β Encode the neighborhood Everything spatial starts with a weights matrix **W** β the formal answer to "who is whose neighbor?". The paper uses 6 nearest neighbors; `explore_connectivity_map` draws the graph so you can inspect it *before* trusting it: ```{python} w = gm.make_weights(gdf, method="knn", k=6) gm.explore_connectivity_map(gdf, w=w).fig ``` Contiguity is the common alternative (`method="queen"`) β see [Spatial dependence and LISA](articles/spatial-dependence.qmd) for how to choose, and [Analyze](analyze.qmd) for checking that results survive the choice. ## Stage 3 β Is development spatially clustered? The Moran scatterplot puts each district's (standardized) value against the average of its neighbors; the slope is **global Moran's I**, the workhorse test of spatial autocorrelation: ```{python} moran = gm.explore_moran_plot(df, "log_ntl_pc_1996", gdf=gdf, w=w, period=1996) moran.fig ``` ```{python} print(moran.interpret()) ``` ## Stage 4 β Where exactly? (LISA) Global Moran's I says *whether* the map clusters; **LISA** (local Moran) says *where* β each district is classified as a High-High hot spot, Low-Low cold spot, or a spatial outlier (High-Low / Low-High), masked at 5% significance: ```{python} lisa = gm.explore_lisa_cluster_map(df, "log_ntl_pc_1996", gdf=gdf, w=w, period=1996) lisa.fig ``` ```{python} print(lisa.interpret()) ``` ## Stage 5 β The whole distribution, year by year Convergence questions are distribution questions. The ridgeline stacks the cross-sectional density of each year on one shared grid, so you can watch the shape β not just the mean β move: ```{python} gm.explore_distribution_over_time(df, "log_ntl_pc_1996").fig ``` (`kind="animated"` plays the same densities as an animation instead.) ## Stage 6 β Every district, every year The space-time heatmap keeps *every* unit visible: one row per district, one column per year. Sorting the rows by latitude turns geography itself into the y-axis β a northβsouth transect of the whole panel: ```{python} gm.explore_spacetime_heatmap( df, "log_ntl_pc_1996", gdf=gdf, sort_by="north_south" ).fig ``` Rows that keep their shading left to right are persistent; rows that lighten or darken are mobile. `sort_by="value"` orders by the first period instead. ## Stage 7 β Does the clustering strengthen or fade? Stage 3 tested one year. Running Moran's I per year closes the loop β is the spatial structure of development deepening or dissolving? ```{python} mot = gm.explore_moran_over_time(df, "log_ntl_pc_1996", gdf=gdf, w=w) mot.fig ``` ```{python} print(mot.interpret()) ``` ## Where next You now know the map clusters, where it clusters, and how the distribution moved. - [Analyze](analyze.qmd) β estimate it: Ξ²/Ο/club convergence, spatial models with spillovers, Markov dynamics, inequality decompositions (on the Bolivia case study) - [The India case study](articles/india-case-study.qmd) β the full replication arc on this same panel - [Learn](learn.qmd) β the ideas behind W, Moran's I and LISA, taught with simulations where you plant the truth - [The data model](articles/data-model.qmd) β bring your own `(gdf, df, df_dict)` ## ----- analyze.qmd ----- --- title: "Analyze convergence and inequality" --- ```{=html} ``` The **Analyze** module estimates what [Explore](explore.qmd) described. This page is a **case study** on the bundled Bolivia panel β 112 provinces with PWT-anchored GDP per capita, 2012β2022 (from [Rossi-Hansberg & Zhang's local GDP](articles/bolivia-dataset.qmd), rescaled to Penn World Table 11.0) β asking the three standing questions of the convergence literature: *are poorer provinces catching up, do spillovers carry growth across borders, and is the national gap narrowing?* The functions appear in the order an analysis actually runs: *build the growth cross-section β estimate Ξ² without space β add spillovers β let the diagnostics pick the model β check robustness to W β Ο-convergence β clubs β distribution dynamics β inequality and its decomposition β local heterogeneity.* (The [India case study](articles/india-case-study.qmd) runs this same arc on the flagship 520-district panel.) ::: {.callout-note} Every `.interpret()` below reads an *association*, never a cause β estimates from observational regional data describe patterns, not policy effects. The [Learn](learn.qmd) module demonstrates each estimator on simulated data where the truth is planted. ::: ## Stage 0 β Load and declare ```{python} import warnings warnings.filterwarnings("ignore") import numpy as np import geometrics as gm gdf, df, df_dict = gm.data.load_bolivia() # 112 provinces x 2012-2022 df = gm.set_labels(df, df_dict, set_panel=True) # labels + (gid, year), once w = gm.make_weights(gdf) # queen contiguity, row-standardized df.head(3) ``` Five provinces are fully censored in the source product (polygons but no panel rows) β geometrics warns and carries on; see [the Bolivia dataset](articles/bolivia-dataset.qmd). ## Stage 1 β The growth cross-section Every convergence regression starts from the same frame: one row per province, its initial level and its annualized log growth. `growth_cross_section` builds it explicitly, so the Ξ² regression that follows has no hidden steps: ```{python} cs = gm.growth_cross_section(df, "gdppc") cs.head() ``` ## Stage 2 β Ξ²-convergence, without space Do initially-poorer provinces grow faster? A negative slope of growth on the initial (log) level says yes; `speed` and `half_life` translate it into years: ```{python} ols = gm.analyze_beta_convergence(df, "gdppc", model="ols") print( f"beta = {ols.beta_total:.4f} (speed {ols.speed:.3f}, " f"half-life {ols.half_life:.0f} yr)" ) ols.fig ``` ```{python} print(ols.interpret()) ``` ## Stage 3 β Ξ² with spillovers (the spatial Durbin model) Provinces are not islands: `model="sdm"` adds the spatial lags of outcome and covariates, and the convergence estimate becomes a LeSage-Pace **total impact** with direct and indirect (spillover) components β Monte-Carlo standard errors included: ```{python} sdm = gm.analyze_beta_convergence( df, "gdppc", model="sdm", gdf=gdf, w=w, n_draws=1000 ) print( f"SDM total: {sdm.beta_total:.4f} = direct {sdm.beta_direct:.4f} " f"+ indirect {sdm.beta_indirect:.4f} (rho = {sdm.rho:.2f})" ) ``` ```{python} print(sdm.interpret()) ``` ## Stage 4 β Which spatial model do the data ask for? Rather than assuming the SDM, let the Lagrange-multiplier diagnostics inspect the OLS residuals and apply the Anselin-Florax decision rule: ```{python} cs["ln_initial"] = np.log(cs["initial"]) cs["year"] = 2012 cs = gm.set_panel(cs, entity="gid", time="year") diag = gm.analyze_spatial_diagnostics( cs, outcome="growth", covariates=["ln_initial"], gdf=gdf, w=w ) print(f"Recommendation: {diag.recommendation}\n") print(diag.reasoning) ``` The recommended specification is one `model=` switch away β and its full spreg table with the impact decomposition comes from `analyze_spatial_model`: ```{python} model = gm.analyze_spatial_model( cs, outcome="growth", covariates=["ln_initial"], gdf=gdf, w=w, model="durbin", n_draws=1000, ) model.gt ``` ## Stage 5 β Robust to the weights choice? A conclusion that only holds under one definition of "neighbor" is fragile. `analyze_spatial_model_by_weights` re-estimates the same model under alternative weights and compares the impacts side by side: ```{python} robust = gm.analyze_spatial_model_by_weights( cs, outcome="growth", covariates=["ln_initial"], gdf=gdf, weights={ "queen": gm.make_weights(gdf, method="queen"), "knn4": gm.make_weights(gdf, method="knn", k=4), "knn6": gm.make_weights(gdf, method="knn", k=6), }, model="durbin", n_draws=1000, ) robust.fig ``` ```{python} print(robust.interpret()) ``` ## Stage 6 β Is the gap narrowing? (Ο-convergence) Ξ² asks about catch-up; Ο asks whether cross-sectional **dispersion** actually shrank. Both matter β fast catch-up can coexist with stable dispersion: ```{python} sigma = gm.analyze_sigma_convergence(df, "gdppc") sigma.fig ``` ```{python} print(sigma.interpret()) ``` ::: {.callout-tip} Dispersion is measured on logs, so the series must be strictly positive β `gdppc` is. For a panel with zeros (India's night lights), filter first: the [India case study](articles/india-case-study.qmd) shows the pattern. ::: ## Stage 7 β One Bolivia or several? (convergence clubs) Global convergence can fail while **clubs** of provinces converge to different steady states. The Phillips-Sul log(t) test and clustering find them from the data: ```{python} clubs = gm.analyze_convergence_clubs(df, "ln_gdppc", gdf=gdf) print(clubs.interpret()) ``` ```{python} clubs.fig_map if clubs.fig_map is not None else clubs.fig ``` ## Stage 8 β Distribution dynamics (Markov chains) How mobile are provinces across the income distribution β and does mobility depend on the neighbors? (Requires the `dynamics` extra: `pip install "geometrics[dynamics]"`.) ```{python} mkv = gm.analyze_markov_transitions(df, "gdppc", k=4) mkv.gt ``` The spatially conditioned chains need every mapped province observed in every period β so drop the five censored polygons from the geometry first (their weights are rebuilt on the subset automatically): ```{python} gdf_obs = gdf[gdf["gid"].isin(df["gid"])] spm = gm.analyze_spatial_markov(df, "gdppc", gdf=gdf_obs, k=4, m=4) print(spm.interpret()) ``` ## Stage 9 β Inequality: trend and decomposition The same panel, read through inequality indices β including the spatial Gini, which splits inequality into neighbor and non-neighbor pairs: ```{python} ineq = gm.analyze_inequality_over_time(df, "gdppc", gdf=gdf, w=w) ineq.fig ``` ```{python} print(ineq.interpret()) ``` How much of provincial inequality is *between departments* rather than within them? The Theil index decomposes exactly: ```{python} theil = gm.analyze_theil_decomposition(df, "gdppc", "name1") theil.fig ``` ```{python} print(theil.interpret()) ``` ## Stage 10 β Local heterogeneity (GWR, briefly) One Ξ² for all of Bolivia may hide local stories. Geographically weighted regression lets the convergence coefficient vary over space and maps it: ```{python} gwr = gm.analyze_gwr( cs, outcome="growth", covariates=["ln_initial"], gdf=gdf ) gwr.figs["ln_initial"] ``` Multiscale GWR (`analyze_mgwr`) lets each term choose its own bandwidth β see the [reference](reference/analyze_mgwr.qmd) and the India article for a full run. ## Where next - [Explore](explore.qmd) β the descriptive workflow that should precede all of this - [Learn](learn.qmd) β every estimator above, demonstrated on simulated data with a planted truth - [Spatial spillovers](articles/spillovers.qmd) β the spreg suite in depth; [Distribution dynamics](articles/dynamics.qmd); [Regional inequality](articles/inequality.qmd) - [The India case study](articles/india-case-study.qmd) β this arc on 520 districts ## ----- learn.qmd ----- --- title: "Learn spatial analysis" --- ```{=html} ``` The **Learn** module is geometrics' teaching layer, and it works two complementary ways: 1. **Every result speaks.** Each `explore_*` / `analyze_*` result carries `.interpret()` β a plain-language reading of *that* result β and `.explain()`, the concept behind the method. 2. **Sandboxes with a planted truth.** The `learn_*` functions simulate data from a known data-generating process, run the *real* geometrics estimator on it, and show whether the truth you planted comes back. Turn the knobs (`rho=`, `shift=`, `convergence_rate=`, β¦) and watch the concept respond. ::: {.callout-note} Sandboxes are for learning, never for your data β they *generate* their own. And even here, where the truth is literally known, `.interpret()` keeps its associational discipline: the habit should transfer to real data, where no truth is planted. ::: ## Stage 1 β Read a real result in plain language Any result can explain itself. A quick Ξ²-convergence on the Bolivia provinces: ```{python} import warnings warnings.filterwarnings("ignore") import geometrics as gm gdf, df, df_dict = gm.data.load_bolivia() df = gm.set_labels(df, df_dict, set_panel=True) res = gm.analyze_beta_convergence(df, "gdppc", model="ols") print(res.interpret()) ``` And the concept behind it, from the built-in explainer registry: ```{python} print(res.explain().to_markdown()[:600], "...") ``` ## Stage 2 β The browsable concept index Thirty topics ship with the package β ESDA, weights, spatial models and impacts, convergence, distribution dynamics, inequality, and foundations. Every key works with `gm.explain(...)`: ```{python} gm.list_topics() ``` ```{python} print(gm.explain("spatial_autocorrelation").to_markdown()) ``` ## Stage 3 β Sandbox: *seeing* spatial autocorrelation What does Ο actually look like? Plant no dependence, then strong dependence β the left panel is one simulated map, the right panel tracks Moran's I across planted Ο: ```{python} gm.learn_spatial_autocorrelation(rho=0.0).fig ``` ```{python} strong = gm.learn_spatial_autocorrelation(rho=0.8) strong.fig ``` ```{python} print(strong.interpret()) ``` ## Stage 4 β Sandbox: why spatial econometrics exists Simulate outcomes that spill over (`y = (I - ΟW)β»ΒΉ(Ξ²x + Ξ΅)`), then fit OLS as if space did not exist. The omitted spatial lag inflates the slope; the SAR model recovers both Ξ² and Ο: ```{python} omit = gm.learn_omitted_spatial_lag(rho=0.7) omit.fig ``` ```{python} print(omit.interpret()) ``` ## Stage 5 β Sandbox: spillovers you planted, impacts recovered In a spatial Durbin world the true direct and indirect effects are known in closed form β so you can watch the LeSage-Pace decomposition earn its keep. This is the idea behind [Analyze Stage 3](analyze.qmd#stage-3-Ξ²-with-spillovers-the-spatial-durbin-model): ```{python} spill = gm.learn_spatial_spillovers(rho=0.5, gamma=0.5) spill.fig ``` ```{python} print(spill.interpret()) ``` ## Stage 6 β Sandbox: convergence at a known speed Plant a 2% convergence rate; the growth-on-initial regression should hand it back β slope, speed, and half-life: ```{python} beta = gm.learn_beta_convergence(convergence_rate=0.02) beta.fig ``` ```{python} beta.df ``` ## Stage 7 β The full sandbox catalog Eleven sandboxes cover the package's method families β each links to its reference page with every knob documented: | Sandbox | The lesson | |---|---| | [`learn_spatial_autocorrelation`](reference/learn_spatial_autocorrelation.qmd) | What Ο looks like, and how Moran's I tracks it | | [`learn_spatial_weights`](reference/learn_spatial_weights.qmd) | The same field under queen / rook / knn β W is a choice | | [`learn_lisa_clusters`](reference/learn_lisa_clusters.qmd) | Planted hot/cold spots, recovered (and false positives counted) | | [`learn_spatial_spillovers`](reference/learn_spatial_spillovers.qmd) | Direct/indirect/total impacts vs a closed-form truth | | [`learn_omitted_spatial_lag`](reference/learn_omitted_spatial_lag.qmd) | The bias of ignoring Wy β and how SAR repairs it | | [`learn_beta_convergence`](reference/learn_beta_convergence.qmd) | A planted convergence rate, recovered | | [`learn_sigma_convergence`](reference/learn_sigma_convergence.qmd) | A planted dispersion path; trend = ln Ο exactly | | [`learn_convergence_clubs`](reference/learn_convergence_clubs.qmd) | Two planted clubs; Phillips-Sul finds them | | [`learn_markov_chains`](reference/learn_markov_chains.qmd) | A planted transition matrix, recovered cell by cell | | [`learn_spatial_markov`](reference/learn_spatial_markov.qmd) | Mobility that depends on the neighbors β detected | | [`learn_theil_decomposition`](reference/learn_theil_decomposition.qmd) | A planted between/within split, decomposed exactly | (The two Markov sandboxes need the `dynamics` extra: `pip install "geometrics[dynamics]"`.) ## Prefer sliders? The Learn app wraps every sandbox knob in a slider and pairs it with the explainer browser β no install needed: ```{=html} ``` ## Where next - [Explore](explore.qmd) β apply the ESDA ideas to the India panel - [Analyze](analyze.qmd) β the estimators these sandboxes demystify, on Bolivia - [API reference](reference/index.qmd) β every knob of every sandbox ## ----- articles/data-model.qmd ----- --- title: "The data model (gdf, df, df_dict)" subtitle: "One ID-only geometry, one long panel, one data dictionary β declared once, used everywhere" --- Every geometrics analysis takes the same three inputs: 1. **`gdf`** β the entity geometry, carrying **only the entity ID** (plus an optional human-readable name) and the geometry column. Loaded and validated by `read_gdf`. 2. **`df`** β a **long-form panel**: one row per (entity, period), one column per variable. Its identifiers are declared once with `set_panel` (or in the same call as the labels, below). 3. **`df_dict`** β a six-column **data dictionary** with one row per `df` column. The dictionary is data too: it supplies the labels on every figure and table, and it can declare the panel IDs and analytical roles in one step. Geometry is a *join table*, not a data table β the variables live in `df`, and every spatial function matches the two on the entity ID. The bundled India case study ships all three inputs in exactly this shape: ```{python} import warnings warnings.filterwarnings("ignore") import geopandas as gpd import pandas as pd import geometrics as gm gdf, df, df_dict = gm.data.load_india() df = gm.set_labels(df, df_dict, set_panel=True) # labels + entity/time + roles, once print(f"gdf: {gdf.shape[0]} units x {list(gdf.columns)}") print(f"df: {df.shape[0]} rows (520 districts x 6 years), {df.shape[1]} columns") print(f"df_dict: {df_dict.shape[0]} rows β one per df column") ``` Note what the geometry table holds: the ID and the polygons, nothing else. ```{python} gdf.head(3) ``` ## `read_gdf`: the geometry entry point `gm.read_gdf` is the single door for user geometry. It accepts a GeoDataFrame or a path in any of four formats β a **shapefile** (`.shp`), a **zipped shapefile** (`.zip`), **GeoJSON** (`.geojson` / `.json`), or a **GeoPackage** (`.gpkg`, with `layer=` for multi-layer files) β and enforces the geometry contract every spatial function relies on: a declared CRS, valid non-empty geometries (invalid ones are repaired with a warning), and a **unique** entity ID. Round-trip a toy grid through three of them: ```{python} import shutil import tempfile from pathlib import Path from shapely.geometry import box cells = [box(x, y, x + 1, y + 1) for y in range(2) for x in range(3)] grid = gpd.GeoDataFrame( {"cell_id": [f"c{i}" for i in range(6)]}, geometry=cells, crs="EPSG:4326" ) tmp = Path(tempfile.mkdtemp()) grid.to_file(tmp / "grid.geojson") grid.to_file(tmp / "grid.gpkg", layer="grid") (tmp / "shp").mkdir() grid.to_file(tmp / "shp" / "grid.shp") shutil.make_archive(str(tmp / "grid"), "zip", tmp / "shp") # -> grid.zip for name in ("grid.geojson", "grid.gpkg", "grid.zip"): back = gm.read_gdf(tmp / name) print(f"{name:13} -> {back.shape[0]} units, CRS EPSG:{back.crs.to_epsg()}, " f"entity = {back.attrs['geometrics_geo']['entity']!r}") ``` Two things happened silently there. First, the **ID-only rule**: because `cell_id` is the sole non-geometry column, it *is* the entity ID β no argument needed. When a file carries several columns, `read_gdf` looks for name hints (`id`, `code`, `region`, `district`, ...) and asks you to pass `entity=` when the choice is ambiguous; an `entity_name=` column (a readable label such as a district name next to a census code) can ride along. Everything else belongs in `df` β trim your geometry down to the ID (and optional name) before analysis. Second, the resolved IDs were stored on `gdf.attrs["geometrics_geo"]`, so every later call can omit `entity=`. **CRS handling** follows one principle: `read_gdf` *declares*, it never *reprojects*. A source with no CRS is an error until you say what the coordinates mean: ```{python} naked = gpd.GeoDataFrame({"cell_id": ["a"]}, geometry=[box(0, 0, 1, 1)]) # no CRS try: gm.read_gdf(naked) except ValueError as err: print(err) gm.read_gdf(naked, crs="EPSG:4326").crs ``` Reprojection happens later, on demand: metric operations (k-nearest-neighbor centroids, distance bands) project to an estimated UTM CRS automatically β or keep raw lon/lat coordinates with `crs=None`, as the India paper does. ## Declare once: `set_panel`, `set_labels`, `set_roles` Rather than repeat `entity="statedist", time="year"` on every call, geometrics stashes structural metadata on `df.attrs` β the **declare-once pattern**. Three helpers write it: `set_panel` (entity / time / entity-name IDs), `set_labels` (`{column: label}` display names), and `set_roles` (the default `outcome` and `covariates`). When the labels come from a `df_dict`, one call does all three: ```{python} df = gm.set_labels(df, df_dict, set_panel=True) print(df.attrs["geometrics_panel"]) print(df.attrs["geometrics_roles"]["outcome"], df.attrs["geometrics_roles"]["covariates"][:3], "...") ``` Two rules make this safe to rely on: - **Explicit arguments always win.** `gm.explore_choropleth_map(df, "ntl_total", gdf=gdf, entity="statedist")` uses the argument, not the stored default β the attrs are a convenience, never a constraint. - **attrs can be dropped by pandas.** Some operations (merges, certain column selections) return a frame without them. The fix is one line β call `set_labels` again: ```{python} lookup = pd.DataFrame({"state": df["state"].unique()}) merged = df.merge(lookup, on="state") print("attrs after merge:", merged.attrs) merged = gm.set_labels(merged, df_dict, set_panel=True) # recall: declare again print("attrs after recall:", merged.attrs["geometrics_panel"]) ``` ## The `df_dict` contract A data dictionary is a plain DataFrame with **six columns, in this order**: `var_name`, `var_def` (the long definition), `label` (the short display name), `type`, `role`, `can_be_na`. Its vocabulary is fixed: - `type` β `entity` / `time` / `factor` / `logical` / `numeric` - `role` β `""` (blank β no special role) / `outcome` / `covariate` / `entity_name` - `can_be_na` β `True` / `False` Six rows of the India dictionary show the vocabulary at work β the entity ID, a factor tagged as the readable entity name, the time ID, a plain numeric column, the declared covariate, and the declared outcome (this panel happens to have no `logical` column; the inferred dictionary below has one): ```{python} df_dict.iloc[[0, 2, 3, 6, 8, 9]] ``` ```{python} print("type vocabulary:", sorted(df_dict["type"].unique())) print("role vocabulary:", sorted(df_dict["role"].fillna("").unique())) ``` (Blank roles read back from CSV as missing values β treat blank and `NaN` as the same "no role".) The `label` column titles every axis, legend and table header; the `type` column lets `set_labels(..., set_panel=True)` find the entity and time IDs; the `role` column feeds the default outcome/covariates and the `Name (id)` hover labels. ## No dictionary? Infer a starting point `gm.build_data_dict` produces a best-guess dictionary for any frame, from column names and dtypes β entity hints like `iso`/`id`/`region`, time hints like `year`/`date`, calendar-year integer detection, and a text column that is ~1:1 with the entity becomes the `entity_name`: ```{python} import numpy as np rng = np.random.default_rng(7) iso = ["BOL", "COL", "ECU", "PER"] names = {"BOL": "Bolivia", "COL": "Colombia", "ECU": "Ecuador", "PER": "Peru"} years = list(range(2000, 2011)) raw = pd.DataFrame( { "iso3": np.repeat(iso, len(years)), "country": np.repeat([names[i] for i in iso], len(years)), "year": years * len(iso), "gdp_pc": rng.uniform(3_000, 12_000, size=len(iso) * len(years)).round(0), "landlocked": np.repeat([True, False, False, False], len(years)), } ) ddict = gm.build_data_dict(raw) ddict ``` The inference is deliberately conservative: it is a *guess* to edit, not a verdict. Pass `entity=` / `time=` to pin the IDs when the hints misfire, rewrite the humanized labels, and mark the analytical roles yourself β `outcome` and `covariate` are **never** guessed. Then declare it the usual way: ```{python} raw = gm.set_labels(raw, ddict, set_panel=True) print(gm.resolve_panel(raw)) ``` ## How `df` meets `gdf`: alignment You never merge the panel onto the geometry yourself. Every spatial function funnels through one alignment path that joins the two on the **entity ID** and returns rows in `gdf` order (killing the classic row-order mismatch bug): 1. slice the requested `period` (defaulting to the latest, with a note); 2. join on the IDs β exact match first, and if there is **zero** overlap, one retry with string-normalized keys (`str.strip` on both sides), reported as a warning; 3. warn about **unmatched IDs** on either side (naming examples, so typos surface); 4. drop rows with **missing values** in the analysis columns, with a warning; 5. when rows dropped and a `w` was given, **restrict the weights** to the kept units (`libpysal`'s `w_subset`) and re-apply the row-standardization, so `w.n` always equals the number of analyzed rows. Watch all of it on a deliberately messy copy of the panel β one district's 2010 value blanked out, plus a "Ghost district" that has no polygon: ```{python} w = gm.make_weights(gdf, method="knn", k=6, crs=None) messy = df.copy() one = messy["statedist"].iloc[0] messy.loc[(messy["statedist"] == one) & (messy["year"] == 2010), "ntl_total"] = np.nan ghost = messy[messy["year"] == 2010].iloc[[1]].copy() ghost["statedist"] = "Ghost district" messy = gm.set_labels( pd.concat([messy, ghost], ignore_index=True), df_dict, set_panel=True ) with warnings.catch_warnings(record=True) as caught: warnings.simplefilter("always") lisa = gm.explore_lisa_cluster_map(messy, "ntl_total", gdf=gdf, w=w, period=2010) for c in caught: if c.category.__name__ == "GeometricsWarning": print("warning:", c.message) ``` Nothing crashed and nothing was silently misaligned: the ghost was named, the incomplete district was dropped, and the weights were subset to match. The result object keeps the full audit trail in `notes`, and `w_spec` records the weights recipe every spatial result carries: ```{python} print("rows analyzed:", len(lisa.df)) print("w_spec:", lisa.w_spec) for note in lisa.notes: print("-", note) ``` ## Bring your own data: the checklist - **Geometry** β load through `gm.read_gdf(...)` (shapefile, zipped shapefile, GeoJSON, or GeoPackage); one row per entity, a **unique** ID, a declared CRS (pass `crs=` when the file carries none), and no other columns beyond the ID and an optional name. - **Panel** β reshape to long form: one row per (entity, period). Entity IDs must be written exactly as in the geometry (pure whitespace differences are retried automatically; anything else is warned about, unit by unit). - **Dictionary** β author the six columns (`var_name, var_def, label, type, role, can_be_na`) with the vocabulary above, or bootstrap with `gm.build_data_dict(df)` and edit. Mark `outcome` / `covariate` yourself. - **Declare once** β `df = gm.set_labels(df, df_dict, set_panel=True)`; re-declare after merges or column subsets, since pandas can drop `attrs`. - **Weights** β build from the *same* geometry (`gm.make_weights(gdf, ...)`) and inspect the graph with `gm.explore_connectivity_map(gdf, w=w)` before trusting it. From here, the [Explore page](../explore.qmd) runs the whole workflow in a page, [Beta and sigma convergence](convergence.qmd) covers the convergence toolkit, and [the India case study](india-case-study.qmd) shows the three inputs carrying a full replication. ## ----- articles/convergence.qmd ----- --- title: "Beta and sigma convergence (and clubs)" subtitle: "Do laggards catch up, is the gap narrowing, and does everyone converge to the same place?" --- Three questions, three tools. **Ξ²-convergence** asks whether units that start behind grow faster (`analyze_beta_convergence`); **Ο-convergence** asks whether the cross-sectional spread is actually narrowing (`analyze_sigma_convergence`); and when neither gives a clean verdict, **convergence clubs** ask whether the panel splits into groups that each converge to their own path (`analyze_convergence_clubs`). This article runs all three on the bundled India panel β 520 districts observed by satellite nighttime lights, 1996-2010 β using the raw panel variable `ntl_total` (the paper's per-capita replication lives in [the India case study](india-case-study.qmd)). ```{python} import warnings warnings.filterwarnings("ignore") import numpy as np import pandas as pd import geometrics as gm gdf, df, df_dict = gm.data.load_india() df = gm.set_labels(df, df_dict, set_panel=True) ``` Every concept in the library ships a built-in explainer. Here is how `gm.explain` introduces the Ξ²-convergence idea: ```{python} print(gm.explain("beta_convergence").to_markdown()[:960], "...") ``` ## Ξ²-convergence, first ignoring space `analyze_beta_convergence` builds the growth cross-section internally: for each district, total luminosity in levels at 1996 and 2010, and the annualized log growth between them, regressed on the **initial log level**. A negative slope is convergence. ```{python} ols = gm.analyze_beta_convergence(df, "ntl_total", model="ols") ols.fig ``` The result object carries the headline scalars β the slope, the implied structural speed Ξ» = -ln(1 + Ξ²Β·T)/T, and the half-life ln 2 / Ξ» (the years needed to close half of an initial gap): ```{python} print( f"beta = {ols.beta_total:.4f} (SE {ols.se_total:.4f}), R2 = {ols.r2:.3f}, " f"N = {ols.n_obs}\n" f"speed = {ols.speed:.4f} per year -> half-life = {ols.half_life:.0f} years" ) ``` ```{python} print(ols.interpret()) ``` ## Adding spillovers: the spatial Durbin model Districts are not islands β initial luminosity and its growth are both spatially clustered (see [the India case study](india-case-study.qmd) for the LISA maps). The `model` switch re-estimates the same regression with `spreg`'s spatial family; the paper's choice is the **spatial Durbin model** (SDM) on 6-nearest-neighbor weights built, like the paper, on plain lon/lat centroids (`crs=None`): ```{python} w = gm.make_weights(gdf, method="knn", k=6, crs=None) sdm = gm.analyze_beta_convergence( df, "ntl_total", model="sdm", gdf=gdf, w=w, n_draws=2000 ) ``` With a spatial lag in the model, the raw coefficient is no longer the answer. The convergence estimate becomes a LeSage-Pace **impact**: a **direct** part (a district's own initial level and its own growth), an **indirect** part (the neighborhood's initial level β the spillover), and their **total**, with Monte-Carlo standard errors from `n_draws` draws (2,000 here for speed; the default is 10,000). `res.impacts` tabulates the decomposition for every regressor: ```{python} sdm.impacts ``` Side by side with OLS β the pattern of the source paper's Table 1: ```{python} comparison = pd.DataFrame( { "OLS": [ols.beta_total, np.nan, ols.beta_total, np.nan, ols.speed, ols.half_life], "SDM": [sdm.beta_direct, sdm.beta_indirect, sdm.beta_total, sdm.rho, sdm.speed, sdm.half_life], }, index=["direct", "indirect", "total", "rho (spatial lag)", "speed (per yr)", "half-life (yr)"], ).round(4) comparison ``` ```{python} print(sdm.interpret()) ``` The OLS Ξ² understates catch-up: once the spatial lag (Ο β 0.7) and the neighbors' initial levels enter, the total impact is larger in magnitude than the OLS slope, and the implied convergence speed rises β part of every district's catch-up is associated with its *neighborhood*, which OLS attributes to nothing. How these impacts are computed, tested and stress-checked against other weights is the subject of the [spatial spillovers article](spillovers.qmd). ## Mapping who grew: `growth_cross_section` The same one-row-per-unit growth table the regression uses is available directly β handy for mapping the dependent variable before modelling it. It returns a plain DataFrame (entity, `initial`, `final`, `growth`) with the panel entity already declared, so it feeds straight into `explore_choropleth_map`: ```{python} cs = gm.growth_cross_section(df, "ntl_total") cs = gm.set_labels(cs, {"growth": "NTL growth (annualized log), 1996-2010"}) gm.explore_choropleth_map(cs, "growth", gdf=gdf).fig ``` ## Ο-convergence: is the gap actually narrowing? Ξ²-convergence is necessary but not sufficient for the distribution to compress β new shocks can re-spread it even while laggards catch up. `analyze_sigma_convergence` tracks the cross-sectional dispersion of the **log** of the variable per period (the standard deviation, the Gini, the coefficient of variation) and tests the trend of the log dispersion over time. Because dispersion is measured on logs, the series must be strictly positive β and one district (Lahul and Spiti, Himachal Pradesh) records zero luminosity in some years. Pass the full panel and geometrics refuses, telling you exactly why: ```{python} try: gm.analyze_sigma_convergence(df, "ntl_total") except ValueError as err: print(err) ``` So filter to the always-positive balanced panel first, exactly as the [Explore page](../explore.qmd) does β dropping the offending *district* (all its years, keeping the panel balanced), not just the offending rows: ```{python} bad = df.loc[df["ntl_total"] <= 0, "statedist"].unique() pos = gm.set_labels( df[~df["statedist"].isin(bad)].copy(), df_dict, set_panel=True ) sigma = gm.analyze_sigma_convergence(pos, "ntl_total") sigma.fig ``` ```{python} print(sigma.interpret()) ``` Both answers agree here: dimmer districts grew faster (Ξ²), *and* the distribution narrowed (Ο). That is not guaranteed β report both. ## Convergence clubs: one destination, or several? A single Ξ² can also paper over a split panel: some districts converging to a high path, others to a low one. The Phillips-Sul **log(t)** machinery tests whole-panel convergence and, when it is rejected, clusters the districts into data-driven **convergence clubs** from their relative transition paths. It runs on the same always-positive subset (the HP smoothing and relative transitions need a gap-free, strictly positive series); pass the matching geometry to get the club map. The clustering sieve refits thousands of log(t) regressions, so expect roughly a minute for the 519 districts β that is normal: ```{python} gdf_pos = gdf[~gdf["statedist"].isin(bad)].copy() clubs = gm.analyze_convergence_clubs(pos, "ntl_total", gdf=gdf_pos) print( f"global log-t = {clubs.global_tstat:.1f} (converged: {clubs.converged}) -> " f"{clubs.n_clubs} clubs, {clubs.n_divergent} divergent districts" ) ``` Global convergence is emphatically rejected β the districts sort into clubs, each converging to its own path. The membership map shows where those paths live: ```{python} clubs.fig ``` ```{python} clubs.fig_map ``` ```{python} print(clubs.interpret()) ``` ## Where next - [Spatial spillovers](spillovers.qmd) β the full spreg suite behind `model="sdm"`: specification diagnostics, impact inference, and robustness to the weights choice - [The India case study](india-case-study.qmd) β this toolkit inside the complete replication arc, on the paper's exact per-capita growth variable - [The data model](data-model.qmd) β bring your own `(gdf, df, df_dict)` ## ----- articles/spatial-dependence.qmd ----- --- title: "Spatial dependence and LISA" subtitle: "Who counts as a neighbor, whether the map clusters, and where" --- Every spatial statistic answers a question *relative to a definition of neighborhood*. This article walks the exploratory spatial data analysis (ESDA) arc on the bundled India case study β 520 districts observed by satellite nighttime lights, 1996β2010: encode the neighborhood as a weights matrix, audit it visually, test whether luminosity clusters globally (Moran's I), locate the clusters locally (LISA), and track the clustering over time. ```{python} import warnings warnings.filterwarnings("ignore") import geometrics as gm gdf, df, df_dict = gm.data.load_india() df = gm.set_labels(df, df_dict, set_panel=True) # The paper's weights: 6 nearest neighbors on plain lon/lat centroids (crs=None) w = gm.make_weights(gdf, method="knn", k=6, crs=None) ``` ## What a weights matrix is The spatial weights matrix **W** is the formal answer to "who is whose neighbor": cell (i, j) is nonzero when district j counts as a neighbor of district i. It is not a nuisance parameter β Moran's I, LISA, and every spatial regression are defined *conditional on the chosen W*. The built-in explainer says it best: ```{python} print(gm.explain("spatial_weights").to_markdown()[:800], "...") ``` ## Building weights: `make_weights` `make_weights` covers the standard families with the library's conventions baked in: entity ids as the weight ids, row standardization, and a human-readable one-liner recorded on `w.geometrics_meta["spec"]` (every spatial result carries it forward as `w_spec`, so figures and tables always document their W). Three variants on the India map: ```{python} variants = { "queen": gm.make_weights(gdf, method="queen"), "knn6": w, "inverse distance": gm.make_weights(gdf, method="inverse_distance", power=2), } for name, wx in variants.items(): print(f"{name:>18}: {wx.geometrics_meta['spec']}") ``` Each spec string tells a small story: - **Queen contiguity** connects districts that share a border or corner. One district β Mumbai, an island city that shares no boundary in this topology β has no queen neighbor at all, and `make_weights` **attaches it to its nearest neighbor automatically** (with a `GeometricsWarning` naming it), because a zero-neighbor unit has no spatial lag and breaks estimators downstream. The fix is recorded in the spec. - **k-nearest neighbors** guarantees every district exactly k neighbors β useful when polygons vary wildly in size. `crs=None` reproduces the source paper's lon/lat-centroid construction; the geometrics default (`crs="auto"`) would project to a metric CRS first. - **Inverse distance** encodes smooth decay, $1/d^p$ within a radius. With `threshold=None` the band is the smallest radius that leaves no district isolated β about 163 km here. ## Audit the graph before trusting it `explore_connectivity_map` draws the neighbor graph over the map and summarizes its health β the standard visual check of a W before any statistic is computed on it: ```{python} conn = gm.explore_connectivity_map(gdf, w=w) print(f"{conn.n_units} units, {conn.mean_neighbors:.0f} neighbors each " f"({conn.min_neighbors}-{conn.max_neighbors}); " f"{conn.pct_nonzero:.2f}% of pairs connected; " f"{conn.n_components} connected component(s); islands: {list(conn.islands)}") ``` ```{python} conn.fig ``` The scalars are the audit: a **single connected component** (global statistics implicitly assume every unit can reach every other through a chain of neighbors), **no islands**, and a sparse graph β each spatial lag averages over a small, local set of six districts. A companion histogram of neighbor counts is available as `conn.fig_hist` (trivially a spike at 6 for k-NN weights, more informative for contiguity). ```{python} print(conn.interpret()) ``` ## Is initial luminosity clustered? The Moran scatterplot `explore_moran_plot` z-standardizes the variable, plots each district's value against the average of its neighbors' values (the **spatial lag**), and colors the four quadrants: **High-High** and **Low-Low** (clustering of similars) on the diagonal, **Low-High** and **High-Low** (spatial outliers) off it. Under row-standardized weights the slope of the fitted line *equals* global Moran's I, and a stat box in the corner reports I, its expectation under spatial randomness, and the permutation pseudo p-value: ```{python} moran = gm.explore_moran_plot(df, "log_ntl_pc_1996", gdf=gdf, w=w, period=1996) moran.fig ``` ```{python} moran.glance().round(3) ``` Initial luminosity per capita is strongly and positively autocorrelated β **Moran's I β 0.73** against an expectation near zero, the value the source paper reports. Most districts sit in the two clustering quadrants: ```{python} print(moran.df["quadrant"].value_counts().to_string()) print() print(moran.interpret()) ``` ## Where does it cluster? LISA Global Moran's I says *that* the map clusters; local Moran statistics (LISA) say *where*. Each district gets a local statistic and its own permutation pseudo p-value; districts significant at `alpha` receive their quadrant's cluster label, everything else is "Not significant": ```{python} lisa = gm.explore_lisa_cluster_map(df, "log_ntl_pc_1996", gdf=gdf, w=w, period=1996) lisa.fig ``` ```{python} print(lisa.interpret()) ``` The picture matches the literature: a bright **High-High belt** and a dim **Low-Low block**, with only a handful of spatial outliers. ### How sensitive are the clusters to alpha? The cluster labels are a significance mask, so they move with the threshold. Re-running at `alpha=0.01` keeps the geography but shrinks the significant set: ```{python} import pandas as pd lisa01 = gm.explore_lisa_cluster_map( df, "log_ntl_pc_1996", gdf=gdf, w=w, period=1996, alpha=0.01 ) pd.DataFrame( { "alpha = 0.05": [lisa.n_hh, lisa.n_ll, lisa.n_lh, lisa.n_hl, lisa.n_ns], "alpha = 0.01": [lisa01.n_hh, lisa01.n_ll, lisa01.n_lh, lisa01.n_hl, lisa01.n_ns], }, index=["High-High", "Low-Low", "Low-High", "High-Low", "Not significant"], ) ``` Tightening alpha from 0.05 to 0.01 drops the significant count from 281 to 175 districts. The hot and cold *cores* survive; the borderline fringe does not β which is exactly how a LISA map should be read. ## Is the clustering strengthening? Moran's I over time `explore_moran_over_time` pivots the panel wide, keeps the one entity set with complete data in every period (so the same W applies throughout and the values are comparable), and re-tests each of the six satellite years: ```{python} mot = gm.explore_moran_over_time(df, "ntl_total", gdf=gdf, w=w) mot.df.round(3) ``` ```{python} mot.fig ``` ```{python} print(mot.interpret()) ``` The answer is: **strong throughout, but not strengthening**. Moran's I for total luminosity fluctuates in a narrow band (roughly 0.39β0.45) and every year rejects spatial randomness at the 1% level β spatial dependence is a persistent feature of India's economic geography over the window, not a trend. ## A caveat, and where next LISA runs one permutation test **per district** β 520 tests here, with no multiple-testing adjustment, so at `alpha = 0.05` roughly 26 "significant" districts would be expected even under complete spatial randomness. Treat the cluster map as *descriptive* of where dependence concentrates (the alpha sensitivity table above is the practical check), not as 520 confirmatory hypothesis tests. The same warning is built into `lisa.interpret()`. Detecting dependence is the exploratory half. Modeling it β letting growth respond to neighbors' outcomes and characteristics, and decomposing associations into direct and spillover components β is the job of the spreg suite, covered in [Spatial spillovers: the spreg suite](spillovers.qmd). ## ----- articles/spillovers.qmd ----- --- title: "Spatial spillovers: the spreg suite" subtitle: "SAR, SEM, SLX, and the spatial Durbin model β diagnosed, estimated, and read through impacts" --- [Spatial dependence and LISA](spatial-dependence.qmd) established that Indian district luminosity clusters in space. This article models that dependence: it runs the Lagrange-multiplier specification tests, estimates the spatial Durbin model of the paper's convergence regression, decomposes the association into direct and spillover (indirect) impacts, and checks that the conclusion survives alternative weights. ```{python} import warnings warnings.filterwarnings("ignore") import geometrics as gm gdf, df, df_dict = gm.data.load_india() df = gm.set_labels(df, df_dict, set_panel=True) # The paper's weights: 6 nearest neighbors on plain lon/lat centroids (crs=None) w = gm.make_weights(gdf, method="knn", k=6, crs=None) ``` ## The specification family Space can enter a regression through three channels, and the classic cross-sectional models are just the on/off combinations. The **spatial lag model (SAR)**, $y = \rho W y + X\beta + \varepsilon$, lets the outcome respond to neighbors' outcomes β one district's growth feeds its neighbors' growth and echoes back, a *global* feedback governed by $\rho$. The **spatial error model (SEM)**, $y = X\beta + u$ with $u = \lambda W u + \varepsilon$, puts the dependence in the disturbances instead: spatially correlated omitted factors, with no substantive spillover in the outcome equation. The **SLX model**, $y = X\beta + WX\gamma + \varepsilon$, is the *local* alternative β neighbors' characteristics matter directly (each $\gamma$ is itself the spillover), but there is no outcome feedback. The **spatial Durbin model (SDM)** combines the lag and SLX channels, $y = \rho W y + X\beta + WX\gamma + \varepsilon$, and is the workhorse of the convergence-with-spillovers literature precisely because it nests the others and lets the data separate the channels. The built-in explainer: ```{python} print(gm.explain("spatial_durbin_model").to_markdown()[:700], "...") ``` All of these are one `model=` switch away in `analyze_spatial_model` (`"ols"`, `"lag"`, `"error"`, `"slx"`, `"durbin"`, `"durbin_error"`), estimated by spreg. ## Which model do the data ask for? `analyze_spatial_diagnostics` estimates the non-spatial OLS benchmark β here the paper's unconditional convergence regression, growth of luminosity per capita 1996β2010 on its initial level, on the 1996 cross-section β then runs Moran's I on the residuals and the five LM tests, and applies the Anselin-Florax decision rule: ```{python} growth = df.query("year == 1996") diag = gm.analyze_spatial_diagnostics( growth, outcome="growth_ntl_pc_9610", covariates=["log_ntl_pc_1996"], gdf=gdf, w=w, ) diag.df ``` ```{python} print(f"Recommendation: {diag.recommendation}\n") print(diag.reasoning) ``` Residual Moran's I of 0.58 means OLS leaves a lot of spatial structure on the table. Both simple LM tests reject overwhelmingly, so the *robust* forms decide, and robust LM error wins by a wide margin β the mechanical rule points to the **error model**. The paper (and this article) nonetheless proceeds with the **spatial Durbin model**: the SDM nests the error model under a common-factor restriction and keeps the spatially lagged covariates that theory suggests, so it is the safe superset when any robust test fires. The diagnostics tell you dependence is real and must be modeled; they do not have the last word on which superset to estimate. ## The spatial Durbin model `analyze_spatial_model` aligns the cross-section to the geometry and weights, estimates by maximum likelihood, and computes the LeSage-Pace impact decomposition from `betas` + `vm` with Monte-Carlo standard errors (`n_draws`; the result records how many were used): ```{python} sdm = gm.analyze_spatial_model( growth, outcome="growth_ntl_pc_9610", covariates=["log_ntl_pc_1996"], gdf=gdf, w=w, model="durbin", n_draws=2000, ) sdm.gt ``` The coefficient table is only the raw material. The spatial autoregressive parameter is large β with $\rho \approx 0.8$, every local difference is amplified through the neighborhood multiplier $1/(1-\rho) \approx 5$ β and the reportable quantities are the impacts: ```{python} print(f"rho = {sdm.rho:.3f} pseudo-R2 = {sdm.r2:.2f} " f"AIC = {sdm.aic:.0f} n = {sdm.n_obs}") sdm.impacts.round(4) ``` ```{python} print(sdm.interpret()) ``` The convergence reading: dimmer districts grow faster (**direct impact β0.026**), and the **total impact of β0.022** β the number that maps to the speed of convergence β folds in the spillover that arrives through the neighborhood. This is the machinery behind `analyze_beta_convergence(..., model="sdm")` in the [Analyze page](../analyze.qmd). ## SLX for contrast: local spillovers only Setting `model="slx"` keeps the spatially lagged covariates but switches off the outcome feedback ($\rho = 0$). Impacts are then **analytic** β no Monte-Carlo draws needed β because direct = $\beta$ and indirect = $\gamma$ exactly, with no multiplier to simulate: ```{python} slx = gm.analyze_spatial_model( growth, outcome="growth_ntl_pc_9610", covariates=["log_ntl_pc_1996"], gdf=gdf, w=w, model="slx", ) slx.impacts.round(4) ``` The contrast is instructive: without the $\rho W y$ channel, the estimated spillover of initial luminosity is small and statistically indistinguishable from zero, and the model has no way to represent growth diffusing between neighbors. The SDM attributed most of the spatial action to the global feedback loop β exactly the channel SLX rules out by construction. AIC comparisons across `sdm.aic` and `slx.aic` (same sample, same W) make the same point. ## Is the result an artifact of the weights? Every impact above is conditional on the 6-nearest-neighbor W. `analyze_spatial_model_by_weights` re-estimates the same model under alternative specifications and compares the focal covariate's direct/indirect/total impacts against a baseline β the source paper's notebook-c07 robustness check: ```{python} alt = { "knn4": gm.make_weights(gdf, method="knn", k=4, crs=None), "knn6": w, "knn8": gm.make_weights(gdf, method="knn", k=8, crs=None), "queen": gm.make_weights(gdf, method="queen"), } robust = gm.analyze_spatial_model_by_weights( growth, outcome="growth_ntl_pc_9610", covariates=["log_ntl_pc_1996"], gdf=gdf, weights=alt, baseline="knn6", n_draws=1000, ) robust.fig ``` ```{python} print(robust.interpret()) ``` The dot-whisker figure is the headline: the total impact keeps its sign in all four specifications and every 95% interval covers the baseline β the convergence conclusion does not hinge on the weights choice. The per-W detail (including each `w_spec` and AIC) lives in `robust.df`, and `robust.gt` renders the comparison table. ## Coefficients are not impacts The closing rule of the spreg suite. In any model with $\rho W y$, a change in one district's covariate propagates to its neighbors and feeds back, so no single regression coefficient describes the association β reading $\beta$ off the SDM coefficient table understates or misstates everything the model has to say. The explainer: ```{python} print(gm.explain("spatial_impacts").to_markdown()[:800], "...") ``` Report **direct** (own-district, feedback included), **indirect** (spillover to the rest of the map), and **total** impacts β never raw coefficients β and say which W they are conditional on. Every geometrics result does the second part for you via `w_spec`. ## Where next - [The India case study](india-case-study.qmd) β the full replication arc these pieces come from - [Spatial dependence and LISA](spatial-dependence.qmd) β the exploratory half: weights, Moran's I, and cluster maps ## ----- articles/inequality.qmd ----- --- title: "Regional inequality" subtitle: "Gini and Theil over time, the spatial Gini, and the between/within split" --- Convergence asks whether poor regions catch up; **inequality analysis asks how far apart regions are right now, and whether that gap is closing**. This article tracks regional inequality in the bundled Indian district panel β 520 districts observed by DMSP-OLS nighttime lights, 1996β2010 β with the PySAL `inequality` stack: level measures over time, the ReyβSmith spatial Gini, and the Theil between/within decomposition by state. ```{python} import warnings warnings.filterwarnings("ignore") import geometrics as gm gdf, df, df_dict = gm.data.load_india() df = gm.set_labels(df, df_dict, set_panel=True) # log-based measures need strictly positive values (see below) bad = df.loc[df["ntl_total"] <= 0, "statedist"].unique() pos = gm.set_labels( df[~df["statedist"].isin(bad)].copy(), df_dict, set_panel=True ) gdf_pos = gdf[~gdf["statedist"].isin(bad)].copy() w_pos = gm.make_weights(gdf_pos, method="knn", k=6, crs=None) ``` ## Gini or Theil? The **Gini index** compares every pair of regions: it is the mean absolute difference between two randomly drawn units, scaled by twice the mean, running from 0 (everyone equal) to (nearly) 1 (one unit holds everything). It is the most widely reported inequality measure, robust to how you group the data, and most sensitive to transfers around the middle of the distribution β but it does not decompose cleanly into parts. The **Theil index** is an entropy measure: it weighs each region's *share* of the total by the log of that share relative to an equal split. It is more sensitive to the top of the distribution, and its killer feature is **exact additive decomposability** β total inequality splits into a between-group plus a within-group component, with nothing left over. That is what makes it the workhorse for regional hierarchies like districts within states: ```{python} print(gm.explain("theil_decomposition").to_markdown()[:600], "...") ``` ## Strictly positive, or geometrics tells you why not The Theil index takes logarithms of shares, so it is undefined at zero. One Indian district (Lahul and Spiti, up in the Himalayas) records **zero luminosity** in some years β and instead of a cryptic numpy warning, geometrics raises a `ValueError` that names the offenders: ```{python} try: gm.analyze_inequality_over_time( df, "ntl_total", measures=("gini", "theil", "cv") ) except ValueError as err: print(err) ``` That is why the first chunk filtered to the always-positive panel (`pos`, 519 districts) exactly as the [Explore page](../explore.qmd) does, and rebuilt the geometry and weights to match. ## Three measures, one trend test `analyze_inequality_over_time` computes each requested measure per period, then regresses the **log** of each measure on time (OLS, HC1 standard errors). A negative, significant slope means inequality is narrowing β the inequality-narrative complement of Ο-convergence: ```{python} ineq = gm.analyze_inequality_over_time( pos, "ntl_total", measures=("gini", "theil", "cv") ) ineq.df.round(3) ``` ```{python} ineq.fig ``` The trend table makes the verdict explicit: ```{python} ineq.gt ``` ```{python} print(ineq.interpret()) ``` All three measures agree: district-level luminosity inequality is high (Gini around 0.54) and essentially flat over 1996β2010 β no narrowing the trend test can distinguish from noise. ## The spatial Gini: inequality between neighbors The Gini is a sum over *pairs* of regions, and every pair is either a **neighbor pair** or a **non-neighbor pair** under a spatial weights matrix. Rey & Smith (2013) split it exactly along that line. Pass the geometry and weights and `analyze_inequality_over_time` adds the decomposition per period: ```{python} sg = gm.analyze_inequality_over_time( pos, "ntl_total", measures=("gini", "theil", "cv"), gdf=gdf_pos, w=w_pos, ) sg.df[["time", "n_units", "gini", "gini_spatial", "gini_spatial_p"]].round(4) ``` `gini_spatial` is the component of the overall Gini owed to differences between **neighboring** districts β here well under 1% of a Gini of ~0.54, so almost all pairwise inequality lives between districts that are *not* neighbors. Neighbors resemble each other; the big gaps are long-distance gaps. `gini_spatial_p` is a permutation pseudo p-value (99 permutations by default) testing whether the non-neighbor component exceeds what spatial randomness would produce β at p = 0.01 in every year, the spatial structure of inequality is no accident. The `w_spec` field records the weights used: ```{python} print(sg.w_spec) ``` ## How much inequality is *between states*? Districts nest inside states, so the Theil index splits exactly into a **between-state** component (inequality across state means) and a **within-state** component (inequality among districts inside each state). With `permutations=99`, districts are randomly reassigned to states and `p_between` reports how often a random partition captures a between share at least as large: ```{python} theil = gm.analyze_theil_decomposition( pos, "ntl_total", "state", permutations=99 ) theil.df.round(4) ``` ```{python} theil.fig ``` ```{python} print(theil.interpret()) ``` About 60% of district luminosity inequality is a *between-state* phenomenon, and that share drifted down (62% β 59%) over the window β state means pulled slightly closer together while within-state gaps held. The permutation p-values (0.01 in every period) say the state partition captures far more inequality than chance groupings do: in India, geography β which state you are in β is the dominant layer of regional inequality. ## A lightweight companion: 32 states, one year `load_india_states()` ships a small state-level cross-section (32 states and union territories, corrected DMSP-OLS lights over gridded population, 1992). With a single year there is no trend to test β `analyze_inequality_over_time` needs at least two periods β but a one-off snapshot takes three lines: ```{python} import pandas as pd from inequality.gini import Gini from inequality.theil import Theil gdf_s, df_s, dict_s = gm.data.load_india_states() df_s = gm.set_labels(df_s, dict_s, set_panel=True) y = df_s["ntl_pc"].to_numpy(dtype=float) pd.DataFrame( { "measure": ["Gini", "Theil", "CV"], "value": [Gini(y).g, Theil(y).T, y.std(ddof=1) / y.mean()], } ).round(3) ``` State-level inequality (Gini 0.39) is well below district-level inequality (0.54) β aggregation averages away the within-state gaps, which is precisely the between/ within story again. The map shows where the bright and dark states are: ```{python} gm.explore_choropleth_map(df_s, "log_ntl_pc", gdf=gdf_s, period=1992).fig ``` ## Where next - [Distribution dynamics](dynamics.qmd) β the same panel through Markov and spatial Markov chains: who moves within the distribution, and does the neighborhood condition mobility? - [The India case study](india-case-study.qmd) β the full replication arc, with Ο-convergence and this Theil decomposition in context - `gm.explain("gini")`, `gm.explain("theil_index")`, `gm.explain("theil_decomposition")` β the concept explainers quoted above ## ----- articles/dynamics.qmd ----- --- title: "Distribution dynamics: Markov and spatial Markov" subtitle: "Following the whole cross-sectional distribution β and asking whether geography conditions mobility" --- A convergence regression compresses regional growth into one slope. Quah's critique is that the slope can mislead: Ξ²-convergence is perfectly compatible with a widening or *polarizing* distribution β poor regions can grow faster on average while the cross-section splits into "twin peaks" of rich and poor clubs. Distribution dynamics therefore follows the **entire cross-sectional distribution** over time β its shape period by period, and the movement of individual regions within it, summarized by Markov transition matrices: ```{python} import warnings warnings.filterwarnings("ignore") import geometrics as gm gdf, df, df_dict = gm.data.load_india() df = gm.set_labels(df, df_dict, set_panel=True) # relative (mean-normalized) measures use the always-positive panel bad = df.loc[df["ntl_total"] <= 0, "statedist"].unique() pos = gm.set_labels( df[~df["statedist"].isin(bad)].copy(), df_dict, set_panel=True ) gdf_pos = gdf[~gdf["statedist"].isin(bad)].copy() w_pos = gm.make_weights(gdf_pos, method="knn", k=6, crs=None) print(gm.explain("distribution_dynamics").to_markdown()[:600], "...") ``` ::: {.callout-note} The two `analyze_*` functions in this article require the optional **giddy** dependency: `pip install "geometrics[dynamics]"` (it is included in `geometrics[all]`). ::: ## The shape of the distribution, period by period `explore_distribution_over_time` estimates one kernel density per period on a shared grid. With `relative=True` each district is divided by the period's cross-sectional mean first β the distribution-dynamics convention β so 1.0 marks the period average and the plot isolates changes in *shape* from changes in the overall level: ```{python} gm.explore_distribution_over_time(pos, "ntl_total", relative=True).fig ``` The relative distribution is strongly right-skewed and stable: a heavy mass of districts below the mean and a long bright tail, with no sign of the mass narrowing toward 1. The same densities animate over the slider if you prefer one curve at a time: ```{python} gm.explore_distribution_over_time( pos, "ntl_total", relative=True, kind="animated" ).fig ``` ## Every district keeps its row Densities hide *who* is where. The space-time heatmap pivots the panel to one row per district and one column per year, so persistence (rows keeping their shading left to right) is visible unit by unit. `sort_by="north_south"` orders rows by centroid latitude using the geometry, and `relative=True` compares districts within each year rather than tracking the level: ```{python} gm.explore_spacetime_heatmap( pos, "ntl_total", gdf=gdf_pos, sort_by="north_south", relative=True ).fig ``` Read top to bottom, the bright bands are not scattered β brightness comes in latitudinal blocks that persist across all six years, a first hint that both the distribution *and* the map are sticky. ## Markov transitions: who moves between quintiles? `analyze_markov_transitions` discretizes each year's relative luminosity into `k=5` quintiles and pools every year-to-year move into a transition-probability matrix β row `Q1`, column `Q2` is the probability that a bottom-quintile district climbs one class by the next period: ```{python} mk = gm.analyze_markov_transitions(pos, "ntl_total", k=5, relative=True) mk.fig ``` The result carries the long-run implications of the matrix β the ergodic (steady-state) distribution, the expected sojourn time in each class, and scalar mobility indices: ```{python} import pandas as pd long_run = pd.DataFrame( {"steady state": mk.steady_state, "sojourn (periods)": mk.sojourn} ) print(long_run.round(3)) print( f"\nmobility: Shorrocks {mk.shorrocks:.3f}, Prais {mk.prais:.3f}, " f"Bartholomew {mk.bartholomew:.3f} ({mk.n_transitions:,} transitions)" ) ``` ```{python} print(mk.interpret()) ``` The diagonal dominates: a district's position in the luminosity distribution is highly persistent (average stay probability 0.83, Shorrocks 0.21), and the extremes are the stickiest β a district entering the top quintile stays roughly fifteen periods. Low mobility is the transition-matrix face of the flat inequality trend in the [regional inequality article](inequality.qmd). ## Spatial Markov: does the neighborhood condition mobility? The classic chain treats every district's move as exchangeable. Rey's **spatial Markov** re-estimates the transition matrix *conditional on the spatial lag* β the average state of each district's 6 nearest neighbors β giving one matrix per neighborhood class, and the BickenbachβBode LR / Q tests of whether those conditional matrices differ from the pooled one: ```{python} smk = gm.analyze_spatial_markov(pos, "ntl_total", gdf=gdf_pos, w=w_pos, k=4) smk.fig ``` ```{python} smk.gt ``` ```{python} print(smk.interpret()) ``` Both homogeneity tests reject decisively (LR = 73.6, Q = 73.5, p < 0.001): transition dynamics are **not** the same in every neighborhood. Districts surrounded by dim neighbors move around more (and find it harder to hold a high rank), while districts embedded in bright neighborhoods stay put β geography conditions mobility, the distribution-dynamics counterpart of the spatial spillovers found by the SDM in the [India case study](india-case-study.qmd). ## Where next - [Regional inequality](inequality.qmd) β Gini/Theil levels and the between-state decomposition behind this persistence - [The India case study](india-case-study.qmd) β convergence regressions, spillovers, and convergence clubs on the same panel - `gm.explain("markov_chains")`, `gm.explain("spatial_markov")`, `gm.explain("mobility_measures")` β the concept explainers ## ----- articles/india-case-study.qmd ----- --- title: "The India case study" subtitle: "Regional growth, convergence, and spatial spillovers β a reproducible view from outer space" --- This article replicates and extends the analysis of *"Regional growth, convergence, and spatial spillovers in India"* ([Mendez, Kabiraj & Li](https://github.com/quarcs-lab/project2025s-py); building on Chanda & Kabiraj 2020, *World Development*): 520 Indian districts observed by radiance-calibrated DMSP-OLS **nighttime lights** between 1996 and 2010, used as a satellite proxy for economic activity. Three questions organize everything: 1. **Convergence** β do dimmer (poorer) districts grow faster than brighter ones? 2. **Spatial dependence** β do neighboring districts light up together? 3. **Spillovers** β does a neighborhood's brightness help local growth? ```{python} import warnings warnings.filterwarnings("ignore") import geometrics as gm gdf, df, df_dict = gm.data.load_india() df = gm.set_labels(df, df_dict, set_panel=True) print(f"{gdf.shape[0]} districts x {df['year'].nunique()} years; " f"{df_dict.shape[0]} documented variables") ``` ## A view from space Total district luminosity, classified with Fisher-Jenks. The animation steps through all six satellite years with a **pooled** classification, so colors are comparable across frames: ```{python} gm.explore_choropleth_map(df, "ntl_total", gdf=gdf, animate=True).fig ``` ## Spatial dependence The paper's weights are 6 nearest neighbors (built, like the paper, on plain lon/lat centroids β pass `crs=None`; the geometrics default would project first): ```{python} w = gm.make_weights(gdf, method="knn", k=6, crs=None) lisa_initial = gm.explore_lisa_cluster_map( df, "log_ntl_pc_1996", gdf=gdf, w=w, period=1996 ) lisa_initial.fig ``` Initial luminosity is strongly clustered β the paper reports Moran's I = 0.73: ```{python} print(f"Moran's I (initial log luminosity pc): {lisa_initial.moran_i:.2f} " f"(pseudo p = {lisa_initial.p_sim_global:.3f})") print(f"High-High districts: {lisa_initial.n_hh}, " f"Low-Low: {lisa_initial.n_ll}") ``` And so is growth: ```{python} growth_lisa = gm.explore_lisa_cluster_map( df.query("year == 1996"), "growth_ntl_pc_9610", gdf=gdf, w=w ) print(f"Moran's I (growth 1996-2010): {growth_lisa.moran_i:.2f} " f"(pseudo p = {growth_lisa.p_sim_global:.3f})") growth_lisa.fig ``` ## Convergence: OLS vs the spatial Durbin model The paper's dependent variable is the **per-capita** luminosity growth rate 1996-2010, shipped verbatim by `load_india()` (an honest per-capita *panel* is impossible β district population exists only for 1996 and 2001 β so the paper's pre-computed columns are carried unchanged). To run the paper's exact cross-section through the panel API, rebuild a two-period panel whose growth reproduces the paper's dependent variable identically: ```{python} import numpy as np import pandas as pd HORIZON = 14 # 1996 -> 2010 base = df.query("year == 1996")[ ["statedist", "state", "district", "ntl_pc_1996", "growth_ntl_pc_9610"] ] paper_panel = pd.concat( [ base.assign(year=1996, ntl_pc=base["ntl_pc_1996"]), base.assign( year=2010, ntl_pc=base["ntl_pc_1996"] * np.exp(HORIZON * base["growth_ntl_pc_9610"]), ), ], ignore_index=True, ) paper_panel = gm.set_panel(paper_panel, entity="statedist", time="year") ``` Unconditional convergence, first ignoring space, then with the SDM (the paper's Table 1, Model 1): ```{python} ols = gm.analyze_beta_convergence(paper_panel, "ntl_pc", model="ols") sdm = gm.analyze_beta_convergence( paper_panel, "ntl_pc", model="sdm", gdf=gdf, w=w, n_draws=5000 ) summary = pd.DataFrame( { "OLS": [ols.beta_total, np.nan, ols.beta_total, ols.speed, ols.half_life], "SDM": [sdm.beta_direct, sdm.beta_indirect, sdm.beta_total, sdm.speed, sdm.half_life], }, index=["direct", "indirect", "total", "speed (per yr)", "half-life (yr)"], ).round(4) summary ``` The headline finding: **spatial spillovers raise the estimated speed of convergence**. Part of every district's catch-up arrives through its neighborhood β the indirect impact β which OLS attributes to nothing. ```{python} sdm.fig ``` ```{python} print(sdm.interpret()) ``` ## Which spatial model do the data ask for? ```{python} diag = gm.analyze_spatial_diagnostics( df.query("year == 1996"), outcome="growth_ntl_pc_9610", covariates=["log_ntl_pc_1996"], gdf=gdf, w=w, ) print(diag.recommendation) print(diag.reasoning) diag.gt ``` ## Robustness to the weights choice The paper re-estimates its preferred SDM under seven alternative weights (notebook c07). Here are four: ```{python} alt = { "knn4": gm.make_weights(gdf, method="knn", k=4, crs=None), "knn6": w, "knn8": gm.make_weights(gdf, method="knn", k=8, crs=None), "queen": gm.make_weights(gdf, method="queen"), } robust = gm.analyze_spatial_model_by_weights( df.query("year == 1996"), outcome="growth_ntl_pc_9610", covariates=["log_ntl_pc_1996"], gdf=gdf, weights=alt, baseline="knn6", n_draws=2000, ) robust.fig ``` ## Local convergence: GWR Is the growth-initial association uniform across India? Geographically weighted regression maps the local convergence coefficient: ```{python} gwr = gm.analyze_gwr( df.query("year == 1996"), outcome="growth_ntl_pc_9610", covariates=["log_ntl_pc_1996"], gdf=gdf, ) print(f"adaptive bandwidth: {gwr.bw:.0f} neighbors; local R2 mean " f"{gwr.df['local_r2'].mean():.2f}") gwr.figs["log_ntl_pc_1996"] ``` ## Distribution dynamics Beyond the regression slope: how does the whole distribution move? (One district records zero luminosity in some years; log-based and relative measures use the always-positive panel.) ```{python} bad = df.loc[df["ntl_total"] <= 0, "statedist"].unique() pos = gm.set_labels( df[~df["statedist"].isin(bad)].copy(), df_dict, set_panel=True ) gdf_pos = gdf[~gdf["statedist"].isin(bad)].copy() w_pos = gm.make_weights(gdf_pos, method="knn", k=6, crs=None) gm.explore_distribution_over_time(pos, "ntl_total", relative=True).fig ``` Quintile-to-quintile mobility, then conditioned on the neighborhood: ```{python} mk = gm.analyze_markov_transitions(pos, "ntl_total", k=5, relative=True) print(f"Shorrocks mobility: {mk.shorrocks:.2f} " f"(diagonal persistence {np.diag(mk.p).mean():.2f})") mk.fig ``` ```{python} smk = gm.analyze_spatial_markov(pos, "ntl_total", gdf=gdf_pos, w=w_pos, k=4) print(f"Homogeneity LR test: {smk.lr_stat:.1f} (p = {smk.lr_p:.2g}) β " "transition dynamics differ by neighborhood") smk.fig ``` ```{python} print(smk.interpret()) ``` ## Regional inequality Ο-convergence and the between/within split β how much of district inequality is *between states*? ```{python} sigma = gm.analyze_sigma_convergence(pos, "ntl_total") sigma.fig ``` ```{python} theil = gm.analyze_theil_decomposition(pos, "ntl_total", "state") theil.fig ``` ```{python} print(theil.interpret()) ``` ## Convergence clubs Finally, the Phillips-Sul log(t) machinery asks whether all districts share one steady-state path or sort into clubs: ```{python} clubs = gm.analyze_convergence_clubs(pos, "ntl_total", gdf=gdf_pos) print(f"{clubs.n_clubs} clubs, {clubs.n_divergent} divergent districts " f"(whole-panel log-t = {clubs.global_tstat:.1f})") clubs.fig_map ``` ## Sources - Mendez, C., Kabiraj, S., & Li, J. β *Regional growth, convergence, and spatial spillovers in India: A reproducible view from outer space* ([repository](https://github.com/quarcs-lab/project2025s-py), [interactive manuscript](https://quarcs-lab.github.io/project2025s-py/)) - Chanda, A., & Kabiraj, S. (2020). Shedding light on regional growth and convergence in India. *World Development*, 133. - Data: DMSP-OLS radiance-calibrated nighttime lights (NOAA/NGDC), district boundaries from the 2001 Census geography. ## ----- articles/bolivia-dataset.qmd ----- --- title: "The Bolivia dataset" subtitle: "PWT-anchored local GDP at three spatial scales, 2012-2022" --- geometrics ships a second case study: **BOL-005popAdj-PWTscaled**, a Bolivia subnational GDP product built from the 0.25Β° gridded estimates of Rossi-Hansberg & Zhang (2026) under their most aggressive low-density censoring (`0_05`), **rescaled so the national totals exactly equal Penn World Table 11.0** (`rgdpo` and `pop`). GDP and population are therefore in interpretable **2021 PPP US dollars**, while the model's relative spatial pattern is preserved exactly. The collection is delivered at three analysis scales β **departments** (ADM1, n=9), **provinces** (ADM2, n=112), and the raw **grid cells** (n=1,603) β for 2012β2022, with GADM 4.10 boundaries. If you use it, cite the underlying estimates, the benchmark, and the boundaries: **Rossi-Hansberg & Zhang (2026, *J. Urban Economics* 154)**, **Feenstra, Inklaar & Timmer (2015, *AER* 105(10); PWT 11.0)**, and **GADM 4.10**. The full method (with the rescaling math) is documented in [`datasets/BOL-005popAdj-PWTscaled/README.md`](https://github.com/quarcs-lab/geometrics/tree/main/datasets/BOL-005popAdj-PWTscaled). ## Three scales, one contract Each loader returns the usual geometrics trio β ID-only geometry, long panel, data dictionary: ```{python} import warnings warnings.filterwarnings("ignore") import geometrics as gm gdf, df, df_dict = gm.data.load_bolivia() # 112 provinces (ADM2) df = gm.set_labels(df, df_dict, set_panel=True) print(f"provinces: {gdf.shape[0]} polygons, panel {df.shape}") gdf1, df1, dd1 = gm.data.load_bolivia_departments() # 9 departments (ADM1) df1 = gm.set_labels(df1, dd1, set_panel=True) ``` The dictionaries ship with the data and document every column, including the scaling provenance: ```{python} df_dict[df_dict["var_name"].isin(["gid", "year", "gdp_pwt", "gdppc", "ln_gdppc"])] ``` One data fact to know: **five provinces have boundary polygons but no panel rows** β all of their grid cells fall below the `0_05` population-density censoring threshold. geometrics' alignment machinery warns about them and carries on; the warnings below are expected, not a bug. ## The map ```{python} gm.explore_choropleth_map(df, "ln_gdppc", gdf=gdf, period=2022).fig ``` ## Convergence across provinces, 2012-2022 Did poorer provinces grow faster? First aspatially, then with the spatial Durbin model on queen-contiguity weights: ```{python} w = gm.make_weights(gdf) # queen contiguity, row-standardized ols = gm.analyze_beta_convergence(df, "gdppc", model="ols") sdm = gm.analyze_beta_convergence( df, "gdppc", model="sdm", gdf=gdf, w=w, n_draws=2000 ) print( f"OLS beta: {ols.beta_total:.4f} (half-life {ols.half_life:.0f} yr)\n" f"SDM total: {sdm.beta_total:.4f} = direct {sdm.beta_direct:.4f} " f"+ indirect {sdm.beta_indirect:.4f} (rho = {sdm.rho:.2f})" ) ``` ```{python} sdm.fig ``` ```{python} print(sdm.interpret()) ``` ## How much inequality lies between departments? The province panel carries its parent department (`name1`), so the Theil between/within decomposition is one call: ```{python} theil = gm.analyze_theil_decomposition(df, "gdppc", "name1") theil.fig ``` ```{python} print(theil.interpret()) ``` ## Down to the grid The raw 0.25Β° cells behind the admin aggregates β useful when administrative boundaries are themselves part of the question: ```{python} gdfg, dfg, ddg = gm.data.load_bolivia_grid() dfg = gm.set_labels(dfg, ddg, set_panel=True) print(f"cells: {gdfg.shape[0]}, panel {dfg.shape}") gm.explore_choropleth_map(dfg, "ln_gdppc", gdf=gdfg, period=2022, tiles=None).fig ``` And spatial structure at cell level β LISA clusters of log GDP per capita: ```{python} wg = gm.make_weights(gdfg) # queen on the grid = rook + corners lisa = gm.explore_lisa_cluster_map(dfg, "ln_gdppc", gdf=gdfg, w=wg, period=2022) print(f"Moran's I = {lisa.moran_i:.2f} (p = {lisa.p_sim_global:.3f}); " f"HH cells: {lisa.n_hh}, LL cells: {lisa.n_ll}") lisa.fig ``` ## Where next - [Beta and sigma convergence](convergence.qmd) β the convergence toolkit in depth - [Spatial spillovers](spillovers.qmd) β diagnostics, the spreg suite, robustness - [The data model](data-model.qmd) β bring your own (gdf, df, df_dict) - The other bundled case study: [India](india-case-study.qmd) ## ----- use-with-llms.qmd ----- --- title: "geometrics for AI agents and LLMs" --- This page is the contract an AI agent needs to use geometrics correctly: where the machine-readable documentation lives, how to install the package, the three-input data model, how to pick a function, and what every result object guarantees. ## Machine-readable entry points - **[llms.txt](https://quarcs-lab.github.io/geometrics/llms.txt)** β the curated index ([llmstxt.org](https://llmstxt.org) convention): what the package is, the docs pages, and the full public API grouped by prefix. Fetch this first. - **[llms-full.txt](https://quarcs-lab.github.io/geometrics/llms-full.txt)** β the full dump: every docs page's source (prose + code) plus every public signature and docstring. Use it when you need exact parameter names and defaults without importing the package. Every page on this site also advertises both files through `` tags in its ``. ## Installing ```bash pip install geometrics # core (maps, ESDA, convergence, spreg, GWR) pip install "geometrics[dynamics]" # + Markov / spatial Markov (giddy) pip install "geometrics[all]" # everything, incl. static PNG export ``` - Python **3.11+**. - Bleeding edge: `pip install "git+https://github.com/quarcs-lab/geometrics.git"`. - In scripts, never call `.fig.show()` β persist figures with `res.fig.write_html("out.html")` (interactive) or `res.fig.write_image("out.png")` (needs the `png` extra; tile-free maps via `tiles=None` export deterministically). ## The three-input contract geometrics separates geometry, data, and metadata. Users supply all three: | Input | What it is | How it enters | |---|---|---| | `gdf` | Geometry with **only the entity ID** (+ optional name) β shapefile, zipped shapefile, GeoJSON, GeoPackage, or a GeoDataFrame | `gm.read_gdf("districts.gpkg", entity="district_id")` | | `df` | A **long-form panel** β one row per (entity, time) | `gm.set_panel(df, entity="district_id", time="year")` | | `df_dict` | A **6-column data dictionary** β `var_name, var_def, label, type, role, can_be_na` | `gm.set_labels(df, df_dict, set_panel=True)` | Declare once, use everywhere: ```python import geometrics as gm gdf, df, df_dict = gm.data.load_india() # or the user's own three inputs df = gm.set_labels(df, df_dict, set_panel=True) # wires entity/time into df.attrs w = gm.make_weights(gdf, method="knn", k=6) # weights are built once, explicitly ``` Rules an agent must follow: - The vocabulary is always **`entity`** and **`time`**. After `set_panel` / `set_labels(..., set_panel=True)` they resolve automatically from `df.attrs`; an explicit `entity=` / `time=` argument always wins. - **`gdf` and `w` are always explicit keyword arguments** β geometry is data, never hidden state. Any function that draws a map or estimates a spatial model takes `gdf=` (and usually `w=`). - `type` in the dictionary is one of `entity / time / factor / logical / numeric`; `role` is one of `"" / outcome / covariate / entity_name`. ## Picking a function The public API is grouped by prefix β the module map of this site: - **`explore_*`** β describe and visualize: choropleths, the weights connectivity graph, Moran scatterplots, LISA cluster maps, Moran's I over time, distribution ridgelines, space-time heatmaps. - **`analyze_*`** β estimate and test: Ξ²/Ο/club convergence, the spreg suite (OLS/SAR/SEM/SLX/SDM) with LeSage-Pace impacts and LM diagnostics, Markov and spatial Markov transitions, Gini/Theil inequality, GWR/MGWR. - **`learn_*`** β teaching sandboxes that simulate data from a known data-generating process so a learner can watch an estimator recover a planted parameter. **Never point them at user data** β they take knobs (e.g. `rho=`, `seed=`), not DataFrames. - Unprefixed **utilities** β `read_gdf`, `make_weights`, `set_panel`, `set_labels`, `build_data_dict`, `explain`, `list_topics`, ... The full membership of each group is listed in [llms.txt](https://quarcs-lab.github.io/geometrics/llms.txt) and the [API reference](reference/index.qmd). ## The result-object contract Every public function returns a **frozen dataclass**. Depending on the function it exposes: | Attribute | What it is | |---|---| | `.df` | The tidy result DataFrame (always present) | | `.fig` | An interactive Plotly figure (themed; never auto-shown) | | `.gt` | A publication-ready Great Tables object (estimation tables) | | named scalars | e.g. `beta`, `speed`, `half_life`, `rho`, `moran_i`, `p_value` | | `.interpret()` | A plain-language reading of this specific result | | `.explain()` | The concept explainer behind the method | | `notes` | Advisory notes accumulated during estimation | | `w_spec` | A human-readable description of the spatial weights used | Results are immutable β build a new call rather than mutating a result. ## Interpretation guardrails - Lead with `res.interpret()` when reporting to a user β it is written in **association-only** language. Follow it: "is associated with", never "causes" or "the effect of". - Ground concepts with the built-in explainer registry: `gm.list_topics()` β `gm.explain("spatial_autocorrelation")`. ## One runnable recipe ```python import geometrics as gm gdf, df, df_dict = gm.data.load_india() # ID-only geometry, long panel, dictionary df = gm.set_labels(df, df_dict, set_panel=True) w = gm.make_weights(gdf, method="knn", k=6) lisa = gm.explore_lisa_cluster_map(df, "log_ntl_pc_1996", gdf=gdf, w=w) print(lisa.interpret()) # where the hot/cold spots are res = gm.analyze_beta_convergence(df, "ntl_total", model="sdm", gdf=gdf, w=w) print(res.interpret()) # catch-up, speed, spillovers res.fig.write_html("convergence.html") ``` ## Links - [Repository](https://github.com/quarcs-lab/geometrics) Β· [API reference](reference/index.qmd) Β· [Changelog](changelog.qmd) - Bundled case studies: `gm.data.load_india()`, `gm.data.load_bolivia()` / `load_bolivia_departments()` / `load_bolivia_grid()` ## ----- changelog.qmd ----- --- title: "Changelog" --- ## v0.1.3 (2026-07-02) - **Three no-code Streamlit apps** β one per module, sharing a lean shell (bundled case-study picker + spatial-weights controls): the **Explore app** (choropleths, connectivity, Moran/LISA, distributions over time), the **Analyze app** (Ξ²/Ο convergence, clubs, spatial models with impacts and LM diagnostics, weights robustness, Markov dynamics, inequality/Theil, button-gated GWR), and the **Learn app** (all 11 sandboxes with sliders + the searchable explainer browser). Pages gate themselves on the active dataset (panel length, unit count, optional extras). - New `streamlit` extra (`pip install "geometrics[streamlit]"`, included in `[all]`), Python launchers `ExploreApp` / `AnalyzeApp` / `LearnApp`, console scripts `geometrics-explore` / `-analyze` / `-learn`, and repo-root entry points (`app_explore.py`, `app_analyze.py`, `app_learn.py`, `streamlit_app.py` chooser) for Streamlit Community Cloud. - The site now presents the library in the three modules (v0.1.2/v0.1.3 together): per-module pedagogical pages, Colab notebooks, and the rewritten landing page. ## v0.1.2 (2026-07-02) - **The Learn module: 11 `learn_*` concept sandboxes.** Each simulates data from a known data-generating process so a learner can watch the estimator recover a planted parameter, returning a frozen `SandboxResult` (`.df` estimated-vs-truth table, `.fig`, `.summary` scalar facts, `.data` raw simulated frame, `.interpret()` / `.explain()`): `learn_spatial_autocorrelation`, `learn_spatial_weights`, `learn_lisa_clusters`, `learn_spatial_spillovers` (closed-form LeSage-Pace truth), `learn_omitted_spatial_lag`, `learn_beta_convergence`, `learn_sigma_convergence`, `learn_convergence_clubs`, `learn_markov_chains`, `learn_spatial_markov` (both via the `dynamics` extra), and `learn_theil_decomposition` (independent numpy truth). - The shared synthetic geographies and planted-parameter processes now live in `geometrics.sandbox._dgp`; the test suite's known-answer fixtures delegate to them. - Visual identity: the "classified lattice" logo/favicon and a hero image built from the real India LISA cluster map; new [For AI / LLMs](use-with-llms.qmd) page. ## v0.1.1 (2026-07-02) - **New dataset: Bolivia (BOL-005popAdj-PWTscaled)** β the 0.25Β° gridded GDP of Rossi-Hansberg & Zhang (2026) under `0_05` censoring, rescaled so national totals equal Penn World Table 11.0 (2021 PPP US$), 2012β2022, on GADM 4.10 boundaries. Committed under `datasets/` in this repository and served from pinned, hash-verified raw URLs. - New loaders: `load_bolivia()` (112 provinces; 5 fully-censored provinces have polygons but no panel rows β documented), `load_bolivia_departments()` (9 departments), `load_bolivia_grid()` (1,603 cells with a synthesized single `cell` entity id), `load_bolivia_raw()` (any level incl. ADM0). - New article: [The Bolivia dataset](articles/bolivia-dataset.qmd). ## v0.1.0 (2026-07-02) First public release. - **The three-input data contract**: `read_gdf` (shapefile / zipped shapefile / GeoJSON / GeoPackage β ID-only geometry), a long-form panel declared with `set_panel` / `set_labels`, and a six-column data dictionary (`df_dict`, inferable with `build_data_dict`). - **Maps & ESDA**: `explore_choropleth_map` (classified, animated), `explore_connectivity_map`, `explore_moran_plot`, `explore_lisa_cluster_map`, `explore_moran_over_time`, `explore_distribution_over_time`, `explore_spacetime_heatmap`. - **Convergence**: `analyze_beta_convergence` (OLS / SAR / SEM / SLX / SDM with LeSage-Pace impacts and Monte-Carlo standard errors), `analyze_sigma_convergence`, `analyze_convergence_clubs` (Phillips-Sul log(t) with club maps). - **Spatial econometrics**: `analyze_spatial_model`, `analyze_spatial_diagnostics` (Anselin-Florax recommendation), `analyze_spatial_model_by_weights`. - **Distribution dynamics**: `analyze_markov_transitions`, `analyze_spatial_markov` (via the `dynamics` extra). - **Inequality**: `analyze_inequality_over_time` (Gini / Theil / CV, spatial Gini), `analyze_theil_decomposition`. - **Local models**: `analyze_gwr`, `analyze_mgwr`. - **Pedagogy**: every result carries `.interpret()` and `.explain()`; 30 concept explainers registered. - **Data**: the Indian district case study (520 districts, DMSP-OLS nighttime lights 1996-2010) and a 32-state demo, fetched from [quarcs-lab/project2025s-py](https://github.com/quarcs-lab/project2025s-py) at a pinned commit with hash verification. # ===== Public API signatures ===== ## explore_* ### explore_choropleth_map(df: 'pd.DataFrame', var: 'str', *, gdf: 'gpd.GeoDataFrame', period: 'Any' = None, animate: 'bool' = False, entity: 'str | None' = None, time: 'str | None' = None, scheme: 'str | None' = 'fisherjenks', k: 'int' = 5, bins: 'Sequence[float] | None' = None, tiles: 'str | None' = 'carto-positron', hover: 'str | Sequence[str] | None' = None, simplify: 'float | str | None' = 'auto', title: 'str | None' = None) -> 'ChoroplethMapResult' Map one variable across entities as a classed (or continuous) choropleth. The panel ``df`` is aligned to the entity geometry ``gdf`` for one cross section (the latest period by default) and drawn as a Plotly choropleth with one legend-togglable trace per class. With ``animate=True`` every period becomes an animation frame, classified on **pooled** breaks (computed from all periods together) so colors are comparable over time. Parameters ---------- df Long panel (or cross section) holding ``var`` per entity. var Numeric column of ``df`` to map. gdf Entity geometry; must carry the same entity-id column as ``df``. period Period to map. Defaults to the latest period when ``df`` has a time dimension (a note records this). Ignored when ``animate=True``. animate Draw every period as an animation frame with a slider and play button. entity, time Panel identifiers; default to the ids declared via :func:`geometrics.set_panel`. scheme A mapclassify scheme name (``"fisherjenks"``, ``"quantiles"``, ...), or ``None`` for a continuous colorbar map. k Number of classes for ``scheme`` (ignored when ``bins`` are given). bins Explicit upper class bounds (overrides ``scheme`` / ``k``). tiles MapLibre base-map style (default ``"carto-positron"``) or ``None`` for the vector backend (deterministic PNG export). hover Extra ``df`` column(s) appended to the hover box. simplify Geometry simplification: ``"auto"`` (metric tolerance = max bounding-box dimension / 2000), a float tolerance in meters, or ``None`` to disable. title Figure title. Defaults to the variable label plus the mapped period. Returns ------- ChoroplethMapResult Frozen result with ``df`` (entity, value, class label), ``fig``, ``gdf_plotted`` (the WGS84 geometry actually drawn, value and class attached), the applied ``bins``, and ``notes``. Examples -------- Map a small two-period panel (latest period by default): ```python import geopandas as gpd import pandas as pd from shapely.geometry import box from geometrics.maps import explore_choropleth_map gdf = gpd.GeoDataFrame( {"region": ["a", "b", "c", "d"]}, geometry=[box(i % 2, i // 2, i % 2 + 1, i // 2 + 1) for i in range(4)], crs="EPSG:4326", ) df = pd.DataFrame( { "region": ["a", "b", "c", "d"] * 2, "year": [2000] * 4 + [2010] * 4, "gdppc": [1.0, 2.0, 3.0, 4.0, 1.5, 2.5, 3.5, 4.5], } ) res = explore_choropleth_map( df, "gdppc", gdf=gdf, entity="region", time="year", k=2, tiles=None ) print(res.period, res.bins) ``` ### explore_connectivity_map(gdf: 'gpd.GeoDataFrame', *, w: 'W | None' = None, entity: 'str | None' = None, tiles: 'str | None' = 'carto-positron', title: 'str | None' = None) -> 'ConnectivityMapResult' Draw the spatial weights graph over the map and summarize its connectivity. The figure overlays the neighbor graph (edges between adjacent centroids, one node per unit) on a light-grey polygon layer, and the companion histogram shows the neighbor-cardinality distribution β the standard visual audit of a ``W`` before any spatial statistic is computed on it. Parameters ---------- gdf Geometry frame (see :func:`geometrics.read_gdf`). w ``libpysal`` weights aligned to the gdf entity ids. ``None`` builds the default weights (queen contiguity for polygons, 6-nearest-neighbor otherwise) with a :class:`~geometrics.GeometricsWarning`. entity Entity id column of ``gdf``; resolved automatically when ``None``. tiles MapLibre basemap style (e.g. ``"carto-positron"``). ``None`` draws a vector (tile-free) figure suitable for deterministic PNG export. title Figure title (a default naming the weights is used when ``None``). Returns ------- ConnectivityMapResult The per-entity neighbor-cardinality frame, the graph figure (``fig``), the cardinality histogram (``fig_hist``), the connectivity scalars and ``w_spec``. Examples -------- Connectivity of a two-cell map (each unit has exactly one neighbor): ```python import geopandas as gpd from shapely.geometry import box from geometrics.weights import explore_connectivity_map, make_weights gdf = gpd.GeoDataFrame( {"region": ["A", "B"]}, geometry=[box(0, 0, 1, 1), box(1, 0, 2, 1)], crs="EPSG:4326", ) res = explore_connectivity_map(gdf, w=make_weights(gdf), tiles=None) (res.n_units, res.mean_neighbors) ``` ### explore_moran_plot(df: 'pd.DataFrame', var: 'str', *, gdf: 'gpd.GeoDataFrame', w: 'W | None' = None, period: 'Any' = None, entity: 'str | None' = None, time: 'str | None' = None, permutations: 'int' = 999, seed: 'int | None' = 12345, title: 'str | None' = None) -> 'MoranPlotResult' Draw the Moran scatterplot and test global spatial autocorrelation in ``var``. The panel is aligned to the geometry for one cross-section (the latest period by default), the variable is z-standardized, and its row-standardized spatial lag is plotted against it, colored by scatter quadrant (HH, LH, LL, HL). The OLS slope of the fitted line equals global Moran's I under row-standardized weights, whose significance is assessed with ``permutations`` conditional permutations (:class:`esda.moran.Moran`). Parameters ---------- df Long panel (or cross-section) holding ``var`` per entity. var Numeric column of ``df`` to test. gdf Entity geometry; must carry the same entity-id column as ``df``. w ``libpysal`` weights aligned to the gdf entity ids. ``None`` builds the default weights (queen contiguity for polygons, 6-nearest-neighbor otherwise) with a :class:`~geometrics.GeometricsWarning`. esda row-standardizes the weights for the statistic (its ``transformation="r"`` convention), which is also what makes the scatter slope equal Moran's I. period Period to analyze. Defaults to the latest period when ``df`` has a time dimension (a note records this). entity, time Panel identifiers; default to the ids declared via :func:`geometrics.set_panel`. permutations Number of conditional permutations behind ``p_sim`` / ``z_sim``. seed esda's global :class:`~esda.moran.Moran` draws its permutations from NumPy's **global** random state and exposes no seed argument, so when ``seed`` is not ``None`` geometrics calls ``numpy.random.seed(seed)`` immediately before the test to make ``p_sim`` reproducible. Pass ``None`` to leave the global state untouched. title Figure title. Defaults to ``"Moran scatterplot: