Explore regional data

The Explore module is your first look at a regional dataset — before you estimate a single model. This page is a case study: you have just been handed 520 Indian districts observed by DMSP-OLS satellite nighttime lights between 1996 and 2010 (from Mendez, Kabiraj & Li) and asked three questions an analyst always starts with: is development spatially clustered, where exactly, and how did the whole regional distribution move over time?

Every Explore function takes the panel and returns a small result object carrying a tidy .df plus an interactive Plotly figure (.fig), and most offer a plain-language .interpret(). Read this page top to bottom: the functions are ordered as a workflow — load the three inputs → map the level → encode the neighborhood → test and localize clustering → watch the distribution move in time and space.

Note

This is exploratory analysis: every reading below describes an association, never a cause. The Analyze module turns these patterns into estimates, and Learn explains the ideas behind them with simulations you control.

Stage 0 — Load the three inputs

geometrics separates geometry, data, and metadata: a geometry with only the entity ID (gdf), a long-form panel (df), and a data dictionary (df_dict). The bundled India case study ships all three; set_labels attaches the dictionary’s labels to every future figure and declares the (entity, time) coordinates once.

import warnings

warnings.filterwarnings("ignore")

import geometrics as gm

gdf, df, df_dict = gm.data.load_india()
df = gm.set_labels(df, df_dict, set_panel=True)  # labels + entity/time + roles, once
df.head(3)

	statedist	state	district	year	ntl_rural	ntl_urban	ntl_total	ntl_pc_1996	log_ntl_pc_1996	growth_ntl_pc_9610	...	latitude	rural_share	log_pop_density	sc_share	st_share	work_share	literacy_share	higher_edu_share	electricity_share	log_paved_roads
0	Andhra PradeshAdilabad	Andhra Pradesh	Adilabad	1996	35532.074	8465.7354	43997.809	0.019194	-3.953148	0.039751	...	19.25	0.948856	5.205684	0.188309	0.16231	0.439483	0.414134	0.043333	0.454379	4.301359
1	Andhra PradeshAdilabad	Andhra Pradesh	Adilabad	1999	51730.660	9121.8799	60852.539	0.019194	-3.953148	0.039751	...	19.25	0.948856	5.205684	0.188309	0.16231	0.439483	0.414134	0.043333	0.454379	4.301359
2	Andhra PradeshAdilabad	Andhra Pradesh	Adilabad	2000	63759.672	10821.1310	74580.805	0.019194	-3.953148	0.039751	...	19.25	0.948856	5.205684	0.188309	0.16231	0.439483	0.414134	0.043333	0.454379	4.301359

3 rows × 28 columns

The dictionary is data too — it documents every column and drives the labels on every figure:

df_dict.head(8)

	var_name	var_def	label	type	role	can_be_na
0	statedist	Unique district identifier formed by concatena...	State-district ID	entity	NaN	False
1	state	Name of the Indian state or union territory th...	State	factor	NaN	False
2	district	Name of the district under 1991-census boundar...	District	factor	entity_name	False
3	year	Observation year of the radiance-calibrated DM...	Year	time	NaN	False
4	ntl_rural	Radiance-calibrated DMSP-OLS total nighttime l...	Rural NTL	numeric	NaN	False
5	ntl_urban	Radiance-calibrated DMSP-OLS total nighttime l...	Urban NTL	numeric	NaN	False
6	ntl_total	Radiance-calibrated DMSP-OLS total nighttime l...	Total NTL	numeric	NaN	False
7	ntl_pc_1996	Radiance-calibrated nighttime luminosity per c...	NTL per capita (1996)	numeric	NaN	False

Stage 1 — See the map

explore_choropleth_map classifies with mapclassify (Fisher-Jenks by default, k classes) and draws one legend entry per class, so the legend is the classification:

gm.explore_choropleth_map(df, "ntl_total", gdf=gdf, period=2010).fig

Pass animate=True instead of a single period to play the whole 1996–2010 film, or switch scheme ("quantiles", "equalinterval", …) to see how much the story depends on the classification — gm.explain("choropleth_classification") explains why.

Stage 2 — Encode the neighborhood

Everything spatial starts with a weights matrix W — the formal answer to “who is whose neighbor?”. The paper uses 6 nearest neighbors; explore_connectivity_map draws the graph so you can inspect it before trusting it:

w = gm.make_weights(gdf, method="knn", k=6)
gm.explore_connectivity_map(gdf, w=w).fig

Contiguity is the common alternative (method="queen") — see Spatial dependence and LISA for how to choose, and Analyze for checking that results survive the choice.

Stage 3 — Is development spatially clustered?

The Moran scatterplot puts each district’s (standardized) value against the average of its neighbors; the slope is global Moran’s I, the workhorse test of spatial autocorrelation:

moran = gm.explore_moran_plot(df, "log_ntl_pc_1996", gdf=gdf, w=w, period=1996)
moran.fig

print(moran.interpret())

Global Moran's I for **log_ntl_pc_1996** in 1996 is 0.724, against an expectation of -0.00193 under spatial randomness — statistically significant at the 1% level (pseudo p = 0.001 from 999 permutations, under 6-nearest-neighbor (metric centroids), row-standardized, n=520).

The dependence is **positive**: similar values cluster in space — high values sit next to high values and low next to low, so the map shows contiguous patches rather than a random scatter.

448 of 520 regions (86%) fall in the clustering quadrants of the scatter (High-High or Low-Low); the rest are surrounded by neighbors unlike themselves. The slope of the fitted line equals Moran's I under row-standardized weights, so a steeper line means stronger clustering.

_These are associations, not causal effects. A causal reading needs a research design — see `explain('correlation_vs_causation')`._

Stage 4 — Where exactly? (LISA)

Global Moran’s I says whether the map clusters; LISA (local Moran) says where — each district is classified as a High-High hot spot, Low-Low cold spot, or a spatial outlier (High-Low / Low-High), masked at 5% significance:

lisa = gm.explore_lisa_cluster_map(df, "log_ntl_pc_1996", gdf=gdf, w=w, period=1996)
lisa.fig

print(lisa.interpret())

Local Moran statistics (LISA) locate *where* **log_ntl_pc_1996** in 1996 clusters or stands out, under 6-nearest-neighbor (metric centroids), row-standardized, n=520. The accompanying global Moran's I is 0.724 (pseudo p = 0.001), consistent with overall clustering of similar values.

At the 0.05 significance level, 281 of 520 regions show significant local association: **169 High-High** hot spots (high values surrounded by high neighbors) and **101 Low-Low** cold spots (low surrounded by low) mark clustering, while **9 High-Low** and **2 Low-High** regions are spatial outliers that break with their surroundings. The remaining 239 regions are not significant — their local pattern is compatible with randomness.

LISA pseudo p-values are computed region by region without a multiple-testing adjustment, so treat borderline clusters cautiously and read the map as descriptive of where dependence concentrates.

_These are associations, not causal effects. A causal reading needs a research design — see `explain('correlation_vs_causation')`._

Stage 5 — The whole distribution, year by year

Convergence questions are distribution questions. The ridgeline stacks the cross-sectional density of each year on one shared grid, so you can watch the shape — not just the mean — move:

gm.explore_distribution_over_time(df, "log_ntl_pc_1996").fig

(kind="animated" plays the same densities as an animation instead.)

Stage 6 — Every district, every year

The space-time heatmap keeps every unit visible: one row per district, one column per year. Sorting the rows by latitude turns geography itself into the y-axis — a north–south transect of the whole panel:

gm.explore_spacetime_heatmap(
    df, "log_ntl_pc_1996", gdf=gdf, sort_by="north_south"
).fig

Rows that keep their shading left to right are persistent; rows that lighten or darken are mobile. sort_by="value" orders by the first period instead.

Stage 7 — Does the clustering strengthen or fade?

Stage 3 tested one year. Running Moran’s I per year closes the loop — is the spatial structure of development deepening or dissolving?

mot = gm.explore_moran_over_time(df, "log_ntl_pc_1996", gdf=gdf, w=w)
mot.fig

print(mot.interpret())

Global Moran's I for **log_ntl_pc_1996** is tracked over 6 periods (1996 to 2010) on a fixed set of regions, under 6-nearest-neighbor (metric centroids), row-standardized, n=520: it moves from 0.724 to 0.724.

The series is broadly **stable**: the degree to which similar values cluster in space changes little over the window.

Per-period permutation tests flag 6 of 6 periods as significant at the 5% level (filled markers in the figure); open markers are periods where the pattern is compatible with spatial randomness.

_These are associations, not causal effects. A causal reading needs a research design — see `explain('correlation_vs_causation')`._

Where next

You now know the map clusters, where it clusters, and how the distribution moved.

Analyze — estimate it: β/σ/club convergence, spatial models with spillovers, Markov dynamics, inequality decompositions (on the Bolivia case study)
The India case study — the full replication arc on this same panel
Learn — the ideas behind W, Moran’s I and LISA, taught with simulations where you plant the truth
The data model — bring your own (gdf, df, df_dict)