build_data_dict

build_data_dict(df, *, entity=None, time=None, factor_cutoff=10)

Infer a best-guess data dictionary (df_dict) for df.

Produces one row per column with an inferred type and a humanized label, ready to pass to :func:~geometrics.set_labels. Column-name hints and dtypes drive the guess: a column is typed entity (name hints like country / iso / id, or — failing that — the column that uniquely keys the rows together with the time id), time (name hints like year / date, a datetime dtype, or an integer column in the calendar-year range), logical (boolean or two-valued), factor (categorical/object, or numeric with at most factor_cutoff distinct values), else numeric.

A best-guess role is also filled: a text column that is constant within the entity and ~1:1 with it (a readable label for the unit, e.g. a country name beside an ISO code) is tagged entity_name; all other rows are left blank. The analytical roles outcome / covariate are never guessed — mark them yourself (in the dictionary or via :func:~geometrics.set_roles).

Parameters

Name	Type	Description	Default
df	pd.DataFrame	The data frame to describe.	required
entity	str \| Sequence[str] \| None	Explicit entity (unit) identifier column name(s); when given, these win over detection (and are validated against `df`).	`None`
time	str \| None	Explicit time identifier column name; when given, it wins over detection.	`None`
factor_cutoff	int	Numeric columns with at most this many distinct values are typed `factor`.	`10`

Returns

Name	Type	Description
	pandas.DataFrame	A dictionary frame with columns `var_name`, `var_def`, `label`, `type`, `role` and `can_be_na` (one row per column of `df`, in column order).

Examples

Build a dictionary for any frame, then attach labels + declare the panel in one step:

import pandas as pd

import geometrics as gm

df = pd.DataFrame(
    {
        "region": ["A", "A", "B", "B"],
        "year": [2000, 2001, 2000, 2001],
        "gdp_pc": [1.0, 1.1, 2.0, 2.1],
    }
)
ddict = gm.build_data_dict(df)
df = gm.set_labels(df, ddict, set_panel=True)
ddict.head()