build_data_dict

build_data_dict(df, *, entity=None, time=None, factor_cutoff=10)

Infer a best-guess data dictionary (df_dict) for df.

Produces one row per column with an inferred type and a humanized label, ready to pass to :func:~geometrics.set_labels. Column-name hints and dtypes drive the guess: a column is typed entity (name hints like country / iso / id, or — failing that — the column that uniquely keys the rows together with the time id), time (name hints like year / date, a datetime dtype, or an integer column in the calendar-year range), logical (boolean or two-valued), factor (categorical/object, or numeric with at most factor_cutoff distinct values), else numeric.

A best-guess role is also filled: a text column that is constant within the entity and ~1:1 with it (a readable label for the unit, e.g. a country name beside an ISO code) is tagged entity_name; all other rows are left blank. The analytical roles outcome / covariate are never guessed — mark them yourself (in the dictionary or via :func:~geometrics.set_roles).

Parameters

Name Type Description Default
df pd.DataFrame The data frame to describe. required
entity str | Sequence[str] | None Explicit entity (unit) identifier column name(s); when given, these win over detection (and are validated against df). None
time str | None Explicit time identifier column name; when given, it wins over detection. None
factor_cutoff int Numeric columns with at most this many distinct values are typed factor. 10

Returns

Name Type Description
pandas.DataFrame A dictionary frame with columns var_name, var_def, label, type, role and can_be_na (one row per column of df, in column order).

Examples

Build a dictionary for any frame, then attach labels + declare the panel in one step:

import pandas as pd

import geometrics as gm

df = pd.DataFrame(
    {
        "region": ["A", "A", "B", "B"],
        "year": [2000, 2001, 2000, 2001],
        "gdp_pc": [1.0, 1.1, 2.0, 2.1],
    }
)
ddict = gm.build_data_dict(df)
df = gm.set_labels(df, ddict, set_panel=True)
ddict.head()