build_data_dict
build_data_dict(df, *, entity=None, time=None, factor_cutoff=10)Infer a best-guess data dictionary (df_dict) for df.
Produces one row per column with an inferred type and a humanized label, ready to pass to :func:~geometrics.set_labels. Column-name hints and dtypes drive the guess: a column is typed entity (name hints like country / iso / id, or — failing that — the column that uniquely keys the rows together with the time id), time (name hints like year / date, a datetime dtype, or an integer column in the calendar-year range), logical (boolean or two-valued), factor (categorical/object, or numeric with at most factor_cutoff distinct values), else numeric.
A best-guess role is also filled: a text column that is constant within the entity and ~1:1 with it (a readable label for the unit, e.g. a country name beside an ISO code) is tagged entity_name; all other rows are left blank. The analytical roles outcome / covariate are never guessed — mark them yourself (in the dictionary or via :func:~geometrics.set_roles).
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| df | pd.DataFrame | The data frame to describe. | required |
| entity | str | Sequence[str] | None | Explicit entity (unit) identifier column name(s); when given, these win over detection (and are validated against df). |
None |
| time | str | None | Explicit time identifier column name; when given, it wins over detection. | None |
| factor_cutoff | int | Numeric columns with at most this many distinct values are typed factor. |
10 |
Returns
| Name | Type | Description |
|---|---|---|
| pandas.DataFrame | A dictionary frame with columns var_name, var_def, label, type, role and can_be_na (one row per column of df, in column order). |
Examples
Build a dictionary for any frame, then attach labels + declare the panel in one step:
import pandas as pd
import geometrics as gm
df = pd.DataFrame(
{
"region": ["A", "A", "B", "B"],
"year": [2000, 2001, 2000, 2001],
"gdp_pc": [1.0, 1.1, 2.0, 2.1],
}
)
ddict = gm.build_data_dict(df)
df = gm.set_labels(df, ddict, set_panel=True)
ddict.head()