Stage 1 — Build the GAUL lookup parquet + enrichment module + tests
Part of the FAO global delivery plan — umbrella: #20. Purely additive — zero existing files modified. Can start immediately, in parallel with Stage 0 (#LINK_C) and the datafactory region work (views-platform/views-datafactory#159).
Why this exists
This is the ADR-011 implementation artifact (docs/ADRs/011_replace_runtime_mapper_with_precomputed_lookup.md, assessment §10). The runtime mapper recomputes the same gid→geography answers on every run from 774 MB of shapefiles; the lookup precomputes them once into a small parquet, and runtime enrichment becomes a merge. Decisive property: it serves both input paths identically — historical (from the datafactory) and forecast (from Appwrite, which never touches the datafactory) — because both key on priogrid_gid.
The data prerequisites are met: views-datafactory regenerated all 7 GAUL parquets on 2026-06-11 as area-majority (the FAO-contracted rule, Release Note 02), 259,200 rows each, mutually consistent. Verified counts: 64,736 of 64,818 land cells fully complete; for africa_me_legacy, 13,105 of 13,110 (5 ocean cells unassigned).
Source data
From a local views-datafactory checkout, data/raw/gaul_admin/ (all June-11, area-majority, schema (gid: int32, value)):
| File |
Value type |
Becomes column |
gaul0_code.parquet |
int32 |
admin1_gaul0_code |
gaul1_code.parquet |
int32 |
admin1_gaul1_code |
gaul2_code.parquet |
int32 |
admin2_gaul2_code |
gaul0_name.parquet |
string |
admin1_gaul0_name |
gaul1_name.parquet |
string |
admin1_gaul1_name |
gaul2_name.parquet |
string |
admin2_gaul2_name |
iso3_code.parquet |
string |
country_iso_a3 |
Plus two computed columns (PRIO-GRID is a fixed 0.5° global grid, 720 columns):
pg_xcoord = -180 + ((gid - 1) % 720) * 0.5 + 0.25
pg_ycoord = -90 + ((gid - 1) // 720) * 0.5 + 0.25
(The same formula views-pipeline-core already uses for row/col, dataloaders.py:1201-1206.)
⚠ The schema contract — exactly these 9 names, nothing else renamed
The 9 columns are hard-validated at three independent points: this repo's unfao.py:141-153 (selection) and unfao.py:189-197 (validation), and views-faoapi handlers.py:1146-1156 (FAO_PGMDataset._METADATA_COLS). Reproducing them exactly is what makes views-faoapi require zero changes — that is the umbrella's acceptance criterion. The broader FAO-contract-naming question (register C-24) is explicitly out of scope.
Spec
1. scripts/build_gaul_lookup.py
- Joins the 7 parquets by gid; applies the rename map above; computes pg_xcoord/pg_ycoord.
- Keeps only fully-complete rows (drops gid where code == -1 or name/iso == ""). Rationale: an unmatched gid at enrichment time must surface as nulls (left-merge produces NaN →
_validate() crashes loudly). Never carry -1/"" into the lookup — -1 is non-null and would sail through validation into FAO data (register C-30 narrative).
- Embeds provenance in the parquet metadata: source file digests (from the datafactory's
provenance/gaul_admin/ingestion_ledger.jsonl), generation date, row count.
- Output:
views_postprocessing/data/gaul_lookup.parquet (global, all complete cells — ~95k rows, a few MB; committed to git, NOT LFS).
2. views_postprocessing/unfao/enrichment.py
- Small class/function: load lookup once,
enrich(df, pg_id_col) -> df via left merge on gid.
- Categorical dtype for the 4 string columns (
country_iso_a3, three *_name) — register C-32: at global scale the metadata broadcasts to ~28M rows; object-dtype strings cost 8–20 GB, categoricals ~10× less. Build this in from day one.
- Dtype parity with current mapper output: numeric codes numeric, coords float — the existing boundary contract tests (
tests/test_integration.py:79-107) define the expectations.
- No spatial logic, no shapefiles, no geopandas import.
3. Tests (extend the existing suites' style; synthetic fixtures stay untouched)
- Coverage contract (register C-34): every gid in the datafactory's
land_gaul set enriches 100% complete (64,736); every africa_me_legacy gid except the 5 known ocean cells enriches complete (13,105).
- Schema contract: output contains exactly the 9 columns with correct dtypes.
- Fail-loud pinned: an unknown gid (e.g., 999999) and a known-unassigned gid (e.g., 62356) both yield null metadata after enrichment — the behavior
_validate() depends on.
- Lookup integrity: row count pinned; no nulls, no -1, no empty strings inside the lookup itself.
Definition of done
Stage 1 — Build the GAUL lookup parquet + enrichment module + tests
Part of the FAO global delivery plan — umbrella: #20. Purely additive — zero existing files modified. Can start immediately, in parallel with Stage 0 (#LINK_C) and the datafactory region work (views-platform/views-datafactory#159).
Why this exists
This is the ADR-011 implementation artifact (
docs/ADRs/011_replace_runtime_mapper_with_precomputed_lookup.md, assessment §10). The runtime mapper recomputes the same gid→geography answers on every run from 774 MB of shapefiles; the lookup precomputes them once into a small parquet, and runtime enrichment becomes a merge. Decisive property: it serves both input paths identically — historical (from the datafactory) and forecast (from Appwrite, which never touches the datafactory) — because both key onpriogrid_gid.The data prerequisites are met: views-datafactory regenerated all 7 GAUL parquets on 2026-06-11 as area-majority (the FAO-contracted rule, Release Note 02), 259,200 rows each, mutually consistent. Verified counts: 64,736 of 64,818
landcells fully complete; for africa_me_legacy, 13,105 of 13,110 (5 ocean cells unassigned).Source data
From a local views-datafactory checkout,
data/raw/gaul_admin/(all June-11, area-majority, schema(gid: int32, value)):gaul0_code.parquetadmin1_gaul0_codegaul1_code.parquetadmin1_gaul1_codegaul2_code.parquetadmin2_gaul2_codegaul0_name.parquetadmin1_gaul0_namegaul1_name.parquetadmin1_gaul1_namegaul2_name.parquetadmin2_gaul2_nameiso3_code.parquetcountry_iso_a3Plus two computed columns (PRIO-GRID is a fixed 0.5° global grid, 720 columns):
(The same formula views-pipeline-core already uses for row/col,
dataloaders.py:1201-1206.)⚠ The schema contract — exactly these 9 names, nothing else renamed
The 9 columns are hard-validated at three independent points: this repo's
unfao.py:141-153(selection) andunfao.py:189-197(validation), and views-faoapihandlers.py:1146-1156(FAO_PGMDataset._METADATA_COLS). Reproducing them exactly is what makes views-faoapi require zero changes — that is the umbrella's acceptance criterion. The broader FAO-contract-naming question (register C-24) is explicitly out of scope.Spec
1.
scripts/build_gaul_lookup.py_validate()crashes loudly). Never carry-1/""into the lookup —-1is non-null and would sail through validation into FAO data (register C-30 narrative).provenance/gaul_admin/ingestion_ledger.jsonl), generation date, row count.views_postprocessing/data/gaul_lookup.parquet(global, all complete cells — ~95k rows, a few MB; committed to git, NOT LFS).2.
views_postprocessing/unfao/enrichment.pyenrich(df, pg_id_col) -> dfvia left merge on gid.country_iso_a3, three*_name) — register C-32: at global scale the metadata broadcasts to ~28M rows; object-dtype strings cost 8–20 GB, categoricals ~10× less. Build this in from day one.tests/test_integration.py:79-107) define the expectations.3. Tests (extend the existing suites' style; synthetic fixtures stay untouched)
land_gaulset enriches 100% complete (64,736); every africa_me_legacy gid except the 5 known ocean cells enriches complete (13,105)._validate()depends on.Definition of done
gaul_lookup.parquetcommitted with embedded provenanceenrichment.pywith categorical dtypes, no geo dependenciesgit diff --statagainst the base branch shows zero modified files — only additions