Stage 1 — Build GAUL lookup parquet + enrichment module + tests

# Stage 1 — Build the GAUL lookup parquet + enrichment module + tests

Part of the FAO global delivery plan — umbrella: #20. **Purely additive — zero existing files modified.** Can start immediately, in parallel with Stage 0 (#LINK_C) and the datafactory region work (views-platform/views-datafactory#159).

## Why this exists

This is the ADR-011 implementation artifact (`docs/ADRs/011_replace_runtime_mapper_with_precomputed_lookup.md`, assessment §10). The runtime mapper recomputes the same gid→geography answers on every run from 774 MB of shapefiles; the lookup precomputes them once into a small parquet, and runtime enrichment becomes a merge. Decisive property: it serves **both** input paths identically — historical (from the datafactory) and forecast (from Appwrite, which never touches the datafactory) — because both key on `priogrid_gid`.

The data prerequisites are met: views-datafactory regenerated all 7 GAUL parquets on 2026-06-11 as area-majority (the FAO-contracted rule, Release Note 02), 259,200 rows each, mutually consistent. Verified counts: 64,736 of 64,818 `land` cells fully complete; for africa_me_legacy, 13,105 of 13,110 (5 ocean cells unassigned).

## Source data

From a local views-datafactory checkout, `data/raw/gaul_admin/` (all June-11, area-majority, schema `(gid: int32, value)`):

| File | Value type | Becomes column |
|---|---|---|
| `gaul0_code.parquet` | int32 | `admin1_gaul0_code` |
| `gaul1_code.parquet` | int32 | `admin1_gaul1_code` |
| `gaul2_code.parquet` | int32 | `admin2_gaul2_code` |
| `gaul0_name.parquet` | string | `admin1_gaul0_name` |
| `gaul1_name.parquet` | string | `admin1_gaul1_name` |
| `gaul2_name.parquet` | string | `admin2_gaul2_name` |
| `iso3_code.parquet` | string | `country_iso_a3` |

Plus two computed columns (PRIO-GRID is a fixed 0.5° global grid, 720 columns):

```
pg_xcoord = -180 + ((gid - 1) %  720) * 0.5 + 0.25
pg_ycoord =  -90 + ((gid - 1) // 720) * 0.5 + 0.25
```

(The same formula views-pipeline-core already uses for row/col, `dataloaders.py:1201-1206`.)

## ⚠ The schema contract — exactly these 9 names, nothing else renamed

The 9 columns are hard-validated at three independent points: this repo's `unfao.py:141-153` (selection) and `unfao.py:189-197` (validation), and views-faoapi `handlers.py:1146-1156` (`FAO_PGMDataset._METADATA_COLS`). Reproducing them exactly is what makes views-faoapi require **zero changes** — that is the umbrella's acceptance criterion. The broader FAO-contract-naming question (register C-24) is explicitly out of scope.

## Spec

### 1. `scripts/build_gaul_lookup.py`
- Joins the 7 parquets by gid; applies the rename map above; computes pg_xcoord/pg_ycoord.
- **Keeps only fully-complete rows** (drops gid where code == -1 or name/iso == ""). Rationale: an unmatched gid at enrichment time must surface as **nulls** (left-merge produces NaN → `_validate()` crashes loudly). Never carry `-1`/`""` into the lookup — `-1` is non-null and would sail through validation into FAO data (register C-30 narrative).
- Embeds provenance in the parquet metadata: source file digests (from the datafactory's `provenance/gaul_admin/ingestion_ledger.jsonl`), generation date, row count.
- Output: `views_postprocessing/data/gaul_lookup.parquet` (global, all complete cells — ~95k rows, a few MB; committed to git, NOT LFS).

### 2. `views_postprocessing/unfao/enrichment.py`
- Small class/function: load lookup once, `enrich(df, pg_id_col) -> df` via left merge on gid.
- **Categorical dtype for the 4 string columns** (`country_iso_a3`, three `*_name`) — register C-32: at global scale the metadata broadcasts to ~28M rows; object-dtype strings cost 8–20 GB, categoricals ~10× less. Build this in from day one.
- Dtype parity with current mapper output: numeric codes numeric, coords float — the existing boundary contract tests (`tests/test_integration.py:79-107`) define the expectations.
- No spatial logic, no shapefiles, no geopandas import.

### 3. Tests (extend the existing suites' style; synthetic fixtures stay untouched)
- **Coverage contract** (register C-34): every gid in the datafactory's `land_gaul` set enriches 100% complete (64,736); every africa_me_legacy gid except the 5 known ocean cells enriches complete (13,105).
- **Schema contract:** output contains exactly the 9 columns with correct dtypes.
- **Fail-loud pinned:** an unknown gid (e.g., 999999) and a known-unassigned gid (e.g., 62356) both yield null metadata after enrichment — the behavior `_validate()` depends on.
- **Lookup integrity:** row count pinned; no nulls, no -1, no empty strings inside the lookup itself.

## Definition of done

- [ ] `gaul_lookup.parquet` committed with embedded provenance
- [ ] `enrichment.py` with categorical dtypes, no geo dependencies
- [ ] All new tests green; full existing suite still green
- [ ] `git diff --stat` against the base branch shows **zero modified files** — only additions
- [ ] Umbrella #20 updated


File	Value type	Becomes column
`gaul0_code.parquet`	int32	`admin1_gaul0_code`
`gaul1_code.parquet`	int32	`admin1_gaul1_code`
`gaul2_code.parquet`	int32	`admin2_gaul2_code`
`gaul0_name.parquet`	string	`admin1_gaul0_name`
`gaul1_name.parquet`	string	`admin1_gaul1_name`
`gaul2_name.parquet`	string	`admin2_gaul2_name`
`iso3_code.parquet`	string	`country_iso_a3`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage 1 — Build GAUL lookup parquet + enrichment module + tests #22

Stage 1 — Build the GAUL lookup parquet + enrichment module + tests

Why this exists

Source data

⚠ The schema contract — exactly these 9 names, nothing else renamed

Spec

1. `scripts/build_gaul_lookup.py`

2. `views_postprocessing/unfao/enrichment.py`

3. Tests (extend the existing suites' style; synthetic fixtures stay untouched)

Definition of done

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Stage 1 — Build GAUL lookup parquet + enrichment module + tests #22

Description

Stage 1 — Build the GAUL lookup parquet + enrichment module + tests

Why this exists

Source data

⚠ The schema contract — exactly these 9 names, nothing else renamed

Spec

1. scripts/build_gaul_lookup.py

2. views_postprocessing/unfao/enrichment.py

3. Tests (extend the existing suites' style; synthetic fixtures stay untouched)

Definition of done

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `scripts/build_gaul_lookup.py`

2. `views_postprocessing/unfao/enrichment.py`