Skip to content

[UMBRELLA] FAO global delivery: lookup-based enrichment (ADR-011) + global coverage #20

@Polichinel

Description

@Polichinel

[UMBRELLA] FAO global delivery: lookup-based enrichment (ADR-011) + global coverage

Why this exists

The UN FAO postprocessing pipeline must deliver global historical conflict data to the FAO Appwrite prediction store, followed shortly by two or three additional Appwrite-based prediction stores. Today the pipeline covers only africa_me_legacy (13,110 PRIO-GRID cells) and enriches predictions with geographic metadata via a 3,171-line runtime spatial mapper (mapping.py) that loads 774 MB of Git-LFS shapefiles at import time.

A full cross-repo investigation (2026-06-12, see docs/cross_repo_integration_report.md in this repo) established that scaling the runtime mapper to global coverage is the higher-risk path: its behavior at 64,818 cells has never been observed, cannot be tested outside the production machine (shapefiles are LFS stubs elsewhere), and its failure modes (unknown runtime, unknown memory, unknown Natural-Earth coverage gaps) would surface at runtime, hours into a delivery run. By contrast, the precomputed lookup table proposed by ADR-011 can be verified completely in advance: the views-datafactory shipped area-majority GAUL assignments (its issue #115 → PR #127, v1.2.28/29), and all 7 source parquets were regenerated 2026-06-11 with 259,200 rows each. We verified directly that 64,736 of the 64,818 land cells have complete metadata; exactly 82 cells (sub-Antarctic islands not covered by FAO GAUL 2024) are unassigned.

This umbrella tracks the agreed plan ("Plan C"): test the unchanged pipeline on africa_me first, use that run's output as the verification baseline, swap the enrichment engine, then go global.

The plan (Plan C)

Stage Issue What happens Behavioral change? Definition of done
0. Baseline (this repo, see child issues) Run the pipeline unchanged on africa_me_legacy on the production machine; archive output as ground truth None Green run; baseline parquet + schema archived
1. Build (this repo) Lookup parquet from datafactory's 7 area-majority parquets + merge-by-gid enricher + tests None — purely additive pytest green incl. coverage tests
— parallel views-datafactory New bundled curated region land_gaul (land ∩ GAUL coverage, 64,736 cells) None — purely additive Test pins count; version released
2. Diff (this repo) Shadow comparison: Stage-0 baseline vs enricher output, same 13,110 cells, all 9 columns None Diff report, zero unexplained differences
3. Swap (this repo) _append_metadata uses the enricher (one method body) The only behavioral change in the plan africa_me green; schema identical to baseline
4. Go global views-models REGION = "land_gaul" (one config line); dry run; deliver; FAO release note Coverage change faoapi serves global historical; FAO notified

Child issues:

Dependency graph

views-datafactory A (region, additive) ──→ pip update datafactory_query on prod machine ──→ G (config flip, LAST)
C (baseline run) ──┐
                   ├──→ E (shadow diff) ──→ F (swap) ──→ G
D (lookup build) ──┘

A, C, and D are fully independent and can run in parallel starting immediately. The only ordering constraints: the diff (E) needs both the baseline (C) and the enricher (D); the swap (F) is gated on a clean diff; the flip (G) goes last and requires both the datafactory release installed on the production machine and the swap verified.

Splash zone

Repo Change Risk profile
views-datafactory New bundled region (json + regions.py entry + script + test). Purely additive. Low — land and africa_me_legacy untouched; cannot break existing consumers
views-postprocessing Lookup + enricher + tests (additive), then one method body in unfao.py:_append_metadata Contained — one commit, instantly revertable, gated by the Stage-2 diff
views-models One config line (config_queryset.py:20) Trivial code-wise; it IS the go-global act, so it goes last
views-pipeline-core FROZEN — zero changes. The dataloader passes the region string through untouched (dataloaders.py:1182). Its known issues (postprocessing register C-26 fillna, C-27 swallowed exceptions, C-28 zarr timeout, C-29 cache side-channel) are real and deliberately deferred: pipeline-core has the widest blast radius in the platform (every model depends on it), and none of these is on the delivery critical path.
views-faoapi FROZEN — zero changes. This is the acceptance criterion, not an accident: the lookup must reproduce the 9 metadata columns exactly (they are hard-validated at handlers.py:1146-1163), making the entire engine swap invisible downstream.
Appwrite No schema/bucket/collection changes; new files through the existing path only

The schema contract (do not touch)

The 9 metadata columns are independently hard-coded at three enforcement points — postprocessor selection (unfao.py:141-153), postprocessor validation (unfao.py:189-197), and faoapi dataset validation (handlers.py:1146-1156):

pg_xcoord, pg_ycoord, country_iso_a3,
admin1_gaul1_code, admin1_gaul1_name, admin1_gaul0_code, admin1_gaul0_name,
admin2_gaul2_code, admin2_gaul2_name

Any rename anywhere breaks views-faoapi with HTTP 500 on the first request. No renaming is in scope for this delivery (the separate FAO-contract-names question, register C-24/D-06, is explicitly deferred).

Risk register cross-references (this repo, reports/technical_risk_register.md)

  • C-30 (Tier 1): the 82 GAUL-uncovered land cells — resolved upstream via the datafactory region (D-10)
  • C-31 (Tier 2): runtime mapper unverified at global scale — the reason for the lookup-first adjudication (D-08)
  • C-32 (Tier 2): memory at ~28M rows — categorical dtypes in the enricher + mandatory dry run before delivery
  • C-33 (Tier 2): hardcoded store identity — deferred to the multi-store phase (D-09)
  • C-34 (Tier 2): coverage has no contract — addressed by the coverage tests in Stage 1
  • D-08: global via mapper vs lookup — adjudicated lookup-first
  • D-09: multi-store now vs after — adjudicated after; only exception: delete dead config blocks during the swap
  • D-10: the 82 cells — resolved: upstream curated region, postprocessor keeps zero spatial knowledge

Explicitly out of scope (with their "when")

  1. views-pipeline-core fixes (C-26 fillna data fabrication, C-27 swallowed constructor exceptions, C-28 zarr timeout, C-29 disk side-channel): after delivery; C-26 first — it is Tier 1.
  2. Multi-store DeliveryProfile (C-33): the calm week after FAO global ships. One manager class, N store configs; never copy-paste the manager.
  3. Deleting mapping.py: after one verified production cycle on the lookup.
  4. Column renaming to FAO contract names (C-24): separate coordinated 2-repo change + FAO sign-off, never bundled with this delivery.

Definition of green for the whole umbrella

  1. faoapi /data/historical/latest serves global (64,736-cell) historical data with the baseline column list.
  2. Every stage's DoD met and checked off in its child issue.
  3. FAO has received one combined release note: global coverage, area-majority attribution changes, the 82 excluded island cells, and the historical-global vs forecast-regional coverage asymmetry.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions