Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -216,3 +216,6 @@ cython_debug/
# generated artifacts
graphify-out/
.claude/

# ADR-011 precomputed GAUL lookup (small, must be tracked despite *.parquet rule)
!views_postprocessing/data/gaul_lookup.parquet
50 changes: 48 additions & 2 deletions docs/ADR-011_implementation_assessment.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# ADR-011 Implementation Assessment

**Last updated:** 2026-06-03
**Status:** Paused — implementation deferred pending infrastructure and verification
**Last updated:** 2026-06-12
**Status:** Data prerequisites MET — see §10. Implementation unblocked on the data side; verification infrastructure (Appwrite, pipeline-core E2E) still required before switching the pipeline.

---

Expand Down Expand Up @@ -496,3 +496,49 @@ The alternative is **Option 4 (do nothing)** — which is also legitimate. The c
1. The column naming question is settled — keep current names
2. The architecture decision (precomputed table) is confirmed — area-majority, sequential allocation, Parquet lookup
3. What blocks progress is infrastructure: LFS for precomputation, views-pipeline-core for E2E testing, Appwrite for verification

---

## 10. June 2026 Status Update — Datafactory Prerequisites MET

**Added:** 2026-06-12. Full detail in `docs/cross_repo_integration_report.md`.

### 10.1 What changed upstream

The views-datafactory completed the area-majority work (issue #115 → PR #127, ADR-039 accepted, shipped v1.2.28/29):

| Artifact | State (verified 2026-06-12) |
|---|---|
| `gaul0/1/2_code.parquet` | 259,200 rows each, int32, **area-majority** (regenerated Jun 11) |
| `gaul0/1/2_name.parquet` | 259,200 rows each, string, **area-majority** (regenerated Jun 11 — the earlier 86,091-row centroid gap is closed) |
| `iso3_code.parquet` | 259,200 rows, string, **area-majority** (Jun 11) |
| Assembled grid/zarr | gaul codes at channels 72-74, area-majority, fill = -1 (assembled Jun 8) |

Consistency for africa_me_legacy (13,110 cells): 13,105 fully consistent (code + name + iso3); 5 pure-ocean cells unassigned (gids 62356, 94776, 99027, 107733, 107742).

### 10.2 What this resolves

- **The "regeneration requires LFS + 774 MB shapefiles" blocker is gone.** The lookup table no longer needs this repo's mapper or shapefiles to be built — it is a join of the 7 factory parquets plus a coordinate formula (`xcoord = -180 + ((gid-1) % 720) * 0.5 + 0.25`, `ycoord = -90 + ((gid-1) // 720) * 0.5 + 0.25`).
- **The area-majority requirement is satisfied at the source.** The factory's single L2-based join yields all three admin levels from the same winning polygon; empirically 0 cells diverge from a direct L0 computation (sequential-allocation invariant preserved).
- **D-05 (platform mapping divergence) is resolved upstream** — the datafactory now ships area-majority, the same algorithm family FAO's contract requires.

### 10.3 Revised implementation path (supersedes §6 options)

Build the lookup FROM the datafactory parquets (Option 1 in the integration report):

1. Offline script joins the 7 parquets by gid, computes pg_xcoord/pg_ycoord, renames to the 9 contract columns (`gaul0_code → admin1_gaul0_code`, `iso3_code → country_iso_a3`, …).
2. Commit the resulting parquet (~1-2 MB africa_me / ~15 MB global) to this repo.
3. New enricher = merge-by-gid; replaces `mapper.enrich_dataframe_with_pg_info()` in `_append_metadata()`.
4. Serves BOTH historical and forecast paths (decisive — forecast data bypasses the factory entirely).
5. **Fail-loud preservation:** unassigned gids (factory code = -1) must surface as nulls so `_validate()` still crashes the pipeline rather than shipping wrong attributions. Do not pass -1/"" through as values.
6. **Acceptance criterion:** output schema (column names, dtypes, index) identical to current mapper output. faoapi then requires zero changes.

### 10.4 What still blocks the switch (unchanged)

- Verification infrastructure: Appwrite access + views-pipeline-core for an E2E run and an old-vs-new diff on real data.
- FAO communication: attribution values change for ~711 border cells (5.4%) and 149 recovered coastal cells — contractually correct (area-majority per Release Note 02) but a visible data-version change.
- Open: deployed-zarr sync status at the remote server; whether the 5 ocean cells ever appear in prediction inputs.

### 10.5 Semantic change to flag

`country_iso_a3` switches source from Natural Earth to GAUL. Current pipeline mixes boundary datasets (NE country + GAUL admin); the lookup makes each row internally consistent from a single GAUL polygon. This is an improvement but technically a change in what the ISO column means at disputed borders.
145 changes: 145 additions & 0 deletions docs/CICs/GaulLookupEnricher.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# Class Intent Contract: GaulLookupEnricher

**Status:** Draft
**Owner:** PRIO MD&D Team
**Last reviewed:** 2026-06-18
**Related ADRs:** ADR-011 (replace runtime mapper with precomputed lookup)

---

## 1. Purpose

> Attach the 9 geographic metadata columns to a prediction frame by merging a
> precomputed GAUL lookup table on the PRIO-GRID cell id.

It is the lookup-based replacement for `PriogridCountryMapper`'s runtime spatial
enrichment: the spatial computation has already happened upstream (the
views-datafactory area-majority join), so this class does only a table join.

---

## 2. Non-Goals (Explicit Exclusions)

- This class does **not** perform spatial computation — no geometry, no
shapefiles, no geopandas, no area-majority calculation.
- This class does **not** build the lookup table (that is
`scripts/build_gaul_lookup.py`, run offline).
- This class does **not** fill, impute, or invent metadata for unmatched cells.
- This class does **not** validate the result — null/coverage validation is the
manager's `_validate()` responsibility.
- This class does **not** read from the datafactory, viewser, or Appwrite.

---

## 3. Responsibilities and Guarantees

- Loads exactly one lookup Parquet at construction and verifies it carries the 9
contract columns; missing columns raise at construction.
- Returns the input frame augmented with exactly the 9 columns of
`gaul_schema.METADATA_COLS`, with their dtypes preserved from the lookup
(codes numeric, coordinates float, names/iso categorical).
- A cell id present in the lookup is enriched with that cell's metadata.
- A cell id **absent** from the lookup yields **null** metadata for that row —
never a sentinel, never a fabricated value (fail-loud downstream).
- Row count and row order of the input are preserved (left merge).

---

## 4. Inputs and Assumptions

- A lookup Parquet exists at the configured path (default: the committed
`views_postprocessing/data/gaul_lookup.parquet`), indexed by `priogrid_gid`,
containing only fully-complete cells (no nulls, no `-1`, no empty strings).
- The input DataFrame has a column named by `pg_id_col` holding PRIO-GRID cell
ids; a missing `pg_id_col` raises `ValueError`.
- The lookup is the single source of geographic truth — the caller does not
expect this class to reconcile it against any other source.

---

## 5. Outputs and Side Effects

- Output: the input frame (or, with `only_metadata=True`, just `pg_id_col` +
`time_id_col`) left-merged with the 9 metadata columns.
- Side effects: logs the lookup size at construction (INFO); logs a WARNING with
the count and sample of unmatched cell ids when any occur; logs ignored
mapper-only kwargs at DEBUG. No file writes, no network.

---

## 6. Failure Modes and Loudness

- **Raises** at construction if the lookup file is missing or lacks a contract
column.
- **Raises** `ValueError` if `pg_id_col` is not in the input.
- **Does not raise** on unmatched cells — it surfaces them as nulls and logs a
WARNING. This is deliberate: the manager's `_validate()` null gate is the
single enforcement point, so a coverage hole fails loudly there (one place),
not in two. Passing a sentinel for unmatched cells would be a **bug** (it would
bypass that gate). Aligns with ADR-003 (fail loud on semantic ambiguity).

---

## 7. Boundaries and Interactions

- Allowed to depend on: pandas, `gaul_schema`, and a local Parquet file.
- Must **not** depend on: geopandas/shapely, the runtime mapper, the
datafactory, viewser, Appwrite, or any network resource.
- Treats the lookup table as an opaque, trusted artifact produced by the build
script; it does not re-validate the table's spatial correctness.

---

## 8. Examples of Correct Usage

```python
enricher = GaulLookupEnricher()
out = enricher.enrich_dataframe_with_pg_info(
df.reset_index(), pg_id_col="priogrid_gid", time_id_col="month_id",
only_metadata=True,
)
# out has the 9 metadata columns; unmatched cells are null.
```

Drop-in for the manager's existing call (same method name and key kwargs).

---

## 9. Examples of Incorrect Usage

- Filling unmatched cells with `-1`/`""`/`"unknown"` to "avoid validation
errors" — defeats the fail-loud contract and ships wrong data to FAO.
- Using it to enrich against a lookup built for a different region without
expecting nulls for out-of-region cells.
- Calling it expecting spatial recomputation when the lookup is stale — regenerate
the lookup with the build script instead.

---

## 10. Test Alignment

`tests/test_enrichment.py`:
- **Green:** the 9 columns present; values match the lookup; row count preserved;
codes numeric / coords float.
- **Beige:** lookup integrity (cell count, no nulls, no `-1`, dtypes); coordinate
formula (independent oracle).
- **Red:** unknown cell id and excluded ocean cells yield null; missing
`pg_id_col` raises.

---

## 11. Evolution Notes

- Stable: the 9-column contract and the fail-loud-on-unmatched behavior (changing
either is a coordinated change across this repo and views-faoapi).
- Expected to change: the lookup's cell-set/region and its source GAUL version,
via re-running the build script. Such regenerations must not change the schema.

---

## End of Contract

This document defines the **intended meaning** of `GaulLookupEnricher`.

Changes to behavior that violate this intent are bugs.
Changes to intent must update this contract.
20 changes: 10 additions & 10 deletions docs/CICs/UNFAOPostProcessorManager.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,15 @@

> **What is this class for?**

`UNFAOPostProcessorManager` orchestrates the end-to-end postprocessing pipeline that reads VIEWS conflict predictions, enriches them with geographic metadata via the Spatial Mapping Engine, validates the output schema, and delivers the enriched data to the UN FAO via Appwrite cloud storage.
`UNFAOPostProcessorManager` orchestrates the end-to-end postprocessing pipeline that reads VIEWS conflict predictions, enriches them with geographic metadata via the precomputed GAUL lookup (`GaulLookupEnricher`, ADR-011), validates the output schema, and delivers the enriched data to the UN FAO via Appwrite cloud storage.

It is the single entrypoint for producing and delivering UN FAO-formatted prediction data.

---

## 2. Non-Goals (Explicit Exclusions)

- This class does **not** perform spatial mapping logic — it delegates to `PriogridCountryMapper`
- This class does **not** perform spatial mapping logic — it delegates enrichment to `GaulLookupEnricher` (a merge against the precomputed GAUL lookup)
- This class does **not** train, evaluate, or modify prediction models
- This class does **not** define the spatial assignment algorithm
- This class does **not** manage shapefile data or geographic reference assets
Expand All @@ -33,7 +33,7 @@ It is the single entrypoint for producing and delivering UN FAO-formatted predic
- Guarantees a 4-stage pipeline: read → transform → validate → save
- Guarantees that historical data is sourced from ViewsER via `ViewsDataLoader`
- Guarantees that forecast data is sourced from the Appwrite production forecasts bucket
- Guarantees that geographic metadata is added via `PriogridCountryMapper.enrich_dataframe_with_pg_info()`
- Guarantees that geographic metadata is added via `GaulLookupEnricher.enrich_dataframe_with_pg_info()` (a cell-id merge against the precomputed lookup)
- Guarantees that required metadata columns are validated before upload
- Guarantees that both historical and forecast datasets are uploaded to the UN FAO Appwrite bucket with correct metadata (name, loa, type, category)
- Guarantees that all structural failures are logged and raised (ADR-008)
Expand All @@ -47,7 +47,7 @@ It is the single entrypoint for producing and delivering UN FAO-formatted predic
- Requires environment variables for Appwrite connectivity (endpoint, project ID, API key, bucket/collection IDs)
- Requires the ensemble's `.env` file to be loadable via `dotenv`
- Requires the Appwrite production forecasts bucket to contain at least one file with `category="forecast"`
- Requires the `PriogridCountryMapper` to be initialized (via module-level `set_default_mapper()`)
- Requires the precomputed GAUL lookup parquet to be present so `GaulLookupEnricher` can load it at construction

Assumptions that are not met **must cause failure**, not fallback behavior.

Expand Down Expand Up @@ -88,17 +88,17 @@ The following **must never** fail silently:
## 7. Boundaries and Interactions

**Allowed interactions:**
- Delegates spatial mapping to `PriogridCountryMapper` (Spatial Mapping Engine layer)
- Delegates geographic enrichment to `GaulLookupEnricher` (a merge against the precomputed GAUL lookup)
- Uses `views-pipeline-core` managers for path resolution, data loading, and Appwrite integration
- Reads environment variables for external service configuration
- Writes to local filesystem and Appwrite cloud storage

**Must not depend on:**
- Shapefile loading or spatial intersection logic directly
- PRIO-GRID geometry details
- Cache management internals of the mapper
- The internals of how the lookup table was built

This anchors the class within ADR-002 (topology): it sits at the Pipeline Manager layer, above the Spatial Mapping Engine, consuming its outputs without knowledge of its internals.
This anchors the class within ADR-002 (topology): it sits at the Pipeline Manager layer, above the enrichment layer, consuming its outputs without knowledge of its internals.

---

Expand Down Expand Up @@ -127,7 +127,7 @@ manager._save()

- **Calling `_transform()` before `_read()`** — datasets will be None, causing AttributeError
- **Calling `_save()` without `_validate()`** — may upload incomplete data to partners
- **Accessing `_mapper` directly to bypass the enrichment pipeline** — violates the orchestration boundary
- **Accessing `_enricher` directly to bypass the enrichment pipeline** — violates the orchestration boundary
- **Hardcoding Appwrite configuration instead of reading from environment** — violates ADR-009

---
Expand All @@ -138,7 +138,7 @@ manager._save()
- **Beige tests:** Missing ensemble name in config; None environment variables; empty forecast bucket; DataFrames with unexpected index structure
- **Red tests:** Corrupted parquet downloads; network timeouts during upload; DataFrames where all cells map to None (all-ocean input)

Currently: **no tests exist** (C-03 in risk register). This contract defines what tests must verify when written.
Currently: the manager cannot be instantiated without `views-pipeline-core`, so its stage logic is covered by **replica tests** that mirror the real methods — `tests/test_validation.py` (`_validate`) and `tests/test_append_metadata.py` (`_append_metadata`). A full end-to-end test against the live manager (C-03) still requires a production-like environment.

---

Expand All @@ -148,7 +148,7 @@ Currently: **no tests exist** (C-03 in risk register). This contract defines wha
- Partner-specific output formats are **evolving** — the UN FAO schema may change (see C-24, D-06 for schema divergence investigation)
- The source of forecast data (Appwrite bucket/collection) is **evolving** — operational configuration
- Null validation is **active** (C-01 resolved 2026-06-02)
- The enrichment source is **transitioning** from runtime mapper to precomputed lookup table (ADR-011)
- The enrichment source is the **precomputed GAUL lookup table** (`GaulLookupEnricher`, ADR-011), as of the Stage 3 swap; the runtime mapper (`mapping.py`) remains in the repo but is no longer used by this manager and is slated for removal after one verified production cycle

---

Expand Down
Loading
Loading