Skip to content

Stage 2 — Shadow diff: runtime mapper baseline vs lookup enricher on africa_me_legacy #23

@Polichinel

Description

@Polichinel

Stage 2 — Shadow diff: runtime mapper baseline vs lookup enricher on africa_me_legacy

Part of the FAO global delivery plan — umbrella: #20. Gated on Stage 0 (#21, provides the baseline) and Stage 1 (#22, provides the enricher). This is the verification gate before any behavioral change — Stage 3 (#LINK_F) does not start until this report shows zero unexplained differences.

Why this exists

The engine swap changes not just how the 9 metadata columns are computed but, for a minority of cells, what values they hold. That is intentional — the lookup implements the FAO-contracted area-majority rule from a single consistent boundary source — but every changed value must be explained before it ships, not discovered by FAO. The africa_me_legacy baseline from Stage 0 is real production output on real input; diffing against it characterizes the swap completely at the current coverage before global multiplies the stakes.

Spec

A diff harness script (scripts/diff_enrichment.py or similar, committed):

  1. Load the Stage-0 baseline enriched output (historical dataframe; 13,110 unique gids).
  2. Run the Stage-1 enricher on the same gid list.
  3. Compare all 9 metadata columns per gid. Output: per-column match counts, and a per-gid record for every difference.
  4. Classify every difference into one of the expected classes:
Class Cause Expected scale (from the cross-repo investigation)
ISO source switch country_iso_a3: mapper uses Natural Earth, lookup uses GAUL — disputed/differently-drawn borders Small; concentrated at known disputed territories (e.g. Western Sahara, Abyei, Hala'ib)
Border-cell algorithm Mapper = NE-country-first then GAUL-within-country (sequential); lookup = single GAUL L2 area-majority Subset of the ~711 cells (5.4%) where the platform's algorithms historically disagreed
Coastal completion Cells where one engine assigns and the other doesn't (incl. how the baseline handled the 5 ocean cells — see Stage 0 notes) Up to ~149 coastal-class cells
UNEXPLAINED Anything not in the above Must be zero
  1. Write reports/enrichment_diff_report.md: counts per class, methodology, and the full per-cell difference list as a CSV artifact alongside.

Reading the result

  • Zero unexplained → Stage 3 is unblocked. The explained classes are not regressions; they are the contractually-correct behavior arriving (area-majority per FAO Release Note 02). Their counts feed the FAO release note (Stage 4, #LINK_G).
  • Any unexplained difference → stop. Diagnose in this issue. Plausible causes: dtype drift in the lookup build, a rename-map error, a stale source parquet. Do not rationalize after the fact — if it does not fit a pre-declared class, it is a bug until proven otherwise.

Definition of done

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions