Skip to content

feat(evaluation): migrate evaluation harnesses from playground#279

Draft
miguelgfierro wants to merge 75 commits into
mainfrom
feat/evaluation-framework
Draft

feat(evaluation): migrate evaluation harnesses from playground#279
miguelgfierro wants to merge 75 commits into
mainfrom
feat/evaluation-framework

Conversation

@miguelgfierro

@miguelgfierro miguelgfierro commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Problem

The first migration of evaluation code into fireflyframework_agentic/evaluation/ brought over too much infrastructure alongside the metrics: a CLI (flyeval), a five-gate framework (G1–G5), a registry/corpus/matcher pipeline, a scorecard renderer, champion persistence, run-config snapshotting, and statistical helpers. The actual value — the LLM judge metric functions — was buried under 10 supporting files.

The goal of this PR (revised) is to gut that infrastructure and keep only the measurement code.

What We Keep

All metric functions from both evaluation systems:

Flyradar G4:

  • [D] deterministic: source_coverage, excerpt_fill_rate
  • [E] embedding-based: semantic_recovery
  • [J] LLM judge: faithfulness, numeric_temporal_fidelity, citation_relevance, nc_semantic_precision, fabricated_entity, contradiction, open_gap, actionability, severity_calibration, answer_relevancy, surface_deduplication, comparative_vs_champion (champion passed as optional parameter — no persistence)

Flycanon:

  • Custom: contains_answer, addresses_question (median of N LLM calls per item)
  • RAGAS: answer_correctness, answer_relevancy, faithfulness, context_recall, context_precision

Retrieval (lab/retrieval_metrics.py): unchanged.

What We Delete

From fireflyframework_agentic/evaluation/:

File Reason
cli.py flyeval CLI — experiment orchestration, not measurement
gates.py G1–G5 gate framework — pipeline infrastructure
corpus.py Corpus loader — pipeline infrastructure
registry.py Registry management — pipeline infrastructure
matcher.py Anchored matching utilities — pipeline infrastructure
scorecard.py Scorecard renderer — reporting, not measurement
run_config_snapshot.py Run config capture — pipeline infrastructure
models.py EvalConfig, GateVerdict — only used by deleted files
stats.py aa_band, aggregate_grounding — only used by deleted files
champion.py Champion persistence — comparative_vs_champion accepts champion data as a parameter instead

Tests for deleted modules also removed: test_champion.py, test_gates.py, test_matcher.py, test_stats.py.

Target Package Layout

fireflyframework_agentic/evaluation/
├── __init__.py       # exports: EvalContext, AdvisoryReport, all metric functions
├── judge_client.py   # JudgeClient — async LLM scoring client (httpx.AsyncClient)
└── judge.py          # ALL metric functions + EvalContext + AdvisoryReport

Three files. No CLI. No gates. No registry.

Unified Interface

Every metric — flyradar [D], [E], [J], flycanon custom, and RAGAS — shares the same async signature:

async def metric_name(item: dict, ctx: EvalContext) -> float | None

item is a plain dict with a normalized schema:

{
    "question": str,
    "answer": str,
    "reference": str,
    "contexts": list[str],
    # flyradar extras (optional):
    "sources": list[str],
    "excerpts": list[str],
    # for comparative_vs_champion (optional):
    "champion_answer": str | None,
}

EvalContext is a Pydantic model carrying all dependencies:

class EvalContext(BaseModel):
    model_config = ConfigDict(arbitrary_types_allowed=True)

    client: JudgeClient                     # async LLM call client (all metrics)
    embedder: OllamaEmbedder | None = None  # [E] metrics + RAGAS embeddings
    runs: int = 3                           # flycanon multi-run median

No ragas_llm / ragas_embeddings — RAGAS metrics wrap ctx.client and ctx.embedder in LangChain adapters internally. Callers see one client, one embedder.

Composable type alias:

Metric = Callable[[dict, EvalContext], Awaitable[float | None]]

Example:

ctx = EvalContext(client=JudgeClient(model="claude-sonnet-4-6", api_key=KEY))
metrics: list[Metric] = [faithfulness, contains_answer, answer_correctness]
scores = await asyncio.gather(*[m(item, ctx) for m in metrics])

judge_client.py

Contains only JudgeClient — a thin async HTTP client for Anthropic/OpenAI/Ollama scoring calls:

class JudgeClient:
    async def chat_json(self, system: str, user: str, max_tokens: int = 200) -> dict: ...

Uses httpx.AsyncClient. Handles 429/5xx retry with Retry-After parsing. No embedding logic — embeddings come from embeddings/providers/ollama.py (existing async OllamaEmbedder). cosine_similarity imported from embeddings/similarity.py.

Dependencies

evaluation optional extra changes:

  • Remove: scipy (only used by deleted stats.py)
  • Add: ragas, langchain-anthropic, langchain-ollama
  • Keep: numpy

No changes to embeddings/ — existing async OllamaEmbedder used as-is.

Test plan

  • pytest tests/unit/evaluation/test_judge.py tests/unit/lab/test_retrieval_metrics.py — all passing
  • Each metric callable independently with a mocked EvalContext
  • asyncio.gather(*[m(item, ctx) for m in metrics]) composes correctly across families

miguelgfierro and others added 11 commits June 18, 2026 23:33
…try point (#268)

* feat(evaluation): add evaluation subpackage __init__ with gate/champion/judge/retrieval exports

* feat(evaluation): add EvalConfig and GateVerdict models

* feat(evaluation): add evaluation optional-deps and flyeval CLI entry point to pyproject.toml

* feat(evaluation): note evaluation as optional subpackage in top-level __init__ docstring

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add matcher primitives (anchored, matches, source_stem, tokens)

* feat(evaluation): add statistics helpers (aa_band, aggregate_grounding, left_skew_flag)

* feat(evaluation): export matcher and stats primitives from evaluation package

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add corpus loader and evidence verification module

* feat(evaluation): add lean-1 registry loader and RegistryItem/Registry models

* feat(evaluation): re-export corpus and registry symbols from evaluation package

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add G1-G5 gate framework (GateResult, run_gates, g2_recall_precision)

* feat(evaluation): export g2_recall_precision from evaluation package

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add scorecard renderer

* feat(evaluation): export render_scorecard, verdict, VERDICT_PROMOTE/HOLD from scorecard module

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add JudgeClient and OllamaEmbedder (judge_client.py)

* feat(evaluation): add AdvisoryReport and run_judge with [D]/[E]/[J] metric families (judge.py)

* feat(evaluation): import cosine from judge_client in matcher.py

* feat(evaluation): export JudgeClient, OllamaEmbedder, build_embedder, cosine from evaluation package

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add ChampionRecord and champion management functions

* feat(evaluation): add run_config_snapshot for flyradar run configuration capture

* feat(evaluation): add flyeval CLI with gate, aa-band, day-zero, invalidate subcommands

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
)

* feat(lab): add retrieval_metrics module with compute_retrieval_metrics and RetrieverMetrics

* feat(lab): export RetrieverMetrics and compute_retrieval_metrics from lab package

* feat(evaluation): import RetrieverMetrics and compute_retrieval_metrics from lab.retrieval_metrics

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add flyradar gate evaluation example

* feat(evaluation): add flycanon RAG retrieval evaluation example

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
… metrics (#277)

* feat(evaluation): add tests/unit/evaluation package init

* feat(evaluation): add unit tests for matcher (anchored, source_stem, tokens, matches)

* feat(evaluation): add unit tests for stats (aa_band, aggregate_grounding, left_skew_flag)

* feat(evaluation): add unit tests for gates (GateResult, verdict, render_scorecard, g5_no_regression)

* feat(evaluation): add unit tests for champion (ChampionRecord, load/save/invalidate, input_hash)

* feat(evaluation): add unit tests for retrieval_metrics (compute_retrieval_metrics, RetrieverMetrics)

* feat(evaluation): fix boundary test for left_skew_flag (floating-point precision)

* feat(evaluation): fix no_answer_rate test to match implementation behaviour

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add evaluation package documentation

* docs(evaluation): mention evaluation subpackage in README

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Comment thread fireflyframework_agentic/evaluation/corpus.py Fixed
Comment thread fireflyframework_agentic/evaluation/scorecard.py Fixed
Comment thread examples/flycanon_eval_example.py Fixed
Comment thread tests/unit/evaluation/test_champion.py Fixed
Comment thread tests/unit/evaluation/test_matcher.py Fixed
Comment thread tests/unit/lab/test_retrieval_metrics.py Fixed
Comment thread tests/unit/lab/test_retrieval_metrics.py Fixed
fix(evaluation): resolve PR gate failures (lint, CI extras, remove flyradar example)
Comment thread fireflyframework_agentic/evaluation/corpus.py Fixed
miguelgfierro and others added 30 commits June 19, 2026 15:00
…ics-as-functions

refactor(evaluation): replace RetrieverMetrics class with plain functions
…etrieval-metrics

Remove compute_retrieval_metrics() aggregate from evaluation
… ragas_faithfulness matches faithfulness column
…mbeddings instead of always falling back to Anthropic
…eline

The evaluation package no longer ships gates, verdict, champion/challenger, or
the flyeval CLI — those modules were removed in this branch. Rewrite the guide
to document the actual public surface: LLM-as-judge metrics (judge.py) and
deterministic retrieval metrics (retrieval_metrics.py). Fix the README mention
to match.
… metric

mean_latency_ms measured search_ms/answer_ms latency, which is operational
telemetry rather than an evaluation of output quality. Remove the function, its
export, its tests, and the search_ms/answer_ms row-schema fields it was the sole
consumer of.
The top-level docstring still described the deleted gate/champion/challenger
infrastructure. Correct it to match the shipped surface: LLM-as-judge metrics,
RAGAS, and retrieval metrics.
Mirror flycanon's embedding_service factory: add build_embedder(spec) resolving
a '<provider>:<model>' spec to a fireflyframework_agentic embedder (8 providers,
deferred per-provider imports). Widen EvalContext.embedder to BaseEmbedder and
feed it into RAGAS via LangchainEmbeddingsWrapper, so the evaluator embeds with
the same provider as the pipeline. Removes the broken AnthropicEmbeddings branch.
Rename _make_ragas_embeddings -> _build_embeddings to decouple the name from RAGAS
for future refactoring.
The azure provider was grouped with openai and built a public-OpenAI ChatOpenAI
client (api.openai.com + OPENAI_API_KEY), sending the azure deployment name as an
OpenAI model id. Split azure out to AzureChatOpenAI using AZURE_OPENAI_ENDPOINT/
AZURE_OPENAI_API_KEY/AZURE_OPENAI_API_VERSION, mirroring judge_client._azure, and
add langchain-openai to the [evaluation] extra so the openai/azure paths import.
Drop the dead 'calibrated' field (only ever set to False, never read) and the
'details' field (never written or read), and rewrite the docstring to remove the
'G4 output / GateResult' gate-era framing. Keeps the live fields: judge_model,
same_provider_caveat, runs, metrics, errors.
Revert the top-level __init__.py docstring addition: it duplicated the README and
docs/evaluation.md, pulled in unrelated lab/experiments, and already went stale
(it described the deleted gates). This leaves the package root untouched by the PR.
Replace the hand-rolled multi-provider httpx client with the framework's
FireflyAgent (pydantic-ai, a core dep). JudgeClient.chat_json(system,user)->dict
becomes judge(system,user,output_type)->validated pydantic model; each of the 13
call-shapes gets a typed output model, so the LLM's structured output is schema-
checked instead of parsed via _first_json_object. Agents are built lazily and
cached per (system, output_type, max_tokens); temperature pinned to 0.0; retries
handled by FireflyAgent/pydantic-ai.

Deletes the bespoke _anthropic/_openai/_azure/_ollama methods, _first_json_object,
_env, and _coerce_float. Fixes the _gather_chat bug: failed judge calls no longer
collapse to {} and get scored as verdicts — they propagate and are recorded in
report.errors (new _judge_all). Adds tests for agent caching, failure propagation,
and the previously-untested run_judge orchestrator.
Align the run_judge output DTO with the framework convention — *Result/*Report
types (EvalReport, EvalResult, BenchmarkResult, PipelineResult, ...) and the
module's own EvalContext are all pydantic BaseModel, leaving AdvisoryReport the
lone dataclass. Switching gains free model_dump_json() for logging/persistence at
no cost (internal output, mutated in place).
After the FireflyAgent refactor the client shrank to ~90 lines and is used only
by judge.py. Fold JudgeClient + parse_model + same_provider into judge.py and drop
the separate file — no standalone transport to justify it anymore. Public imports
are unchanged (still re-exported from the package).
Replace the 35 one-symbol 'from X import (Y as Y)' re-export blocks with three
grouped imports and an explicit __all__, matching the agents/__init__ convention.
__all__ marks the public re-exports so ruff doesn't flag them as unused.
Switch the example default and doc snippets from the pinned
claude-haiku-4-5-20251001 to the floating claude-haiku-4-5 alias.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant