Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
a2d6770
feat(evaluation): add evaluation subpackage skeleton and pyproject en…
miguelgfierro Jun 18, 2026
8676b6a
feat(evaluation): add matcher primitives and statistics helpers (#269)
miguelgfierro Jun 18, 2026
8eb2110
feat(evaluation): add corpus loader and registry modules (#270)
miguelgfierro Jun 18, 2026
ee64cfa
feat(evaluation): add G1-G5 gate framework (#271)
miguelgfierro Jun 18, 2026
d964ba1
feat(evaluation): add scorecard renderer (#272)
miguelgfierro Jun 18, 2026
09cfc34
feat(evaluation): add LLM-as-judge and judge client (#273)
miguelgfierro Jun 18, 2026
1906ede
feat(evaluation): add champion tracking and flyeval CLI (#274)
miguelgfierro Jun 18, 2026
4ab1d85
feat(lab): add retrieval metrics (hit@k, recall@k, MRR, MAP, nDCG) (#…
miguelgfierro Jun 18, 2026
0acac37
feat(examples): add flyradar and flycanon evaluation examples (#276)
miguelgfierro Jun 18, 2026
cc048cf
test(evaluation): add unit tests for evaluation package and retrieval…
miguelgfierro Jun 18, 2026
f79439b
docs(evaluation): add evaluation package documentation (#278)
miguelgfierro Jun 18, 2026
a1d28a5
remove examples/flyradar_eval_example.py
miguelgfierro Jun 19, 2026
6161718
ci: add --extra evaluation to typecheck and test sync steps
miguelgfierro Jun 19, 2026
203134c
fix(evaluation): resolve all ruff lint errors (import sort, SIM108, B…
miguelgfierro Jun 19, 2026
ceaba78
Merge pull request #280 from fireflyframework/fix/eval-ci-gate
miguelgfierro Jun 19, 2026
9c3555d
chore(evaluation): delete cli.py
miguelgfierro Jun 19, 2026
e9fd965
chore(evaluation): delete gates.py
miguelgfierro Jun 19, 2026
38c3f60
chore(evaluation): delete corpus.py
miguelgfierro Jun 19, 2026
f819923
chore(evaluation): delete registry.py
miguelgfierro Jun 19, 2026
3bc0786
chore(evaluation): delete matcher.py
miguelgfierro Jun 19, 2026
9c43a32
chore(evaluation): delete scorecard.py
miguelgfierro Jun 19, 2026
a3673b5
chore(evaluation): delete run_config_snapshot.py
miguelgfierro Jun 19, 2026
a51115e
chore(evaluation): delete models.py
miguelgfierro Jun 19, 2026
5074d14
chore(evaluation): delete stats.py
miguelgfierro Jun 19, 2026
8716be9
chore(evaluation): delete champion.py
miguelgfierro Jun 19, 2026
5c8fe8e
chore(evaluation): delete test_champion.py
miguelgfierro Jun 19, 2026
fdc0277
chore(evaluation): delete test_gates.py
miguelgfierro Jun 19, 2026
0732f85
chore(evaluation): delete test_matcher.py
miguelgfierro Jun 19, 2026
f769ef1
chore(evaluation): delete test_stats.py
miguelgfierro Jun 19, 2026
2516052
feat(evaluation): rewrite judge_client.py as async (httpx.AsyncClient)
miguelgfierro Jun 19, 2026
5609ab6
feat(evaluation): rewrite judge.py — async metrics + EvalContext + fl…
miguelgfierro Jun 19, 2026
7799185
feat(evaluation): slim __init__.py to 3-file exports
miguelgfierro Jun 19, 2026
9526f43
chore(evaluation): update pyproject.toml — drop scipy, add ragas deps…
miguelgfierro Jun 19, 2026
d567552
test(evaluation): add unit tests for judge.py metrics
miguelgfierro Jun 19, 2026
0dd9bac
chore: merge feat/evaluation-framework, keep simplification
miguelgfierro Jun 19, 2026
561f9b5
Merge pull request #282 from fireflyframework/feat/eval-simplification
miguelgfierro Jun 19, 2026
5646974
fix(lab): type-annotate out dict, remove quoted return type in retrie…
miguelgfierro Jun 19, 2026
582d1c0
fix(lab): remove unused import math, fix import sort in test_retrieva…
miguelgfierro Jun 19, 2026
3e62b1f
fix(evaluation): add type: ignore for pyright errors on RAGAS/langcha…
miguelgfierro Jun 19, 2026
a7e44d1
Merge pull request #283 from fireflyframework/chore/eval-ci-fixes
miguelgfierro Jun 19, 2026
6dd8575
Merge remote-tracking branch 'origin/main' into chore/sync-dev-with-main
miguelgfierro Jun 19, 2026
3679dbc
refactor(evaluation): move retrieval_metrics.py from lab/ to evaluation/
miguelgfierro Jun 19, 2026
6bce374
refactor(evaluation): update imports — retrieval_metrics now in evalu…
miguelgfierro Jun 19, 2026
9229c43
refactor(evaluation): move test_retrieval_metrics.py to tests/unit/ev…
miguelgfierro Jun 19, 2026
4d9353d
Merge pull request #284 from fireflyframework/refactor/move-retrieval…
miguelgfierro Jun 19, 2026
6cdd3db
refactor(evaluation): replace RetrieverMetrics class with plain funct…
miguelgfierro Jun 19, 2026
3a3c35f
refactor(evaluation): update __init__.py exports — replace RetrieverM…
miguelgfierro Jun 19, 2026
26bfe3b
test(evaluation): rewrite test_retrieval_metrics for individual metri…
miguelgfierro Jun 19, 2026
b029d36
Merge pull request #285 from fireflyframework/refactor/retrieval-metr…
miguelgfierro Jun 19, 2026
feadcbd
Remove compute_retrieval_metrics() and KS constant from retrieval_met…
miguelgfierro Jun 19, 2026
d54814f
Remove compute_retrieval_metrics export from evaluation __init__
miguelgfierro Jun 19, 2026
0853698
Remove test_compute_retrieval_metrics_* tests
miguelgfierro Jun 19, 2026
a7b1b91
Update flycanon_eval_example to use plain metric functions instead of…
miguelgfierro Jun 19, 2026
0c911b3
Apply ruff format to retrieval_metrics.py
miguelgfierro Jun 19, 2026
ef16882
Apply ruff format to test_retrieval_metrics.py
miguelgfierro Jun 19, 2026
5a9926b
Merge pull request #286 from fireflyframework/refactor/drop-compute-r…
miguelgfierro Jun 19, 2026
e9e97d1
fix(evaluation): deepcopy base in _median_runs to prevent mutation of…
miguelgfierro Jun 25, 2026
eeb315f
fix(evaluation): strip ragas_ prefix in _ragas_score column lookup so…
miguelgfierro Jun 25, 2026
ef092be
fix(evaluation): use provider-appropriate embeddings in _make_ragas_e…
miguelgfierro Jun 25, 2026
d25d2ce
fix(evaluation): use n_gold as MAP denominator instead of min(n_gold, k)
miguelgfierro Jun 25, 2026
c690977
refactor(examples): replace flycanon_eval_example with simpler generi…
miguelgfierro Jun 25, 2026
efdcbf7
refactor(examples): replace rag_eval_example with llm_eval_example us…
miguelgfierro Jun 25, 2026
a1519b9
fix(examples): use SI units only in sample reference
miguelgfierro Jun 25, 2026
d3f53a7
docs(evaluation): rewrite guide around metrics, drop deleted gate pip…
miguelgfierro Jun 29, 2026
a6db4ff
refactor(evaluation): drop mean_latency_ms — telemetry, not a quality…
miguelgfierro Jun 29, 2026
9c78ae5
docs: fix stale evaluation subpackage description in package docstring
miguelgfierro Jun 29, 2026
0d2476b
feat(evaluation): build RAGAS embeddings from the framework embedder
miguelgfierro Jun 29, 2026
7c14351
fix(evaluation): use AzureChatOpenAI for the azure RAGAS LLM
miguelgfierro Jun 29, 2026
c02b10c
refactor(evaluation): strip gate-era baggage from AdvisoryReport
miguelgfierro Jun 29, 2026
4d262ce
docs: drop optional-subpackages block from package docstring
miguelgfierro Jun 29, 2026
2dc1054
refactor(evaluation): back JudgeClient with FireflyAgent + typed outputs
miguelgfierro Jun 29, 2026
dd86d74
refactor(evaluation): make AdvisoryReport a pydantic model
miguelgfierro Jun 29, 2026
30e5fa6
refactor(evaluation): merge judge_client into judge
miguelgfierro Jun 29, 2026
648b20e
style(evaluation): simplify __init__ to grouped imports + __all__
miguelgfierro Jun 29, 2026
d8a48d5
docs(evaluation): use claude-haiku-4-5 alias in example and guide
miguelgfierro Jun 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/pr-gate.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ jobs:
- uses: actions/setup-python@v6
with:
python-version: '3.13'
- run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra openai-embeddings
- run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra openai-embeddings --extra evaluation
- run: uv run pyright

test:
Expand All @@ -72,7 +72,7 @@ jobs:
- uses: actions/setup-python@v6
with:
python-version: '3.13'
- run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra vectorstores-pgvector --extra openai-embeddings
- run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra vectorstores-pgvector --extra openai-embeddings --extra evaluation
- run: uv run pytest -m "not nightly" --cov --cov-report=term-missing

build:
Expand Down
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,12 @@ classDiagram
`EvalDataset` loads/saves test cases from JSON. `ModelComparison` runs the
same prompts across multiple agents for side-by-side analysis.

- **Evaluation** — LLM-as-judge metrics (faithfulness, relevancy, answer correctness,
RAGAS, …) and deterministic retrieval metrics (recall@k, MRR, MAP, nDCG, …) for
assessing LLM and pipeline outputs. Each metric is a plain function you call directly.
Install with `pip install "fireflyframework-agentic[evaluation]"`.
See [docs/evaluation.md](docs/evaluation.md) for the full guide.

> **Optional developer tooling.** `fireflyframework_agentic.experiments` (A/B
> experiments) and `fireflyframework_agentic.lab` (offline evaluation /
> benchmarking) are leaf modules — nothing in the core imports them and they add
Expand Down Expand Up @@ -817,6 +823,7 @@ Detailed guides for each module:
- [Security](docs/security.md) — Prompt/output guards, at-rest encryption
- [Experiments](docs/experiments.md) — A/B testing, variant comparison
- [Lab](docs/lab.md) — Benchmarks, datasets, evaluators
- [Evaluation](docs/evaluation.md) — LLM-as-judge metrics, RAGAS, retrieval metrics
- Studio — moved to [fireflyframework-agentic-studio](https://github.com/fireflyframework/fireflyframework-agentic-studio)
---

Expand Down
301 changes: 301 additions & 0 deletions docs/evaluation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,301 @@
# Evaluation Guide

Copyright 2026 Firefly Software Foundation. Licensed under the Apache License 2.0.

The Evaluation subpackage provides **metrics for assessing LLM and pipeline outputs**:
LLM-as-judge metrics (faithfulness, relevancy, answer correctness, …) and deterministic
information-retrieval metrics (recall@k, nDCG, MRR, …). Every metric is a plain function
you call directly and combine however your harness needs — there is no gate, verdict, or
promotion machinery to opt into.

---

## Installation

The evaluation subpackage needs `numpy` for the embedding path and `ragas` (plus its
LangChain providers) for the RAGAS metrics. Install the optional extra:

```bash
pip install "fireflyframework-agentic[evaluation]"
```

Everything except the RAGAS metrics works without `ragas` installed; the RAGAS functions
import it lazily and only fail if you call them without the extra.

---

## Two metric families

| Family | Module | Needs an LLM? | Use it to evaluate… |
|--------|--------|---------------|---------------------|
| **LLM-as-judge** | `evaluation.judge` | Most metrics yes (a few are deterministic/embedding) | The semantic quality of a model's answers and reports — faithfulness, relevancy, correctness, hallucination. |
| **Retrieval** | `evaluation.retrieval_metrics` | No (pure functions, no network) | The ranked retrieval that feeds the LLM — recall@k, precision@k, MRR, MAP, nDCG, latency. |

Both are re-exported from `fireflyframework_agentic.evaluation`.

---

## LLM-as-judge metrics

Each judge metric is an **async function** with the same signature:

```python
async def metric(item: dict, ctx: EvalContext) -> dict | float | None
```

- `item` — a plain dict of the output under evaluation (see schema below).
- `ctx` — an `EvalContext` carrying the judge client, optional embedder, and run count.
- The return is either a small summary dict, a single float, or `None` when the metric
cannot run (e.g. an embedding metric with no embedder, or a missing field).

### EvalContext and JudgeClient

```python
from fireflyframework_agentic.evaluation import EvalContext, JudgeClient, build_embedder

ctx = EvalContext(
client=JudgeClient("anthropic:claude-haiku-4-5"),
runs=3, # metrics that repeat use the median of this many calls
embedder=None, # optional framework embedder; required by semantic_recovery and RAGAS
)
```

`embedder` is any `fireflyframework_agentic` embedder. Build one from a
`"<provider>:<model>"` spec with `build_embedder` (openai, azure, cohere, google,
mistral, voyage, bedrock, ollama):

```python
ctx = EvalContext(
client=JudgeClient("anthropic:claude-haiku-4-5"),
embedder=build_embedder("ollama:nomic-embed-text"),
)
```

The RAGAS metrics reuse this same framework embedder (wrapped for RAGAS), so the
evaluator embeds with the same provider as the rest of your pipeline.

`JudgeClient` is an async multi-provider judge backed by the framework's `FireflyAgent`
(pydantic-ai). The model spec is `"<provider>:<model>"`, where provider is one of
`anthropic`, `openai`, `azure`, `ollama`. Each call returns a **validated, typed** Pydantic
model — the LLM's structured output is schema-checked rather than hand-parsed — and
`temperature` is pinned to `0.0` for stable verdicts. The provider reads its API key
(`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `AZURE_OPENAI_*`, `OLLAMA_HOST`) when the underlying
agent is first built, so constructing a `JudgeClient` never requires a secret.

### Item schema

The judge metrics read whichever keys they need and ignore the rest, so one `item` dict can
serve many metrics.

**RAG / Q&A items** (single answer under test):

```python
item = {
"question": "What is the boiling point of water at sea level?",
"answer": "Water boils at 100 degrees Celsius at sea level.",
"reference": "Water boils at 100 °C at standard atmospheric pressure.",
"contexts": ["...retrieved passage...", "..."], # used by RAGAS metrics
}
```

**Report / discovery items** (a structured pipeline output):

```python
item = {
"findings": [{"id": ..., "title": ..., "description": ..., "severity": ...,
"evidence_refs": [{"evidence_id": ...}], ...}],
"evidence_index": [{"id": ..., "locator": "doc.md#L1", "excerpt": "..."}],
"process_graph": {"processes": [{"name": ..., "activities": [...], "decisions": [...]}]},
"proposed_actions": [{"title": ..., "finding_id": ..., "expected_savings_fte": ...}],
"workspace": {"name": ..., "description": ...},
"nc_items": [{"id": ..., "description": "a statement that is factually false"}],
"lexical_missed_ids": ["..."], # ids the lexical pass missed (semantic_recovery)
"champion": { ... another item ... }, # baseline for comparative_vs_champion
}
```

### Quick start — scoring Q&A pairs

```python
import asyncio
from fireflyframework_agentic.evaluation import (
EvalContext, JudgeClient, contains_answer, addresses_question,
)

item = {
"question": "Who wrote Romeo and Juliet?",
"reference": "Romeo and Juliet was written by William Shakespeare around 1594–1596.",
"answer": "It was written by Shakespeare.",
}

async def main():
ctx = EvalContext(client=JudgeClient("anthropic:claude-haiku-4-5"), runs=3)
contains = await contains_answer(item, ctx) # 0.0–1.0
addresses = await addresses_question(item, ctx) # 0.0–1.0
print(contains, addresses)

asyncio.run(main())
```

See `examples/llm_eval_example.py` for a runnable version that scores a list of items
(built-in sample data or a JSONL file) and prints a table.

### Metric catalog

**Deterministic** — no LLM call, always available:

| Metric | Returns | Measures |
|--------|---------|----------|
| `source_coverage` | `{cited, total, orphaned}` | Distinct source documents cited by ≥1 finding vs. all sources; `orphaned` lists uncited stems. |
| `excerpt_fill_rate` | `{populated, total}` | Fraction of `evidence_index` entries that carry a non-empty excerpt. |

**Embedding** — requires `ctx.embedder`:

| Metric | Returns | Measures |
|--------|---------|----------|
| `semantic_recovery` | `{lexical_recall, recovered_recall, recovered, tau, scored_denominator}` or `None` | Context-recall: recovers lexically-missed items via embedding similarity above `tau` (default 0.70). Returns `None` when no embedder is set. |

**LLM-as-judge** — requires `ctx.client`:

| Metric | Returns | Measures |
|--------|---------|----------|
| `faithfulness` | `{supported, total, unsupported_ids}` | Does each finding's cited evidence entail its claim? |
| `numeric_temporal_fidelity` | `{mismatches, count}` | Numbers/dates asserted in a finding that don't match its evidence. |
| `citation_relevance` | `{precision, relevant, total}` | Context precision: fraction of cited passages actually relevant to the claim. |
| `nc_semantic_precision` | `{asserted, total, asserted_ids}` | How many negative-control falsehoods (`nc_items`) the output asserts or endorses. |
| `fabricated_entity` | `{count, entities}` | Systems/orgs/metrics named in the output but absent from the corpus. |
| `contradiction` | `{count, pairs}` | Internally contradictory finding pairs. |
| `open_gap` | `{gap}` | G-Eval open probe: the most important issue the output missed (free-text, no score). |
| `actionability` | `{score, rated}` | Average 0–1 rating of whether proposed actions are specific, quantified, and linked. |
| `severity_calibration` | `{miscalibrated, total, verdicts}` | Whether each finding's stated severity matches its evidence (under/over/calibrated). |
| `answer_relevancy` | `{score}` | Does the output address the stated workspace intention? |
| `surface_deduplication` | `{distinct, redundant, total, distinct_rate, redundant_pairs}` | Fraction of near-duplicate process-graph nodes that are genuinely distinct. |
| `comparative_vs_champion` | `{candidate, champion, more_consistent}` or `None` | Pairwise five-axis review of candidate vs. `item["champion"]`. `None` if no champion. |

**RAG Q&A** — requires `ctx.client`; repeats `ctx.runs` times and returns the median:

| Metric | Returns | Measures |
|--------|---------|----------|
| `contains_answer` | `float` or `None` | Does the answer contain the correct information from the reference? |
| `addresses_question` | `float` or `None` | Does the answer directly address what the question asks? |

**RAGAS** — requires the `ragas` extra and `ctx.client` (plus an embedder for some):

| Metric | Returns | Measures |
|--------|---------|----------|
| `answer_correctness` | `float` or `None` | Semantic F1 of the answer against the reference. |
| `ragas_faithfulness` | `float` or `None` | Answer grounded in the retrieved `contexts`. |
| `context_recall` | `float` or `None` | Reference coverage by the retrieved `contexts`. |
| `context_precision` | `float` or `None` | Retrieved `contexts` relevant to the question. |

### Running every metric at once

`run_judge()` runs all metrics concurrently and collects them into an `AdvisoryReport`. It
is best-effort and never raises — any metric that fails is recorded in `report.errors`
instead of propagating.

```python
import asyncio
from fireflyframework_agentic.evaluation import run_judge, EvalContext, JudgeClient

async def main():
ctx = EvalContext(client=JudgeClient("anthropic:claude-haiku-4-5"), runs=3)
report = await run_judge(item, ctx, pipeline_model="anthropic:claude-sonnet-4-6")
print(report.metrics) # {metric_name: result, ...}
print(report.errors) # ["metric: ExceptionType: message", ...]

asyncio.run(main())
```

`AdvisoryReport` fields:

| Field | Type | Description |
|-------|------|-------------|
| `judge_model` | `str` | The judge model spec used. |
| `same_provider_caveat` | `bool` | `True` when the judge and the evaluated pipeline share a provider (self-grading risk). |
| `runs` | `int` | Judge runs per repeated metric. |
| `metrics` | `dict` | Per-metric results, keyed by metric name. |
| `errors` | `list[str]` | Per-metric failures captured best-effort. |

---

## Retrieval metrics

Deterministic IR metrics over ranked retrieval results — no LLM and no network, the same
design as scikit-learn or MS MARCO evaluation scripts. Each is a plain function over a list
of result rows.

### Row schema

```python
results = [
{
"retrieved": [{"rank": 1, "source_id": "SOP-002.md", "is_gold": True},
{"rank": 2, "source_id": "SOP-001.md", "is_gold": False}],
"gold": ["SOP-002.md"], # gold source identifiers
# optional:
"no_answer": False, # model refused / produced no answer
"answer": "...", # used for no_answer detection if no_answer absent
"citations": [{"is_gold": True}],
},
]
```

`rank` is 1-based (rank 1 is the top hit). Duplicate sources are de-duplicated by
`source_id`, keeping the best-ranked chunk.

### Metric catalog

| Function | Signature | Measures |
|----------|-----------|----------|
| `hit_at_k` | `(results, k) -> float` | Fraction of queries with ≥1 gold document in top-k. |
| `recall_at_k` | `(results, k) -> float` | Mean fraction of gold documents found in top-k. |
| `precision_at_k` | `(results, k) -> float` | Mean fraction of top-k results that are gold. |
| `mrr` | `(results, k=10) -> float` | Mean reciprocal rank of the first gold hit. |
| `map_score` | `(results, k=10) -> float` | Mean average precision at k. |
| `ndcg` | `(results, k=10) -> float` | Mean normalised discounted cumulative gain at k. |
| `no_answer_rate` | `(results) -> float \| None` | Fraction of queries with no answer. `None` if no results. |
| `citation_precision` | `(results) -> float \| None` | Precision of in-answer citations vs. the gold set. `None` if no citations. |

### Example

```python
from fireflyframework_agentic.evaluation import recall_at_k, ndcg, mrr

print(f"Recall@5: {recall_at_k(results, 5):.3f}")
print(f"nDCG@10: {ndcg(results):.3f}")
print(f"MRR@10: {mrr(results):.3f}")
```

---

## Reference

All symbols below are importable from `fireflyframework_agentic.evaluation`.

### Core types

| Symbol | Kind | Description |
|--------|------|-------------|
| `EvalContext` | Pydantic model | Carries `client`, optional `embedder`, and `runs` for the judge metrics. |
| `build_embedder` | Function | Build a framework embedder from a `"<provider>:<model>"` spec (openai/azure/cohere/google/mistral/voyage/bedrock/ollama). |
| `JudgeClient` | Class | Async multi-provider (`anthropic`/`openai`/`azure`/`ollama`) judge backed by `FireflyAgent`; returns validated typed output. |
| `AdvisoryReport` | Pydantic model | Aggregated `run_judge` output: `metrics`, `errors`, and run metadata. |
| `Metric` | Type alias | `Callable[[dict, EvalContext], Awaitable[dict \| float \| None]]`. |
| `parse_model` | Function | Split `"provider:model"` into `(provider, model)`. |
| `same_provider` | Function | `True` if two model specs share a known provider prefix. |

### Judge metrics

`source_coverage`, `excerpt_fill_rate`, `semantic_recovery`, `faithfulness`,
`numeric_temporal_fidelity`, `citation_relevance`, `nc_semantic_precision`,
`fabricated_entity`, `contradiction`, `open_gap`, `actionability`,
`severity_calibration`, `answer_relevancy`, `surface_deduplication`,
`comparative_vs_champion`, `contains_answer`, `addresses_question`,
`answer_correctness`, `ragas_faithfulness`, `context_recall`, `context_precision`,
and the orchestrator `run_judge`.

### Retrieval metrics

`hit_at_k`, `recall_at_k`, `precision_at_k`, `mrr`, `map_score`, `ndcg`,
`no_answer_rate`, `citation_precision`.
Loading