Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
243 changes: 243 additions & 0 deletions docs/guide/benchmarking/2026-06-11-measurement-coverage-audit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
# ELF Benchmark Measurement Coverage Audit - June 11, 2026

Goal: Record what is actually measured today, where competitor comparisons are still
not comparable, and which measurement reports should guide future ELF iteration.
Read this when: You need to answer whether ELF has enough empirical evidence to
claim a win, tie, loss, or non-claim against tracked memory, RAG, graph, and
agent-continuity projects.
Inputs: Fresh local runs of `cargo make real-world-memory` and
`cargo make real-world-memory-live-adapters` on commit `286af8b`, plus
`apps/elf-eval/fixtures/real_world_external_adapters/memory_projects_manifest.json`,
`2026-06-11-competitor-strength-evidence-matrix.md`, and
`2026-06-11-elf-iteration-direction-from-competitor-benchmarks.md`.
Outputs: Fresh measured counters, scenario coverage, project coverage, and the next
measurement reports needed before stronger ELF claims.

## Executive Judgment

The benchmark program is useful and already prevents misleading claims, but the
current measured comparison is not complete enough to say ELF beats or ties every
tracked project's strongest scenario.

What is proven today:

- ELF has a strong fixture-backed real-world benchmark contract: 38 jobs, 36 pass,
2 blocked operator boundaries, and no wrong results in the fixture aggregate.
- ELF and qmd have comparable full-suite live real-world sweeps. They are effectively
tied on pass/fail shape: each has 38 jobs, 18 pass, 5 wrong_result, 2 blocked, and
13 not_encoded.
- ELF is ahead on production-operation evidence among tracked systems because it has
checked-in provider synthetic, stress, backfill, backup/restore, and Qdrant rebuild
evidence.
- The current comparison still undermeasures most competitor strengths. OpenViking
trajectory, mem0/OpenMemory entity history and UI, Letta core-vs-archival memory,
Graphiti/Zep temporal graph behavior, graph/RAG navigation, agentmemory and
claude-mem capture/continuity, and knowledge-page workflows remain non-claims.

So the current adoption decision can remain "credible for bounded personal
production," but the competitiveness objective remains open.

## Fresh Runs

These commands were run from an isolated report worktree based on `origin/main`:

| Command | Result | Runtime |
| --- | --- | ---: |
| `cargo make real-world-memory` | pass | 42.38 seconds |
| `cargo make real-world-memory-live-adapters` | pass | 121.93 seconds |

The live adapter run emitted repeated Qdrant client/server compatibility warnings, but
the command completed successfully and produced ELF and qmd JSON/Markdown reports.
Treat that warning as a measurement-harness risk to keep visible, not as a current run
failure.

## Fixture Aggregate

`cargo make real-world-memory` produced:

| Metric | Value |
| --- | ---: |
| Jobs | `38` |
| Encoded suites | `11` |
| Pass | `36` |
| Blocked | `2` |
| Wrong result | `0` |
| Lifecycle fail | `0` |
| Incomplete | `0` |
| Not encoded | `0` |
| Unsupported claim | `0` |
| Mean score | `0.947` |
| Mean latency | `4.411 ms` |
| Expected evidence recall | `77/77` |
| Evidence coverage | `84/84` |
| Source-ref coverage | `84/84` |
| Quote coverage | `84/84` |

This proves fixture contract breadth and scoring behavior. It does not prove every
live adapter or competitor runtime can complete those jobs.

## Live ELF/qmd Sweep

`cargo make real-world-memory-live-adapters` produced:

| Adapter | Jobs | Pass | Wrong result | Blocked | Not encoded | Mean score | Mean latency | Evidence recall | Evidence coverage |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| ELF live service adapter | `38` | `18` | `5` | `2` | `13` | `0.525` | `5.100 ms` | `41/77` | `48/84` |
| qmd live CLI adapter | `38` | `18` | `5` | `2` | `13` | `0.512` | `719.758 ms` | `41/77` | `48/84` |

This supports a narrow tie on the currently encoded live real-world suite shape. It
does not support a broad ELF-over-qmd claim because qmd remains the stronger
retrieval-debug UX reference and its deep profile is still not encoded.

### Live Suite Breakdown

ELF and qmd had the same suite status shape:

| Suite | Jobs | Status breakdown |
| --- | ---: | --- |
| `trust_source_of_truth` | `1` | `pass:1` |
| `work_resume` | `5` | `pass:5` |
| `retrieval` | `5` | `pass:5` |
| `project_decisions` | `5` | `pass:5` |
| `personalization` | `1` | `pass:1` |
| `memory_evolution` | `6` | `pass:1`, `wrong_result:5` |
| `capture_integration` | `2` | `not_encoded:2` |
| `consolidation` | `4` | `not_encoded:4` |
| `knowledge_compilation` | `2` | `not_encoded:2` |
| `operator_debugging_ux` | `1` | `not_encoded:1` |
| `production_ops` | `6` | `blocked:2`, `not_encoded:4` |

The five live wrong results are all memory-evolution jobs. The live adapters retrieve
current evidence but do not yet provide the required historical conflict evidence
links for current-vs-historical reasoning.

## External Adapter Ledger

The checked-in manifest records 21 adapter records across 17 unique project names.

| Evidence class | Adapter records | Meaning |
| --- | ---: | --- |
| `fixture_backed` | `1` | ELF fixture scoring only. |
| `live_baseline_only` | `6` | Docker same-corpus or lifecycle evidence without real-world job scoring. |
| `live_real_world` | `2` | ELF and qmd live real-world sweeps. |
| `research_gate` | `12` | Setup, source, resource, or output-contract gate only. |

| Overall status | Adapter records |
| --- | ---: |
| `pass` | `1` |
| `wrong_result` | `6` |
| `lifecycle_fail` | `1` |
| `blocked` | `6` |
| `not_encoded` | `7` |

The generated JSON report also emits `external_project_count: 19`, while the unique
project-name count from the manifest is 17. The runner currently computes that field
as adapter records whose project is not `ELF`, not as unique external project names.
Interpret the unique manifest project list as the project coverage count.

## Project Coverage

| Project | Best current evidence | Current measured state | Strongest unproven scenario | Next measurement before claim |
| --- | --- | --- | --- | --- |
| ELF | `fixture_backed` plus `live_real_world` | Fixture aggregate passes except 2 blocked operator boundaries; live full sweep is `wrong_result`. | Full live memory evolution, live consolidation, live knowledge pages, live capture, live production ops. | Memory-evolution diagnostic report, then live operator/capture/consolidation reports. |
| qmd | `live_real_world` plus `live_baseline_only` | Same live sweep shape as ELF; same-corpus baseline passes. | Deep retrieval-debug ergonomics and trace replay. | qmd/ELF deep retrieval-debug profile with expansion, fusion, rerank, and dropped-candidate traces. |
| agentmemory | `live_baseline_only` | `lifecycle_fail`. | Durable coding-agent continuity and capture hooks. | Durable lifecycle and work-resume/capture adapter report. |
| mem0/OpenMemory | `live_baseline_only` | `wrong_result`. | Entity history, lifecycle UI, OpenMemory inspection. | Same-corpus repair first, then entity-history and UI-readback report. |
| memsearch | `live_baseline_only` | `wrong_result`; source-of-truth is `incomplete`. | Markdown canonical store and local reindex clarity. | Reindex/update/delete/reload plus source-of-truth report. |
| OpenViking | `live_baseline_only` plus `research_gate` | Same-corpus retrieval is `wrong_result`; trajectory is `not_encoded`. | Hierarchical staged context trajectory. | Evidence-bearing retrieval fix, then staged trajectory report. |
| claude-mem | `live_baseline_only` | `wrong_result`. | Progressive disclosure and automatic capture review. | Work-resume, operator-debugging, and capture/write-policy report. |
| RAGFlow | `research_gate` | `blocked`. | RAG app workflow with document/chunk references. | Tiny Docker evidence-smoke with `reference.chunks` mapped to evidence ids. |
| LightRAG | `research_gate` | `blocked`. | Graph/RAG context export with source-path citations. | Docker context-export report with explicit provider config and source citation mapping. |
| GraphRAG | `research_gate` | `blocked`. | Graph summaries and document/text-unit evidence tables. | Cost-bounded Docker adapter report over a tiny corpus. |
| Graphiti/Zep | `research_gate` | `blocked`. | Temporal graph facts and validity windows. | Docker-local temporal graph adapter report for current and historical facts. |
| Letta | `research_gate` | `not_encoded`. | Core memory blocks versus archival memory. | Contained export contract, then core-vs-archival and decision-memory report. |
| LangGraph | `research_gate` | `not_encoded`; direct memory backend is unsupported. | Checkpoint replay and fork/regression debugging. | Treat as benchmark-infra reference unless a memory-output contract emerges. |
| nanograph | `research_gate` | `not_encoded`; full memory backend is unsupported. | Typed graph schema and query ergonomics. | Typed relation query report only if evidence ids can be emitted. |
| llm-wiki | `research_gate` | `not_encoded`. | Wiki/page generation, query-save, lint and repair loops. | Contained page-generation report with citation and unsupported-claim lint. |
| gbrain | `research_gate` | `not_encoded`; setup path is blocked. | Compiled truth pages, timelines, and brain navigation. | Docker-local brain repo setup proof, then compiled-truth/timeline report. |
| graphify | `research_gate` | `blocked`. | Graph-compressed navigation with `graph.json` and `GRAPH_REPORT`. | Docker graph/report output report mapped to benchmark evidence ids. |

## Scenario Coverage And Claims

| Scenario | Current measured position | Claim allowed today | Missing measurement |
| --- | --- | --- | --- |
| Retrieval/debug | ELF and qmd live retrieval pass; qmd same-corpus baseline passes. | Tie on encoded live retrieval; no ELF-over-qmd UX claim. | qmd/ELF deep trace replay and debug ergonomics scoring. |
| Work resume | ELF and qmd live pass. | ELF is credible on encoded work resume. | agentmemory, claude-mem, and OpenViking comparable continuity adapters. |
| Project decisions | ELF and qmd live pass. | ELF is credible on encoded project-decision recovery. | Letta core/archival decision memory comparison. |
| Source of truth | ELF and qmd live pass; ELF has stronger production restore/rebuild evidence. | ELF has strongest measured source-of-truth discipline. | memsearch source-of-truth reindex/reload evidence. |
| Memory evolution | ELF and qmd live fail 5/6 jobs; fixture aggregate passes. | No live superiority claim. | Historical conflict evidence links and Graphiti/Zep temporal comparison. |
| Consolidation | Fixture aggregate passes; live adapters are not encoded. | Fixture-only claim. | Live proposal generation with lineage, confidence, and review-action audit. |
| Knowledge pages | Fixture aggregate passes; live adapters are not encoded. | Fixture-only claim. | Live page rebuild/lint plus llm-wiki, gbrain, GraphRAG, and graphify comparisons. |
| Operator debugging | Fixture aggregate passes; live adapters are not encoded. | Fixture-only claim. | Trace hydration, stage attribution, dropped-candidate, and repair-action scoring. |
| Capture/write policy | Fixture aggregate passes; live adapters are not encoded. | Fixture-only claim. | agentmemory/claude-mem style capture with redaction and evidence binding. |
| Production ops | ELF has separate production-provider/backfill/restore evidence; live sweep is not a full production-ops pass. | Bounded personal-production adoption claim with caveats. | Private corpus manifest and credentialed provider gates. |
| Personalization | ELF and qmd live pass one scoped preference job. | Narrow encoded pass only. | mem0/OpenMemory and Letta entity/preference history comparison. |
| Context trajectory | Not comparable. | No claim. | OpenViking staged hierarchy/trajectory scoring. |
| Core-vs-archival memory | Not comparable. | No claim. | Letta contained export and ELF core-block benchmark. |
| Graph/RAG navigation | Research gates and blocked adapters only. | No claim. | RAGFlow, LightRAG, GraphRAG, Graphiti/Zep, and graphify Docker reports. |

## Next Measurement Reports

Order these by decision value, not implementation convenience:

1. ELF/qmd retrieval-debug deep profile
- Why: qmd is the closest measured live competitor and still stronger as a
debugging reference.
- Output: trace-level comparison of expansion, dense/sparse retrieval, fusion,
rerank, dropped candidates, and command-line replay.

2. ELF/qmd live memory-evolution diagnostic
- Why: both systems currently fail 5/6 live memory-evolution jobs.
- Output: per-job evidence-link failure analysis for current-vs-historical facts,
supersession, and relation temporal validity.

3. Live operator-debugging and capture/write-policy report
- Why: these are daily-use agent-memory qualities, currently fixture-only or
not_encoded in live sweeps.
- Output: trace hydration, raw-SQL avoidance, redaction, exclusion, write-policy,
and repair-action scoring.

4. Continuity and context-trajectory report
- Why: agentmemory, claude-mem, and OpenViking represent real user expectations
around automatic capture, progressive disclosure, and staged context.
- Output: comparable work-resume/capture/trajectory jobs or typed blockers.

5. Personalization and core-memory report
- Why: mem0/OpenMemory and Letta represent product expectations ELF should absorb
before claiming better personalization or operating context.
- Output: entity history, preference correction, UI/readback, core-vs-archival,
and project-decision scoring.

6. Knowledge and graph/RAG report pack
- Why: llm-wiki, gbrain, graphify, GraphRAG, LightRAG, RAGFlow, and Graphiti/Zep
cover knowledge synthesis and graph navigation that ELF currently cannot claim.
- Output: Docker-contained artifacts mapped to evidence ids, or typed setup and
resource blockers.

Before publishing the next aggregate report, clarify or rename the generated
`external_project_count` field so readers do not confuse non-ELF adapter records with
unique external projects.

## Fail Criteria

Use these criteria for future reports:

- `pass`: comparable scenario is encoded, run, and evidence-backed.
- `wrong_result`: the system ran but answered with wrong, stale, unsupported, or
insufficiently evidenced memory.
- `not_encoded`: the runner does not yet exercise the scenario. This is not a win or
loss.
- `blocked`: safe measurement needs missing credentials, private data, resource
envelope acceptance, setup proof, or an export contract.
- `unsupported`: the project shape is not a direct memory-system comparison target.
- Fixture evidence cannot be promoted into live runtime evidence.
- Live baseline evidence cannot be promoted into real-world job evidence.
- Research-gate evidence cannot be promoted into pass/fail product quality evidence.

## Bottom Line

ELF is on a strong path because its benchmark methodology is stricter than a normal
leaderboard, and its production evidence is unusually concrete. The next work is not
to declare victory. The next work is to measure the strongest user-facing patterns in
adjacent projects, then decide which ones ELF should absorb behind fresh benchmark
gates.
4 changes: 4 additions & 0 deletions docs/guide/benchmarking/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,10 @@ cleanup, use `docs/guide/single_user_production.md`.
- `2026-06-11-elf-iteration-direction-from-competitor-benchmarks.md`: current
optimization-direction report that translates measured benchmark data and competitor
strengths into prioritized ELF iteration themes and explicit non-claims.
- `2026-06-11-measurement-coverage-audit.md`: fresh coverage audit that separates
current measured ELF/qmd data, fixture evidence, external adapter ledger coverage,
scenario non-claims, and the next measurement reports needed before stronger
competitor claims.
- `real_world_agent_memory_benchmark.md`: operator overview for the v1 real-world
agent memory benchmark contract, including suite taxonomy, typed report states,
knowledge-compilation fixture tasks, and the production-ops fixture target.
Expand Down
Loading