diff --git a/docs/guide/benchmarking/2026-06-11-temporal-history-competitor-gap-report.md b/docs/guide/benchmarking/2026-06-11-temporal-history-competitor-gap-report.md new file mode 100644 index 0000000..d48a02f --- /dev/null +++ b/docs/guide/benchmarking/2026-06-11-temporal-history-competitor-gap-report.md @@ -0,0 +1,279 @@ +# Temporal/History Competitor Gap Report - June 11, 2026 + +Goal: Turn the latest live measurements into a clear competitor-gap report and +future optimization direction for ELF without implementing optimization changes here. +Read this when: You need to decide whether ELF currently wins, ties, loses, or has +no comparable claim against qmd, mem0/OpenMemory, Graphiti/Zep, Letta, and adjacent +agent-memory projects on temporal history, lifecycle, and real-world memory use. +Inputs: Fresh local runs of Graphiti/Zep temporal smoke, ELF+mem0 live baseline, +fixture memory evolution, and ELF/qmd live real-world adapters on commit +`d6d9051`. +Outputs: Evidence-class boundaries, scenario judgments, claim limits, and a +prioritized benchmark-driven optimization plan. + +## Executive Judgment + +The overall goal is not complete. ELF does not yet have complete, comparable +benchmark wins across all tracked memory projects and all user-important memory +scenarios. + +The current evidence supports a narrower judgment: + +- ELF remains a strong personal-production foundation because its core source of + truth, typed evidence, rebuild/backfill/restore story, and fixture benchmark + coverage are much more disciplined than most competitors. +- ELF now ties or beats mem0 only on the fresh basic local lifecycle smoke shape: + the combined Docker run passed `12/12` checks across ELF and mem0. This does not + measure OpenMemory UI, hosted behavior, entity history quality, optional graph + memory, or real-world temporal jobs. +- ELF narrowly beats qmd on the fresh live memory-evolution slice because ELF passes + the delete/TTL tombstone job that qmd fails, and ELF retrieves all required + memory-evolution evidence. This is still not a production-quality temporal memory + win because ELF fails five current-vs-historical jobs. +- Graphiti/Zep remains the strongest temporal-validity design reference, but the + local live smoke is typed `blocked` because no explicit provider API key was + configured. No ELF-over-Graphiti/Zep claim is allowed. +- Letta remains a core-vs-archival memory design reference. There is no contained + comparable live benchmark here, so no win, tie, or loss claim is allowed. + +The highest-value ELF direction is temporal reconciliation and lifecycle readback, +not more generic retrieval. In the failing temporal jobs ELF usually finds the +evidence but does not turn current, historical, superseded, and deleted facts into a +clear answer and trace. + +## Fresh Runs + +| Command | Result | Runtime | Main artifact | +| --- | --- | ---: | --- | +| `ELF_GRAPHITI_ZEP_SMOKE_START=1 ELF_GRAPHITI_ZEP_SMOKE_RUN=1 cargo make graphiti-zep-docker-temporal-smoke` | typed blocked | 3.5 seconds | `tmp/real-world-memory/graphiti-zep-smoke/summary.json` | +| `ELF_BASELINE_PROJECTS=ELF,mem0 cargo make baseline-live-docker` | pass | 50.14 seconds | `tmp/live-baseline/live-baseline-report.json` | +| `cargo make real-world-memory-evolution` | pass | 59.65 seconds | `tmp/real-world-memory/evolution-report.json` | +| `cargo make real-world-memory-live-adapters` | pass | 166.61 seconds | `tmp/real-world-memory/live-adapters/` | + +The Graphiti/Zep command did not use a hosted Zep service or unrecorded credentials. +It recorded a typed blocker: `provider_api_key_missing`. + +The ELF+mem0 baseline loaded the repository `.env` from the main checkout so the +container had the configured embedding environment. The report artifact still records +the local smoke embedding mode for this baseline path, so do not cite this run as a +4096-dimensional production-embedding quality test. + +## Evidence-Class Boundary + +| Evidence class | What it proves | What it does not prove | +| --- | --- | --- | +| Fixture memory-evolution pass | The benchmark contract can score current facts, historical facts, conflicts, update rationales, and history readback. | Live ELF or competitor runtime quality. | +| ELF/qmd live real-world adapters | Comparable live behavior for encoded suites in the checked-in runner. | Full memory-system superiority or unencoded suites. | +| ELF+mem0 live baseline | Basic Docker local same-corpus, update, delete, and reload lifecycle smoke. | OpenMemory UI, hosted behavior, real-world jobs, temporal history quality, or graph memory. | +| Graphiti/Zep typed blocker | The adapter has a Docker-local temporal smoke contract and typed provider boundary. | Live Graphiti/Zep search quality or ELF superiority over Graphiti/Zep. | +| Letta research-only state | Core-vs-archival memory is a relevant product pattern for ELF to borrow. | Comparable live results. | + +## Basic Local Lifecycle: ELF And mem0 + +The fresh `ELF,mem0` live-baseline run passed. + +| Project | Status | Checks | Runtime | What passed | +| --- | --- | ---: | ---: | --- | +| ELF | pass | `8/8` | 11 seconds | resumable backfill, same-corpus retrieval, async worker indexing, update, delete, cold-start reload, concurrent writes, resource envelope | +| mem0 | pass | `4/4` | 36 seconds | same-corpus retrieval, update, delete, cold-start reload | + +This updates the older mem0 local-baseline picture. For the basic Docker local +lifecycle smoke, mem0 should no longer be described as currently failing. + +It remains a limited comparison. ELF's smoke covers more local operational checks, +while mem0's strongest product claims are elsewhere: entity-scoped memory history, +OpenMemory inspection UX, hosted ecosystem behavior, and optional graph memory. Those +are not measured by this run. + +## Live Temporal Memory: ELF And qmd + +The fixture memory-evolution suite passed `5/5` with mean score `1.000`, expected +evidence `11/11`, conflict detection `5`, and update rationale count `5`. + +The fresh live adapters still fail the real temporal-history behavior. + +| Adapter | Jobs | Pass | Wrong-result jobs | Mean score | Expected evidence recall | Evidence coverage | +| --- | ---: | ---: | ---: | ---: | ---: | ---: | +| ELF live service adapter | `38` | `18` | `5` | `0.525` | `41/77` | `48/84` | +| qmd live CLI adapter | `38` | `17` | `6` | `0.486` | `38/77` | `45/84` | + +For the `memory_evolution` suite: + +| Adapter | Encoded jobs | Job statuses | Score mean | Evidence recall | Diagnosis | +| --- | ---: | --- | ---: | ---: | --- | +| ELF live service adapter | `6` | `1` pass, `5` wrong_result | `0.492` | `1.000` | Finds the evidence, but does not narrate current-vs-historical conflict and lifecycle state. | +| qmd live CLI adapter | `6` | `0` pass, `6` wrong_result | `0.325` | `0.769` | Same lifecycle gap, plus missed evidence including the delete tombstone. | + +### Job-Level Pattern + +| Job | ELF | qmd | What the result means | +| --- | --- | --- | --- | +| `memory-evolution-benchmark-verdict-001` | wrong_result, `0.40`, evidence `3/3` | wrong_result, `0.15`, evidence `2/3` | ELF found current verdict, caveat, and rationale but did not represent the superseded verdict as historical. | +| `memory-evolution-deploy-method-001` | wrong_result, `0.40`, evidence `2/2` | wrong_result, `0.40`, evidence `2/2` | Both found current runbook and rationale, but neither preserved the old quickstart path as historical. | +| `memory-evolution-issue-state-001` | wrong_result, `0.40`, evidence `2/2` | wrong_result, `0.40`, evidence `2/2` | Both found current done state and rationale, but neither surfaced the earlier blocked state. | +| `memory-evolution-preference-001` | wrong_result, `0.40`, evidence `2/2` | wrong_result, `0.15`, evidence `1/2` | ELF found current preference and rationale, but did not preserve old preference history. | +| `memory-evolution-relation-temporal-001` | wrong_result, `0.35`, evidence `2/2` | wrong_result, `0.35`, evidence `2/2` | Both found current and old owners, but did not emit scored temporal-validity explanation. | +| `memory-evolution-delete-ttl-001` | pass, `1.00`, evidence `2/2` | wrong_result, `0.50`, evidence `1/2` | ELF found tombstone and current plan. qmd missed the tombstone. | + +The key ELF failure is not retrieval. The five wrong-result jobs all have evidence +grounding `1.0`, trap avoidance `1.0`, answer correctness `0.0`, and lifecycle +behavior `0.0`. ELF needs to reconcile and explain lifecycle state, not merely return +the right snippets. + +## Competitor Strengths And Current ELF Position + +| Scenario | Competitor/reference strength | Current evidence | ELF position | +| --- | --- | --- | --- | +| Basic local lifecycle | mem0 update/delete/reload | Fresh Docker baseline: ELF `8/8`, mem0 `4/4`, combined `12/12` | ELF ties or exceeds the encoded smoke surface, but does not beat OpenMemory UI/history/hosted claims. | +| Retrieval/debug | qmd transparent CLI, expansion/fusion/rerank/replay ergonomics | ELF/qmd live adapters pass retrieval suites; previous qmd debug profile exists | ELF is not clearly stronger. qmd remains the debug-UX bar. | +| Current-vs-historical memory | Graphiti/Zep temporal validity; mem0 history surfaces | ELF/qmd live memory-evolution wrong_result; Graphiti/Zep blocked; mem0 real-world history not encoded | ELF has a measured gap. It only narrowly beats qmd's current run. | +| Delete/tombstone lifecycle | ELF production ops and qmd local replay | ELF passes delete/TTL job; qmd misses tombstone | ELF has a narrow measured win over qmd on this job. | +| Entity preference history | mem0/OpenMemory | Only basic mem0 lifecycle smoke passed | Not comparable. Need mem0/OpenMemory history and UI/export benchmark. | +| Core-vs-archival memory | Letta core memory blocks versus archival memory | Research-only, no contained live output | Not comparable. Borrow design only. | +| Context trajectory | OpenViking staged context and hierarchy | Existing adapter remains not encoded or wrong_result for trajectory | Not comparable. Need staged trajectory benchmark. | +| Capture and continuity | agentmemory, claude-mem hooks/viewers | Existing adapters are baseline-only and undermeasured | Not comparable. Need capture/write-policy and work-resume adapters. | +| Knowledge pages and graph/RAG navigation | llm-wiki, gbrain, graphify, RAGFlow, LightRAG, GraphRAG | Research-gate or blocked adapter state | Not comparable. Need Docker-contained evidence-linked adapters. | +| Production operation discipline | ELF backfill, restore, typed gates | Existing production adoption reports plus current benchmark discipline | ELF has the strongest measured local production-operation story, with private/provider gates still typed blocked. | + +## What ELF Should Borrow + +| Source | Best idea to absorb | Benchmark gate before any claim | +| --- | --- | --- | +| Graphiti/Zep | Validity windows, `valid_at`/`invalid_at`, current/historical/future fact separation, temporal relation provenance | Provider-backed Docker temporal smoke must map current, historical, and rationale facts to scored evidence ids. | +| mem0/OpenMemory | Entity-scoped memory history, user-visible lifecycle inspection, update/delete ergonomics | mem0/OpenMemory adapter must score preference history, correction, deletion, and UI/export readback. | +| Letta | Always-loaded core memory blocks separated from archival search | Add core-vs-archival jobs for attachment scope, provenance, fallback, and stale-core avoidance. | +| qmd | Local replay, candidate inspection, expansion/fusion/rerank debug knobs | ELF trace artifacts must show candidate generation, rerank, dropped evidence, conflict candidates, and replay commands. | +| OpenViking | Staged context trajectory and hierarchy | Encode trajectory jobs after evidence-bearing same-corpus output passes. | +| agentmemory and claude-mem | Capture breadth, continuity hooks, and viewer comfort | Live capture/write-policy benchmark must prove redaction, exclusion, source ids, and no secret leakage. | +| memsearch | User-inspectable canonical files and rebuild clarity | Source-of-truth/reindex benchmark must prove update/delete/reload without making derived vectors authoritative. | +| llm-wiki, gbrain, graphify, GraphRAG | Cited knowledge pages, timelines, graph reports, rebuild/lint loops | Knowledge-page rebuild/lint jobs must catch unsupported claims and stale sections. | + +## Optimization Direction + +These are future optimization directions, not implemented changes in this report. + +### P0 - Temporal Reconciliation Contract + +ELF should add an answer and trace contract for current-vs-historical memory: + +1. Identify current winner, historical loser, and update rationale for the same claim. +2. Preserve superseded facts as history instead of dropping or silently demoting them. +3. Expose tombstones and TTL invalidations as answerable lifecycle evidence. +4. Emit trace fields for conflict candidates, current selection, historical selection, + tombstone selection, and rationale selection. +5. Add scorer gates so a retrieved-but-not-narrated conflict remains `wrong_result`. + +Target benchmark: ELF live `memory_evolution` should pass all six jobs before any +claim that ELF has solved temporal memory. + +### P0 - mem0/OpenMemory History Comparison + +The fresh mem0 pass means the next useful comparison is no longer basic update/delete. +It should move to the product behavior users actually care about: + +1. preference history across correction events; +2. entity-scoped memory lookup and update; +3. user-visible inspection/export of memory lifecycle; +4. deletion versus historical audit readback; +5. optional graph-memory behavior only if the OSS path is reproducible in Docker. + +Target benchmark: mem0/OpenMemory and ELF both run comparable history jobs; claims are +made per scenario, not per project brand. + +### P0 - qmd-Level Debugging And Replay + +ELF should match qmd's practical debugging strengths: + +1. show query expansion, sparse/dense retrieval, fusion, rerank, and final selection; +2. mark candidate-drop reasons; +3. include replay commands that do not require raw SQL; +4. connect wrong-result scores to specific missing stages; +5. keep artifacts local and reproducible. + +Target benchmark: every wrong temporal or retrieval answer has a replayable trace that +explains whether evidence was absent, retrieved but dropped, selected but not narrated, +or contradicted by a higher-priority lifecycle fact. + +### P1 - Core Memory Blocks + +ELF should evaluate Letta-style core memory without weakening ELF's source-of-truth +discipline: + +1. scoped read-only core blocks; +2. provenance and source ids on every core assertion; +3. explicit attach/detach rules; +4. stale-core detection when archival evidence supersedes a core statement; +5. fallback to archival search when core memory is insufficient. + +Target benchmark: core-vs-archival jobs prove correct attachment, sharing, update +visibility, and stale-core avoidance. + +### P1 - Capture, Consolidation, And Knowledge Pages + +A good memory system is not only retrieval. ELF should benchmark and later optimize: + +1. safe capture/write policy with redaction and exclusion proof; +2. reviewable consolidation proposals with source lineage and unsupported-claim flags; +3. project/entity knowledge pages that rebuild from authoritative notes; +4. timelines for changed decisions, ownership, and production state; +5. operator UX that explains failures without raw database inspection. + +Target benchmark: live capture, consolidation, knowledge, and operator-debugging suites +must move from `not_encoded` or fixture-only to comparable live evidence. + +### P2 - Graph/RAG And Context-Trajectory Adapters + +Graph/RAG and context trajectory should be measured, not assumed: + +1. Graphiti/Zep for temporal graph facts; +2. RAGFlow, LightRAG, and GraphRAG for document/chunk/graph evidence handles; +3. graphify for graph-compressed navigation reports; +4. OpenViking for staged context trajectory; +5. llm-wiki and gbrain for maintained knowledge workflows. + +Target benchmark: each adapter must emit evidence-linked outputs from Docker-contained +or explicitly typed provider-backed runs before any ELF win/loss claim. + +## Claim Boundaries + +Allowed: + +- ELF+mem0 basic local lifecycle smoke passed in the fresh Docker baseline. +- ELF narrowly outperformed qmd on the fresh memory-evolution slice because ELF passed + delete/TTL and qmd did not. +- ELF still failed five of six live memory-evolution jobs. +- Graphiti/Zep temporal smoke is typed blocked due missing explicit provider key. +- Letta is a design reference, not a measured comparable competitor in this report. +- The next work should be benchmark/report driven before implementation work is + claimed successful. + +Not allowed: + +- Do not claim all goals are complete. +- Do not claim ELF beats all tracked memory projects. +- Do not claim ELF beats mem0/OpenMemory on UI, hosted behavior, entity history, or + graph memory. +- Do not claim ELF beats Graphiti/Zep on temporal validity. +- Do not claim ELF beats Letta on core-vs-archival memory. +- Do not treat fixture pass, baseline smoke pass, and live real-world pass as the + same evidence class. + +## Next Concrete Report/Issue Directions + +1. Open or refine a P0 issue for ELF live temporal reconciliation and trace contract. +2. Open a P0 benchmark issue for mem0/OpenMemory history and UI/export readback. +3. Open a P0 benchmark issue for ELF/qmd trace-level replay and wrong-result + diagnosis. +4. Open a P1 benchmark issue for Letta-style core-vs-archival memory. +5. Keep Graphiti/Zep provider-backed temporal smoke blocked until explicit provider + credentials are available, then rerun and compare validity-window behavior. +6. Keep graph/RAG and knowledge-page adapters as P2 until Docker-contained evidence + mappings are available. + +## Bottom Line + +ELF is not done competing. The evidence says ELF should keep its strict +source-of-truth and production-operation core, then absorb the best competitor ideas +behind benchmark gates. The immediate product-quality gap is temporal and lifecycle +memory: users need to know what is current, what changed, what was deleted, what is +historical, and why the system believes that answer. diff --git a/docs/guide/benchmarking/index.md b/docs/guide/benchmarking/index.md index 1cc0563..e7b0cde 100644 --- a/docs/guide/benchmarking/index.md +++ b/docs/guide/benchmarking/index.md @@ -65,6 +65,11 @@ cleanup, use `docs/guide/single_user_production.md`. memory-evolution diagnostic showing fixture pass, live ELF/qmd current-vs-historical wrong-result patterns, qmd tombstone evidence miss, and temporal-reconciliation iteration directions. +- `2026-06-11-temporal-history-competitor-gap-report.md`: fresh report-only + temporal/history competitor-gap report that updates the mem0 basic lifecycle result, + records Graphiti/Zep and Letta claim boundaries, and turns qmd, mem0/OpenMemory, + Graphiti/Zep, Letta, and adjacent project strengths into benchmark-gated ELF + optimization directions. - `real_world_agent_memory_benchmark.md`: operator overview for the v1 real-world agent memory benchmark contract, including suite taxonomy, typed report states, knowledge-compilation fixture tasks, and the production-ops fixture target. diff --git a/docs/research/2026-06-11-temporal-history-competitor-gap-report.json b/docs/research/2026-06-11-temporal-history-competitor-gap-report.json new file mode 100644 index 0000000..fe95e72 --- /dev/null +++ b/docs/research/2026-06-11-temporal-history-competitor-gap-report.json @@ -0,0 +1,347 @@ +{ + "schema": "elf.temporal_history_competitor_gap_report/v1", + "run_id": "2026-06-11-temporal-history-competitor-gap-report", + "commit": "d6d9051f9e28384410308ac952936fcdb021dbc2", + "created_at": "2026-06-11", + "scope": "Report-only competitor gap assessment for temporal/history memory, lifecycle smoke, and future ELF optimization direction", + "role_boundary": "No ELF optimization implementation is included; this report records evidence, claim boundaries, and future optimization directions.", + "commands": [ + { + "command": "ELF_GRAPHITI_ZEP_SMOKE_START=1 ELF_GRAPHITI_ZEP_SMOKE_RUN=1 cargo make graphiti-zep-docker-temporal-smoke", + "status": "blocked", + "typed_status": "provider_api_key_missing", + "runtime_seconds": 3.5, + "artifact": "tmp/real-world-memory/graphiti-zep-smoke/summary.json" + }, + { + "command": "ELF_BASELINE_PROJECTS=ELF,mem0 cargo make baseline-live-docker", + "status": "pass", + "runtime_seconds": 50.14, + "artifact": "tmp/live-baseline/live-baseline-report.json" + }, + { + "command": "cargo make real-world-memory-evolution", + "status": "pass", + "runtime_seconds": 59.65, + "artifact": "tmp/real-world-memory/evolution-report.json" + }, + { + "command": "cargo make real-world-memory-live-adapters", + "status": "pass", + "runtime_seconds": 166.61, + "artifact": "tmp/real-world-memory/live-adapters/" + } + ], + "executive_judgment": { + "goal_complete": false, + "summary": "ELF is a credible personal-production foundation, but the current evidence does not prove broad superiority across all tracked memory projects or all user-important scenarios.", + "highest_priority_gap": "temporal_reconciliation_and_lifecycle_readback", + "main_reason": "In live memory-evolution jobs, ELF retrieves the required evidence but does not represent current, historical, superseded, and deleted facts as explicit answer and trace state." + }, + "basic_local_lifecycle": { + "run_id": "live-baseline-20260611010431", + "project_filter": "ELF,mem0", + "verdict": "pass", + "summary": { + "total": 2, + "pass": 2, + "wrong_result": 0, + "lifecycle_fail": 0, + "incomplete": 0, + "blocked": 0, + "not_encoded": 0 + }, + "same_corpus_summary": { + "total": 2, + "pass": 2, + "fail": 0 + }, + "full_check_summary": { + "total": 12, + "pass": 12, + "fail": 0, + "wrong_result": 0, + "lifecycle_fail": 0, + "incomplete": 0, + "blocked": 0, + "not_encoded": 0 + }, + "projects": [ + { + "project": "ELF", + "status": "pass", + "elapsed_seconds": 11, + "checks": 8, + "checks_passed": 8, + "passed_capabilities": [ + "resumable_backfill_no_duplicates", + "same_corpus_retrieval", + "async_worker_indexing_e2e", + "update_replaces_note_text", + "delete_suppresses_retrieval", + "cold_start_recovery_search", + "concurrent_write_search_e2e", + "resource_envelope" + ] + }, + { + "project": "mem0", + "status": "pass", + "elapsed_seconds": 36, + "checks": 4, + "checks_passed": 4, + "passed_capabilities": [ + "same_corpus_retrieval", + "update_replaces_note_text", + "delete_suppresses_retrieval", + "cold_start_recovery_search" + ], + "not_measured": [ + "OpenMemory UI", + "hosted ecosystem behavior", + "entity history quality", + "optional graph memory", + "real-world memory_evolution jobs" + ] + } + ], + "claim": "ELF and mem0 both pass the encoded local Docker lifecycle smoke; this does not prove ELF beats mem0/OpenMemory on its strongest product surfaces." + }, + "fixture_memory_evolution": { + "job_count": 5, + "pass": 5, + "wrong_result": 0, + "mean_score": 1.0, + "expected_evidence_total": 11, + "expected_evidence_matched": 11, + "conflict_detection_count": 5, + "update_rationale_available_count": 5, + "history_readback_encoded_count": 1 + }, + "live_real_world_context": { + "elf": { + "job_count": 38, + "encoded_suite_count": 11, + "pass": 18, + "wrong_result": 5, + "wrong_result_signal_count": 6, + "blocked": 2, + "not_encoded": 13, + "mean_score": 0.525, + "mean_latency_ms": 9.888, + "expected_evidence_total": 77, + "expected_evidence_matched": 41, + "evidence_required_count": 84, + "evidence_covered_count": 48 + }, + "qmd": { + "job_count": 38, + "encoded_suite_count": 11, + "pass": 17, + "wrong_result": 6, + "wrong_result_signal_count": 11, + "blocked": 2, + "not_encoded": 13, + "mean_score": 0.486, + "mean_latency_ms": 1132.646, + "expected_evidence_total": 77, + "expected_evidence_matched": 38, + "evidence_required_count": 84, + "evidence_covered_count": 45 + } + }, + "live_memory_evolution": { + "elf": { + "encoded_jobs": 6, + "pass": 1, + "wrong_result_jobs": 5, + "score_mean": 0.492, + "expected_evidence_recall": 1.0, + "diagnosis": "ELF retrieved all required memory-evolution evidence but did not emit lifecycle-aware current-vs-historical answer behavior on five jobs." + }, + "qmd": { + "encoded_jobs": 6, + "pass": 0, + "wrong_result_jobs": 6, + "score_mean": 0.325, + "expected_evidence_recall": 0.769, + "diagnosis": "qmd had the same missing temporal-conflict pattern and additionally missed evidence, including the delete tombstone." + }, + "job_matrix": [ + { + "job_id": "memory-evolution-benchmark-verdict-001", + "elf_status": "wrong_result", + "elf_score": 0.4, + "elf_evidence": "3/3", + "qmd_status": "wrong_result", + "qmd_score": 0.15, + "qmd_evidence": "2/3", + "diagnosis": "ELF found current verdict, caveat, and rationale but did not represent the superseded verdict as historical." + }, + { + "job_id": "memory-evolution-deploy-method-001", + "elf_status": "wrong_result", + "elf_score": 0.4, + "elf_evidence": "2/2", + "qmd_status": "wrong_result", + "qmd_score": 0.4, + "qmd_evidence": "2/2", + "diagnosis": "Both found current runbook and rationale, but neither preserved the old quickstart path as historical." + }, + { + "job_id": "memory-evolution-issue-state-001", + "elf_status": "wrong_result", + "elf_score": 0.4, + "elf_evidence": "2/2", + "qmd_status": "wrong_result", + "qmd_score": 0.4, + "qmd_evidence": "2/2", + "diagnosis": "Both found current done state and rationale, but neither surfaced the earlier blocked state as history." + }, + { + "job_id": "memory-evolution-preference-001", + "elf_status": "wrong_result", + "elf_score": 0.4, + "elf_evidence": "2/2", + "qmd_status": "wrong_result", + "qmd_score": 0.15, + "qmd_evidence": "1/2", + "diagnosis": "ELF found current preference and rationale, but did not preserve old preference history." + }, + { + "job_id": "memory-evolution-relation-temporal-001", + "elf_status": "wrong_result", + "elf_score": 0.35, + "elf_evidence": "2/2", + "qmd_status": "wrong_result", + "qmd_score": 0.35, + "qmd_evidence": "2/2", + "diagnosis": "Both found current and old owners, but did not emit temporal-validity explanation." + }, + { + "job_id": "memory-evolution-delete-ttl-001", + "elf_status": "pass", + "elf_score": 1.0, + "elf_evidence": "2/2", + "qmd_status": "wrong_result", + "qmd_score": 0.5, + "qmd_evidence": "1/2", + "diagnosis": "ELF found tombstone and current plan; qmd missed tombstone." + } + ] + }, + "graphiti_zep_temporal_smoke": { + "run_id": "graphiti-zep-docker-smoke-20260611010309", + "evidence_class": "research_gate", + "status": "blocked", + "failure_class": "provider_api_key_missing", + "failure_reason": "Graphiti/Zep live temporal search requires an explicit provider API key; no hosted Zep service or unrecorded provider credentials were used.", + "expected_evidence_ids": [ + "graphiti-zep-old-owner", + "graphiti-zep-current-owner", + "graphiti-zep-owner-rationale" + ], + "claim": "Graphiti/Zep remains a temporal-validity reference, but no live pass or ELF superiority claim is supported." + }, + "scenario_judgments": [ + { + "scenario": "basic_local_lifecycle", + "current_judgment": "elf_and_mem0_both_pass_encoded_smoke", + "claim_strength": "limited_tie_or_elf_broader_smoke_surface", + "next_gate": "mem0/OpenMemory history and UI/export readback benchmark" + }, + { + "scenario": "retrieval_debug", + "current_judgment": "qmd_remains_debug_ux_reference", + "claim_strength": "no_elf_win_claim", + "next_gate": "ELF/qmd trace-level replay and wrong-result diagnosis" + }, + { + "scenario": "current_vs_historical_memory", + "current_judgment": "elf_narrowly_beats_qmd_but_still_fails_temporal_product_quality", + "claim_strength": "narrow_job_slice_only", + "next_gate": "ELF live memory_evolution pass for all six jobs" + }, + { + "scenario": "temporal_graph_validity", + "current_judgment": "graphiti_zep_blocked_reference", + "claim_strength": "no_comparable_claim", + "next_gate": "provider-backed Graphiti/Zep Docker temporal smoke" + }, + { + "scenario": "core_vs_archival_memory", + "current_judgment": "letta_research_only_reference", + "claim_strength": "no_comparable_claim", + "next_gate": "contained Letta export path and core-vs-archival jobs" + }, + { + "scenario": "production_operation_discipline", + "current_judgment": "elf_strongest_measured_local_story", + "claim_strength": "bounded_by_private_and_provider_gates", + "next_gate": "private-corpus and credentialed production-ops evidence only when operator inputs exist" + } + ], + "optimization_direction_order": [ + { + "priority": "P0", + "direction": "temporal_reconciliation_contract", + "description": "Add answer and trace semantics for current winner, historical loser, update rationale, tombstone, and supersession state.", + "benchmark_gate": "ELF live memory_evolution pass for all six jobs." + }, + { + "priority": "P0", + "direction": "mem0_openmemory_history_comparison", + "description": "Move past basic update/delete smoke into preference history, entity memory, lifecycle inspection, deletion audit, and UI/export readback.", + "benchmark_gate": "Comparable ELF and mem0/OpenMemory history jobs with typed evidence classes." + }, + { + "priority": "P0", + "direction": "qmd_level_debugging_and_replay", + "description": "Expose query expansion, sparse/dense retrieval, fusion, rerank, dropped candidates, conflict candidates, and replay commands.", + "benchmark_gate": "Every wrong result has a replayable trace that localizes absent, dropped, selected-but-not-narrated, or contradicted evidence." + }, + { + "priority": "P1", + "direction": "core_memory_blocks", + "description": "Evaluate Letta-style core memory blocks with provenance, attachment rules, stale-core detection, and archival fallback.", + "benchmark_gate": "Core-vs-archival jobs prove correct attachment, sharing, update visibility, and stale-core avoidance." + }, + { + "priority": "P1", + "direction": "capture_consolidation_knowledge_pages", + "description": "Score safe capture, reviewable consolidation, cited knowledge pages, timelines, and operator UX as live surfaces.", + "benchmark_gate": "Live capture, consolidation, knowledge, and operator-debugging suites move from not_encoded or fixture-only to comparable evidence." + }, + { + "priority": "P2", + "direction": "graph_rag_and_context_trajectory_adapters", + "description": "Measure Graphiti/Zep, RAGFlow, LightRAG, GraphRAG, graphify, OpenViking, llm-wiki, and gbrain with evidence-linked output contracts.", + "benchmark_gate": "Docker-contained or explicitly typed provider-backed adapters emit scored evidence outputs." + } + ], + "claim_boundaries": { + "allowed": [ + "ELF+mem0 basic local lifecycle smoke passed in the fresh Docker baseline.", + "ELF narrowly outperformed qmd on the fresh memory-evolution slice because ELF passed delete/TTL and qmd did not.", + "ELF still failed five of six live memory-evolution jobs.", + "Graphiti/Zep temporal smoke is typed blocked due missing explicit provider key.", + "Letta is a design reference, not a measured comparable competitor in this report." + ], + "not_allowed": [ + "All goals are complete.", + "ELF beats all tracked memory projects.", + "ELF beats mem0/OpenMemory on UI, hosted behavior, entity history, or graph memory.", + "ELF beats Graphiti/Zep on temporal validity.", + "ELF beats Letta on core-vs-archival memory.", + "Fixture pass, baseline smoke pass, and live real-world pass are interchangeable evidence classes." + ] + }, + "next_issue_directions": [ + "P0 ELF live temporal reconciliation and trace contract", + "P0 mem0/OpenMemory history and UI/export readback benchmark", + "P0 ELF/qmd trace-level replay and wrong-result diagnosis", + "P1 Letta-style core-vs-archival memory benchmark", + "P2 Graphiti/Zep provider-backed temporal smoke after explicit provider credentials exist", + "P2 graph/RAG and knowledge-page Docker-contained evidence adapters" + ] +}