Calendar Day
Thursday, June 12, 2026
Planned Effort
3 story points — sprint item #7 (Medium): Performance benchmarks for parse/export path
One issue → one PR.
Depends on: Mon–Wed week-2 work merged or rebased (stable CI matrix + parser hardening). Independent of export-warning response shape — benchmarks measure throughput, not HTTP headers.
Out of scope: perf regression gates in CI, caching architecture changes, frontend benchmarks.
Problem
No benchmarks/, perf/, or performance tests exist. The app re-parses JSONL from disk on session detail (api/sessions.py), search (api/search.py), and bulk export (api/export_api.py → run_bulk_export). Long sessions (thousands of lines) and large bulk exports have no latency or memory baselines, so regressions in the parse boundary pipeline go undetected.
Goal
Establish repeatable, local performance measurements with:
pytest-benchmark harness under tests/benchmarks/
- Synthetic corpora including a 5,000+ line session file
- tracemalloc peak-memory check on large parse
- Non-gating CI job that uploads benchmark JSON artifacts
benchmarks/README.md documenting local runs
Scope
1. Dependencies and layout
Touch points: requirements-dev.txt, tests/benchmarks/, benchmarks/README.md, optional benchmarks/baselines.json, pyproject.toml (pytest marker)
- Add
pytest-benchmark>=4.0.0 to dev dependencies.
- Create
tests/benchmarks/ with modules for parse, export, search, and memory.
2. Synthetic fixtures
Build on patterns in tests/conftest.py and tests/fixtures/session_with_tools.jsonl:
| Fixture |
Size |
Purpose |
| small |
~10 JSONL lines |
Fast sanity bench |
| medium |
~500 lines |
Typical long session |
| large |
≥ 5,000 lines |
Memory pressure + worst-case parse |
| export corpus |
10 / 50 / 100 session files |
Bulk export scaling |
| search corpus |
multi-session project tree |
Full linear scan search |
Large file may be generated at test session scope (tmp_path_factory) rather than committed, as long as generation always produces ≥ 5,000 lines.
3. Benchmark scenarios
| Scenario |
Target |
Tool |
| Single-session parse (small/medium/large) |
utils/jsonl_parser.parse_session |
pytest-benchmark |
| Bulk export (10 / 50 / 100 sessions) |
utils.export_engine.run_bulk_export + NoopSink |
pytest-benchmark |
| Search across corpus |
GET /api/search via Flask test client or equivalent loop in api/search.py |
pytest-benchmark |
| Large-parse memory |
parse_session on large file |
tracemalloc assert (regular pytest test) |
Use @pytest.mark.benchmark on timing tests. Parametrize export counts with distinct benchmark ids.
4. Memory ceiling
- Wrap large-file
parse_session in tracemalloc.start() / get_traced_memory().
- Assert peak allocated memory < 10× on-disk file size (document in test if ceiling adjusted).
5. CI (informational only)
Touch points: .github/workflows/ci.yml
Add benchmarks job on ubuntu-latest:
pytest tests/benchmarks/
--benchmark-only
--benchmark-json=benchmark-results.json
-o addopts=
- Upload
benchmark-results.json via actions/upload-artifact.
- No
--benchmark-compare fail gate — baselines stabilize first.
- Run with
-o addopts= to disable coverage overhead from pyproject.toml addopts.
test_parse_memory.py runs in main pytest job (not --benchmark-only).
6. Documentation
benchmarks/README.md — local commands, scenario table, CI artifact note, how to refresh baselines.json.
- One-line link from
CONTRIBUTING.md testing section.
Acceptance Criteria
Calendar Day
Thursday, June 12, 2026
Planned Effort
3 story points — sprint item #7 (Medium): Performance benchmarks for parse/export path
One issue → one PR.
Depends on: Mon–Wed week-2 work merged or rebased (stable CI matrix + parser hardening). Independent of export-warning response shape — benchmarks measure throughput, not HTTP headers.
Out of scope: perf regression gates in CI, caching architecture changes, frontend benchmarks.
Problem
No
benchmarks/,perf/, or performance tests exist. The app re-parses JSONL from disk on session detail (api/sessions.py), search (api/search.py), and bulk export (api/export_api.py→run_bulk_export). Long sessions (thousands of lines) and large bulk exports have no latency or memory baselines, so regressions in the parse boundary pipeline go undetected.Goal
Establish repeatable, local performance measurements with:
pytest-benchmarkharness undertests/benchmarks/benchmarks/README.mddocumenting local runsScope
1. Dependencies and layout
Touch points:
requirements-dev.txt,tests/benchmarks/,benchmarks/README.md, optionalbenchmarks/baselines.json,pyproject.toml(pytest marker)pytest-benchmark>=4.0.0to dev dependencies.tests/benchmarks/with modules for parse, export, search, and memory.2. Synthetic fixtures
Build on patterns in
tests/conftest.pyandtests/fixtures/session_with_tools.jsonl:Large file may be generated at test session scope (
tmp_path_factory) rather than committed, as long as generation always produces ≥ 5,000 lines.3. Benchmark scenarios
utils/jsonl_parser.parse_sessionpytest-benchmarkutils.export_engine.run_bulk_export+NoopSinkpytest-benchmarkGET /api/searchvia Flask test client or equivalent loop inapi/search.pypytest-benchmarkparse_sessionon large filetracemallocassert (regular pytest test)Use
@pytest.mark.benchmarkon timing tests. Parametrize export counts with distinct benchmarkids.4. Memory ceiling
parse_sessionintracemalloc.start()/get_traced_memory().5. CI (informational only)
Touch points:
.github/workflows/ci.ymlAdd
benchmarksjob onubuntu-latest:benchmark-results.jsonviaactions/upload-artifact.--benchmark-comparefail gate — baselines stabilize first.-o addopts=to disable coverage overhead frompyproject.tomladdopts.test_parse_memory.pyruns in mainpytestjob (not--benchmark-only).6. Documentation
benchmarks/README.md— local commands, scenario table, CI artifact note, how to refreshbaselines.json.CONTRIBUTING.mdtesting section.Acceptance Criteria
tests/benchmarks/usingpytest-benchmarktracemalloc); peak under documented ceilingbenchmarks/README.mdor CONTRIBUTING section explains local runspytest -q,mypy,ruff check .pass in main CI jobs