feat: dump and classify api round-trip failures#217
Merged
Conversation
gen-quality
|
Coverage reportCore-dev coverage
|
| Metric | Coverage | Covered / Total |
|---|---|---|
| Lines | 77.9% | 28539 / 36624 |
| Functions | 74.4% | 6360 / 8550 |
| Branches | 50.7% | 22672 / 44725 |
API coverage src/private/mx/{api,impl,utility}/
| Metric | Coverage | Covered / Total |
|---|---|---|
| Lines | 72.1% | 5328 / 7390 |
| Functions | 60.1% | 1819 / 3029 |
| Branches | 43.0% | 4447 / 10333 |
Core HTML report | API HTML report
Commit fda04bdf35f07340ce89359792240ca158ce74ed.
This was referenced Jun 20, 2026
gen-quality
|
Coverage reportCore-dev coverage
|
| Metric | Coverage | Covered / Total |
|---|---|---|
| Lines | 77.9% | 28539 / 36624 |
| Functions | 74.4% | 6360 / 8550 |
| Branches | 50.7% | 22672 / 44725 |
API coverage src/private/mx/{api,impl,utility}/
| Metric | Coverage | Covered / Total |
|---|---|---|
| Lines | 72.1% | 5328 / 7390 |
| Functions | 60.1% | 1819 / 3029 |
| Branches | 43.0% | 4447 / 10333 |
Core HTML report | API HTML report
Commit 2fdd68002c9a67c717cce0d09369e45b382892d3.
Owner
Author
|
@copilot rebase |
Contributor
Done — merged current |
gen-quality
|
Coverage reportCore-dev coverage
|
| Metric | Coverage | Covered / Total |
|---|---|---|
| Lines | 77.9% | 28539 / 36624 |
| Functions | 74.4% | 6360 / 8550 |
| Branches | 50.7% | 22672 / 44725 |
API coverage src/private/mx/{api,impl,utility}/
| Metric | Coverage | Covered / Total |
|---|---|---|
| Lines | 72.1% | 5328 / 7390 |
| Functions | 60.1% | 1819 / 3029 |
| Branches | 43.0% | 4447 / 10333 |
Core HTML report | API HTML report
Commit f604535dc246b72c57953921cb6cea25683a0f3c.
Document the multiset-first, layered diff design for audit/classify.py, superseding the naive "walk both trees in parallel until the first mismatch" sketch. A positional walk cannot survive a deletion (the dominant mx::api failure signal): one drop desynchronizes every later sibling, so only the first divergence is trustworthy. Replace it with a collections.Counter multiset difference that enumerates all missing element classes in O(n), reorder-invariant, stdlib-only. Includes a cited research appendix surveying tree edit distance, XML-specific diff algorithms, sequence alignment, Python libraries, and yield-based ranking.
Phase 1 (#210): add a --dump <dir> flag to the api round-trip harness' discovery mode. For every non-PASS, non-SKIP file it re-runs the pipeline and writes the fully-normalized expected (and, when produced, actual) documents to <dir>/<flat>.expected.xml / .actual.xml. Pipeline errors (LOADFAIL/GETDATAFAIL/ CREATEFAIL) have no actual document, so a <flat>.status sidecar records the exact code. runRoundtrip() is left untouched; dumpDocuments() replicates the normalization sequence. New make target: dump-api-roundtrip. Phase 2 (#211): add audit/classify.py and the `classify` subcommand. It diffs each dumped pair as an order-free element multiset (Counter(expected) - Counter(actual)), which enumerates every dropped element class in O(n) and is reorder-invariant -- fixing the positional-walk cascade where one drop desynchronizes all later siblings. It cross-references data/api.features.xml and assigns each file a root-cause category (B drop-only, C reorder-only, D enum-bug, E missing-attribute, F pipeline-error, or unknown), emitting build/api/classified.json plus a stdout worklist ranked by files unblocked. New make targets: classify-api-roundtrip, test-audit (wired into CI). Tests in audit/tests/test_classify.py cover all categories, the multiset completeness fix, the single-blocker low-hanging-fruit metric, and ranking.
Drives the failure classifier (dump -> classify) and turns build/api/classified.json into a prioritized, plain-language explanation of what is wrong with the mx::api round-trip and what to fix next, grouped by failure mode (crash, supported-element drop, reorder, by-design drop, audit blind spot).
b83931e to
2f3675a
Compare
gen-quality
|
webern
commented
Jun 20, 2026
gen-quality
|
Coverage reportCore-dev coverage
|
| Metric | Coverage | Covered / Total |
|---|---|---|
| Lines | 77.9% | 28539 / 36624 |
| Functions | 74.4% | 6360 / 8550 |
| Branches | 50.7% | 22672 / 44725 |
API coverage src/private/mx/{api,impl,utility}/
| Metric | Coverage | Covered / Total |
|---|---|---|
| Lines | 72.1% | 5328 / 7390 |
| Functions | 60.1% | 1819 / 3029 |
| Branches | 43.0% | 4447 / 10333 |
Core HTML report | API HTML report
Commit 2cd996bbfcd8dc03492f330b9bb0d15268346705.
This was referenced Jun 20, 2026
webern
added a commit
that referenced
this pull request
Jun 20, 2026
…pported-element drops (#224) ## Summary Progresses #219 in two parts: fix part-group round-trip, and make the round-trip classifier surface the dropped-supported-element signal #219 is about (instead of burying it in `unknown`). ### part-group round-trip (mx::api / mx::impl) #219 flagged part-group as the most-dropped `support=full` element on api round-trip (373 files). It was two problems: 1. Misleading corpus signal. All 373 drops were synthetic files with an unmatched `<part-group type="start">` (no stop) -- schema-valid but semantically invalid, a start/stop pairing constraint XSD cannot express. `mx::api` correctly drops an unmatched start (it models a complete start..stop span), and zero real-world files drop part-group. Fixed by making the synthetic part-groups well-formed (trailing `start` -> `stop`, 386 files; schema-valid per version, audit surface unchanged). 2. Overstated support. Even well-formed groups lost data on write: `group-abbreviation` read but never written, `group-barline` fabricated as a constant `yes`, `displayName`/`displayAbbreviation` dead. Now: model `group-barline` (`api::GroupBarline`), write `group-abbreviation`, wire the display names to `group-name-display`/`group-abbreviation-display` via a shared `NameDisplayFunctions` helper. `api.features.xml` corrected `full` -> `partial`; `group-time`/editorial stay unmodeled by design. ### classifier (audit/classify.py) The classifier put every file dropping a `support=full`/`partial` element into `unknown` -- category B only fires when every dropped class is `support=none`, so any file mixing a supported drop with unsupported ones (nearly all) fell through. That buried #219's premise: 759 of 828 files were `unknown`. Added category G (supported-element drop): an actionable impl-bug-or-audit-overstatement signal, evaluated after B/C/D/E (a precise enum/attribute finding still wins) and listing the dropped supported classes as `blocking_features`. On the corpus this moves 575 files out of `unknown` (-> 183) and ranks the real offenders: staff (280), lyric (117), text (115), voice (96), measure-numbering (88). ## Testing - [x] New `partGroupRoundTrip` test: red before the part-group fix, green after - [x] Full api/impl suite: 4113 assertions / 271 cases - [x] corert over the corpus: 830 files (the 386 synthetic edits round-trip in `mx::core`) - [x] `make test-audit`: 13 cases incl. 2 new category-G tests - [x] api round-trip: files dropping part-group 373 -> 0; classifier `unknown` 759 -> 183 - [x] Changed synthetic files schema-valid (3.0/3.1/4.0); `make check` and the api-roundtrip regression gate pass No corpus files become green (each still drops `footnote`/`level`/`staff`), so the pinned baseline is unchanged; the targeted test is the demonstration. ## References - Progresses #219 - Part of #208 - Surfaced by #211 / #217
webern
added a commit
that referenced
this pull request
Jun 20, 2026
## Summary The api pipeline crashed outright (no output at all) on 16 corpus metronome/tempo files — 8 GETDATAFAIL on read, 8 CREATEFAIL on write — caused by two long-standing MX_THROW sites in the impl layer (present in the original library too; the round-trip classifier from #217 is what surfaced them). Read (MetronomeReader): - parseNoteRelationNote() threw "wtf is this" on the metronome-note form. The api has no representation for it, so the tempo is now left unspecified instead of crashing the whole document. - parseMetronomeModulation() was an empty stub that left the tempo unspecified (which then crashed the writer). Implemented it — the two beat-units are read into api::MetricModulation. Write (DirectionWriter): - The tempo loop threw on any non-beatsPerMinute tempo. Replaced with a switch that writes beatsPerMinute and metricModulation, and skips tempoText / unspecified tempos gracefully. Result: metric modulation now round-trips through the api. The metronome-note form and a non-numeric per-minute (a legal xs:string, e.g. "ca. 76") no longer crash — their tempo is dropped, producing imperfect output instead of none. Corpus discovery confirms GETDATAFAIL and CREATEFAIL both drop from 8 to 0. No file reaches a full PASS: all 16 still fail the strict DOM compare on pre-existing, by-design fidelity losses unrelated to metronomes (identification / encoding / part-list reordering), so there are no new live-corpus baseline entries — as the issue anticipated. ## Testing - [x] New targeted red/green tests cover both former crash sites plus metric-modulation fidelity (`*_MetronomeApi`: 17 assertions in 4 test cases) - [x] Full api test suite passes (4114 assertions in 272 test cases) - [x] Corpus discovery: 0 GETDATAFAIL, 0 CREATEFAIL (was 8 + 8) ## References - Closes #218 - Progresses #208 - Surfaced by #211 / #217
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements Phases 1 and 2 of the api round-trip triage plan (#208).
Phase 1 (#210) adds a
--dump <dir>flag to the api round-trip harness's discovery mode. For every non-PASS, non-SKIP file it re-runs the pipeline and writes the fully-normalized expected and actual documents.runRoundtrip()is left untouched; a newdumpDocuments()helper replays the normalization sequence. Pipeline errors produce no actual document, so a small.statussidecar records the exact error.New target:
make dump-api-roundtrip.Phase 2 (#211) adds
audit/classify.pyand thepython3 -m audit classifysubcommand.It diffs each dumped pair as an order-free element count because the dominant signal, deletion, would desynchronize after the first drop. It provides (
distinct_missing_count = len(missing)).It cross-references
data/api.features.xmlto assign each file a root-cause category and writesbuild/api/classified.json. It also prints a worklist ranked by files unblocked with a single-blocker (low-hanging-fruit).Design rationale (defect analysis, layered algorithm, library survey, cited research) is in
docs/ai/design/api-roundtrip-classifier.md.Testing
make test-audit— 11 classifier tests pass (all categories, the multiset-completeness fix, single-blocker metric, ranking)make dump-api-roundtrip && make classify-api-roundtripclassified 828 non-passing files into B 1 / C 51 / F 16 / unknown 760. The largeunknownbucket is the correct result on this pre-api: run discovery and pin currently-passing corpus files to the round-trip baseline #209 branch: those files drop elements that aresupport="full"(e.g.part-group,staff,voice) — genuine correctness signals, not feature gaps — which the design deliberately surfaces asunknownrather than mislabeling drop-only..statussidecar round-trips: a CREATEFAIL file reports"pipeline_error_kind": "CREATEFAIL".git statusclean after a full dump + classify run (all artifacts gitignored).mxtest-api-roundtrip regressionpasses (runRoundtrip()and regression mode untouched; the change only adds a discovery-mode branch).Sample stdout from the real run:
References