Drop label-duplicating synonyms (seeder fix + clean 39 records)#116
Merged
Conversation
The METPO seeder copied each class label into a same-text RELATED/EXACT synonym (source: metpo.owl) — redundant per OBO convention (the label already represents that string; no information beyond it). 39 trait records carried such a synonym. - seed_from_metpo.py: at emit time, skip any synonym whose text equals the label (case-insensitive) and de-dupe repeated synonym_texts, so the seeder no longer introduces them. - Migrated the 39 existing records: removed the redundant synonym (37 RELATED + 2 EXACT), REMOVE_REDUNDANT_SYNONYM curation event per file. syn==label now 0. validate-strict 0 errors; 90 tests pass; seeder dry-run clean. Pages regenerated. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A quality diagnostic found 39 trait records with a synonym whose text exactly equals the label (e.g.
aerobic.yamllabel "aerobic" + synonym "aerobic"). All were seeder-introduced (source: metpo.owl) — the METPO seeder copied each class label into a same-text RELATED/EXACT synonym. Per OBO convention these are redundant noise (the label already represents that string; the synonym adds no information).seed_from_metpo.py: at emit time, skip any synonym whose text equals the label (case-insensitive) and de-dupe repeated synonym_texts — the seeder no longer creates them.REMOVE_REDUNDANT_SYNONYMcuration event per file.Also checked (no action needed)
The same diagnostic confirmed the corpus is otherwise clean: 0 malformed evidence references, 0 duplicate synonyms, 0 duplicate causal edges, 0 duplicate labels across records. (The "105 TRAIT-not-an-edge-target" signal is a non-issue — those trait nodes are connected as edge subjects, a legitimate modeling direction.)
Verification
syn==labelnow 0.just validate-strict: 477 files, 0 errors. 90 tests pass. Seeder dry-run clean (writes nothing). Pages regenerated.🤖 Generated with Claude Code