Sort rows across books when aligning by Enkidu93 · Pull Request #309 · sillsdev/machine.py

Enkidu93 · 2026-06-04T18:26:38Z

I encountered this while helping Michael with another incorrectly extracted project (a Georgian translation that uses the Russian Orthodox versification). Although the new machine.py code extracted the project with the correct number of lines (previously it had too many), Michael noticed it was missing all content after a certain verse in Daniel (it was a full, published Bible). It turns out that this is because those verses in Daniel in the Russian Orthodox versification are mapped to DAG. Because the TextCorpusEnumerator only guarantees correct ordering per book and the alignment code relies on this, it was causing the project rows to advance all the way to the deuterocanon before aligning any rows.

The only clear way I can see to fix this is to order the rows overall (not just per book), but I realize that this is a hit to efficiency. It was a difference of a second or two with this project. If you're OK with it for now, as we have time, I would like to revisit how to do this more efficiently. One option might be to collect the rows in lists per book and then only sort per book as needed, but that also wouldn't respect the enumerator-type approach. Let me know what you think.

I spent a long time on this since I was puzzled as to why we haven't hit this before, but I think I've come up with a satisfactory answer. We haven't noticed this because:

This cross-book mapping only results in large numbers of missing verses if the section maps to a later book. If it maps to an earlier book, then we just advance past that chunk and continue. So there would be missing verses but not very many likely.
The only standard versifications that have cross-book mappings to later books are the Russian Orthodox versification and the Vulgate. It looks like we have virtually no projects using these versifications. It also looks like there is some self-awareness even in the standard files that this mapping is a bad idea. In fact, the project that prompted this exploration actually does have DAG in a separate book and does not follow the Russian Orthodox versification in this spot, but unfortunately, its custom.vrs does not correctly capture this intent.

I have adjusted the custom.vrs in that project on the bucket and Michael has a good extract to work with for the time-being, but we should try and fix this ASAP.

This change is

ddaspit

One of the design philosophies for the corpus classes in Machine is to avoid loading the entire corpus in memory, so that it can efficiently handle large corpora. If any verse could be ordered before a previously iterated verse without respect to book, then we have to load the entire corpus into memory before sorting. The only other option that avoids loading the entire corpus into memory is to sort in batches and offload those batches to disk. This just isn't worthwhile for corpora the size of Bible translations. I don't think we can avoid loading everything into memory. I would hate to do this for every translation when this is an issue for only a small number of cases. Is there some way we could detect when this is needed by looking at the versification?

Also, it would be good to add a unit test.

@ddaspit reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on Enkidu93).

codecov-commenter · 2026-06-05T14:00:49Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.56%. Comparing base (2f04902) to head (566fab8).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #309      +/-   ##
==========================================
+ Coverage   91.53%   91.56%   +0.03%     
==========================================
  Files         355      355              
  Lines       23061    23127      +66     
==========================================
+ Hits        21109    21177      +68     
+ Misses       1952     1950       -2

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Enkidu93

@Enkidu93 made 2 comments.
Reviewable status: 1 of 6 files reviewed, all discussions resolved (waiting on ddaspit).

machine/scripture/canon.py line 155 at r3 (raw file):

def book_number_to_id(number: int, error_value: str = "***") -> str:
    index = number - 1
    if index < 0 or index >= len(ALL_BOOK_IDS):

This is a fix for a bug I found where is_canonical(123) was false but is_canonical("LAO") (LAO is book 123) was true which was causing some downstream bugs. The version in C# is already correct.

machine/scripture/verse_ref.py line 879 at r3 (raw file):

        return True

    def all_included_verses(self) -> Generator[VerseRef, None, None]:

I'm keeping this 'public' because I will need this method for the versification error detector in a future PR.

Enkidu93 · 2026-06-08T15:23:58Z

I've gone ahead and implemented the method we discussed. Now the rows will only be sorted across books if there are cross-book mappings (which is most but not all of the time).

I also added tests and fixed a small bug in canon.py.

ddaspit

@ddaspit reviewed 6 files and all commit messages, and made 1 comment.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on Enkidu93).

Sort rows across books if needed

70d3f49

Enkidu93 requested a review from ddaspit June 4, 2026 18:26

ddaspit requested changes Jun 4, 2026

View reviewed changes

Only sort books when needed; add test

87a3a46

Check for cross book mappings before sorting rows across books

61331fa

Enkidu93 commented Jun 8, 2026

View reviewed changes

Enkidu93 requested a review from ddaspit June 8, 2026 15:22

Update library version to 1.8.10

566fab8

ddaspit approved these changes Jun 8, 2026

View reviewed changes

Enkidu93 merged commit 25e2fbf into main Jun 8, 2026
14 checks passed

Enkidu93 deleted the fix-corpus-alignment-cross-book-mappings branch June 8, 2026 15:44

Enkidu93 mentioned this pull request Jun 8, 2026

Port 'Sort rows across books when aligning' sillsdev/machine#425

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sort rows across books when aligning#309

Sort rows across books when aligning#309
Enkidu93 merged 4 commits into
mainfrom
fix-corpus-alignment-cross-book-mappings

Enkidu93 commented Jun 4, 2026 •

edited by ddaspit

Loading

Uh oh!

ddaspit left a comment

Uh oh!

codecov-commenter commented Jun 5, 2026 •

edited

Loading

Uh oh!

Enkidu93 left a comment

Uh oh!

Enkidu93 commented Jun 8, 2026

Uh oh!

ddaspit left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Enkidu93 commented Jun 4, 2026 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 commented Jun 8, 2026

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Enkidu93 commented Jun 4, 2026 •

edited by ddaspit

Loading

codecov-commenter commented Jun 5, 2026 •

edited

Loading