Skip to content

Sort rows across books when aligning#309

Merged
Enkidu93 merged 4 commits into
mainfrom
fix-corpus-alignment-cross-book-mappings
Jun 8, 2026
Merged

Sort rows across books when aligning#309
Enkidu93 merged 4 commits into
mainfrom
fix-corpus-alignment-cross-book-mappings

Conversation

@Enkidu93

@Enkidu93 Enkidu93 commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

I encountered this while helping Michael with another incorrectly extracted project (a Georgian translation that uses the Russian Orthodox versification). Although the new machine.py code extracted the project with the correct number of lines (previously it had too many), Michael noticed it was missing all content after a certain verse in Daniel (it was a full, published Bible). It turns out that this is because those verses in Daniel in the Russian Orthodox versification are mapped to DAG. Because the TextCorpusEnumerator only guarantees correct ordering per book and the alignment code relies on this, it was causing the project rows to advance all the way to the deuterocanon before aligning any rows.

The only clear way I can see to fix this is to order the rows overall (not just per book), but I realize that this is a hit to efficiency. It was a difference of a second or two with this project. If you're OK with it for now, as we have time, I would like to revisit how to do this more efficiently. One option might be to collect the rows in lists per book and then only sort per book as needed, but that also wouldn't respect the enumerator-type approach. Let me know what you think.

I spent a long time on this since I was puzzled as to why we haven't hit this before, but I think I've come up with a satisfactory answer. We haven't noticed this because:

  1. This cross-book mapping only results in large numbers of missing verses if the section maps to a later book. If it maps to an earlier book, then we just advance past that chunk and continue. So there would be missing verses but not very many likely.
  2. The only standard versifications that have cross-book mappings to later books are the Russian Orthodox versification and the Vulgate. It looks like we have virtually no projects using these versifications. It also looks like there is some self-awareness even in the standard files that this mapping is a bad idea. In fact, the project that prompted this exploration actually does have DAG in a separate book and does not follow the Russian Orthodox versification in this spot, but unfortunately, its custom.vrs does not correctly capture this intent.

I have adjusted the custom.vrs in that project on the bucket and Michael has a good extract to work with for the time-being, but we should try and fix this ASAP.


This change is Reviewable

@Enkidu93 Enkidu93 requested a review from ddaspit June 4, 2026 18:26

@ddaspit ddaspit left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the design philosophies for the corpus classes in Machine is to avoid loading the entire corpus in memory, so that it can efficiently handle large corpora. If any verse could be ordered before a previously iterated verse without respect to book, then we have to load the entire corpus into memory before sorting. The only other option that avoids loading the entire corpus into memory is to sort in batches and offload those batches to disk. This just isn't worthwhile for corpora the size of Bible translations. I don't think we can avoid loading everything into memory. I would hate to do this for every translation when this is an issue for only a small number of cases. Is there some way we could detect when this is needed by looking at the versification?

Also, it would be good to add a unit test.

@ddaspit reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on Enkidu93).

@codecov-commenter

codecov-commenter commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.56%. Comparing base (2f04902) to head (566fab8).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #309      +/-   ##
==========================================
+ Coverage   91.53%   91.56%   +0.03%     
==========================================
  Files         355      355              
  Lines       23061    23127      +66     
==========================================
+ Hits        21109    21177      +68     
+ Misses       1952     1950       -2     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Enkidu93 Enkidu93 left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 made 2 comments.
Reviewable status: 1 of 6 files reviewed, all discussions resolved (waiting on ddaspit).


machine/scripture/canon.py line 155 at r3 (raw file):

def book_number_to_id(number: int, error_value: str = "***") -> str:
    index = number - 1
    if index < 0 or index >= len(ALL_BOOK_IDS):

This is a fix for a bug I found where is_canonical(123) was false but is_canonical("LAO") (LAO is book 123) was true which was causing some downstream bugs. The version in C# is already correct.


machine/scripture/verse_ref.py line 879 at r3 (raw file):

        return True

    def all_included_verses(self) -> Generator[VerseRef, None, None]:

I'm keeping this 'public' because I will need this method for the versification error detector in a future PR.

@Enkidu93 Enkidu93 requested a review from ddaspit June 8, 2026 15:22
@Enkidu93

Enkidu93 commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

I've gone ahead and implemented the method we discussed. Now the rows will only be sorted across books if there are cross-book mappings (which is most but not all of the time).

I also added tests and fixed a small bug in canon.py.

@ddaspit ddaspit left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@ddaspit reviewed 6 files and all commit messages, and made 1 comment.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on Enkidu93).

@Enkidu93 Enkidu93 merged commit 25e2fbf into main Jun 8, 2026
14 checks passed
@Enkidu93 Enkidu93 deleted the fix-corpus-alignment-cross-book-mappings branch June 8, 2026 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants