gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize()#149080
gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize()#149080sethmlarson wants to merge 2 commits into
Conversation
…ze() Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters in alternating CCC order. Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com>
|
Reviewers: Note that there are pending changes from previous reviews. |
serhiy-storchaka
left a comment
There was a problem hiding this comment.
There is a potential for optimization, but in general LGTM. 👍
| Py_ssize_t i, o, osize; | ||
| int kind; | ||
| const void *data; | ||
| int input_kind, result_kind; |
There was a problem hiding this comment.
Why not reuse the same variable?
There was a problem hiding this comment.
IIRC, I asked to have two different variables for readability purposes. We could reuse it but when reading the code, it was cleaner when I saw the separation. But it can be reverted if you insist.
|
Ideas for optimization:
|
tim-one
left a comment
There was a problem hiding this comment.
I'm not sure there's ever an end to suggestions, so I'd prefer to ship this already. Good work, good enough,, and thank you for your care and patience!
|
@sethmlarson a little ping, there are some suggestions above that need applying. |
|
Most of my suggestions can wait to the following PRs. Although some of them can be easy to implement, so you can include them in this PR if you wish. |
encukou
left a comment
There was a problem hiding this comment.
Converting @maurycy's comments to GitHub suggestions:
|
Following Petr's example, I wrote sethmlarson#1 to apply all open suggestions (except #149080 (comment), which I think is fine as-is, and Serhiy's future performance plans). |
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com> Co-authored-by: Petr Viktorin <encukou@gmail.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Maurycy Pawłowski-Wieroński <maurycy@maurycy.com>
|
Sorry for the delay folks, I've merged the changes from @StanFromIreland 🙏 |
Documentation build overview
118 files changed ·
|
Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters with a large number of inversions in combing class order.
unicodedata.normalize("NFC")canonical ordering #149079