gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize() by sethmlarson · Pull Request #149080 · python/cpython

sethmlarson · 2026-04-27T21:38:23Z

Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters with a large number of inversions in combing class order.

Issue: O(n²) insertion sort in unicodedata.normalize("NFC") canonical ordering #149079

…ze() Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters in alternating CCC order. Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com>

StanFromIreland · 2026-04-27T21:46:49Z

Reviewers: Note that there are pending changes from previous reviews.

serhiy-storchaka

There is a potential for optimization, but in general LGTM. 👍

serhiy-storchaka · 2026-04-28T10:43:51Z

    Py_ssize_t i, o, osize;
-    int kind;
-    const void *data;
+    int input_kind, result_kind;


Why not reuse the same variable?

IIRC, I asked to have two different variables for readability purposes. We could reuse it but when reading the code, it was cleaner when I saw the separation. But it can be reverted if you insist.

serhiy-storchaka · 2026-04-28T10:59:57Z

Ideas for optimization:

We already have the Py_UCS4 output buffer. It is better to sort it, without using more costly PyUnicode_READ and PyUnicode_WRITE.
It is perhaps possible to combine sorting routines with the code that determines the length. This will reduce the number of costly _getrecord_ex() calls but requires heavy rewriting.
Since Unicode characters only need 21 bits of 32, they can be combined with 8-bit combining in the temporary buffer, reducing the number of costly _getrecord_ex() calls. But this will make the code more difficult to read.

tim-one

I'm not sure there's ever an end to suggestions, so I'd prefer to ship this already. Good work, good enough,, and thank you for your care and patience!

StanFromIreland · 2026-05-13T06:54:40Z

@sethmlarson a little ping, there are some suggestions above that need applying.

serhiy-storchaka · 2026-05-13T08:58:20Z

Most of my suggestions can wait to the following PRs. Although some of them can be easy to implement, so you can include them in this PR if you wish.

encukou

Converting @maurycy's comments to GitHub suggestions:

StanFromIreland · 2026-05-14T19:50:14Z

Following Petr's example, I wrote sethmlarson#1 to apply all open suggestions (except #149080 (comment), which I think is fine as-is, and Serhiy's future performance plans).

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com> Co-authored-by: Petr Viktorin <encukou@gmail.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Maurycy Pawłowski-Wieroński <maurycy@maurycy.com>

sethmlarson · 2026-05-26T21:29:14Z

Sorry for the delay folks, I've merged the changes from @StanFromIreland 🙏

read-the-docs-community · 2026-05-26T21:31:54Z

Documentation build overview

📚 cpython-previews | 🛠️ Build #32860334 | 📁 Comparing 4a29545 against main (776573c)

🔍 Preview build

118 files changed · ± 114 modified · - 4 deleted

± Modified

- Deleted

sethmlarson requested review from malemburg and tim-one April 27, 2026 21:38

bedevere-app Bot added the awaiting review label Apr 27, 2026

sethmlarson added type-security A security issue topic-unicode and removed awaiting review labels Apr 27, 2026

bedevere-app Bot mentioned this pull request Apr 27, 2026

O(n²) insertion sort in unicodedata.normalize("NFC") canonical ordering #149079

Open

maurycy reviewed Apr 27, 2026

View reviewed changes

Comment thread Lib/test/test_unicodedata.py Outdated

Comment thread Lib/test/test_unicodedata.py Outdated

Comment thread Modules/unicodedata.c Outdated

serhiy-storchaka self-requested a review April 27, 2026 22:16

serhiy-storchaka approved these changes Apr 28, 2026

View reviewed changes

bedevere-app Bot added the awaiting merge label Apr 28, 2026

picnixz reviewed Apr 28, 2026

View reviewed changes

Comment thread Modules/unicodedata.c

picnixz reviewed Apr 28, 2026

View reviewed changes

Comment thread Modules/unicodedata.c Outdated

picnixz reviewed Apr 28, 2026

View reviewed changes

Comment thread Misc/NEWS.d/next/Security/2026-04-27-16-36-11.gh-issue-149079.vKl-LM.rst Outdated

tim-one approved these changes Apr 30, 2026

View reviewed changes

encukou reviewed May 13, 2026

View reviewed changes

Comment thread Lib/test/test_unicodedata.py Outdated

Comment thread Lib/test/test_unicodedata.py Outdated

Comment thread Modules/unicodedata.c Outdated

Apply all review suggestions

4a29545

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com> Co-authored-by: Petr Viktorin <encukou@gmail.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Maurycy Pawłowski-Wieroński <maurycy@maurycy.com>

sethmlarson requested a review from encukou May 26, 2026 21:29

serhiy-storchaka approved these changes May 27, 2026

View reviewed changes

Uh oh!

Conversation

sethmlarson commented Apr 27, 2026 • edited by tim-one Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StanFromIreland commented Apr 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

picnixz Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

serhiy-storchaka commented Apr 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tim-one left a comment

Choose a reason for hiding this comment

Uh oh!

StanFromIreland commented May 13, 2026

Uh oh!

serhiy-storchaka commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

encukou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StanFromIreland commented May 14, 2026

Uh oh!

sethmlarson commented May 26, 2026

Uh oh!

read-the-docs-community Bot commented May 26, 2026

Documentation build overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

sethmlarson commented Apr 27, 2026 •

edited by tim-one

Loading

serhiy-storchaka commented May 13, 2026 •

edited

Loading