Skip to content

Handle supplementary code points in StringUtils.splitByCharacterType()#1734

Merged
garydgregory merged 1 commit into
apache:masterfrom
alhudz:splitbycharactertype-codepoint
Jun 28, 2026
Merged

Handle supplementary code points in StringUtils.splitByCharacterType()#1734
garydgregory merged 1 commit into
apache:masterfrom
alhudz:splitbycharactertype-codepoint

Conversation

@alhudz

@alhudz alhudz commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Repro: splitByCharacterType("A" + boldA), where boldA is U+1D400 MATHEMATICAL BOLD CAPITAL A.
Expected: ["A𝐀"], one token, since A and the bold A are both upper-case letters.
Actual: ["A", "𝐀"].
Cause: the shared worker iterates one char at a time and calls Character.getType(char), so each half of a surrogate pair reads as SURROGATE rather than the real category of the code point. Same-type neighbours get split, and in the camelCase path pos - 1 lands inside the pair. splitByCharacterType("5" + boldFive) splits two decimal digits the same way.
Fix: iterate by code point with Character.codePointAt/charCount and classify the whole code point; the camelCase boundary backs up by Character.charCount(Character.codePointBefore(...)). BMP input is unchanged.

@garydgregory garydgregory changed the title handle supplementary code points in splitByCharacterType Handle supplementary code points in StringUtils.splitByCharacterType() Jun 28, 2026
@garydgregory garydgregory merged commit 37a2be8 into apache:master Jun 28, 2026
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants