Handle supplementary code points in StringUtils.splitByCharacterType() by alhudz · Pull Request #1734 · apache/commons-lang

alhudz · 2026-06-28T10:55:11Z

Repro: splitByCharacterType("A" + boldA), where boldA is U+1D400 MATHEMATICAL BOLD CAPITAL A.
Expected: ["A𝐀"], one token, since A and the bold A are both upper-case letters.
Actual: ["A", "𝐀"].
Cause: the shared worker iterates one char at a time and calls Character.getType(char), so each half of a surrogate pair reads as SURROGATE rather than the real category of the code point. Same-type neighbours get split, and in the camelCase path pos - 1 lands inside the pair. splitByCharacterType("5" + boldFive) splits two decimal digits the same way.
Fix: iterate by code point with Character.codePointAt/charCount and classify the whole code point; the camelCase boundary backs up by Character.charCount(Character.codePointBefore(...)). BMP input is unchanged.

(#1734).

(#1734). Javadoc

handle supplementary code points in splitByCharacterType

17eeefa

garydgregory changed the title ~~handle supplementary code points in splitByCharacterType~~ Handle supplementary code points in StringUtils.splitByCharacterType() Jun 28, 2026

garydgregory merged commit 37a2be8 into apache:master Jun 28, 2026
20 of 21 checks passed

garydgregory added a commit that referenced this pull request Jun 28, 2026

Handle supplementary code points in StringUtils.splitByCharacterType()

98bc7e5

(#1734).

garydgregory added a commit that referenced this pull request Jun 28, 2026

Handle supplementary code points in StringUtils.splitByCharacterType()

7b11dcf

(#1734). Javadoc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle supplementary code points in StringUtils.splitByCharacterType()#1734

Handle supplementary code points in StringUtils.splitByCharacterType()#1734
garydgregory merged 1 commit into
apache:masterfrom
alhudz:splitbycharactertype-codepoint

alhudz commented Jun 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

alhudz commented Jun 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants