Skip to content

Compile language encoder regexes once (Caverphone1/2, MatchRating)#440

Closed
nishantmehta wants to merge 3 commits into
apache:masterfrom
nishantmehta:pr/precompile-language-regexes
Closed

Compile language encoder regexes once (Caverphone1/2, MatchRating)#440
nishantmehta wants to merge 3 commits into
apache:masterfrom
nishantmehta:pr/precompile-language-regexes

Conversation

@nishantmehta

Copy link
Copy Markdown

Summary

Three phonetic encoders in the language package applied their regular expressions via String.replaceAll, which recompiles its regex argument on every call. A single encode therefore compiled many Patterns, repeated for every input.

This hoists those regexes into static final Pattern constants (compiled once) and applies them with Matcher.replaceAll. The regexes, their order, and the replacements are unchanged, so the produced codes are identical.

  • Caverphone2 — ~25 patterns per encode
  • Caverphone1 — 17 patterns per encode
  • MatchRatingApproachEncoder — 9 patterns per encode (cleanName's five punctuation regexes + a whitespace-collapse applied three times + a boundary whitespace-run in removeVowels)

Measurement

ThreadMXBean allocation driver, 200k warmed ops:

encoder before after
Caverphone2.encode 19670 B/op 6304 B/op (−68%)
Caverphone1.encode 14005 B/op 4672 B/op (−67%)
MatchRatingApproachEncoder.encode ~5800 B/op ~1800 B/op (−70%)

Testing

Caverphone1Test, Caverphone2Test and MatchRatingApproachEncoderTest pass unchanged.

Three commits, one per encoder, in case you prefer to take them separately.

encode() applied more than twenty regular expressions via String.replaceAll,
each of which compiles its pattern on every call. A single encode therefore
compiled 25 patterns, repeated for every input.

Hoist the patterns into static final Pattern constants and apply them with
Matcher.replaceAll. The regexes and their order are unchanged, so the produced
codes are identical.

Measured with a ThreadMXBean allocation driver (200k warmed ops, encoding
"Thompson"): 19670 B/op -> 6304 B/op (-68%). Caverphone2Test passes
unchanged.

Signed-off-by: Nishant Mehta <nishantmehta.n@gmail.com>
As with Caverphone2, encode() applied seventeen regular expressions via
String.replaceAll, recompiling each pattern on every call.

Hoist them into static final Pattern constants applied with Matcher.replaceAll.
The regexes and their order are unchanged, so the produced codes are identical.

Measured with a ThreadMXBean allocation driver (200k warmed ops, encoding
"Thompson"): 14005 B/op -> 4672 B/op (-67%). Caverphone1Test passes
unchanged.

Signed-off-by: Nishant Mehta <nishantmehta.n@gmail.com>
cleanName() and the right-to-left comparison built and compiled regexes on
every encode: five punctuation-trimming regexes (in a per-call String[]),
a whitespace-collapse regex applied three times, and a boundary
whitespace-run regex in removeVowels(). String.replaceAll recompiles its
regex argument on every call, so a single encode compiled nine Patterns.

Hoist them into static final Pattern constants (the five punctuation
regexes into a Pattern[]) and apply them with Matcher.replaceAll. The
regexes, their order, and the empty/space replacements are unchanged, so
the produced codes are identical.

Measured with ThreadMXBean.getThreadAllocatedBytes (200k warmed iters):
encode("Smith") 5612 -> 1597 B/op, encode("O'Brien") 6040 -> 1952,
encode("Thompson") 5872 -> 1784 (about -70%). MatchRatingApproachEncoderTest
(100 tests) passes.

Signed-off-by: Nishant Mehta <nishantmehta.n@gmail.com>
@garydgregory

Copy link
Copy Markdown
Member

Closing. No one's asking for this. Fix bugs in Jira if want to help improve this things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants