Compile language encoder regexes once (Caverphone1/2, MatchRating)#440
Closed
nishantmehta wants to merge 3 commits into
Closed
Compile language encoder regexes once (Caverphone1/2, MatchRating)#440nishantmehta wants to merge 3 commits into
nishantmehta wants to merge 3 commits into
Conversation
encode() applied more than twenty regular expressions via String.replaceAll, each of which compiles its pattern on every call. A single encode therefore compiled 25 patterns, repeated for every input. Hoist the patterns into static final Pattern constants and apply them with Matcher.replaceAll. The regexes and their order are unchanged, so the produced codes are identical. Measured with a ThreadMXBean allocation driver (200k warmed ops, encoding "Thompson"): 19670 B/op -> 6304 B/op (-68%). Caverphone2Test passes unchanged. Signed-off-by: Nishant Mehta <nishantmehta.n@gmail.com>
As with Caverphone2, encode() applied seventeen regular expressions via String.replaceAll, recompiling each pattern on every call. Hoist them into static final Pattern constants applied with Matcher.replaceAll. The regexes and their order are unchanged, so the produced codes are identical. Measured with a ThreadMXBean allocation driver (200k warmed ops, encoding "Thompson"): 14005 B/op -> 4672 B/op (-67%). Caverphone1Test passes unchanged. Signed-off-by: Nishant Mehta <nishantmehta.n@gmail.com>
cleanName() and the right-to-left comparison built and compiled regexes on
every encode: five punctuation-trimming regexes (in a per-call String[]),
a whitespace-collapse regex applied three times, and a boundary
whitespace-run regex in removeVowels(). String.replaceAll recompiles its
regex argument on every call, so a single encode compiled nine Patterns.
Hoist them into static final Pattern constants (the five punctuation
regexes into a Pattern[]) and apply them with Matcher.replaceAll. The
regexes, their order, and the empty/space replacements are unchanged, so
the produced codes are identical.
Measured with ThreadMXBean.getThreadAllocatedBytes (200k warmed iters):
encode("Smith") 5612 -> 1597 B/op, encode("O'Brien") 6040 -> 1952,
encode("Thompson") 5872 -> 1784 (about -70%). MatchRatingApproachEncoderTest
(100 tests) passes.
Signed-off-by: Nishant Mehta <nishantmehta.n@gmail.com>
Member
|
Closing. No one's asking for this. Fix bugs in Jira if want to help improve this things. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three phonetic encoders in the
languagepackage applied their regular expressions viaString.replaceAll, which recompiles its regex argument on every call. A singleencodetherefore compiled manyPatterns, repeated for every input.This hoists those regexes into
static final Patternconstants (compiled once) and applies them withMatcher.replaceAll. The regexes, their order, and the replacements are unchanged, so the produced codes are identical.cleanName's five punctuation regexes + a whitespace-collapse applied three times + a boundary whitespace-run inremoveVowels)Measurement
ThreadMXBean allocation driver, 200k warmed ops:
Caverphone2.encodeCaverphone1.encodeMatchRatingApproachEncoder.encodeTesting
Caverphone1Test,Caverphone2TestandMatchRatingApproachEncoderTestpass unchanged.Three commits, one per encoder, in case you prefer to take them separately.