Sort nodeset on demand#330
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adjusts REXML XPath evaluation to return consistently ordered node-sets (via centralized sorting) and updates tests to handle XPath primitive results returned as single-element arrays.
Changes:
- Centralizes ordering by using
XPathParser.sort(...)inmatchand some call sites, and makessorta class method. - Simplifies
step(...)by always de-duplicating via identitySetand removing theaxis_orderparameter. - Updates the Jaxen test helper to unwrap primitive XPath results before calling
REXML::Functions.string.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| test/test_jaxen.rb | Unwraps primitive XPath results (single-element arrays) to keep Functions.string calls compatible. |
| lib/rexml/xpath_parser.rb | Reworks node-set ordering and de-duplication; changes sort to a class method; removes reverse-axis handling in step. |
| lib/rexml/functions.rb | Sorts node-sets before iterating / stringifying to make results deterministic. |
Comments suppressed due to low confidence (1)
lib/rexml/xpath_parser.rb:1
stepno longer sorts the merged node-set (it used to callsort(...)for the multi-nodeset case). Returningnodes.to_amakes ordering dependent on insertion order throughSet, which can change predicate behavior where ordering is significant (e.g., later[1]filters orposition()), and can introduce non-determinism across Ruby versions/Set behavior. Consider sorting the merged result before returning (and keep de-duplication by identity).
# frozen_string_literal: false
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| XPathParser.sort(node_set.to_a).each do |node| | ||
| result << yield(node) if node.respond_to?(:namespace) | ||
| end |
Refactor `step` so that nodeset materialization is deferred. Instead of building the full nodeset up front and filtering through predicates, each axis returns a *scan descriptor* (`[generator_name, generator_argument]`), and `step` picks a scan strategy based on the predicates' shape. Predicates are classified into three groups: | kind | examples | strategy passed to the generator | |---|---|---| | position-independent | `[@A="1"]`, `[name()="foo"]`, `[@A=@b]` | `:uniq` — emit deduplicated matching nodes | | simple positional | `[N]`, `[position()=N]`, `[position()>N]`, `[position()<N]` | `[op, value]` — positional scan with one comparison | | complex / position-dependent | `[position()*@A]`, `[last()-1]`, ... | `:nodesets` — fall back to per-anchor nodesets + the previous `evaluate_predicate` pipeline | Mixed predicate lists are split: position-independent predicates *before* the first positional predicate are folded into the node test; predicates *after* it are applied per-node on the result. Each axis can implement zero, one, or all of the three strategies. If a strategy is not implemented, the generator falls back to producing `:nodesets` and the common slow path (`non_optimized_nodesets_select`) handles dedup / positional filtering on flattened nodesets — i.e. the same behavior as before this PR. This pull request adds fast paths for: - `descendant` / `descendant-or-self`: `:uniq` (single DFS with a seen-set; this is what speeds up `//a//a//a//a`) - `ancestor` / `ancestor-or-self`: `:uniq` (parent-chain walk with a seen-set) - `preceding-sibling` / `following-sibling`: `:uniq` and `[op, value]` (sibling scan with anchor-index tracking) Other axes (`child`, `parent`, `self`, `attribute`, `preceding`, `following`, etc) keeps the previous behavior via the fallback path; they can be optimized in follow-ups without changing call sites. ## Detail For `//a//a//a` style queries, the previous code built nodesets keyed by each anchor, including the same descendant once per anchor. The new `:uniq` path scans every node at most once per step. For `[position() > N]` style predicates on wide trees (e.g. `//a/preceding-sibling::*[position()>2]`), we previously built the full preceding-sibling nodeset for each anchor and then ran `evaluate_predicate`. The new `[op, value]` path scans children once per parent and uses anchor-index bookkeeping to recover per-anchor positions. Note: general XPath cannot be linear — e.g. `*[position() * number(@A) % number(@b) = 1]` is genuinely O(n²) — so the goal is only to add a fast-path for specific case: position-independent predicates and simple-positional predicates. ## Benchmark of best case ```ruby DEPTH = 500 xml = '<a>' * DEPTH + '</a>' * DEPTH doc = REXML::Document.new(xml) WIDTH = 1000 xml_wide = '<root>' + '<child/>' * WIDTH + '</root>' doc_wide = REXML::Document.new(xml_wide) REXML::XPath.match(doc, "//a//a"); # processing time: 30.756939s → 0.126807s REXML::XPath.match(doc_wide, "//*/preceding-sibling::*[position()=10]"); # processing time: 2.446333s → 0.083954s ``` ## Benchmark of various case ### Scenario ```yaml prelude: | require "rexml" xml_wide = "<root>" + (1..1000).map { |i| "<item id='#{i}'/>" }.join + "</root>" wide = REXML::Document.new(xml_wide) xml_deep = "<root>" + (1..1000).map { |i| "<item id='#{i}'>" }.join + '</item>'*1000 + "</root>" deep = REXML::Document.new(xml_deep) benchmark: child: REXML::XPath.match(wide, "root/item") descendant: REXML::XPath.match(deep, "//item") descendant-descendant: REXML::XPath.match(deep, "//item//item") descendant-descendant-wildcard: REXML::XPath.match(deep, "//*//*") ancestor-descendant: REXML::XPath.match(deep, "descendant::*/ancestor::*/descendant::*") preceding-following-sibling: REXML::XPath.match(wide, "//*/preceding-sibling::*/following-sibling::*") preceding-following-sibling-positional: REXML::XPath.match(wide, "//*/preceding-sibling::*[10]/following-sibling::*[10]") ``` ### Compares master, xpath_step_optimize (this pull), sort_on_demand(#330), sort_improve(Emulate ideal sort computation time), and its combinations. There's no implementation of `sort_improve` yet, so I used the code below to emulate the computational cost of ideal sort. ```ruby def sort(array_of_nodes) # Just spend time to emulate the ideal computational cost of sorting nodes parents = Set.new.compare_by_identity array_of_nodes.each { parents << it.parent if it.parent } 4.times do # find the common ancestor nodes = array_of_nodes seen = Set.new.compare_by_identity while nodes.size >= 2 new_nodes = Set.new.compare_by_identity nodes.map(&:parent).each do |parent| if parent && !seen.include?(parent) seen << parent new_nodes << parent end end nodes = new_nodes end # iterate each node's siblings parents.each{it.children.each{}} end array_of_nodes # not sorted end ``` ### Result ``` Comparison: child master: 1288.1 i/s master_sort_improve: 1190.4 i/s - 1.08x slower xpath_step_optimize_sort_on_demand_sort_improve: 875.3 i/s - 1.47x slower xpath_step_optimize_sort_improve: 861.3 i/s - 1.50x slower xpath_step_optimize_sort_on_demand: 92.3 i/s - 13.96x slower xpath_step_optimize: 91.7 i/s - 14.05x slower sort_on_demand: 90.6 i/s - 14.21x slower descendant master_sort_improve: 75.5 i/s xpath_step_optimize_sort_on_demand_sort_improve: 75.1 i/s - 1.01x slower xpath_step_optimize_sort_improve: 68.8 i/s - 1.10x slower sort_on_demand: 21.4 i/s - 3.52x slower xpath_step_optimize_sort_on_demand: 21.4 i/s - 3.52x slower master: 20.9 i/s - 3.61x slower xpath_step_optimize: 11.7 i/s - 6.45x slower descendant-descendant xpath_step_optimize_sort_on_demand_sort_improve: 47.5 i/s xpath_step_optimize_sort_improve: 41.9 i/s - 1.13x slower xpath_step_optimize_sort_on_demand: 17.9 i/s - 2.65x slower master_sort_improve: 8.6 i/s - 5.54x slower sort_on_demand: 6.7 i/s - 7.07x slower xpath_step_optimize: 6.1 i/s - 7.84x slower master: 4.6 i/s - 10.24x slower descendant-descendant-wildcard xpath_step_optimize_sort_on_demand_sort_improve: 339.5 i/s xpath_step_optimize_sort_improve: 155.9 i/s - 2.18x slower xpath_step_optimize_sort_on_demand: 26.4 i/s - 12.86x slower master_sort_improve: 10.0 i/s - 33.96x slower sort_on_demand: 7.7 i/s - 44.30x slower xpath_step_optimize: 6.8 i/s - 50.06x slower master: 4.9 i/s - 68.58x slower ancestor-descendant xpath_step_optimize_sort_on_demand_sort_improve: 377.9 i/s xpath_step_optimize_sort_improve: 203.7 i/s - 1.85x slower xpath_step_optimize_sort_on_demand: 26.3 i/s - 14.39x slower xpath_step_optimize: 8.7 i/s - 43.24x slower master_sort_improve: 7.8 i/s - 48.55x slower sort_on_demand: 6.3 i/s - 59.93x slower master: 5.0 i/s - 75.46x slower preceding-following-sibling xpath_step_optimize_sort_on_demand_sort_improve: 684.1 i/s xpath_step_optimize_sort_improve: 424.5 i/s - 1.61x slower xpath_step_optimize_sort_on_demand: 85.8 i/s - 7.98x slower xpath_step_optimize: 23.7 i/s - 28.91x slower master_sort_improve: 20.9 i/s - 32.72x slower sort_on_demand: 19.3 i/s - 35.39x slower master: 13.9 i/s - 49.11x slower preceding-following-sibling-positional xpath_step_optimize_sort_on_demand_sort_improve: 425.4 i/s xpath_step_optimize_sort_improve: 315.0 i/s - 1.35x slower xpath_step_optimize_sort_on_demand: 84.3 i/s - 5.05x slower xpath_step_optimize: 23.3 i/s - 18.22x slower master_sort_improve: 2.1 i/s - 201.38x slower sort_on_demand: 2.1 i/s - 204.75x slower master: 1.9 i/s - 222.08x slower ``` In scenario "child" and "descendant", this PR is slower than master because it adds one additional `sort` call. The difference will be small when `sort` is improved. In most case, this PR itself does not unleash its full potential because sort is the next bottleneck. Combining with `sort` improvement is important. The difference of "descendant-descendant" and "descendant-descendant-wildcard" shows that after optimizing sort, the bottleneck will be namespace lookup in qname check for deeply nested xml.
|
@tompng |
| unless matched.all? { |node| node.is_a?(REXML::Node) } | ||
| assert_equal(1, matched.size, 'Primitive value should be a single value') | ||
| matched = matched.first | ||
| end |
|
Resolve done, updated PR description and added some test that fails when |
| doc = Document.new("<a><b>1</b><c>2</c><d>3</d><e/></a>") | ||
| assert_equal(["b", "c", "d"], XPath.match(doc, "//e/preceding-sibling::*").map(&:name)) | ||
| assert_equal(["d"], XPath.match(doc, "//e/preceding-sibling::*[1]").map(&:name)) | ||
| assert_equal(["b"], XPath.match(doc, "(//e/preceding-sibling::*)[1]").map(&:name)) | ||
| assert_equal(["b"], XPath.match(doc, "//e/preceding-sibling::*/self::*[1]").map(&:name)) |
There was a problem hiding this comment.
> doc = Nokogiri::XML("<a><b>1</b><c>2</c><d>3</d><e/></a>")
> nodes = doc.xpath("//e/preceding-sibling::*").map(&:name)
=> ["b", "c", "d"] 🙆♂️
> nodes = doc.xpath("//e/preceding-sibling::*[1]").map(&:name)
=> ["d"] 🙆♂️
> nodes = doc.xpath("(//e/preceding-sibling::*)[1]").map(&:name)
=> ["b"] 🙆♂️
> nodes = doc.xpath("//e/preceding-sibling::*/self::*[1]").map(&:name)
=> ["b", "c", "d"] 🙅♀️ I think this is a potential issue, but the results differ from those of nokogiri.
I think nokogiri's behavior is the correct one—what do you think? @tompng @kou
https://www.w3.org/TR/1999/REC-xpath-19991116/#location-paths
Each node in that set is used as a context node for the following step. The sets of nodes identified by that step are unioned together.
https://www.w3.org/TR/1999/REC-xpath-19991116/#predicates
For each node in the node-set to be filtered, the PredicateExpr is evaluated with that node as the context node, with the number of nodes in the node-set as the context size, and with the proximity position of the node in the node-set with respect to the axis as the context position
There was a problem hiding this comment.
What happens with:
//e/preceding-sibling::*/self::*(//e/preceding-sibling::*/self::*)[1]
There was a problem hiding this comment.
- rexml 3.4.4
> REXML::XPath.match(doc, "//e/preceding-sibling::*/self::*").map(&:name)
=> ["b", "c", "d"]
> REXML::XPath.match(doc, "(//e/preceding-sibling::*/self::*)[1]").map(&:name)
=> ["d"]- master & this PR
> REXML::XPath.match(doc, "//e/preceding-sibling::*/self::*").map(&:name)
=> ["b", "c", "d"]
> REXML::XPath.match(doc, "(//e/preceding-sibling::*/self::*)[1]").map(&:name)
=> ["b"]- nokogiri
> doc = Nokogiri::XML("<a><b>1</b><c>2</c><d>3</d><e/></a>")
> nodes = doc.xpath("//e/preceding-sibling::*/self::*").map(&:name)
=> ["b", "c", "d"]
> nodes = doc.xpath("(//e/preceding-sibling::*/self::*)[1]").map(&:name)
=> ["b"]There was a problem hiding this comment.
Thanks.
//e/preceding-sibling::*/self::*[1] should be evaluated as //e/preceding-sibling::*/(self::*[1]), right?
If so, I also think that Nokogiri's behavior is correct.
There was a problem hiding this comment.
Nice catch! I also think nokogiri's behavior is correct. I reverted the change to :self and removed this assertion, leave the behavior as is in this PR.
I've opened #336 to fix the bug.
There was a problem hiding this comment.
//e/preceding-sibling::*/(self::*[1])
This is an invalid XPath.
https://www.w3.org/TR/1999/REC-xpath-19991116/#node-sets
[3] RelativeLocationPath ::= Step
| RelativeLocationPath '/' Step
| AbbreviatedRelativeLocationPath
The text to the right of / must always be Step.
[4] Step ::= AxisSpecifier NodeTest Predicate*
| AbbreviatedStep
Step does not contain parentheses.
> REXML::XPath.match(doc, "//e/preceding-sibling::*/(self::*[1])").map(&:name)
/Users/naitoh/.rbenv/versions/4.0.0/lib/ruby/gems/4.0.0/gems/rexml-3.4.4/lib/rexml/parsers/xpathparser.rb:29:in 'REXML::Parsers::XPathParser#parse': Garbage component exists at the end: <(self::*[1])>: <//e/preceding-sibling::*/(self::*[1])> (REXML::ParseException)
from /Users/naitoh/.rbenv/versions/4.0.0/lib/ruby/gems/4.0.0/gems/rexml-3.4.4/lib/rexml/xpath_parser.rb:80:in 'REXML::XPathParser#parse'
from /Users/naitoh/.rbenv/versions/4.0.0/lib/ruby/gems/4.0.0/gems/rexml-3.4.4/lib/rexml/xpath.rb:67:in 'REXML::XPath.match'
from (irb):4:in '<main>'
from /Users/naitoh/.rbenv/versions/4.0.0/lib/ruby/gems/4.0.0/gems/irb-1.18.0/exe/irb:9:in '<top (required)>'
from /Users/naitoh/.rbenv/versions/4.0.0/lib/ruby/4.0.0/rubygems.rb:303:in 'Kernel#load'
from /Users/naitoh/.rbenv/versions/4.0.0/lib/ruby/4.0.0/rubygems.rb:303:in 'Gem.activate_and_load_bin_path'
from /Users/naitoh/.rbenv/versions/4.0.0/bin/irb:25:in '<main>'> doc = Nokogiri::XML("<a><b>1</b><c>2</c><d>3</d><e/></a>")
> nodes = doc.xpath("//e/preceding-sibling::*/(self::*[1])")
/Users/naitoh/.rbenv/versions/4.0.0/lib/ruby/gems/4.0.0/gems/nokogiri-1.19.3-arm64-darwin/lib/nokogiri/xml/searchable.rb:270:in 'Nokogiri::XML::XPathContext#evaluate': ERROR: Invalid expression: //e/preceding-sibling::*/(self::*[1]) (Nokogiri::XML::XPath::SyntaxError)
from /Users/naitoh/.rbenv/versions/4.0.0/lib/ruby/gems/4.0.0/gems/nokogiri-1.19.3-arm64-darwin/lib/nokogiri/xml/searchable.rb:270:in 'Nokogiri::XML::Searchable#xpath_impl'
from /Users/naitoh/.rbenv/versions/4.0.0/lib/ruby/gems/4.0.0/gems/nokogiri-1.19.3-arm64-darwin/lib/nokogiri/xml/searchable.rb:253:in 'Nokogiri::XML::Searchable#xpath_internal'
from /Users/naitoh/.rbenv/versions/4.0.0/lib/ruby/gems/4.0.0/gems/nokogiri-1.19.3-arm64-darwin/lib/nokogiri/xml/searchable.rb:182:in 'Nokogiri::XML::Searchable#xpath'
from (irb):4:in '<main>'
from /Users/naitoh/.rbenv/versions/4.0.0/lib/ruby/gems/4.0.0/gems/irb-1.18.0/exe/irb:9:in '<top (required)>'
from /Users/naitoh/.rbenv/versions/4.0.0/lib/ruby/4.0.0/rubygems.rb:303:in 'Kernel#load'
from /Users/naitoh/.rbenv/versions/4.0.0/lib/ruby/4.0.0/rubygems.rb:303:in 'Gem.activate_and_load_bin_path'
from /Users/naitoh/.rbenv/versions/4.0.0/bin/irb:25:in '<main>'For now, it seems that nokogiri's behavior is the correct one.
Thank you.
Delay sorting, only sort when it is needed.
In most case, sorting nodeset is not needed. Sort is only required in:
Number of sort operations
/a/b/c/d/e(a/b/c/d)[position()>1]/e/f/gnumber(/a/b/c/d/e)count(/a/b/c/d/e)//a//b//c//d//e/a[1]/b[1]/c[1]/d[1]/e#315 removed one
nodesets.size == 1optimization path. This pull request will reduce the performance regression.To reduce more sort calls, we need to mark nodeset ordering: introducing
Nodeset = Struct.new(:nodes, :order)but IMO, it shouldn't be done now. If
sortis optimized, one extra sort won't be a problem. Optimizingstepwill be harder and the code may be complicated.Note
This pull request will slightly add complexity and a risk to forgot sorting the nodeset in some path.
The effect may seem drastic in some case for now, but it's just because
sortis currently worstO(n^2). We can improvesortperformance, so there's an option to leave the sort strategy simple.