Optimize id()/start_id()/end_id() with direct column access#2295
Optimize id()/start_id()/end_id() with direct column access#2295jrgemignani wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR optimizes the execution of id(), start_id(), and end_id() functions by avoiding full vertex/edge object reconstruction. Instead of calling these functions on rebuilt _agtype_build_vertex/_agtype_build_edge expressions, the implementation directly accesses the underlying graphid columns and wraps them with graphid_to_agtype().
Key Changes:
- Added hidden column export mechanism that stores id, start_id, and end_id as target entries when entities pass between clauses
- Implemented cross-clause and current-clause optimization patterns in
try_optimize_id_funcs() - Added validation logic to prevent stale Var references after WITH clauses
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/include/parser/cypher_transform_entity.h | Added id_var, start_id_var, end_id_var, and props_var fields to transform_entity struct to store exposed column Vars |
| src/backend/parser/cypher_transform_entity.c | Initialized the new Var fields to NULL in make_transform_entity |
| src/backend/parser/cypher_expr.c | Implemented optimization logic with try_optimize_id_funcs, extract_id_var_from_entity_expr, and find_entity_in_current_cpstate; includes unrelated whitespace changes |
| src/backend/parser/cypher_clause.c | Added export_entity_hidden_columns to export id columns as hidden target entries and update_entity_vars_from_rte to update entity Vars from RTE columns |
| regress/sql/cypher_match.sql | Added 13 test cases covering WHERE clause optimizations for both current-clause and cross-clause scenarios |
| regress/sql/cypher_with.sql | Added 12 test cases verifying id functions work correctly through WITH clauses, including chained WITH and aliasing |
| regress/expected/cypher_match.out | Updated expected output showing optimized query plans using graphid_to_agtype instead of age_id |
| regress/expected/cypher_with.out | Updated expected output with correct results for WITH clause id function tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
eac9cf1 to
3ffb136
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Optimize id()/start_id()/end_id() with direct graphid column access. NOTE: This PR is created with AI tools and a human. Add optimization to avoid rebuilding full vertex/edge objects when only the id, start_id, or end_id is needed. Instead of calling age_id() on a reconstructed _agtype_build_vertex/_agtype_build_edge, directly access the underlying graphid column. This enables PostgreSQL to use B-tree indexes on graphid columns for significantly improved query performance. Implementation: - Export hidden columns as raw graphid type (not agtype-wrapped) when entities pass between clauses via export_entity_hidden_columns() - Store Var references (id_var, start_id_var, end_id_var) in transform_entity - try_optimize_id_funcs() returns raw graphid Var for optimizable patterns - Add cross-type comparison operators (graphid vs int8) with btree support - Update graphid_ops operator class to include cross-type operators, enabling Index Cond scans instead of post-scan Filter evaluation - Add graphid support to agtype_volatile_wrapper() for SET/MERGE contexts Optimized patterns: MATCH (p) WHERE id(p) > 0 RETURN p MATCH ()-[e]->() WHERE id(e) > 0 RETURN e MATCH ()-[e]->() WHERE start_id(e) > 0 RETURN e MATCH ()-[e]->() WHERE end_id(e) > 0 RETURN e All regression tests passed. Added regression tests for the new graphid operators. Files changed: - src/include/parser/cypher_transform_entity.h: Add id_var, start_id_var, end_id_var fields to transform_entity struct - src/backend/parser/cypher_clause.c: Export raw graphid columns - src/backend/parser/cypher_expr.c: Add try_optimize_id_funcs(), extract_id_var_from_entity_expr(); return raw graphid Var - src/backend/utils/adt/graphid.c: Add 12 cross-type comparison functions (graphid_eq_int8, etc.) and 2 btree comparison functions (graphid_btree_cmp_int8, int8_btree_cmp_graphid) - src/backend/utils/adt/agtype.c: Add GRAPHIDOID to agtype_volatile_wrapper() - sql/age_main.sql: Add cross-type operators, functions, and update graphid_ops operator class with cross-type btree support - age--1.6.0--y.y.y.sql: Add upgrade definitions for new functions, operators, and operator class - regress/sql/graphid.sql: Add cross-type operator and index tests - regress/sql/cypher_match.sql: Add WHERE optimization tests - regress/expected/*.out: Update expected output
3ffb136 to
ae06570
Compare
gregfelice
left a comment
There was a problem hiding this comment.
Reviewed with a focus on the two things that could actually go wrong here: cross-type btree index correctness and the safety of the id()/start_id()/end_id() rewrite. Both hold up — nice work. Details and a few findings below.
Cross-type btree is safe. My main concern with a new cross-type opclass is a signedness/ordering inconsistency that could produce wrong index results. But graphid is typedef int64 (and F_GRAPHIDEQ == F_INT8EQ), so graphid vs int8 is literally int64 vs int64. graphid_btree_cmp_int8 / int8_btree_cmp_graphid do plain signed compares returning -1/0/1 — a consistent total order identical to int8 and to id()'s existing agtype-integer semantics. The graphid_ops wiring is well-formed: same-type strategies 1-5, cross-type (graphid,int8) strategies 1-5, and support FUNCTION 1 for both (graphid,graphid) and (graphid,int8). So the indexed path returns the same answers as the old reconstruct-and-compare path.
Rewrite guards are appropriately conservative. try_optimize_id_funcs only fires on a bare single-part ColumnRef argument, looks the entity up only in the current cpstate (avoiding invalid varnos inside EXISTS/subqueries), and explicitly bails in EXPR_KIND_INSERT_TARGET (MERGE/CREATE). Null semantics are preserved too: an unmatched OPTIONAL MATCH entity yields a NULL graphid column, so WHERE id(p) > 0 filters it the same as id(null) > 0 -> null.
Findings (all minor / non-correctness):
-
Cross-clause propagation gap (also the subject of one of the Copilot notes).
export_entity_hidden_columnsskips entities not declared in the current clause (if (!entity->declared_in_current_clause) continue;), yet the doc comment says it re-exports previous-clause entities via their Vars. For chains of 3+ clauses theid_varlikely won't propagate past the second boundary, so the optimization silently falls back to the slow path. Results stay correct — it's a missed optimization plus an inaccurate comment. Worth confirming whether multi-hop propagation is intended, and fixing the comment either way. -
Memory leak (Copilot):
col_namefrompsprintfis reassigned three times withoutpfree. Per-query context so it's bounded, but worth tidying. -
Minor (Copilot): the misplaced/duplicated doc block before
find_entity_in_current_cpstate, andexported_namesappendingentity_namewithoutpstrdup(safe only while those pointers are stable). -
Possible opfamily asymmetry: I don't see the reverse
(int8, graphid)operators /int8_btree_cmp_graphidadded tographid_ops, soWHERE <int8col> OP id(p)may not get an index scan. The commutator covers the common0 < id(p)form, so this is a missed-opt at most — just flagging in case both directions were intended.
Mechanical blockers before merge: the PR is currently CONFLICTING with master and needs a rebase (it dates to Dec 2025 and the parser/agtype areas it touches have moved quite a bit since), and CI is stale — worth a fresh green run after the rebase.
Overall I'd support this after a rebase, a green CI run, and the comment/leak tidy-ups; the core approach and the index-correctness story are solid.
NOTE: This PR was created with AI tools and a human
Optimize id()/start_id()/end_id() with direct graphid column access.
Add optimization to avoid rebuilding full vertex/edge objects when only
the id, start_id, or end_id is needed. Instead of calling age_id() on a
reconstructed _agtype_build_vertex/_agtype_build_edge, directly access
the underlying graphid column. This enables PostgreSQL to use B-tree
indexes on graphid columns for significantly improved query performance.
Implementation:
entities pass between clauses via export_entity_hidden_columns()
transform_entity
patterns
support
enabling Index Cond scans instead of post-scan Filter evaluation
contexts
Optimized patterns:
MATCH (p) WHERE id(p) > 0 RETURN p
MATCH ()-[e]->() WHERE id(e) > 0 RETURN e
MATCH ()-[e]->() WHERE start_id(e) > 0 RETURN e
MATCH ()-[e]->() WHERE end_id(e) > 0 RETURN e
All regression tests passed.
Added regression tests for the new graphid operators.
Files changed:
end_id_var fields to transform_entity struct
extract_id_var_from_entity_expr(); return raw graphid Var
(graphid_eq_int8, etc.) and 2 btree comparison functions
(graphid_btree_cmp_int8, int8_btree_cmp_graphid)
agtype_volatile_wrapper()
graphid_ops operator class with cross-type btree support
operators, and operator class