You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
pthread_mutex in manage_GRAPH_global_contexts() causes permanent self-deadlock on VLE queries. When ereport(ERROR) is raised while the mutex is held (e.g., statement timeout, query cancellation, OOM), PostgreSQL's siglongjmp jumps to the error handler, skipping pthread_mutex_unlock(). The mutex remains permanently locked. Any subsequent VLE query on the same backend connection deadlocks on itself — the process hangs forever in pthread_mutex_lock() with __owner == own PID.
How are you accessing AGE (Command line, driver, etc.)?
psql (command line), but the bug affects any client/driver.
What data setup do we need to do?
LOAD 'age';
SET search_path = ag_catalog, "$user", public;
SELECT create_graph('test_deadlock');
SELECT*FROM cypher('test_deadlock', $$
UNWIND range(1, 50000) AS i
CREATE (:Node {id: i})
$$) AS (v agtype);
SELECT*FROM cypher('test_deadlock', $$
MATCH (a:Node), (b:Node)
WHEREb.id=a.id+1
CREATE (a)-[:LINK {weight: a.id}]->(b)
$$) AS (e agtype);
-- Load graph context into cache firstSELECT*FROM cypher('test_deadlock', $$
MATCH path= (a)-[r*1..2]->(b)
RETURN pathLIMIT1
$$) AS (path agtype);
-- Invalidate cached context by modifying the graphSELECT*FROM cypher('test_deadlock', $$
CREATE (:Dummy {x: 1})
$$) AS (v agtype);
Repeat cache-invalidate + timeout in a loop. The timeout must hit during graph context reload (while the mutex is held). It may take a few iterations depending on machine speed.
-- Repeat: invalidate cache, then cancel VLE query via statement_timeout.-- Each round has a chance of hitting the mutex-held window.-- Once it hits, every subsequent VLE query on this connection hangs forever.-- Round 1SELECT*FROM cypher('test_deadlock', $$ CREATE (:T1 {x: 1}) $$) AS (v agtype);
SET statement_timeout ='1ms';
SELECT*FROM cypher('test_deadlock', $$
MATCH path= (a)-[r*1..3]->(b) RETURN pathLIMIT1
$$) AS (path agtype);
RESET statement_timeout;
-- Round 2SELECT*FROM cypher('test_deadlock', $$ CREATE (:T2 {x: 2}) $$) AS (v agtype);
SET statement_timeout ='1ms';
SELECT*FROM cypher('test_deadlock', $$
MATCH path= (a)-[r*1..3]->(b) RETURN pathLIMIT1
$$) AS (path agtype);
RESET statement_timeout;
-- Round 3SELECT*FROM cypher('test_deadlock', $$ CREATE (:T3 {x: 3}) $$) AS (v agtype);
SET statement_timeout ='1ms';
SELECT*FROM cypher('test_deadlock', $$
MATCH path= (a)-[r*1..3]->(b) RETURN pathLIMIT1
$$) AS (path agtype);
RESET statement_timeout;
-- (add more rounds if needed)-- Final test: if any round above hit the mutex window,-- this query hangs forever (self-deadlock).SELECT*FROM cypher('test_deadlock', $$
MATCH path= (a)-[r*1..2]->(b) RETURN pathLIMIT1
$$) AS (path agtype);
-- If it returns results, add more rounds above and retry.
VLE queries should continue to work normally after a query error or cancellation. A statement timeout on one query should not permanently break the backend connection.
Environment (please complete the following information):
The mutex was introduced in PR #1881 (fix for issue #1878). However, it is both unnecessary and harmful:
Unnecessary: The protected variable is a process-local static — no concurrent access exists. The test failure in Flaky test age_global_graph fails on slow machines #1878 was a catalog-level race, already fixed by the Assert→runtime check and strndup in the same PR. For cross-backend cache invalidation, PostgreSQL syscache uses sinval callbacks, and AGE PR VLE cache: replace snapshot invalidation with per-graph #2376 already uses lock-free pg_atomic_uint64 version counters in shared memory for this.
Harmful:pthread_mutex is incompatible with PostgreSQL's error handling. ereport(ERROR) uses siglongjmp to jump directly to the error handler, bypassing all code between the error site and the handler — including pthread_mutex_unlock(). Once skipped, the mutex is permanently locked for that backend process, and any subsequent VLE query self-deadlocks.
Describe the bug
pthread_mutexinmanage_GRAPH_global_contexts()causes permanent self-deadlock on VLE queries. Whenereport(ERROR)is raised while the mutex is held (e.g., statement timeout, query cancellation, OOM), PostgreSQL'ssiglongjmpjumps to the error handler, skippingpthread_mutex_unlock(). The mutex remains permanently locked. Any subsequent VLE query on the same backend connection deadlocks on itself — the process hangs forever inpthread_mutex_lock()with__owner == own PID.How are you accessing AGE (Command line, driver, etc.)?
What data setup do we need to do?
What is the necessary configuration info needed?
What is the command that caused the error?
Repeat cache-invalidate + timeout in a loop. The timeout must hit during graph context reload (while the mutex is held). It may take a few iterations depending on machine speed.
To confirm with GDB:
Expected behavior
VLE queries should continue to work normally after a query error or cancellation. A statement timeout on one query should not permanently break the backend connection.
Environment (please complete the following information):
Additional context
The mutex was introduced in PR #1881 (fix for issue #1878). However, it is both unnecessary and harmful:
Unnecessary: The protected variable is a process-local
static— no concurrent access exists. The test failure in Flaky test age_global_graph fails on slow machines #1878 was a catalog-level race, already fixed by the Assert→runtime check andstrndupin the same PR. For cross-backend cache invalidation, PostgreSQL syscache usessinvalcallbacks, and AGE PR VLE cache: replace snapshot invalidation with per-graph #2376 already uses lock-freepg_atomic_uint64version counters in shared memory for this.Harmful:
pthread_mutexis incompatible with PostgreSQL's error handling.ereport(ERROR)usessiglongjmpto jump directly to the error handler, bypassing all code between the error site and the handler — includingpthread_mutex_unlock(). Once skipped, the mutex is permanently locked for that backend process, and any subsequent VLE query self-deadlocks.We will submit a PR with a fix.