Skip to content

pthread_mutex in manage_GRAPH_global_contexts causes permanent self-deadlock on VLE queries #2432

Description

@crdv7

Describe the bug

pthread_mutex in manage_GRAPH_global_contexts() causes permanent self-deadlock on VLE queries. When ereport(ERROR) is raised while the mutex is held (e.g., statement timeout, query cancellation, OOM), PostgreSQL's siglongjmp jumps to the error handler, skipping pthread_mutex_unlock(). The mutex remains permanently locked. Any subsequent VLE query on the same backend connection deadlocks on itself — the process hangs forever in pthread_mutex_lock() with __owner == own PID.

How are you accessing AGE (Command line, driver, etc.)?

  • psql (command line), but the bug affects any client/driver.

What data setup do we need to do?

LOAD 'age';
SET search_path = ag_catalog, "$user", public;

SELECT create_graph('test_deadlock');

SELECT * FROM cypher('test_deadlock', $$
    UNWIND range(1, 50000) AS i
    CREATE (:Node {id: i})
$$) AS (v agtype);

SELECT * FROM cypher('test_deadlock', $$
    MATCH (a:Node), (b:Node)
    WHERE b.id = a.id + 1
    CREATE (a)-[:LINK {weight: a.id}]->(b)
$$) AS (e agtype);

-- Load graph context into cache first
SELECT * FROM cypher('test_deadlock', $$
    MATCH path = (a)-[r*1..2]->(b)
    RETURN path LIMIT 1
$$) AS (path agtype);

-- Invalidate cached context by modifying the graph
SELECT * FROM cypher('test_deadlock', $$
    CREATE (:Dummy {x: 1})
$$) AS (v agtype);

What is the necessary configuration info needed?

What is the command that caused the error?

Repeat cache-invalidate + timeout in a loop. The timeout must hit during graph context reload (while the mutex is held). It may take a few iterations depending on machine speed.

-- Repeat: invalidate cache, then cancel VLE query via statement_timeout.
-- Each round has a chance of hitting the mutex-held window.
-- Once it hits, every subsequent VLE query on this connection hangs forever.

-- Round 1
SELECT * FROM cypher('test_deadlock', $$ CREATE (:T1 {x: 1}) $$) AS (v agtype);
SET statement_timeout = '1ms';
SELECT * FROM cypher('test_deadlock', $$
    MATCH path = (a)-[r*1..3]->(b) RETURN path LIMIT 1
$$) AS (path agtype);
RESET statement_timeout;

-- Round 2
SELECT * FROM cypher('test_deadlock', $$ CREATE (:T2 {x: 2}) $$) AS (v agtype);
SET statement_timeout = '1ms';
SELECT * FROM cypher('test_deadlock', $$
    MATCH path = (a)-[r*1..3]->(b) RETURN path LIMIT 1
$$) AS (path agtype);
RESET statement_timeout;

-- Round 3
SELECT * FROM cypher('test_deadlock', $$ CREATE (:T3 {x: 3}) $$) AS (v agtype);
SET statement_timeout = '1ms';
SELECT * FROM cypher('test_deadlock', $$
    MATCH path = (a)-[r*1..3]->(b) RETURN path LIMIT 1
$$) AS (path agtype);
RESET statement_timeout;

-- (add more rounds if needed)

-- Final test: if any round above hit the mutex window,
-- this query hangs forever (self-deadlock).
SELECT * FROM cypher('test_deadlock', $$
    MATCH path = (a)-[r*1..2]->(b) RETURN path LIMIT 1
$$) AS (path agtype);
-- If it returns results, add more rounds above and retry.

To confirm with GDB:

gdb -batch -p <hung_pid> \
  -ex "print global_graph_contexts_container.mutex_lock.__data.__owner"
# Output: $1 = <hung_pid>   (owner == self → self-deadlock)

Expected behavior

VLE queries should continue to work normally after a query error or cancellation. A statement timeout on one query should not permanently break the backend connection.

Environment (please complete the following information):

Additional context

The mutex was introduced in PR #1881 (fix for issue #1878). However, it is both unnecessary and harmful:

  1. Unnecessary: The protected variable is a process-local static — no concurrent access exists. The test failure in Flaky test age_global_graph fails on slow machines #1878 was a catalog-level race, already fixed by the Assert→runtime check and strndup in the same PR. For cross-backend cache invalidation, PostgreSQL syscache uses sinval callbacks, and AGE PR VLE cache: replace snapshot invalidation with per-graph #2376 already uses lock-free pg_atomic_uint64 version counters in shared memory for this.

  2. Harmful: pthread_mutex is incompatible with PostgreSQL's error handling. ereport(ERROR) uses siglongjmp to jump directly to the error handler, bypassing all code between the error site and the handler — including pthread_mutex_unlock(). Once skipped, the mutex is permanently locked for that backend process, and any subsequent VLE query self-deadlocks.

We will submit a PR with a fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions