Skip to content

Extended-protocol fixes: Flush handling + prepared-statement client-cache consistency#953

Open
ikeboy003 wants to merge 4 commits into
postgresml:mainfrom
YCWagerWise:fix/extended-protocol-flush-and-prepared-cache
Open

Extended-protocol fixes: Flush handling + prepared-statement client-cache consistency#953
ikeboy003 wants to merge 4 commits into
postgresml:mainfrom
YCWagerWise:fix/extended-protocol-flush-and-prepared-cache

Conversation

@ikeboy003
Copy link
Copy Markdown

Four commits. Two distinct fixes that share a theme: pgcat losing extended-protocol state in ways the client cannot detect or recover from.

1. Flush ('H') handling — 3 commits

Drivers using describeFirst (e.g. postgres.js with prepare: false + params) send Flush between Parse/Describe and Bind/Execute. pgcat logged Unexpected code: H and dropped it, hanging the client.

  • Handle Flush ('H') extended-protocol message — drains buffered extended-protocol messages into the send buffer, forwards the Flush byte, keeps the server checked out for the follow-up Bind/Execute/Sync.
  • Read Flush responses without waiting for ReadyForQueryServer::recv() only breaks on 'Z', which the server never sends after Flush. Adds recv_flush_response() that breaks on the actual terminal markers (RowDescription / NoData / ErrorResponse / BindComplete).
  • Flush response: drop double-copy parse and clone — perf: inspect framing byte without re-parsing length or copying the body for messages we only need to identify.

2. Prepared-statement client-cache desync — 1 commit

ensure_prepared_statement_is_on_server on PreparedStatementError:

  1. Removed the entry from the client's prepared_statements map
  2. Returned Ok(())

The client never received Close('S') so it still believed the statement was valid. The next Bind for that name missed in buffer_bind and pgcat dropped the entire TCP conn with ClientError(\"Prepared statement sN doesn't exist\").

For drivers that cache prepared statements per logical connection (tokio-postgres, asyncpg, pgx) and reuse them across many transactions, this is unrecoverable — the client has no way to know its view diverged from pgcat's.

Fix:

  • Keep the client cache intact — it mirrors what the client believes; dropping it without Close is a protocol-level lie.
  • Mark the offending server bad so the pool replaces it on next checkout.
  • Propagate PreparedStatementError so the caller fails this one op cleanly and retries on a fresh backend.

Test plan

  • cargo build clean
  • cargo fmt --check
  • postgres.js prepare: false + params: completes in ~500ms (was hanging)
  • Happy to add an integration test for Flush + describeFirst if you point me at the right harness

Nathanael Aninweze added 4 commits May 26, 2026 19:16
postgres.js (and likely other drivers) sends Flush between
Parse/Describe and Bind/Execute when describeFirst is enabled
(the default for parameterized non-prepared queries). pgcat was
logging 'Unexpected code: H' and dropping the message, leaving the
client hung waiting for the ParameterDescription/RowDescription
response that never arrived.

Add a Flush arm that drains the buffered extended-protocol
messages into self.buffer (same as Sync), appends the Flush byte,
sends to the server, forwards the server response, and keeps the
server checked out so the subsequent Bind/Execute/Sync can
complete on the same backend.
The previous commit added a Flush arm in the client loop but reused
send_and_receive_loop, which calls Server::recv() — that recv loop
only terminates on ReadyForQuery ('Z'). Servers do NOT send
ReadyForQuery in response to Flush, so the read hung forever.

Add Server::recv_flush_response() that breaks on RowDescription,
NoData, ErrorResponse, or BindComplete (the terminal describe /
bind response markers). Wire it into the Flush arm via a direct
send + recv_flush_response call path.

Verified locally: postgres.js with prepare:false + params now
completes in ~500ms through pgcat (previously hung).
- recv_flush_response: skip body parse for non-S/non-1 codes (5–N byte
  protocol frames; framing byte is enough to dispatch).
- recv_flush_response: replace self.buffer.clone() with std::mem::take
  to avoid one memcpy+alloc per Flush response.
- Merge separate saw_terminal flag into single matches!() check on the
  framing byte.

Net effect on cloudrest describe-heavy workload: removes one full-buffer
clone and one redundant get_u8/get_i32/read_string pass per Flush. At
intra-DC RTT (~0.3ms) and 10k+ qps the saved CPU is the dominant win.
…re-prepare failure

When a checked-out backend rejected re-Parse of a previously-prepared
statement (e.g. after schema-change cleanup, DEALLOCATE, or transient
backend state), the old code removed the entry from the *client*
prepared_statements HashMap and returned Ok(). This left tokio-postgres
clients in an unrecoverable state: the client still believed the
statement was valid (it never received a Close), so its very next Bind
for that name missed in buffer_bind and the whole TCP conn dropped with
'Prepared statement sN doesn't exist'.

For long-lived ETL clients (pgraft's tokio-postgres prepares many shapes
via execute() and caches them per-conn by SQL), this killed every write
inflight on that conn and orphaned any open COPY-IN, leaving
AccessExclusiveLock on temp staging tables until session termination.

New behavior: keep the client cache intact (it mirrors the client's own
view), mark the misbehaving server bad so the pool replaces it, and
propagate PreparedStatementError so the caller can fail this one
operation and retry on a fresh backend.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant