Skip to content

Vertex mTLS async path (AsyncAuthorizedSession) caches event loop -> 'Event loop is closed' on 2nd request in per-request asyncio.run() servers (Agent Engine) #2532

@coleam00

Description

@coleam00

Description

On Vertex AI, when the async Google-auth / mTLS path is selected
(BaseApiClient._use_google_auth_async() returns True), the client creates an
aiohttp AsyncAuthorizedSession and caches it on self._aiohttp_session
without any closed-loop recreation check. The non-auth aiohttp branch in the
same method does check self._aiohttp_session._loop.is_closed() and recreates, but
the async-auth branch returns the cached session unconditionally.

In any server that handles each request on a fresh event loop (e.g. a
sync-to-async bridge that calls asyncio.run() per request, which is exactly what
Vertex AI Agent Engine / ADK's deployed runner does), the cached
AsyncAuthorizedSession is bound to the first request's now-closed loop. The
second request fails with RuntimeError: Event loop is closed (raised from aiohttp
DNS resolution / getaddrinfo), and the call returns nothing.

This is the same class of bug as #1083 and #1518, but specific to the
Vertex + aiohttp + mTLS client-cert path.

Environment

  • google-genai 1.75.0 (also present in nearby 1.7x)
  • Vertex AI (vertexai=True), aiohttp installed
  • mtls.should_use_client_cert() is True (i.e. a client certificate is present —
    e.g. an agent deployed to Vertex AI Agent Engine with agent identity, which
    provisions a SPIFFE x509 client cert)
  • Server model: one asyncio.run() per request on a worker thread

Relevant code (google/genai/_api_client.py, ~v1.75.0)

_use_google_auth_async() (~line 859) returns True when
has_aiohttp and self.vertexai and mtls.should_use_client_cert().

In _get_aiohttp_session():

  • async-auth branch (~lines 879-911): if self._aiohttp_session is None and self._use_google_auth_async(): → creates AsyncAuthorizedSession, caches it,
    returns. No loop / closed check on subsequent calls.
  • non-auth branch (~lines 915-918): recreates when
    self._aiohttp_session is None or .closed or ._loop.is_closed().

Steps to reproduce

  1. Vertex client with aiohttp installed and a client certificate available so
    _use_google_auth_async() is True.
  2. Issue a streaming/generate request inside one asyncio.run(...) call on a
    worker thread; let that loop close.
  3. Issue a second request inside a new asyncio.run(...) on a new loop, reusing
    the same client.

Expected: the second request succeeds (session recreated on the current loop,
as the non-auth branch already does).

Actual: RuntimeError: Event loop is closed; the call returns no data.

Real-world impact

ADK agents deployed to Vertex AI Agent Engine with agent identity return a
correct answer to the first query and then an empty response to every
subsequent query
on the same warm instance. It is invisible in local dev because
no client cert is present there (so the self-healing non-auth branch is used).
Reproduced on multiple independent agents; identical signature.

Suggested fix

Apply the same closed-loop guard the non-auth branch uses to the async-auth
branch — recreate the AsyncAuthorizedSession when its bound loop is closed (or
when the running loop differs from the one it was created on).

Workaround

Wrap _get_aiohttp_session to drop the cached session when the event loop changes:

import asyncio
from google.genai import _api_client as g

_orig = g.BaseApiClient._get_aiohttp_session

async def _loop_safe(self):
    try:
        running = asyncio.get_running_loop()
    except RuntimeError:
        running = None
    if (getattr(self, "_bound_loop", None) is not None
            and self._bound_loop is not running
            and getattr(self, "_aiohttp_session", None) is not None):
        self._aiohttp_session = None  # recreate on the current loop
    session = await _orig(self)
    self._bound_loop = running
    return session

g.BaseApiClient._get_aiohttp_session = _loop_safe

This keeps the mTLS auth intact and only repairs the loop binding. Verified to fix
repeated queries on a deployed Agent Engine instance (5/5 vs 1/5 before).

Metadata

Metadata

Labels

priority: p2Moderately-important priority. Fix may not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions