Skip to content

fix: Retry transient connection errors on all HTTP requests#2148

Open
iMicknl wants to merge 4 commits into
mainfrom
fix/retry-connection-failure-all-requests
Open

fix: Retry transient connection errors on all HTTP requests#2148
iMicknl wants to merge 4 commits into
mainfrom
fix/retry-connection-failure-all-requests

Conversation

@iMicknl

@iMicknl iMicknl commented Jun 22, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes #2147.

Previously only fetch_events was decorated with @retry_on_connection_failure, so a transient TimeoutError / aiohttp.ClientConnectorError raised from any other method — including the command path (execute_action_group_execute_action_group_direct) and all setup/state/refresh calls — propagated raw on the very first occurrence instead of being retried.

This surfaced in Home Assistant (home-assistant/core#173155): a ConnectionTimeoutError (subclass of the builtin TimeoutError) raised from a cover close command escaped as an unhandled traceback.

Change

Centralize the retry on the _get/_post/_put/_delete request helpers (the approach suggested in the issue), so every request gets uniform transient-connection retry. This removes the need to remember the decorator per-method.

The now-redundant @retry_on_connection_failure is dropped from fetch_events to avoid nested double-retry — the outer retry_on_concurrent_requests / retry_on_auth_error / retry_on_listener_error decorators still wrap the request helpers correctly (connection retry now sits innermost, closest to the request).

Tests

Added two regression tests in tests/test_client.py that inject the failure at the session level so it flows through the decorated helpers:

  • test_backoff_retries_command_on_connection_failure — the motivating command-path case (TimeoutError)
  • test_backoff_retries_get_on_connection_failure — GET path (ClientConnectorError)

Both were confirmed to fail before the fix and pass after (TDD).

Verification

  • 549 tests pass
  • ruff clean
  • mypy clean

Previously only fetch_events was decorated with
@retry_on_connection_failure, so a transient TimeoutError or
ClientConnectorError raised from any other method — including the
command path (execute_action_group) and all setup/state/refresh
calls — propagated raw on the first occurrence.

Centralize the retry on the _get/_post/_put/_delete helpers so every
request gets uniform transient-connection retry, and drop the now
redundant decorator from fetch_events.

Fixes #2147
@iMicknl iMicknl requested a review from tetienne as a code owner June 22, 2026 14:54
@github-actions github-actions Bot added the bug Something isn't working label Jun 22, 2026
iMicknl added 2 commits June 22, 2026 14:57
Tighten retry_on_connection_failure from 5 tries / 120s to 3 tries /
30s so a flaky connection gives up faster (~3s worst-case sleep)
instead of blocking a command or poll for up to ~15s. Add a test
covering the give-up-after-max-tries path.
The give-up-after-max-tries test already exercises _get through the
decorator, so the GET retry-once test added no coverage beyond the
command-path (_post) regression test.
@iMicknl iMicknl changed the title fix: retry transient connection errors on all HTTP requests fix: Retry transient connection errors on all HTTP requests Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Transient connection errors (TimeoutError/ClientConnectorError) are only retried in fetch_events

1 participant