Skip to content

Transient connection errors (TimeoutError/ClientConnectorError) are only retried in fetch_events #2147

Description

@iMicknl

Summary

Only fetch_events is decorated with @retry_on_connection_failure. Every other HTTP-making method on OverkizClient — including the command path (_execute_action_group_direct) and all setup/state/refresh calls — lacks it, so a transient TimeoutError / aiohttp.ClientConnectorError propagates out raw on the very first occurrence instead of being retried.

This surfaced in Home Assistant: a ConnectionTimeoutError (subclass of the builtin TimeoutError) raised from a cover close command escaped as an unhandled traceback, because execute_action_group_execute_action_group_direct carries only @retry_on_too_many_executions + @retry_on_auth_error.

Decorator definition

retry_on_connection_failure = backoff.on_exception(
    backoff.expo,
    (TimeoutError, ClientConnectorError),
    max_tries=5,
    max_time=120,
    ...
)

Audit of HTTP methods missing @retry_on_connection_failure

fetch_events is the only method that has it. Everything below makes an HTTP request (_get/_post/_put/_delete) and does not:

  • Command / execution: _execute_action_group_direct, execute_persisted_action_group, schedule_persisted_action_group, cancel_execution
  • Core polling/setup: get_setup, get_devices, get_gateways, get_state, refresh_states, refresh_device_states, get_current_execution, get_current_executions, get_execution_history, register_event_listener, unregister_event_listener, get_api_version
  • Everything else: get_diagnostic_data, get_device_definition, get_action_groups, get_places, all get_reference_*, all firmware methods, all local-token methods, all developer-mode methods, search_reference_devices, etc.

Note: ServerDisconnectedError is retried (once) only as a side effect of retry_on_auth_error's relogin path; plain TimeoutError / ClientConnectorError are retried nowhere except fetch_events.

Suggested fix

Apply @retry_on_connection_failure consistently to the request-issuing methods — or, more robustly, centralize it in the _get/_post/_put/_delete helpers so every call gets uniform transient-connection retry, removing the need to remember it per-method. The command path in particular should retry transient timeouts before giving up.

Context

Found while reviewing home-assistant/core#173155. On the HA side we added a catch that converts these to HomeAssistantError so the user sees a clean error instead of a traceback, but the underlying retry gap belongs here in the library.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions