Skip to content

ci(quality): retry ollama model pull on transient network failure#1234

Open
planetf1 wants to merge 2 commits into
generative-computing:mainfrom
planetf1:ci/ollama-pull-retry
Open

ci(quality): retry ollama model pull on transient network failure#1234
planetf1 wants to merge 2 commits into
generative-computing:mainfrom
planetf1:ci/ollama-pull-retry

Conversation

@planetf1

@planetf1 planetf1 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Fixes #1233

What

The Pull models step in .github/workflows/quality.yml calls ollama pull granite4.1:3b with no retry logic. A single connection reset by peer from Ollama's CDN fails all three Python matrix jobs and requires a manual re-run.

Wraps the pull in a retry loop: 5 attempts, 20-second backoff, ~2 minutes of total headroom — enough to ride out transient CDN blips while still failing fast on real errors (wrong model name, Ollama not running, sustained outage).

Why not cache?

Caching the model blobs is a follow-on improvement. The retry loop addresses the flakiness with a two-line change; caching requires knowing the correct system path for Ollama's model store on the GitHub-hosted runner and managing cache invalidation. Filed separately.

Testing

Observed the failure mode on PR #1174 (connection reset by peer on granite4.1:3b). Re-run passed. The retry loop was added and the subsequent CI run for #1174 is currently in progress with the fix in place.

planetf1 added 2 commits June 9, 2026 10:44
A single connection reset from registry.ollama.ai fails the entire
quality run. Wrap the pull in a retry loop (3 attempts, 15s backoff)
to handle transient network errors without requiring a manual re-run.

Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Assisted-by: Claude Code
3×15s was conservative; 5×20s gives ~2 minutes of retry headroom for
sustained brief outages without masking real failures.

Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Assisted-by: Claude Code
@planetf1 planetf1 marked this pull request as ready for review June 9, 2026 10:45
@planetf1 planetf1 requested a review from a team as a code owner June 9, 2026 10:45

@akihikokuroda akihikokuroda left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@markstur markstur left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like on the 5, it will echo "retrying..." and continue successfully (to whatever error happens later), when it probably should instead exit 1 on that last attempt.

Something like this (untested):

for i in 1 2 3 4 5; do
  ollama pull granite4.1:3b && break
  if [ $i -lt 5 ]; then
    echo "Attempt $i failed, retrying in 20s..."
    sleep 20
  else
    echo "Attempt $i failed, no more retries"
    exit 1
  fi
done

This fixes:

  1. Exits with code 1 if all attempts fail
  2. Shows meaningful message on final attempt instead of retrying...

But I'm not going to -1 this because I'm sure what you have generally works and solves the problem well enough. Also, looking forward to future caching!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ci: intermittent quality CI failure — ollama model pull connection reset

3 participants