ci(quality): retry ollama model pull on transient network failure#1234
Open
planetf1 wants to merge 2 commits into
Open
ci(quality): retry ollama model pull on transient network failure#1234planetf1 wants to merge 2 commits into
planetf1 wants to merge 2 commits into
Conversation
A single connection reset from registry.ollama.ai fails the entire quality run. Wrap the pull in a retry loop (3 attempts, 15s backoff) to handle transient network errors without requiring a manual re-run. Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> Assisted-by: Claude Code
3×15s was conservative; 5×20s gives ~2 minutes of retry headroom for sustained brief outages without masking real failures. Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> Assisted-by: Claude Code
markstur
reviewed
Jun 10, 2026
markstur
left a comment
Contributor
There was a problem hiding this comment.
Seems like on the 5, it will echo "retrying..." and continue successfully (to whatever error happens later), when it probably should instead exit 1 on that last attempt.
Something like this (untested):
for i in 1 2 3 4 5; do
ollama pull granite4.1:3b && break
if [ $i -lt 5 ]; then
echo "Attempt $i failed, retrying in 20s..."
sleep 20
else
echo "Attempt $i failed, no more retries"
exit 1
fi
done
This fixes:
- Exits with code 1 if all attempts fail
- Shows meaningful message on final attempt instead of retrying...
But I'm not going to -1 this because I'm sure what you have generally works and solves the problem well enough. Also, looking forward to future caching!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1233
What
The
Pull modelsstep in.github/workflows/quality.ymlcallsollama pull granite4.1:3bwith no retry logic. A singleconnection reset by peerfrom Ollama's CDN fails all three Python matrix jobs and requires a manual re-run.Wraps the pull in a retry loop: 5 attempts, 20-second backoff, ~2 minutes of total headroom — enough to ride out transient CDN blips while still failing fast on real errors (wrong model name, Ollama not running, sustained outage).
Why not cache?
Caching the model blobs is a follow-on improvement. The retry loop addresses the flakiness with a two-line change; caching requires knowing the correct system path for Ollama's model store on the GitHub-hosted runner and managing cache invalidation. Filed separately.
Testing
Observed the failure mode on PR #1174 (
connection reset by peerongranite4.1:3b). Re-run passed. The retry loop was added and the subsequent CI run for #1174 is currently in progress with the fix in place.