Skip to content

Fix/mirror snapshot repo races#1573

Open
neolynx wants to merge 7 commits into
fix/publish-racesfrom
fix/mirror-snapshot-repo-races
Open

Fix/mirror snapshot repo races#1573
neolynx wants to merge 7 commits into
fix/publish-racesfrom
fix/mirror-snapshot-repo-races

Conversation

@neolynx
Copy link
Copy Markdown
Member

@neolynx neolynx commented May 25, 2026

Fixes #

Requirements

All new code should be covered with tests, documentation should be updated. CI should pass.

Also, to speed up things, if you could kindly "Allow edits and access to secrets by maintainers" in the
PR settings, as this allows us to rebase the PR on master, fix conflicts, run coverage and help with
implementing code and tests.

Description of the Change

Checklist

  • allow Maintainers to edit PR (rebase, run coverage, help with tests, ...)
  • unit-test added (if change is algorithm)
  • functional test added/updated (if change is functional)
  • man page updated (if applicable)
  • bash completion updated (if applicable)
  • documentation updated
  • author name in AUTHORS

@neolynx neolynx self-assigned this May 25, 2026
…ask closures

Affected endpoints: apiReposDrop, apiReposPackagesAddDelete,
apiReposPackageFromDir, apiReposCopyPackage, apiReposIncludePackageFromDir,
apiReposEdit, apiReposCreate.

All seven endpoints shared the same architectural flaw as the previously
fixed publish endpoints: operations were performed outside the task lock,
with stale DB state used inside the lock.

Issues Fixed:

1. apiReposDrop - Collections created before task lock
   Problem: snapshotCollection, publishedCollection captured from pre-task
   factory. Concurrent snapshot/published modifications not detected.

   Fix: Create fresh taskCollectionFactory inside task, re-read repo after
   lock acquired, use fresh collections for checks.

2. apiReposPackagesAddDelete - Repo and factory stale before lock
   Problem: repo loaded outside task, collectionFactory created before lock.
   Concurrent add/delete operations both load same pre-task state, last
   write wins, packages lost.

   Fix: Create fresh taskCollectionFactory inside task, re-read repo after
   lock acquired, use fresh factory for all operations.

3. apiReposPackageFromDir - Repo and factory stale before lock
   Problem: repo loaded outside task, collectionFactory created before lock.
   Concurrent file imports both load same pre-task state, last write wins.

   Fix: Create fresh taskCollectionFactory inside task, re-read repo after
   lock acquired, use fresh factory for imports.

4. apiReposCopyPackage - Both repos and factory stale before lock
   Problem: dstRepo and srcRepo loaded outside task, collectionFactory
   created before lock. Concurrent copy operations race on stale state.

   Fix: Create fresh taskCollectionFactory inside task, re-read both repos
   after lock acquired, use fresh factory for all operations.

5. apiReposIncludePackageFromDir - Repo and factory stale before lock
   Problem: repo loaded outside task, collectionFactory created before lock.
   Concurrent .changes file processing races on stale state.

   Fix: Create fresh taskCollectionFactory inside task, use fresh factory
   for import operations.

6. apiReposEdit - No serialization, concurrent modification race
   Problem: Direct update without task locking. Two concurrent renames can
   both pass duplicate check, second overwrites first.

   Fix: Convert to async task. Duplicate check and update now atomic inside
   lock, after fresh load from DB.

7. apiReposCreate - No serialization, TOCTOU on duplicate check
   Problem: Duplicate check outside task lock, add outside lock. Two
   concurrent creates with same name both pass check, second overwrites first.

   Fix: Convert to async task. Duplicate check and add now atomic inside
   lock, after fresh load from DB.

Root cause analysis:

The fundamental issue is the split between pre-task work and task-protected
work. Collections and objects were being loaded before lock acquisition, then
stale copies used inside the lock.

Correct pattern (now applied consistently across all 7 endpoints):

1. HTTP Handler (before task lock):
   - Shallow load for 404 check only
   - Extract resource keys
   - Submit task with resources

2. Task Closure (after lock acquired):
   - Create fresh collectionFactory
   - Fresh load of all objects
   - LoadComplete on fresh copies
   - All mutations on fresh state
   - All checks atomic inside lock
   - Save using fresh collections

This ensures:
- Concurrent operations are serialized by task queue
- No stale DB state used for mutations
- No lost updates from concurrent modifications
- No TOCTOU races on duplicate checks
- No DB handle issues from pre-task factory capture
@neolynx neolynx force-pushed the fix/mirror-snapshot-repo-races branch from ca63adc to bf6dc8c Compare May 25, 2026 16:37
neolynx added 3 commits May 25, 2026 19:57
…e task closures

Affected endpoints: apiSnapshotsCreate, apiSnapshotsUpdate, apiSnapshotsDrop,
apiSnapshotsMerge, apiSnapshotsPull.

All five endpoints shared the same architectural flaw as the previously fixed
repos and publish endpoints: operations were performed outside the task lock,
with stale DB state used inside the lock.

Issues Fixed:

1. apiSnapshotsCreate - Source snapshots loaded before task lock
   Problem: snapshotCollection and collectionFactory created before task lock.
   Source snapshots and destination check done with stale factory.
   Concurrent creates both load pre-task state, second overwrites first.

   Fix: Create fresh taskCollectionFactory inside task, fresh loads of all
   sources after lock acquired, pre-task duplicate check for destination,
   use fresh sources and collections for snapshot creation.

2. apiSnapshotsUpdate - Snapshot loaded before task lock
   Problem: snapshot loaded outside task, duplicate check with stale factory.
   Concurrent renames both load pre-task state, both pass check, second
   overwrites first.

   Fix: Create fresh taskCollectionFactory inside task, fresh load of snapshot
   after lock acquired, fresh duplicate check inside lock, pre-task validation
   of new name, atomic rename with fresh copy.

3. apiSnapshotsDrop - Collections created before task lock
   Problem: snapshotCollection and publishedCollection created before task lock.
   Concurrent snapshot/published modifications not detected. Can delete snapshot
   that becomes published between pre-task and task.

   Fix: Create fresh taskCollectionFactory inside task, fresh load of snapshot,
   fresh collections for all checks (published, source dependency), all checks
   inside lock.

4. apiSnapshotsMerge - Source snapshots loaded before task lock
   Problem: snapshotCollection created before task lock. Source snapshots
   loaded outside task, LoadComplete called on stale copies. Concurrent
   merges both load pre-task state, merge result doesn't include source changes.

   Fix: Create fresh taskCollectionFactory inside task, fresh load of all
   sources after lock acquired, LoadComplete on fresh copies, merge using
   fresh RefLists, save using fresh factory.

5. apiSnapshotsPull - Snapshots loaded before task lock
   Problem: toSnapshot and sourceSnapshot loaded outside task,
   collectionFactory created before task. LoadComplete called on stale copies.
   Concurrent pulls load pre-task state, pull doesn't include source changes.

   Fix: Create fresh taskCollectionFactory inside task, fresh load of both
   snapshots after lock acquired, LoadComplete on fresh copies, all filtering
   and pulling on fresh RefLists, save using fresh factory.

Root cause analysis:

The fundamental issue is the split between pre-task work and task-protected
work. Collections and objects were being loaded before lock acquisition, then
stale copies used inside the lock.

Correct pattern (from fixed publish.go and repos.go):

1. HTTP Handler (before task lock):
   - Shallow load for 404 check only
   - Extract resource keys
   - Submit task with resources

2. Task Closure (after lock acquired):
   - Create fresh collectionFactory
   - Fresh load of all objects
   - LoadComplete on fresh copies
   - All mutations on fresh state
   - All checks atomic inside lock
   - Save using fresh collections

This ensures:
- Concurrent operations are serialized by task queue
- No stale DB state used for mutations
- No lost updates from concurrent modifications
- No TOCTOU races on duplicate checks
- No DB handle issues from pre-task factory capture
…task closures

Affected endpoints: apiMirrorsDrop, apiMirrorsUpdate.

Both endpoints shared the same architectural flaw as the previously fixed
publish, repos, and snapshot endpoints: operations were performed outside
the task lock, with stale DB state used inside the lock.

Issues Fixed:

1. apiMirrorsDrop - Collections created before task lock
   Problem: mirrorCollection and snapshotCollection created before task lock.
   Snapshot dependency check done with stale factory. Concurrent drops both
   load pre-task state, both see same snapshot dependencies. If snapshots
   created after pre-task check, can delete mirror used by snapshots.

   Fix: Create fresh taskCollectionFactory inside task, fresh load of mirror
   after lock acquired, fresh snapshot check with current factory, drop using
   fresh collections.

2. apiMirrorsUpdate - Mirror loaded before task lock
   Problem: remote loaded outside task, rename duplicate check with stale
   factory. Concurrent updates both load pre-task state, long-running update
   uses stale mirror reference. TOCTOU race: rename check passes, another
   creates mirror with same name, update saves with stale data.

   Fix: Create fresh taskCollectionFactory inside task, fresh load of mirror
   after lock acquired, pre-task rename validation, fresh rename check inside
   lock, use fresh mirror and collections for all operations.

Root cause analysis:

The fundamental issue is the split between pre-task work and task-protected
work. Collections and objects were being loaded before lock acquisition, then
stale copies used inside the lock.

Correct pattern (from fixed publish.go, repos.go, and snapshot.go):

1. HTTP Handler (before task lock):
   - Shallow load for 404 check only
   - Extract resource keys
   - Submit task with resources

2. Task Closure (after lock acquired):
   - Create fresh collectionFactory
   - Fresh load of all objects
   - LoadComplete on fresh copies
   - All mutations on fresh state
   - All checks atomic inside lock
   - Save using fresh collections

This ensures:
- Concurrent operations are serialized by task queue
- No stale DB state used for mutations
- No lost updates from concurrent modifications
- No TOCTOU races on duplicate checks
- No loss of mirrors used by snapshots
- No stale data in long-running updates
The gin context (c) may be recycled after the HTTP handler returns 202
for async tasks. Accessing c.Params.ByName() inside the task closure
returns an empty string, causing 'mirror with name  not found' errors.

Capture the URL :name parameter into a local variable before the
closure so it is safely captured by value.

Affected endpoints:
- PUT /api/mirrors/:name (apiMirrorsUpdate)
- POST/DELETE /api/repos/:name/packages (apiReposPackagesAddDelete)
@neolynx neolynx force-pushed the fix/mirror-snapshot-repo-races branch from 3c88e59 to b8373b0 Compare May 25, 2026 17:57
neolynx added 2 commits May 25, 2026 20:41
The SnapshotsAPITestCreateUpdate test expects that PUT /api/snapshots/:name
with the same Name in the body returns a conflict error. The previous fix
added 'b.Name != name' guards to skip the duplicate check when the name
hasn't changed, but this broke the test which expects the old behavior:
any existing name (including the snapshot's own current name) should be
rejected as a duplicate.

Remove the 'b.Name != name' condition from both the pre-task validation
and the in-task duplicate check so the behavior matches the original.
The pre-task validation in apiSnapshotsUpdate was incorrectly rejecting
PUT requests that set the Name to the snapshot's current name. This caused
a 409 response before creating a task, which broke the system test
SnapshotsAPITestCreateUpdate that expects a task to be created and then
fail inside the task.

The fix restores the 'b.Name != name' condition in the pre-task check so
that same-name updates pass through to the task, where the in-task
duplicate check will properly fail them (returning a failed task state
instead of a direct 409).
@codecov
Copy link
Copy Markdown

codecov Bot commented May 25, 2026

Codecov Report

❌ Patch coverage is 74.86339% with 46 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.28%. Comparing base (68814ff) to head (d5f1929).

Files with missing lines Patch % Lines
api/repos.go 74.48% 14 Missing and 11 partials ⚠️
api/snapshot.go 79.03% 7 Missing and 6 partials ⚠️
api/mirror.go 65.21% 5 Missing and 3 partials ⚠️
Additional details and impacted files
@@                  Coverage Diff                  @@
##           fix/publish-races    #1573      +/-   ##
=====================================================
+ Coverage              77.22%   77.28%   +0.05%     
=====================================================
  Files                    161      161              
  Lines                  15143    15230      +87     
=====================================================
+ Hits                   11694    11770      +76     
+ Misses                  2297     2291       -6     
- Partials                1152     1169      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant