nasbackup.sh: add timeout, cleanup trap, space check, quiesce support#12843
nasbackup.sh: add timeout, cleanup trap, space check, quiesce support#12843jmsperu wants to merge 2 commits into
Conversation
1728663 to
937e646
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #12843 +/- ##
============================================
+ Coverage 18.10% 18.11% +0.01%
- Complexity 16746 16758 +12
============================================
Files 6037 6037
Lines 542933 542783 -150
Branches 66487 66454 -33
============================================
+ Hits 98283 98322 +39
+ Misses 433602 433416 -186
+ Partials 11048 11045 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@jmsperu can you please check on 4.21? |
Thanks for the heads up! I see 4.22 already has quiesce and improved error handling. I'll rebase onto 4.22 and update the PR to only include what's still missing: backup timeout, trap handler with VM resume, and free space check. The other PRs (compression, verification, bandwidth throttle, encryption, parallel execution) are still new features on top of 4.22. |
|
Belated reply @abh1sar — you were right, there's significant overlap with what's already landed on main. The quiesce support and EXIT_CLEANUP_FAILED handling from this PR have since been merged upstream (likely via the broader nasbackup.sh improvements in #12822 and others), so those bits are now redundant. What's still uniquely unaddressed by what landed:
I'll rebase against main to drop the overlapping changes and keep only those three, then push again. If you'd rather close this and open a fresh narrower PR with just those bits, also happy to do that — your call. |
…M resume Three independent reliability fixes for the KVM NAS backup script, layered on top of the existing quiesce + EXIT_CLEANUP_FAILED groundwork: 1. BACKUP_TIMEOUT env var (default 6h) bounds the libvirt domjobinfo wait loop in backup_running_vm. Today a stuck QEMU backup holds the agent's command slot until the orchestrator-level timeout fires. The new guard issues domjobabort and exits non-zero so the agent reclaims the slot promptly. 2. MIN_FREE_SPACE env var (default 1 GiB) + check_free_space() runs after mount and before any qemu-img convert in both backup_running_vm and backup_stopped_vm. Fail-fast on a near-full NAS instead of failing mid-write halfway through a multi-GiB convert. 3. trap cleanup EXIT replaces the six explicit cleanup() call sites as the primary cleanup mechanism so orphan NFS mounts no longer accumulate when the script dies to SIGTERM, SIGINT, or any uncaught set -e failure between the explicit call sites. cleanup() is now guarded by CLEANUP_DONE so the trap doesn't re-run an already-completed cleanup from an explicit call. cleanup() additionally resumes the VM if it's still paused — backup-begin holds the guest paused briefly and a failed backup mid-pause currently leaves the guest stuck in 'paused' state until an operator intervenes. Targets main; supersedes the 4.20-targeted version of this PR.
937e646 to
0deb4c4
Compare
|
Rebased. Force-pushed 0deb4c4 and changed the base from What landed in main since this PR was opened (via other PRs):
So this PR is now reduced to just the three improvements those didn't cover:
Total: 71 +/3 - on one file. Ready for another look @abh1sar. |
|
Friendly nudge — the new push is sitting on GitHub Actions |
I’ve added you to the collaborators list @jmsperu . With great privileges comes great responsibility… ;) |
|
Friendly bump — this adds defensive hardening around nasbackup.sh:
Latest CI had a single shard failure in |
|
I’ll look at this and your other PRs tomorrow, bandwith permitting @jmsperu . |
There was a problem hiding this comment.
Pull request overview
This PR updates the KVM NAS backup helper script (nasbackup.sh) to improve reliability by adding a backup polling timeout, introducing an EXIT cleanup trap to avoid orphaned mounts/paused VMs, adding a pre-flight free-space check, and supporting optional guest filesystem quiescing via qemu-guest-agent.
Changes:
- Add
BACKUP_TIMEOUTand enforce it in thedomjobinfopolling loop (abort viadomjobaborton timeout). - Add
check_free_space()and call it before writing backup data. - Add an idempotent
cleanup()and installtrap cleanup EXITto ensure cleanup on all exit paths.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Failed) | ||
| echo "Virsh backup job failed" | ||
| cleanup ;; |
There was a problem hiding this comment.
this remark makes sense; either catch and evaluate the return of the cleanup() call or add a break statement.
| if [[ -n "$mount_point" && -d "$mount_point" ]]; then | ||
| umount "$mount_point" 2>/dev/null || { echo "Failed to unmount $mount_point"; status=1; } | ||
| rmdir "$mount_point" 2>/dev/null || true | ||
| fi |
| fi | ||
| } |
| mount_operation | ||
| check_free_space | ||
| mkdir -p "$dest" || { echo "Failed to create backup directory $dest"; exit 1; } | ||
|
|
||
| IFS="," |
DaanHoogland
left a comment
There was a problem hiding this comment.
co-pilot has some smarter comments than I ;)
| # Idempotent: skip if a prior explicit call already ran. Without this guard, | ||
| # the EXIT trap would re-run cleanup and fail on the already-unmounted point. | ||
| [[ $CLEANUP_DONE -eq 1 ]] && return 0 | ||
| CLEANUP_DONE=1 |
There was a problem hiding this comment.
should this line move to the end of the method, when cleanup is actually done?
| Failed) | ||
| echo "Virsh backup job failed" | ||
| cleanup ;; |
There was a problem hiding this comment.
this remark makes sense; either catch and evaluate the return of the cleanup() call or add a break statement.
Summary
BACKUP_TIMEOUT(default 6 hours) to prevent indefinitely stuck backup jobs; aborts viadomjobabortwhen exceededEXITtrap withcleanup()that resumes paused VMs, removes temp dirs, and unmounts NFS on any exit (error, signal, or normal) — prevents orphan mounts from accumulatingcheck_free_space()pre-flight check (default 1 GB minimum) before writing backup data-q/--quiesceflag for optionalfsfreeze/thawvia qemu-guest-agent for application-consistent backupsset -eo pipefailfor stricter error handlingmount_operation(): properif mountinstead of broken$?check after pipeumount/rmdirfrom functions (now handled by trap)Motivation
The current
nasbackup.shhas several reliability issues observed in production:until domjobinfo --completed) runs forever if backup stalls, blocking the agent/tmp/csbackup.XXXXX) and paused VMs that never resume (related: KVM NAS backup: VM remains paused indefinitely when backup job fails #12821)$?after a pipe always returns the pipe's exit status, not the command'sTest plan
BACKUP_TIMEOUT=30— verify timeout anddomjobabort-qflag and qemu-guest-agent installed — verify fsfreeze/thaw in logs-qflag without qemu-guest-agent — verify graceful fallback