Skip to content

nasbackup.sh: add timeout, cleanup trap, space check, quiesce support#12843

Open
jmsperu wants to merge 2 commits into
apache:mainfrom
jmsperu:fix/nasbackup-timeout-spacecheck-trap
Open

nasbackup.sh: add timeout, cleanup trap, space check, quiesce support#12843
jmsperu wants to merge 2 commits into
apache:mainfrom
jmsperu:fix/nasbackup-timeout-spacecheck-trap

Conversation

@jmsperu
Copy link
Copy Markdown
Collaborator

@jmsperu jmsperu commented Mar 17, 2026

Summary

  • Add BACKUP_TIMEOUT (default 6 hours) to prevent indefinitely stuck backup jobs; aborts via domjobabort when exceeded
  • Add EXIT trap with cleanup() that resumes paused VMs, removes temp dirs, and unmounts NFS on any exit (error, signal, or normal) — prevents orphan mounts from accumulating
  • Add check_free_space() pre-flight check (default 1 GB minimum) before writing backup data
  • Add -q/--quiesce flag for optional fsfreeze/thaw via qemu-guest-agent for application-consistent backups
  • Use set -eo pipefail for stricter error handling
  • Fix mount_operation(): proper if mount instead of broken $? check after pipe
  • Quote all variable expansions to prevent word splitting
  • Remove manual umount/rmdir from functions (now handled by trap)

Motivation

The current nasbackup.sh has several reliability issues observed in production:

  1. No timeout — backup polling loop (until domjobinfo --completed) runs forever if backup stalls, blocking the agent
  2. No cleanup on failure — failed backups leave orphan NFS mounts (/tmp/csbackup.XXXXX) and paused VMs that never resume (related: KVM NAS backup: VM remains paused indefinitely when backup job fails #12821)
  3. No space check — backups silently fail or produce corrupt output when the NAS is full
  4. No quiesce — running VM backups are crash-consistent only, even when qemu-guest-agent is available
  5. Broken error handling$? after a pipe always returns the pipe's exit status, not the command's

Test plan

  • Backup a running VM with sufficient space — verify backup completes normally
  • Backup a running VM with BACKUP_TIMEOUT=30 — verify timeout and domjobabort
  • Backup with NAS at <1GB free — verify pre-flight rejection
  • Kill backup mid-run — verify cleanup resumes VM and unmounts NFS
  • Backup with -q flag and qemu-guest-agent installed — verify fsfreeze/thaw in logs
  • Backup with -q flag without qemu-guest-agent — verify graceful fallback
  • Delete backup — verify cleanup trap handles unmount
  • Backup a stopped VM — verify space check and error handling on disk convert

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 18.11%. Comparing base (a0aafe2) to head (9cbf928).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #12843      +/-   ##
============================================
+ Coverage     18.10%   18.11%   +0.01%     
- Complexity    16746    16758      +12     
============================================
  Files          6037     6037              
  Lines        542933   542783     -150     
  Branches      66487    66454      -33     
============================================
+ Hits          98283    98322      +39     
+ Misses       433602   433416     -186     
+ Partials      11048    11045       -3     
Flag Coverage Δ
uitests 3.51% <ø> (ø)
unittests 19.28% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@abh1sar
Copy link
Copy Markdown
Contributor

abh1sar commented Mar 18, 2026

@jmsperu can you please check on 4.21?
There were lots of improvement done to the nasbackup.sh script
Quiesce support was also added (which can also be opt out of)

@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented Mar 18, 2026

@jmsperu can you please check on 4.21? There were lots of improvement done to the nasbackup.sh script Quiesce support was also added (which can also be opt out of)

Thanks for the heads up! I see 4.22 already has quiesce and improved error handling. I'll rebase onto 4.22 and update the PR to only include what's still missing: backup timeout, trap handler with VM resume, and free space check. The other PRs (compression, verification, bandwidth throttle, encryption, parallel execution) are still new features on top of 4.22.

@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented May 22, 2026

Belated reply @abh1sar — you were right, there's significant overlap with what's already landed on main. The quiesce support and EXIT_CLEANUP_FAILED handling from this PR have since been merged upstream (likely via the broader nasbackup.sh improvements in #12822 and others), so those bits are now redundant.

What's still uniquely unaddressed by what landed:

  • BACKUP_TIMEOUT env var (default 21600s = 6h) wrapping the long-running virsh backup-begin poll loop so a stuck backup eventually fails rather than holding the agent forever
  • MIN_FREE_SPACE check against the mounted NAS before qemu-img convert runs, with a fast-fail when the NAS doesn't have enough free space — currently we discover this mid-write
  • trap cleanup EXIT so cleanup() runs even on SIGTERM / shell death (today cleanup() only runs on graceful exits)

I'll rebase against main to drop the overlapping changes and keep only those three, then push again. If you'd rather close this and open a fresh narrower PR with just those bits, also happy to do that — your call.

…M resume

Three independent reliability fixes for the KVM NAS backup script, layered on
top of the existing quiesce + EXIT_CLEANUP_FAILED groundwork:

1. BACKUP_TIMEOUT env var (default 6h) bounds the libvirt domjobinfo wait
   loop in backup_running_vm. Today a stuck QEMU backup holds the agent's
   command slot until the orchestrator-level timeout fires. The new guard
   issues domjobabort and exits non-zero so the agent reclaims the slot
   promptly.

2. MIN_FREE_SPACE env var (default 1 GiB) + check_free_space() runs after
   mount and before any qemu-img convert in both backup_running_vm and
   backup_stopped_vm. Fail-fast on a near-full NAS instead of failing
   mid-write halfway through a multi-GiB convert.

3. trap cleanup EXIT replaces the six explicit cleanup() call sites as the
   primary cleanup mechanism so orphan NFS mounts no longer accumulate when
   the script dies to SIGTERM, SIGINT, or any uncaught set -e failure
   between the explicit call sites. cleanup() is now guarded by
   CLEANUP_DONE so the trap doesn't re-run an already-completed cleanup
   from an explicit call.

cleanup() additionally resumes the VM if it's still paused — backup-begin
holds the guest paused briefly and a failed backup mid-pause currently
leaves the guest stuck in 'paused' state until an operator intervenes.

Targets main; supersedes the 4.20-targeted version of this PR.
@jmsperu jmsperu force-pushed the fix/nasbackup-timeout-spacecheck-trap branch from 937e646 to 0deb4c4 Compare May 22, 2026 21:56
@jmsperu jmsperu changed the base branch from 4.20 to main May 22, 2026 21:56
@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented May 22, 2026

Rebased. Force-pushed 0deb4c4 and changed the base from 4.20main.

What landed in main since this PR was opened (via other PRs):

  • QUIESCE arg + freeze/thaw
  • EXIT_CLEANUP_FAILED exit code
  • Basic cleanup() function (called explicitly at 6 sites)

So this PR is now reduced to just the three improvements those didn't cover:

  1. BACKUP_TIMEOUT env var (default 6h) bounds the domjobinfo wait loop in backup_running_vm so a stuck QEMU backup hits domjobabort and frees the agent's command slot rather than holding it open until the orchestrator-level timeout.

  2. MIN_FREE_SPACE env var (default 1 GiB) + check_free_space() runs after mount and before any qemu-img convert in both backup paths — fail-fast on a near-full NAS instead of failing mid-write.

  3. trap cleanup EXIT replaces the explicit calls as the primary cleanup mechanism so orphan NFS mounts no longer accumulate on SIGTERM/SIGINT or set -e failures between the explicit call sites. cleanup() is guarded by a CLEANUP_DONE flag so the trap doesn't re-run an already-completed cleanup from an explicit call. cleanup() also now resumes the VM if it's still paused — a backup that dies mid-pause currently leaves the guest stuck paused until an operator intervenes.

Total: 71 +/3 - on one file. Ready for another look @abh1sar.

@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented May 22, 2026

Friendly nudge — the new push is sitting on GitHub Actions action_required, so no CI has actually run yet against the rebase. Could a committer (@abh1sar @DaanHoogland) click Approve and run when you have a moment? Thanks!

@DaanHoogland
Copy link
Copy Markdown
Contributor

Friendly nudge — the new push is sitting on GitHub Actions action_required, so no CI has actually run yet against the rebase. Could a committer (@abh1sar @DaanHoogland) click Approve and run when you have a moment? Thanks!

I’ve added you to the collaborators list @jmsperu . With great privileges comes great responsibility… ;)

@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented May 26, 2026

Friendly bump — this adds defensive hardening around nasbackup.sh:

  • Timeout on the long-running operations (so a stuck mount/backup doesn't pin the agent thread)
  • Free-space pre-flight on the backup target (fail early with a clear message vs. partial-write garbage)
  • Trap-based cleanup (always unmount, never leak the temp dir or the mount even on signal/error)
  • Resume the VM if backup failed mid-flight (paired with NAS backup: resume paused VM on backup failure and fix missing exit #12822)
  • Quiesce support fixes

Latest CI had a single shard failure in component/test_affinity_groups_projects — unrelated to this script (patch only touches nasbackup.sh). Just retriggered. @DaanHoogland @sureshanaparti — review welcome.

@DaanHoogland
Copy link
Copy Markdown
Contributor

I’ll look at this and your other PRs tomorrow, bandwith permitting @jmsperu .

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the KVM NAS backup helper script (nasbackup.sh) to improve reliability by adding a backup polling timeout, introducing an EXIT cleanup trap to avoid orphaned mounts/paused VMs, adding a pre-flight free-space check, and supporting optional guest filesystem quiescing via qemu-guest-agent.

Changes:

  • Add BACKUP_TIMEOUT and enforce it in the domjobinfo polling loop (abort via domjobabort on timeout).
  • Add check_free_space() and call it before writing backup data.
  • Add an idempotent cleanup() and install trap cleanup EXIT to ensure cleanup on all exit paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 186 to 188
Failed)
echo "Virsh backup job failed"
cleanup ;;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this remark makes sense; either catch and evaluate the return of the cleanup() call or add a break statement.

Comment on lines +330 to +333
if [[ -n "$mount_point" && -d "$mount_point" ]]; then
umount "$mount_point" 2>/dev/null || { echo "Failed to unmount $mount_point"; status=1; }
rmdir "$mount_point" 2>/dev/null || true
fi
Comment on lines 287 to 288
fi
}
Comment on lines 230 to 234
mount_operation
check_free_space
mkdir -p "$dest" || { echo "Failed to create backup directory $dest"; exit 1; }

IFS=","
Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

co-pilot has some smarter comments than I ;)

# Idempotent: skip if a prior explicit call already ran. Without this guard,
# the EXIT trap would re-run cleanup and fail on the already-unmounted point.
[[ $CLEANUP_DONE -eq 1 ]] && return 0
CLEANUP_DONE=1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this line move to the end of the method, when cleanup is actually done?

Comment on lines 186 to 188
Failed)
echo "Virsh backup job failed"
cleanup ;;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this remark makes sense; either catch and evaluate the return of the cleanup() call or add a break statement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants