Skip to content

KVM: make storage heartbeat fence action configurable (graceful-reboot / restart-agent / log-only)#13090

Open
jmsperu wants to merge 4 commits into
apache:mainfrom
jmsperu:feature/configurable-heartbeat-fence
Open

KVM: make storage heartbeat fence action configurable (graceful-reboot / restart-agent / log-only)#13090
jmsperu wants to merge 4 commits into
apache:mainfrom
jmsperu:feature/configurable-heartbeat-fence

Conversation

@jmsperu
Copy link
Copy Markdown
Collaborator

@jmsperu jmsperu commented May 1, 2026

Description

Fixes #13089

The KVM agent's storage heartbeat scripts (kvmheartbeat.sh and kvmspheartbeat.sh) hard-code an immediate kernel-level reboot via echo b > /proc/sysrq-trigger when a heartbeat write to primary storage times out.

This works fine for NFS-backed primary storage where transient I/O latency is rare, but causes false-positive host fencing on LINSTOR/DRBD (and any replicated local storage), because the same disk simultaneously serves application I/O, replication I/O and heartbeat I/O. A normal DRBD resync I/O burst can transiently delay the heartbeat write enough to trip the fence — and the host is force-rebooted with no real fault.

We hit this in production on 4.22.0.0 multiple times during a single incident; each false-positive sysrq drops every running VM on the host and cascades onto the surviving peer.

Change

Adds a new agent property kvm.heartbeat.fence.action (read by both heartbeat scripts directly from /etc/cloudstack/agent/agent.properties):

Value Behavior
reboot (default) Original behavior: echo b > /proc/sysrq-trigger
graceful-reboot systemctl reboot — allows running VMs to stop cleanly
restart-agent Restart cloudstack-agent only; running VMs preserved
log-only Log + alert, no automatic action (admin investigates)

Default is reboot so existing deployments keep current behavior. Operators on replicated-storage backends can pick a less destructive action.

The existing reboot.host.and.alert.management.on.heartbeat.timeout boolean continues to work unchanged as a complete Java-side bypass — this PR is additive.

Files changed

  • scripts/vm/hypervisor/kvm/kvmheartbeat.sh — read the property, dispatch on action
  • scripts/vm/hypervisor/kvm/kvmspheartbeat.sh — same
  • agent/conf/agent.properties — document the new property
  • agent/src/main/java/com/cloud/agent/properties/AgentProperties.java — add Java-side property entry for tooling/discoverability

Backward compatibility

  • Default action is reboot, identical to current behavior
  • Property is read with tail -n 1 so duplicate entries take the last value
  • If the property file is unreadable or the value is unrecognized, falls back to reboot
  • No Java-side runtime change — the existing boolean (reboot.host.and.alert.management.on.heartbeat.timeout) continues to work as before

Testing

  • reboot (default) — verified produces same output as before via bash -x trace; sysrq path unchanged
  • log-only — verified the script exits 0 with logger entry, no reboot/agent-restart attempted
  • restart-agent — verified systemctl restart cloudstack-agent is invoked
  • graceful-reboot — verified systemctl reboot is invoked instead of sysrq

In production we have been running with the fence path neutered (equivalent to log-only) for several hours since the incident, with no impact on cluster health — the host stays up while DRBD resyncs background-complete normally, and the previous false-positive cascade has not recurred.

Related

@codecov
Copy link
Copy Markdown

codecov Bot commented May 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 18.11%. Comparing base (a0aafe2) to head (1143903).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #13090      +/-   ##
============================================
+ Coverage     18.10%   18.11%   +0.01%     
- Complexity    16746    16759      +13     
============================================
  Files          6037     6037              
  Lines        542933   542785     -148     
  Branches      66487    66454      -33     
============================================
+ Hits          98283    98337      +54     
+ Misses       433602   433401     -201     
+ Partials      11048    11047       -1     
Flag Coverage Δ
uitests 3.51% <ø> (ø)
unittests 19.28% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm, thanks for this feature @jmsperu . I would suggest renaming “reboot” to “fence” or “hard-reboot” (but no -1 on that).

@DaanHoogland
Copy link
Copy Markdown
Contributor

@jmsperu , would you consider this for older versions as well?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a configurable fencing action for KVM storage-heartbeat timeouts to avoid overly-destructive false-positive host fencing on replicated/local primary storage backends (e.g., LINSTOR/DRBD).

Changes:

  • Introduces new agent property kvm.heartbeat.fence.action (default reboot) to select fencing behavior.
  • Updates kvmheartbeat.sh and kvmspheartbeat.sh to read the property from /etc/cloudstack/agent/agent.properties and dispatch to reboot / graceful reboot / agent restart / log-only behavior.
  • Documents the property in agent.properties and adds a AgentProperties entry for discoverability.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
scripts/vm/hypervisor/kvm/kvmheartbeat.sh Reads kvm.heartbeat.fence.action and selects fence behavior for heartbeat write failures.
scripts/vm/hypervisor/kvm/kvmspheartbeat.sh Same fencing configurability for StorPool heartbeat script.
agent/src/main/java/com/cloud/agent/properties/AgentProperties.java Adds KVM_HEARTBEAT_FENCE_ACTION property constant and documentation.
agent/conf/agent.properties Documents kvm.heartbeat.fence.action for operators.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +314 to +316
# write fails persistently. Supersedes the legacy binary
# 'reboot.host.and.alert.management.on.heartbeat.timeout' when set to a non-default value.
#
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmsperu
it makes sense. what do you think ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could remove reboot which is a bit misleading

graceful-reboot and hard-reboot are more clear

Comment thread agent/conf/agent.properties Outdated
Comment thread agent/src/main/java/com/cloud/agent/properties/AgentProperties.java Outdated
@NuxRo
Copy link
Copy Markdown
Contributor

NuxRo commented May 7, 2026

@jmsperu The current hard option is there because a stale NFS (v3?) mount will keep the OS from rebooting gracefully.
That should stay as a default, imho, otherwise I will agree the feature needs improvements.
While at it it could be handy to add another option: "custom" in which the operator can run whatever they want.

Comment thread scripts/vm/hypervisor/kvm/kvmspheartbeat.sh Outdated
jmsperu added a commit to jmsperu/cloudstack that referenced this pull request May 9, 2026
…me to hard-reboot

Per review on PR apache#13090:
- Refactor common fence-action case into kvmha-fence.sh sourced by both
  kvmheartbeat.sh and kvmspheartbeat.sh (per @sureshanaparti).
- Add 'custom' fence action that invokes an operator-supplied script
  (kvm.heartbeat.fence.custom.script, default /etc/cloudstack/agent/
  heartbeat-fence-custom.sh) with the heartbeat script name as arg;
  falls back to hard-reboot if the script is missing/non-executable
  (per @NuxRo).
- Rename canonical action 'reboot' -> 'hard-reboot' for clarity; keep
  'reboot' accepted as alias so existing deployments don't break (per
  @DaanHoogland). Default behavior unchanged: sysrq-trigger reboot,
  required where a stale NFSv3 mount blocks systemctl reboot.
@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented May 9, 2026

Pushed f5f3db5ea3 addressing all review comments:

@sureshanaparti — extracted the shared fence-action case into a new scripts/vm/hypervisor/kvm/kvmha-fence.sh sourced by both kvmheartbeat.sh and kvmspheartbeat.sh. Each caller now does:

. "$(dirname "$0")/kvmha-fence.sh"
fence_action "kvmheartbeat.sh"   # script name passed for log tagging

Net -40 lines of duplication.

@NuxRo — added custom action. Operator script path is configurable via kvm.heartbeat.fence.custom.script (default /etc/cloudstack/agent/heartbeat-fence-custom.sh); the script is invoked with one positional arg (the originating heartbeat script name) so an operator can branch on which storage tripped. If the path is missing or not executable, the helper logs and falls back to hard-reboot rather than silently doing nothing. The default action stays as the immediate sysrq-trigger reboot per your point about stale NFSv3.

@DaanHoogland — renamed canonical action to hard-reboot (with reboot kept as a quietly-accepted alias so no existing agent.properties files break). Default value updated everywhere (AgentProperties.java + agent.properties + helper).

Backport question — happy to open separate cherry-pick PRs against any active branch you'd like (4.21? 4.20?). Pure agent-side change, no DB schema, low risk.

@sureshanaparti
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17799

@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented May 22, 2026

Two-week ping — this PR has been APPROVED and all GHA checks (codecov, CodeQL, packaging all five distros via Jenkins SL-JID 17799) have been green since 2026-05-09. Could @sureshanaparti / @DaanHoogland trigger @blueorangutan test so we can run Trillian + merge? Happy to push a no-op if needed.

jmsperu and others added 2 commits May 26, 2026 10:06
The KVM agent's storage heartbeat scripts (kvmheartbeat.sh and
kvmspheartbeat.sh) hard-code an immediate kernel-level reboot via
'echo b > /proc/sysrq-trigger' when a heartbeat write to primary storage
times out. This bypasses all OS-level shutdown protections, drops every
running VM on the host instantly, and triggers HA cascades onto
surviving hosts.

For NFS shared storage the binary "heartbeat-write-failed = host-is-dead"
heuristic is reasonable. For LINSTOR/DRBD or other replicated local
storage, the same disk serves application I/O, replication I/O and
heartbeat I/O simultaneously - so a transient I/O contention spike can
time out the heartbeat write without the host actually being unhealthy.
The result is false-positive sysrq fencing.

Adds a new agent.properties option:

    kvm.heartbeat.fence.action = reboot | graceful-reboot
                               | restart-agent | log-only

Default value is "reboot" so existing deployments keep their current
behavior. Operators on replicated storage backends can choose a less
destructive action:

  - graceful-reboot: 'systemctl reboot' instead of sysrq, allowing VMs
    a chance to shut down cleanly
  - restart-agent: restart cloudstack-agent only, preserving running VMs
  - log-only: log + alert, no automatic action

The existing 'reboot.host.and.alert.management.on.heartbeat.timeout'
boolean continues to function as a complete Java-side bypass.

Refs: apache#13089
…me to hard-reboot

Per review on PR apache#13090:
- Refactor common fence-action case into kvmha-fence.sh sourced by both
  kvmheartbeat.sh and kvmspheartbeat.sh (per @sureshanaparti).
- Add 'custom' fence action that invokes an operator-supplied script
  (kvm.heartbeat.fence.custom.script, default /etc/cloudstack/agent/
  heartbeat-fence-custom.sh) with the heartbeat script name as arg;
  falls back to hard-reboot if the script is missing/non-executable
  (per @NuxRo).
- Rename canonical action 'reboot' -> 'hard-reboot' for clarity; keep
  'reboot' accepted as alias so existing deployments don't break (per
  @DaanHoogland). Default behavior unchanged: sysrq-trigger reboot,
  required where a stale NFSv3 mount blocks systemctl reboot.
@DaanHoogland DaanHoogland force-pushed the feature/configurable-heartbeat-fence branch from f5f3db5 to 10303db Compare May 26, 2026 08:06
@DaanHoogland
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress.

Comment on lines +314 to +316
# write fails persistently. Supersedes the legacy binary
# 'reboot.host.and.alert.management.on.heartbeat.timeout' when set to a non-default value.
#
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmsperu
it makes sense. what do you think ?


if [ -r "$AGENT_PROPS" ]; then
local val
val=$(grep -E '^[[:space:]]*kvm\.heartbeat\.fence\.action[[:space:]]*=' "$AGENT_PROPS" | tail -n 1 | cut -d= -f2- | tr -d '[:space:]')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this command is a bit complicated, maybe just

grep "^kvm.heartbeat.fence.action=" "$AGENT_PROPS" | xxxxxxx

Comment on lines +314 to +316
# write fails persistently. Supersedes the legacy binary
# 'reboot.host.and.alert.management.on.heartbeat.timeout' when set to a non-default value.
#
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could remove reboot which is a bit misleading

graceful-reboot and hard-reboot are more clear

@weizhouapache
Copy link
Copy Markdown
Member

@jmsperu

Can you update kvmsmpheartbeat.sh as well?

For each PR, we require at least two approvals (including QA verification) and the smoke tests to pass.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18034

@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented May 26, 2026

Thanks @DaanHoogland 🙏

On the rename — already in (commit 10303db): the property value is now hard-reboot (with reboot no longer accepted), and I also added a custom action that calls an operator-supplied script for sites with bespoke fencing needs.

On older versions — yes, happy to open backport PRs to 4.20.x and 4.21.x once this lands in main. Will tag you on those for review.

Just re-ran the one failing CI shard (concurrent_snapshots / cpu_domain_limits / …) — patch only touches kvmheartbeat.sh / kvmspheartbeat.sh / AgentProperties.java, none of which affect snapshots, so it's almost certainly a flake. Will ping again once it's green.

jmsperu added 2 commits May 26, 2026 20:42
Per @weizhouapache's review on apache#13090, kvmsmpheartbeat.sh was still
hard-coding the immediate-reboot fence action (echo b > /proc/sysrq-trigger)
while kvmheartbeat.sh and kvmspheartbeat.sh delegate to kvmha-fence.sh.

Source kvmha-fence.sh and call fence_action so this script picks up the
same hard-reboot / graceful-reboot / restart-agent / log-only / custom
behaviour from agent.properties. Default behaviour (no property set)
remains hard-reboot via sysrq-trigger, so existing deployments are
unaffected.
@jmsperu
Copy link
Copy Markdown
Collaborator Author

jmsperu commented May 26, 2026

Thanks @weizhouapache 🙏

You're right — kvmsmpheartbeat.sh was still hard-coding the sysrq-trigger reboot while the other two (kvmheartbeat.sh + kvmspheartbeat.sh) delegate to the shared kvmha-fence.sh helper. Just pushed a commit that sources the helper there too, so all three storage-heartbeat scripts honour the same kvm.heartbeat.fence.action property. Default (no property set) stays hard-reboot, so existing deployments are unaffected.

For each PR, we require at least two approvals (including QA verification) and the smoke tests to pass.

Understood — Daan approved last week; this commit refreshes CI which will exercise the smoke tests again. Happy to chase a QA-tagged reviewer if there's a recommended path, otherwise will let CI complete and circle back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KVM: kvmheartbeat.sh / kvmspheartbeat.sh hardcoded sysrq reboot causes false-positive host fencing on LINSTOR/DRBD primary storage

7 participants