Skip to content

<fix>[kvm]: TIC-5776 backport memory hotfix#4066

Closed
ZStack-Robot wants to merge 2 commits into
feature-ky10-gfbfrom
sync/haidong.pang/fix/TIC-5776@@3
Closed

<fix>[kvm]: TIC-5776 backport memory hotfix#4066
ZStack-Robot wants to merge 2 commits into
feature-ky10-gfbfrom
sync/haidong.pang/fix/TIC-5776@@3

Conversation

@ZStack-Robot
Copy link
Copy Markdown
Collaborator

Summary

Backport TIC-5776 kvmagent memory hotfix into feature-ky10-gfb.

Includes the ZSTAC-71000 management-node KVM memory monitor implementation and the ZSTAC-72552 host ping/restart safeguard.

Changes

  • Add kvmagent physical memory statistics command path, configs, and host handling.
  • Add RestartKvmAgent message/reply plumbing.
  • Add host ping skip/cancel handling while kvmagent is busy.

Testing

  • Full zstack + premium build: scripts/backend-build-container/container-run full-build
  • zstacklib package: scripts/backend-build-container/container-run python-package zstack-utility/zstacklib
  • kvmagent package: scripts/backend-build-container/container-run python-package zstack-utility/kvmagent
  • CI pipeline
  • Live TIC-5776 environment verification

Related: TIC-5776
Backports: ZSTAC-71000, ZSTAC-72552

sync from gitlab !9965

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Review Change Stack

Warning

Rate limit exceeded

@MatheMatrix has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 55 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d0d58b12-163e-4aee-9a56-9373dfab8c2d

📥 Commits

Reviewing files that changed from the base of the PR and between 6197976 and fb9e84e.

⛔ Files ignored due to path filters (2)
  • conf/globalConfig/kvm.xml is excluded by !**/*.xml
  • test/src/test/resources/globalConfig/kvm.xml is excluded by !**/*.xml
📒 Files selected for processing (12)
  • compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java
  • core/src/main/java/org/zstack/core/CoreManagerImpl.java
  • header/src/main/java/org/zstack/header/core/GetLocalTaskMsg.java
  • header/src/main/java/org/zstack/header/host/HostCanonicalEvents.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMAgentCommands.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMConstant.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMGlobalConfig.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHostFactory.java
  • plugin/kvm/src/main/java/org/zstack/kvm/RestartKvmAgentMsg.java
  • plugin/kvm/src/main/java/org/zstack/kvm/RestartKvmAgentReply.java
  • test/src/test/groovy/org/zstack/test/integration/kvm/host/KVMPingCase.groovy

功能概览

该 PR 为 ZStack KVM 插件引入 KVM 代理物理内存监控与自适应 ping 跳过机制。当代理内存使用超过硬限制时自动触发重启流程,当代理处于繁忙状态时动态跳过主机 ping,同时支持跨管理节点的任务队列空闲检查以保证重启安全性。

变更内容

KVM 代理内存监控与 Ping 自适应控制

层 / 文件 摘要
消息与事件契约定义
header/src/main/java/org/zstack/header/core/GetLocalTaskMsg.java, plugin/kvm/src/main/java/org/zstack/kvm/RestartKvmAgentMsg.java, plugin/kvm/src/main/java/org/zstack/kvm/RestartKvmAgentReply.java, header/src/main/java/org/zstack/header/host/HostCanonicalEvents.java
GetLocalTaskMsg 增加 onlyRunningTask 标志用于任务队列过滤;新增 RestartKvmAgentMsg/Reply 消息类;HostCanonicalEvents 定义 HOST_PING_SKIP、HOST_PING_CANCEL_SKIP、HOST_PROCESS_PHYSICAL_MEMORY_USAGE_ABNORMAL 三个事件路径与 HostPingSkipData、HostProcessPhysicalMemoryUsageAlarmData 数据承载类。
全局配置与常量扩展
plugin/kvm/src/main/java/org/zstack/kvm/KVMGlobalConfig.java, plugin/kvm/src/main/java/org/zstack/kvm/KVMConstant.java, plugin/kvm/src/main/java/org/zstack/kvm/KVMAgentCommands.java
KVMGlobalConfig 新增 RECONNECT_HOST_RESTART_LIBVIRTD_SERVICE、KVMAGENT_PHYSICAL_MEMORY_USAGE_ALARM_THRESHOLD、KVMAGENT_PHYSICAL_MEMORY_USAGE_HARD_LIMIT 三项配置;KVMConstant 新增 HOST_PROCESS_PHYSICAL_MEMORY_USAGE_ALARM_PATH、HOST_KVMAGENT_STATUS_PATH 两个路径常量;PingCmd 扩展 configs 字段,新增 HostProcessPhysicalMemoryUsageAlarmCmd 和 HostKvmagentStatusCmd 两个命令类。
KVMHostFactory 告警处理与配置管理
plugin/kvm/src/main/java/org/zstack/kvm/KVMHostFactory.java
新增线程安全 configs 缓存容器与 skipHostPingTimeWhenKvmagentBusy 参数;start() 中初始化内存阈值配置并监听更新;新增 processKvmagentPhysicalMemUsageAbnormal 方法在超限时触发重启;实现 HOST_PROCESS_PHYSICAL_MEMORY_USAGE_ALARM_PATH 和 HOST_KVMAGENT_STATUS_PATH 同步回调处理器,分别处理内存告警与代理状态变化。
KVMHost 代理重启流程
plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java
新增 handleLocalMessage() 对 RestartKvmAgentMsg 的分支处理;实现 handle(RestartKvmAgentMsg) 和 restartKvmAgentOnHost() 完整重启流程,包含主机所有权验证、可选的跨节点任务队列空闲检查、SSH 重启命令执行、kvmagent 就绪轮询、ReconnectHostMsg 发送;pingHook 为 PingCmd 填充 configs。
主机 Ping 跳过缓存与事件监听
compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java
新增 skippedPingHostDeadline 缓存(1小时写入过期);Tracker.track() 检查缓存跳过 ping;onHostPingSkip() 监听 HOST_PING_SKIP/HOST_PING_CANCEL_SKIP 事件,动态更新/失效跳过截止时间。
本地任务过滤扩展
core/src/main/java/org/zstack/core/CoreManagerImpl.java
handle(GetLocalTaskMsg) 新增条件分支:当 isOnlyRunningTask() 为真时清空返回结果中各 ChainInfo 的 pendingTask,支持跨节点任务队列检查。
配置与内存阈值测试验证
test/src/test/groovy/org/zstack/test/integration/kvm/host/KVMPingCase.groovy
testHostReconnectAfterPingFailure 中从 PingCmd.configs 提取物理内存告警和硬限制值;验证初始字节值;更新全局配置后断言缓存同步刷新;新增 GlobalConfigFacadeImpl bean 注入与配置更新测试。

🎯 4 (复杂) | ⏱️ ~45 分钟

小兔蹦蹦跳,监控忙碌的代理小伙;
内存超限时重启别烦恼,
Ping 自适应跳过显智巧,
任务队列检查保安全,
缓存过期自失效,妙哉~ 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed 标题清晰简洁地反映了本次变更的核心内容——回溯一个针对 KVM agent 内存管理的修复补丁到指定分支。
Description check ✅ Passed 描述详细阐述了此次回溯的目标、包含的功能变更(内存监控与 ping 跳过机制)以及已完成的测试项,与文件摘要内容完全相关。
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/haidong.pang/fix/TIC-5776@@3

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java`:
- Around line 130-133: Read skippedPingHostDeadline.getIfPresent(uuid) once into
a local Long variable (e.g., Long deadline =
skippedPingHostDeadline.getIfPresent(uuid)) in HostTrackImpl (the method
containing the shown block), then check deadline != null &&
System.currentTimeMillis() / 1000 <= deadline and use that same deadline
variable in the logger.debug call; this avoids repeated cache reads, possible
NPE/auto-unboxing, and ensures the logged value matches the checked value while
keeping the continueToRunThisTimer() and return behavior unchanged.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java`:
- Around line 879-895: The current flow always calls trigger.next() even when
ErrorCodeList shows all retry probes failed (the block in the
done(ErrorCodeList) uses errorCodeList.getCauses().size() == retryCount but
still calls trigger.next()), and then sends a fire-and-forget ReconnectHostMsg
which can cause RestartKvmAgentReply to be reported successful prematurely;
change the done(ErrorCodeList) handler to call trigger.fail(...) or set an error
in the workflow when probes exhausted (use the ErrorCode from ErrorCodeList to
construct the failure), and change the subsequent reconnect step to send
ReconnectHostMsg with a callback (bus.send with a reply handler) and only call
trigger.next() when the reconnect reply indicates success; reference
ErrorCodeList, retryCount, trigger.next()/trigger.fail(), ReconnectHostMsg and
the reconnect run(FlowTrigger, Map) to locate and update the code.
- Around line 747-756: GetLocalTaskReply gr may not contain an entry for the
syncSignature so ChainInfo chainInfo = gr.getResults().get(syncSignature) can be
null and calling chainInfo.getRunningTask() causes an NPE; update the block
around GetLocalTaskReply/chainInfo to null-check the result from
gr.getResults().get(syncSignature) (using the local variables gr, syncSignature,
ChainInfo) and treat a null ChainInfo as "no running tasks" (i.e. call
compl.done() and continue) while preserving the existing branch where a
non-empty chainInfo.getRunningTask() sets queueIsEmpty to false, logs via
logger.debug (including self.getUuid(), mnId, buildTaskInfo(...)) and calls
compl.allDone()/return.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHostFactory.java`:
- Around line 544-550: The switch directly uses data.getProcessName() which may
be null and cause an NPE; update KVMHostFactory to first store
data.getProcessName() into a local variable, check for null/empty (log a
suitable message via logger.debug or handle as default) and only then perform
the switch on that non-null value; keep the existing case "zstack-kvmagent"
calling processKvmagentPhysicalMemUsageAbnormal(cmd) and the default branch
logging unknown process names.

In
`@test/src/test/groovy/org/zstack/test/integration/kvm/host/KVMPingCase.groovy`:
- Around line 126-137: This test mutates
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_ALARM_THRESHOLD and
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_HARD_LIMIT without restoring
them, risking cross-test interference; wrap the modifications and assertions
(the blocks using SizeUtils.sizeStringToBytes and retryInSecs) in a try/finally
where you capture the original values of
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_ALARM_THRESHOLD and
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_HARD_LIMIT before changing them
and restore those originals in the finally block so the global config is
returned to its prior state regardless of test outcome.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5dadea20-c8c4-4580-9182-df8d1b308a89

📥 Commits

Reviewing files that changed from the base of the PR and between 12bb2b7 and 6197976.

⛔ Files ignored due to path filters (2)
  • conf/globalConfig/kvm.xml is excluded by !**/*.xml
  • test/src/test/resources/globalConfig/kvm.xml is excluded by !**/*.xml
📒 Files selected for processing (12)
  • compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java
  • core/src/main/java/org/zstack/core/CoreManagerImpl.java
  • header/src/main/java/org/zstack/header/core/GetLocalTaskMsg.java
  • header/src/main/java/org/zstack/header/host/HostCanonicalEvents.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMAgentCommands.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMConstant.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMGlobalConfig.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHostFactory.java
  • plugin/kvm/src/main/java/org/zstack/kvm/RestartKvmAgentMsg.java
  • plugin/kvm/src/main/java/org/zstack/kvm/RestartKvmAgentReply.java
  • test/src/test/groovy/org/zstack/test/integration/kvm/host/KVMPingCase.groovy

Comment on lines +130 to +133
if (skippedPingHostDeadline.getIfPresent(uuid) != null && System.currentTimeMillis() / 1000 <= skippedPingHostDeadline.getIfPresent(uuid)) {
logger.debug(String.format("skip tracking host[uuid:%s] this time, deadline %s", uuid, skippedPingHostDeadline.getIfPresent(uuid)));
continueToRunThisTimer();
return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

避免重复读取缓存导致潜在空指针

Line [130]-Line [131] 对 getIfPresent(uuid) 进行了重复读取;在并发 invalidate/expire 时,后续读取可能变为 null,触发自动拆箱异常。建议先读取到局部变量再判断与打印。

建议修复
-            if (skippedPingHostDeadline.getIfPresent(uuid) != null && System.currentTimeMillis() / 1000 <= skippedPingHostDeadline.getIfPresent(uuid)) {
-                logger.debug(String.format("skip tracking host[uuid:%s] this time, deadline %s", uuid, skippedPingHostDeadline.getIfPresent(uuid)));
+            Long deadline = skippedPingHostDeadline.getIfPresent(uuid);
+            if (deadline != null && System.currentTimeMillis() / 1000 <= deadline) {
+                logger.debug(String.format("skip tracking host[uuid:%s] this time, deadline %s", uuid, deadline));
                 continueToRunThisTimer();
                 return;
             }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (skippedPingHostDeadline.getIfPresent(uuid) != null && System.currentTimeMillis() / 1000 <= skippedPingHostDeadline.getIfPresent(uuid)) {
logger.debug(String.format("skip tracking host[uuid:%s] this time, deadline %s", uuid, skippedPingHostDeadline.getIfPresent(uuid)));
continueToRunThisTimer();
return;
Long deadline = skippedPingHostDeadline.getIfPresent(uuid);
if (deadline != null && System.currentTimeMillis() / 1000 <= deadline) {
logger.debug(String.format("skip tracking host[uuid:%s] this time, deadline %s", uuid, deadline));
continueToRunThisTimer();
return;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@compute/src/main/java/org/zstack/compute/host/HostTrackImpl.java` around
lines 130 - 133, Read skippedPingHostDeadline.getIfPresent(uuid) once into a
local Long variable (e.g., Long deadline =
skippedPingHostDeadline.getIfPresent(uuid)) in HostTrackImpl (the method
containing the shown block), then check deadline != null &&
System.currentTimeMillis() / 1000 <= deadline and use that same deadline
variable in the logger.debug call; this avoids repeated cache reads, possible
NPE/auto-unboxing, and ensures the logged value matches the checked value while
keeping the continueToRunThisTimer() and return behavior unchanged.

Comment on lines +747 to +756
GetLocalTaskReply gr = r.castReply();
ChainInfo chainInfo = gr.getResults().get(syncSignature);
if (!chainInfo.getRunningTask().isEmpty()) {
logger.debug(String.format("%s tasks exist on the running task of host %s, mnNodeId %s: %s...", chainInfo.getRunningTask().size(),
self.getUuid(), mnId, buildTaskInfo(chainInfo.getRunningTask())));
queueIsEmpty.set(false);
compl.allDone();
return;
}
compl.done();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

补上空结果保护,避免跨节点查询在无任务节点上直接 NPE。

这里默认每个管理节点都会返回 syncSignature 对应的 ChainInfo,但没有本地任务的节点很可能不会带这个 key。此时 gr.getResults().get(syncSignature) 会是 null,后面的 chainInfo.getRunningTask() 会让重启流程在正常场景下直接失败。

建议修改
                     GetLocalTaskReply gr = r.castReply();
-                    ChainInfo chainInfo = gr.getResults().get(syncSignature);
-                    if (!chainInfo.getRunningTask().isEmpty()) {
+                    ChainInfo chainInfo = gr.getResults() == null ? null : gr.getResults().get(syncSignature);
+                    List<? extends TaskInfo> runningTasks = chainInfo == null || chainInfo.getRunningTask() == null
+                            ? Collections.emptyList()
+                            : chainInfo.getRunningTask();
+                    if (!runningTasks.isEmpty()) {
                         logger.debug(String.format("%s tasks exist on the running task of host %s, mnNodeId %s: %s...", chainInfo.getRunningTask().size(),
-                                self.getUuid(), mnId, buildTaskInfo(chainInfo.getRunningTask())));
+                                self.getUuid(), mnId, buildTaskInfo(runningTasks)));
                         queueIsEmpty.set(false);
                         compl.allDone();
                         return;
                     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java` around lines 747 - 756,
GetLocalTaskReply gr may not contain an entry for the syncSignature so ChainInfo
chainInfo = gr.getResults().get(syncSignature) can be null and calling
chainInfo.getRunningTask() causes an NPE; update the block around
GetLocalTaskReply/chainInfo to null-check the result from
gr.getResults().get(syncSignature) (using the local variables gr, syncSignature,
ChainInfo) and treat a null ChainInfo as "no running tasks" (i.e. call
compl.done() and continue) while preserving the existing branch where a
non-empty chainInfo.getRunningTask() sets queueIsEmpty to false, logs via
logger.debug (including self.getUuid(), mnId, buildTaskInfo(...)) and calls
compl.allDone()/return.

Comment on lines +879 to +895
public void done(ErrorCodeList errorCodeList) {
if (errorCodeList.getCauses().size() == retryCount) {
logger.debug("waiting for kvmagent to start timeout: " + errorCodeList.getCauses().get(0).getDetails());
}
trigger.next();
}
});
}
}).then(new NoRollbackFlow() {
String __name__ = String.format("reconnect host %s after restart kvmagent", self.getUuid());

public void run(FlowTrigger trigger, Map data) {
ReconnectHostMsg rmsg = new ReconnectHostMsg();
rmsg.setHostUuid(self.getUuid());
bus.makeTargetServiceIdByResourceUuid(rmsg, HostConstant.SERVICE_ID, self.getUuid());
bus.send(rmsg);
trigger.next();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

不要在 agent 仍未就绪时继续返回成功。

这里即使 60 次探测全部失败也会 trigger.next(),随后 ReconnectHostMsg 还是 fire-and-forget 发送,没有回调校验。结果是 RestartKvmAgentReply 可能在 agent 没起来、甚至重连失败时仍然返回成功,调用方会误判修复已经完成。

建议修改
                 }).run(new WhileDoneCompletion(trigger) {
                     `@Override`
                     public void done(ErrorCodeList errorCodeList) {
                         if (errorCodeList.getCauses().size() == retryCount) {
-                            logger.debug("waiting for kvmagent to start timeout: " + errorCodeList.getCauses().get(0).getDetails());
+                            trigger.fail(operr("kvmagent on host %s did not become ready after %s retries: %s",
+                                    self.getUuid(), retryCount, errorCodeList.getCauses().get(0).getDetails()));
+                            return;
                         }
                         trigger.next();
                     }
                 });
             }
         }).then(new NoRollbackFlow() {
             String __name__ = String.format("reconnect host %s after restart kvmagent", self.getUuid());

             public void run(FlowTrigger trigger, Map data) {
                 ReconnectHostMsg rmsg = new ReconnectHostMsg();
                 rmsg.setHostUuid(self.getUuid());
                 bus.makeTargetServiceIdByResourceUuid(rmsg, HostConstant.SERVICE_ID, self.getUuid());
-                bus.send(rmsg);
-                trigger.next();
+                bus.send(rmsg, new CloudBusCallBack(trigger) {
+                    `@Override`
+                    public void run(MessageReply reply) {
+                        if (reply.isSuccess()) {
+                            trigger.next();
+                        } else {
+                            trigger.fail(reply.getError());
+                        }
+                    }
+                });
             }
         })
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java` around lines 879 - 895,
The current flow always calls trigger.next() even when ErrorCodeList shows all
retry probes failed (the block in the done(ErrorCodeList) uses
errorCodeList.getCauses().size() == retryCount but still calls trigger.next()),
and then sends a fire-and-forget ReconnectHostMsg which can cause
RestartKvmAgentReply to be reported successful prematurely; change the
done(ErrorCodeList) handler to call trigger.fail(...) or set an error in the
workflow when probes exhausted (use the ErrorCode from ErrorCodeList to
construct the failure), and change the subsequent reconnect step to send
ReconnectHostMsg with a callback (bus.send with a reply handler) and only call
trigger.next() when the reconnect reply indicates success; reference
ErrorCodeList, retryCount, trigger.next()/trigger.fail(), ReconnectHostMsg and
the reconnect run(FlowTrigger, Map) to locate and update the code.

Comment on lines +544 to +550
switch (data.getProcessName()) {
case "zstack-kvmagent":
processKvmagentPhysicalMemUsageAbnormal(cmd);
break;
default:
logger.debug(String.format("unknown process name[%s] in host[uuid:%s]", cmd.getProcessName(), cmd.getHostUuid()));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

避免在 switch 中直接使用可能为 nullprocessName

Line 544 当前对 data.getProcessName() 直接 switch,上报缺失该字段时会抛出 NPE,导致该回调处理失败。建议先做空值校验,再进行分支判断。

🔧 建议修改
-            switch (data.getProcessName()) {
-                case "zstack-kvmagent":
-                    processKvmagentPhysicalMemUsageAbnormal(cmd);
-                    break;
-                default:
-                    logger.debug(String.format("unknown process name[%s] in host[uuid:%s]", cmd.getProcessName(), cmd.getHostUuid()));
-            }
+            String processName = cmd.getProcessName();
+            if (StringUtils.isBlank(processName)) {
+                logger.warn(String.format("empty process name in host[uuid:%s]", cmd.getHostUuid()));
+                return null;
+            }
+
+            if ("zstack-kvmagent".equals(processName)) {
+                processKvmagentPhysicalMemUsageAbnormal(cmd);
+            } else {
+                logger.debug(String.format("unknown process name[%s] in host[uuid:%s]", processName, cmd.getHostUuid()));
+            }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHostFactory.java` around lines 544
- 550, The switch directly uses data.getProcessName() which may be null and
cause an NPE; update KVMHostFactory to first store data.getProcessName() into a
local variable, check for null/empty (log a suitable message via logger.debug or
handle as default) and only then perform the switch on that non-null value; keep
the existing case "zstack-kvmagent" calling
processKvmagentPhysicalMemUsageAbnormal(cmd) and the default branch logging
unknown process names.

Comment on lines +126 to +137
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_ALARM_THRESHOLD.updateValue(SizeUtils.sizeStringToBytes("100M"))
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_HARD_LIMIT.updateValue(SizeUtils.sizeStringToBytes("200M"))
retryInSecs {
assert kvmagent_mem_hard_limit == SizeUtils.sizeStringToBytes("200M")
assert kvmagent_mem_threshold == SizeUtils.sizeStringToBytes("100M")
}
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_ALARM_THRESHOLD.updateValue(SizeUtils.sizeStringToBytes("5G"))
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_HARD_LIMIT.updateValue(SizeUtils.sizeStringToBytes("20G"))
retryInSecs {
assert kvmagent_mem_hard_limit == SizeUtils.sizeStringToBytes("20G")
assert kvmagent_mem_threshold == SizeUtils.sizeStringToBytes("5G")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

测试修改全局配置后应恢复,避免用例串扰

Line [126]-Line [137] 在测试中多次修改全局配置,但未恢复原值,后续用例可能受到影响。建议在该测试内用 try/finally 还原。

建议修复
+        def oldAlarm = KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_ALARM_THRESHOLD.value(Long.class)
+        def oldHardLimit = KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_HARD_LIMIT.value(Long.class)
+        try {
         KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_ALARM_THRESHOLD.updateValue(SizeUtils.sizeStringToBytes("100M"))
         KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_HARD_LIMIT.updateValue(SizeUtils.sizeStringToBytes("200M"))
         retryInSecs {
             assert kvmagent_mem_hard_limit == SizeUtils.sizeStringToBytes("200M")
             assert kvmagent_mem_threshold == SizeUtils.sizeStringToBytes("100M")
         }
         KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_ALARM_THRESHOLD.updateValue(SizeUtils.sizeStringToBytes("5G"))
         KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_HARD_LIMIT.updateValue(SizeUtils.sizeStringToBytes("20G"))
         retryInSecs {
             assert kvmagent_mem_hard_limit == SizeUtils.sizeStringToBytes("20G")
             assert kvmagent_mem_threshold == SizeUtils.sizeStringToBytes("5G")
         }
+        } finally {
+            KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_ALARM_THRESHOLD.updateValue(oldAlarm)
+            KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_HARD_LIMIT.updateValue(oldHardLimit)
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_ALARM_THRESHOLD.updateValue(SizeUtils.sizeStringToBytes("100M"))
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_HARD_LIMIT.updateValue(SizeUtils.sizeStringToBytes("200M"))
retryInSecs {
assert kvmagent_mem_hard_limit == SizeUtils.sizeStringToBytes("200M")
assert kvmagent_mem_threshold == SizeUtils.sizeStringToBytes("100M")
}
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_ALARM_THRESHOLD.updateValue(SizeUtils.sizeStringToBytes("5G"))
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_HARD_LIMIT.updateValue(SizeUtils.sizeStringToBytes("20G"))
retryInSecs {
assert kvmagent_mem_hard_limit == SizeUtils.sizeStringToBytes("20G")
assert kvmagent_mem_threshold == SizeUtils.sizeStringToBytes("5G")
}
def oldAlarm = KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_ALARM_THRESHOLD.value(Long.class)
def oldHardLimit = KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_HARD_LIMIT.value(Long.class)
try {
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_ALARM_THRESHOLD.updateValue(SizeUtils.sizeStringToBytes("100M"))
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_HARD_LIMIT.updateValue(SizeUtils.sizeStringToBytes("200M"))
retryInSecs {
assert kvmagent_mem_hard_limit == SizeUtils.sizeStringToBytes("200M")
assert kvmagent_mem_threshold == SizeUtils.sizeStringToBytes("100M")
}
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_ALARM_THRESHOLD.updateValue(SizeUtils.sizeStringToBytes("5G"))
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_HARD_LIMIT.updateValue(SizeUtils.sizeStringToBytes("20G"))
retryInSecs {
assert kvmagent_mem_hard_limit == SizeUtils.sizeStringToBytes("20G")
assert kvmagent_mem_threshold == SizeUtils.sizeStringToBytes("5G")
}
} finally {
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_ALARM_THRESHOLD.updateValue(oldAlarm)
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_HARD_LIMIT.updateValue(oldHardLimit)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/src/test/groovy/org/zstack/test/integration/kvm/host/KVMPingCase.groovy`
around lines 126 - 137, This test mutates
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_ALARM_THRESHOLD and
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_HARD_LIMIT without restoring
them, risking cross-test interference; wrap the modifications and assertions
(the blocks using SizeUtils.sizeStringToBytes and retryInSecs) in a try/finally
where you capture the original values of
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_ALARM_THRESHOLD and
KVMGlobalConfig.KVMAGENT_PHYSICAL_MEMORY_USAGE_HARD_LIMIT before changing them
and restore those originals in the finally block so the global config is
returned to its prior state regardless of test outcome.

shan.wu added 2 commits May 22, 2026 18:09
1、support agent memory monitoring and alarm
2、support automatic dump of thread stack and tracking of objects when the memory continues to increase
3、automatically restart the process when the memory exceeds the threshold

Resolves/Related: ZSTAC-71000

Change-Id: I676f656f6f6d6b6565656e737677666667767170
(cherry picked from commit 8fc2e9d)
1、skip host ping when dumping kvmagent
2、Wait for the kvmagent service to fully start before reconnecting to the host to avoid restarting twice.

Resolves/Related: ZSTAC-72552

Change-Id: I676f656f6f6d6b6565656e737677666667767176
(cherry picked from commit 914853d)

# Conflicts:
#	plugin/kvm/src/main/java/org/zstack/kvm/KVMHostFactory.java
@MatheMatrix MatheMatrix force-pushed the sync/haidong.pang/fix/TIC-5776@@3 branch from 6197976 to fb9e84e Compare May 22, 2026 10:12
@MatheMatrix MatheMatrix deleted the sync/haidong.pang/fix/TIC-5776@@3 branch May 29, 2026 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants