Skip to content

[host]: ping syncSignature per-host to prevent queue serialization#4093

Open
zstack-robot-2 wants to merge 1 commit into
5.5.22from
sync/songtao.liu/fix/ZSTAC-80543-v2
Open

[host]: ping syncSignature per-host to prevent queue serialization#4093
zstack-robot-2 wants to merge 1 commit into
5.5.22from
sync/songtao.liu/fix/ZSTAC-80543-v2

Conversation

@zstack-robot-2
Copy link
Copy Markdown
Collaborator

Summary

  • Bug: host ping queue serialization from shared syncSignature
  • Target: 5.5.22

Root Cause Analysis

The host ping mechanism uses ChainTask with a syncSignature to serialize operations. Multiple hosts shared the same syncSignature, causing ping operations for different hosts to be unnecessarily serialized. In large clusters (100+ hosts), this created a queue bottleneck: a slow ping to one host blocked pings to all other hosts, causing cascading timeouts and false-positive host disconnects.

Fix Plan

Change the syncSignature from a global constant to per-host UUID: getSyncSignature() now returns the host UUID string. This allows ping operations for different hosts to run in parallel, eliminating the serialization bottleneck while maintaining per-host ordering.

Impact

Host ping/reconnect path in large clusters. Most impactful in environments with 50+ hosts.

Test Plan

  • Local compilation verified
  • Unit tests pass (no new failures)
  • Integration testing on test environment
  • Code review approved

🤖 Generated with Claude Code

sync from gitlab !9839

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Review Change Stack

总体概述

该PR调整了主机ping任务的同步签名粒度,从固定的全局签名改为按主机UUID区分的签名,以支持不同主机的ping请求并发执行而同一主机的ping请求序列化处理。新增集成测试验证了这一行为。

变更内容

Ping任务同步签名与调度

层级 / 文件(s) 概述
主机级同步签名实现
compute/src/main/java/org/zstack/compute/host/HostBase.java
ChainTask.getSyncSignature() 从固定字符串 "do-ping-host" 改为使用 String.format("do-ping-host-%s", msg.getHostUuid()) 生成主机特定的签名;getName() 相应地改为返回 getSyncSignature() 的结果,确保任务名称与同步签名保持一致。
不同主机ping不阻塞的验证测试
test/src/test/groovy/org/zstack/test/integration/kvm/host/KVMPingCase.groovy
新增 CountDownLatch 导入和 testDifferentHostPingsNotBlocked() 测试方法:将 HOST_TRACK_PARALLELISM_DEGREE 设为 1,通过模拟器拦截 KVM_PING_PATH 请求,阻塞 kvm1 的ping直至 kvm2 的ping完成(验证不同主机ping可并发),最后在 finally 块中释放阻塞、清理模拟处理器、恢复默认并行度和主机连接状态;测试被添加到 test() 方法的调用序列中。

🎯 3 (Moderate) | ⏱️ ~20 minutes


🐰 同步签名按主机分,
不同主机不再等,
调度并发更高效,
测试验证无差别,
ping声乱飞不困顿!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed 标题清晰准确地总结了主要变更内容:通过按主机UUID区分pingSync标签来防止队列序列化,完全符合PR的核心目标。
Description check ✅ Passed 描述详细地阐述了问题根源、解决方案和测试计划,与代码变更完全相关,包括bug根因分析、修复计划、影响范围和测试状态。
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/songtao.liu/fix/ZSTAC-80543-v2

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@test/src/test/java/org/zstack/test/core/thread/DispatchQueuePerHostSignatureTest.java`:
- Around line 42-46: The test currently has a race/false-positive window because
call() uses started.countDown() then letFinish.await(...) without checking the
await result, and the spawned Futures aren't .get()'d so exceptions are
swallowed; tighten the startup/synchronization and consume task results: assert
started.await with a shorter timeout and fail if it returns false, replace
letFinish.await(10, TimeUnit.SECONDS) with a checked await whose boolean result
is asserted, and for the Futures created later call future.get(timeout,
TimeUnit.SECONDS) (or get() and assert the returned values) so any thrown
exceptions fail the test and ordering/assertions on executionOrder reflect
actual completion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: 4f5a7527-88b0-42aa-9c2b-245311c2dd0a

📥 Commits

Reviewing files that changed from the base of the PR and between 96cad7e and 416f692.

📒 Files selected for processing (2)
  • compute/src/main/java/org/zstack/compute/host/HostBase.java
  • test/src/test/java/org/zstack/test/core/thread/DispatchQueuePerHostSignatureTest.java

@MatheMatrix MatheMatrix force-pushed the sync/songtao.liu/fix/ZSTAC-80543-v2 branch 8 times, most recently from 8b0b4c3 to e204090 Compare May 28, 2026 08:31
…ocking

syncSignature changed from shared "do-ping-host" to per-host "do-ping-host-<uuid>".
Each host now has independent ping queue; different hosts no longer block each other.

BREAKING: ping.parallelismDegree semantics changed from global concurrency limit to
per-host concurrency limit. The implicit global backstop is now ThreadFacade's sync
thread pool (150 threads by default, from ThreadGlobalProperty.MAX_THREAD_NUM).

Removed getSyncLevel() override — per-host signature with 60s ping cycle makes
per-host parallelism meaningless; ChainTask default level=1 ensures strict serial.

Resolves: ZSTAC-80543

Change-Id: I307de7233cd6633c93b4d1c03f4898a558a982d3
@MatheMatrix MatheMatrix force-pushed the sync/songtao.liu/fix/ZSTAC-80543-v2 branch from e204090 to 1ff12a3 Compare May 28, 2026 09:04
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@test/src/test/groovy/org/zstack/test/integration/kvm/host/KVMPingCase.groovy`:
- Around line 478-499: The test can produce false positives because it never
asserts that kvm1 actually entered the blocking branch; add a new CountDownLatch
(e.g., kvm1Entered) and trigger kvm1Entered.countDown() inside the env.simulator
closure when cmd.hostUuid == kvm1.uuid before calling blockKvm1.await(...), then
assert kvm1Entered.await(30, TimeUnit.SECONDS) prior to asserting kvm2Pinged to
ensure kvm1 truly reached the blocked state; update references to the existing
blockKvm1, kvm2Pinged, kvm1 and the simulator for KVMConstant.KVM_PING_PATH and
KVMAgentCommands.PingCmd so the new latch is created and awaited in the same
test scope.
- Around line 476-503: The test currently hardcodes restoring
HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE to 100 in the finally block,
which can pollute other tests; capture the original value before calling
HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(1) (e.g. store it in
a local variable like originalDegree) and in the finally block call
HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(originalDegree)
instead of updateValue(100), leaving the rest of the test (blockKvm1,
kvm2Pinged, simulators and recoverHostToConnected) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: 319f471b-3c05-4bc8-8654-62c86f512e24

📥 Commits

Reviewing files that changed from the base of the PR and between 416f692 and 1ff12a3.

⛔ Files ignored due to path filters (1)
  • conf/globalConfig/host.xml is excluded by !**/*.xml
📒 Files selected for processing (2)
  • compute/src/main/java/org/zstack/compute/host/HostBase.java
  • test/src/test/groovy/org/zstack/test/integration/kvm/host/KVMPingCase.groovy
🚧 Files skipped from review as they are similar to previous changes (1)
  • compute/src/main/java/org/zstack/compute/host/HostBase.java

Comment on lines +476 to +503
HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(1)

CountDownLatch blockKvm1 = new CountDownLatch(1)
CountDownLatch kvm2Pinged = new CountDownLatch(1)

env.simulator(KVMConstant.KVM_PING_PATH) { HttpEntity<String> e, EnvSpec espec ->
KVMAgentCommands.PingCmd cmd = JSONObjectUtil.toObject(e.getBody(), KVMAgentCommands.PingCmd.class)
def rsp = new KVMAgentCommands.PingResponse()

if (cmd.hostUuid == kvm1.uuid) {
blockKvm1.await(30, TimeUnit.SECONDS)
}

if (cmd.hostUuid == kvm2.uuid) {
kvm2Pinged.countDown()
}

rsp.hostUuid = cmd.hostUuid
return rsp
}

try {
assert kvm2Pinged.await(30, TimeUnit.SECONDS)
} finally {
blockKvm1.countDown()
env.cleanSimulatorHandlers()
HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(100)
recoverHostToConnected(kvm1.uuid)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

避免硬编码恢复 HOST_TRACK_PARALLELISM_DEGREE

Line 502 把并行度固定恢复为 100,会覆盖测试前真实配置,增加后续用例被污染和顺序相关失败的风险。建议在 Line 476 前保存原值,并在 finally 中恢复原值。

建议修改
+        Integer originalParallelism = HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.value(Integer.class)
         HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(1)
@@
-            HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(100)
+            HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(originalParallelism)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(1)
CountDownLatch blockKvm1 = new CountDownLatch(1)
CountDownLatch kvm2Pinged = new CountDownLatch(1)
env.simulator(KVMConstant.KVM_PING_PATH) { HttpEntity<String> e, EnvSpec espec ->
KVMAgentCommands.PingCmd cmd = JSONObjectUtil.toObject(e.getBody(), KVMAgentCommands.PingCmd.class)
def rsp = new KVMAgentCommands.PingResponse()
if (cmd.hostUuid == kvm1.uuid) {
blockKvm1.await(30, TimeUnit.SECONDS)
}
if (cmd.hostUuid == kvm2.uuid) {
kvm2Pinged.countDown()
}
rsp.hostUuid = cmd.hostUuid
return rsp
}
try {
assert kvm2Pinged.await(30, TimeUnit.SECONDS)
} finally {
blockKvm1.countDown()
env.cleanSimulatorHandlers()
HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(100)
recoverHostToConnected(kvm1.uuid)
Integer originalParallelism = HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.value(Integer.class)
HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(1)
CountDownLatch blockKvm1 = new CountDownLatch(1)
CountDownLatch kvm2Pinged = new CountDownLatch(1)
env.simulator(KVMConstant.KVM_PING_PATH) { HttpEntity<String> e, EnvSpec espec ->
KVMAgentCommands.PingCmd cmd = JSONObjectUtil.toObject(e.getBody(), KVMAgentCommands.PingCmd.class)
def rsp = new KVMAgentCommands.PingResponse()
if (cmd.hostUuid == kvm1.uuid) {
blockKvm1.await(30, TimeUnit.SECONDS)
}
if (cmd.hostUuid == kvm2.uuid) {
kvm2Pinged.countDown()
}
rsp.hostUuid = cmd.hostUuid
return rsp
}
try {
assert kvm2Pinged.await(30, TimeUnit.SECONDS)
} finally {
blockKvm1.countDown()
env.cleanSimulatorHandlers()
HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(originalParallelism)
recoverHostToConnected(kvm1.uuid)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/src/test/groovy/org/zstack/test/integration/kvm/host/KVMPingCase.groovy`
around lines 476 - 503, The test currently hardcodes restoring
HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE to 100 in the finally block,
which can pollute other tests; capture the original value before calling
HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(1) (e.g. store it in
a local variable like originalDegree) and in the finally block call
HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(originalDegree)
instead of updateValue(100), leaving the rest of the test (blockKvm1,
kvm2Pinged, simulators and recoverHostToConnected) unchanged.

Comment on lines +478 to +499
CountDownLatch blockKvm1 = new CountDownLatch(1)
CountDownLatch kvm2Pinged = new CountDownLatch(1)

env.simulator(KVMConstant.KVM_PING_PATH) { HttpEntity<String> e, EnvSpec espec ->
KVMAgentCommands.PingCmd cmd = JSONObjectUtil.toObject(e.getBody(), KVMAgentCommands.PingCmd.class)
def rsp = new KVMAgentCommands.PingResponse()

if (cmd.hostUuid == kvm1.uuid) {
blockKvm1.await(30, TimeUnit.SECONDS)
}

if (cmd.hostUuid == kvm2.uuid) {
kvm2Pinged.countDown()
}

rsp.hostUuid = cmd.hostUuid
return rsp
}

try {
assert kvm2Pinged.await(30, TimeUnit.SECONDS)
} finally {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

补充对 kvm1 已进入阻塞分支的断言,避免假阳性。

当前只在 Line 498 断言 kvm2 收到 ping;若 kvm1 分支未实际进入阻塞,这个用例仍可能通过,无法证明“一个主机阻塞时另一个主机不受影响”。建议增加一个 kvm1Entered latch,并先断言其触发。

建议修改
         CountDownLatch blockKvm1 = new CountDownLatch(1)
+        CountDownLatch kvm1Entered = new CountDownLatch(1)
         CountDownLatch kvm2Pinged = new CountDownLatch(1)
@@
             if (cmd.hostUuid == kvm1.uuid) {
+                kvm1Entered.countDown()
                 blockKvm1.await(30, TimeUnit.SECONDS)
             }
@@
         try {
+            assert kvm1Entered.await(30, TimeUnit.SECONDS)
             assert kvm2Pinged.await(30, TimeUnit.SECONDS)
         } finally {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/src/test/groovy/org/zstack/test/integration/kvm/host/KVMPingCase.groovy`
around lines 478 - 499, The test can produce false positives because it never
asserts that kvm1 actually entered the blocking branch; add a new CountDownLatch
(e.g., kvm1Entered) and trigger kvm1Entered.countDown() inside the env.simulator
closure when cmd.hostUuid == kvm1.uuid before calling blockKvm1.await(...), then
assert kvm1Entered.await(30, TimeUnit.SECONDS) prior to asserting kvm2Pinged to
ensure kvm1 truly reached the blocked state; update references to the existing
blockKvm1, kvm2Pinged, kvm1 and the simulator for KVMConstant.KVM_PING_PATH and
KVMAgentCommands.PingCmd so the new latch is created and awaited in the same
test scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant