[host]: ping syncSignature per-host to prevent queue serialization#4093
[host]: ping syncSignature per-host to prevent queue serialization#4093zstack-robot-2 wants to merge 1 commit into
Conversation
总体概述该PR调整了主机ping任务的同步签名粒度,从固定的全局签名改为按主机UUID区分的签名,以支持不同主机的ping请求并发执行而同一主机的ping请求序列化处理。新增集成测试验证了这一行为。 变更内容Ping任务同步签名与调度
🎯 3 (Moderate) | ⏱️ ~20 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@test/src/test/java/org/zstack/test/core/thread/DispatchQueuePerHostSignatureTest.java`:
- Around line 42-46: The test currently has a race/false-positive window because
call() uses started.countDown() then letFinish.await(...) without checking the
await result, and the spawned Futures aren't .get()'d so exceptions are
swallowed; tighten the startup/synchronization and consume task results: assert
started.await with a shorter timeout and fail if it returns false, replace
letFinish.await(10, TimeUnit.SECONDS) with a checked await whose boolean result
is asserted, and for the Futures created later call future.get(timeout,
TimeUnit.SECONDS) (or get() and assert the returned values) so any thrown
exceptions fail the test and ordering/assertions on executionOrder reflect
actual completion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
Run ID: 4f5a7527-88b0-42aa-9c2b-245311c2dd0a
📒 Files selected for processing (2)
compute/src/main/java/org/zstack/compute/host/HostBase.javatest/src/test/java/org/zstack/test/core/thread/DispatchQueuePerHostSignatureTest.java
8b0b4c3 to
e204090
Compare
…ocking syncSignature changed from shared "do-ping-host" to per-host "do-ping-host-<uuid>". Each host now has independent ping queue; different hosts no longer block each other. BREAKING: ping.parallelismDegree semantics changed from global concurrency limit to per-host concurrency limit. The implicit global backstop is now ThreadFacade's sync thread pool (150 threads by default, from ThreadGlobalProperty.MAX_THREAD_NUM). Removed getSyncLevel() override — per-host signature with 60s ping cycle makes per-host parallelism meaningless; ChainTask default level=1 ensures strict serial. Resolves: ZSTAC-80543 Change-Id: I307de7233cd6633c93b4d1c03f4898a558a982d3
e204090 to
1ff12a3
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@test/src/test/groovy/org/zstack/test/integration/kvm/host/KVMPingCase.groovy`:
- Around line 478-499: The test can produce false positives because it never
asserts that kvm1 actually entered the blocking branch; add a new CountDownLatch
(e.g., kvm1Entered) and trigger kvm1Entered.countDown() inside the env.simulator
closure when cmd.hostUuid == kvm1.uuid before calling blockKvm1.await(...), then
assert kvm1Entered.await(30, TimeUnit.SECONDS) prior to asserting kvm2Pinged to
ensure kvm1 truly reached the blocked state; update references to the existing
blockKvm1, kvm2Pinged, kvm1 and the simulator for KVMConstant.KVM_PING_PATH and
KVMAgentCommands.PingCmd so the new latch is created and awaited in the same
test scope.
- Around line 476-503: The test currently hardcodes restoring
HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE to 100 in the finally block,
which can pollute other tests; capture the original value before calling
HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(1) (e.g. store it in
a local variable like originalDegree) and in the finally block call
HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(originalDegree)
instead of updateValue(100), leaving the rest of the test (blockKvm1,
kvm2Pinged, simulators and recoverHostToConnected) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
Run ID: 319f471b-3c05-4bc8-8654-62c86f512e24
⛔ Files ignored due to path filters (1)
conf/globalConfig/host.xmlis excluded by!**/*.xml
📒 Files selected for processing (2)
compute/src/main/java/org/zstack/compute/host/HostBase.javatest/src/test/groovy/org/zstack/test/integration/kvm/host/KVMPingCase.groovy
🚧 Files skipped from review as they are similar to previous changes (1)
- compute/src/main/java/org/zstack/compute/host/HostBase.java
| HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(1) | ||
|
|
||
| CountDownLatch blockKvm1 = new CountDownLatch(1) | ||
| CountDownLatch kvm2Pinged = new CountDownLatch(1) | ||
|
|
||
| env.simulator(KVMConstant.KVM_PING_PATH) { HttpEntity<String> e, EnvSpec espec -> | ||
| KVMAgentCommands.PingCmd cmd = JSONObjectUtil.toObject(e.getBody(), KVMAgentCommands.PingCmd.class) | ||
| def rsp = new KVMAgentCommands.PingResponse() | ||
|
|
||
| if (cmd.hostUuid == kvm1.uuid) { | ||
| blockKvm1.await(30, TimeUnit.SECONDS) | ||
| } | ||
|
|
||
| if (cmd.hostUuid == kvm2.uuid) { | ||
| kvm2Pinged.countDown() | ||
| } | ||
|
|
||
| rsp.hostUuid = cmd.hostUuid | ||
| return rsp | ||
| } | ||
|
|
||
| try { | ||
| assert kvm2Pinged.await(30, TimeUnit.SECONDS) | ||
| } finally { | ||
| blockKvm1.countDown() | ||
| env.cleanSimulatorHandlers() | ||
| HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(100) | ||
| recoverHostToConnected(kvm1.uuid) |
There was a problem hiding this comment.
避免硬编码恢复 HOST_TRACK_PARALLELISM_DEGREE。
Line 502 把并行度固定恢复为 100,会覆盖测试前真实配置,增加后续用例被污染和顺序相关失败的风险。建议在 Line 476 前保存原值,并在 finally 中恢复原值。
建议修改
+ Integer originalParallelism = HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.value(Integer.class)
HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(1)
@@
- HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(100)
+ HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(originalParallelism)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(1) | |
| CountDownLatch blockKvm1 = new CountDownLatch(1) | |
| CountDownLatch kvm2Pinged = new CountDownLatch(1) | |
| env.simulator(KVMConstant.KVM_PING_PATH) { HttpEntity<String> e, EnvSpec espec -> | |
| KVMAgentCommands.PingCmd cmd = JSONObjectUtil.toObject(e.getBody(), KVMAgentCommands.PingCmd.class) | |
| def rsp = new KVMAgentCommands.PingResponse() | |
| if (cmd.hostUuid == kvm1.uuid) { | |
| blockKvm1.await(30, TimeUnit.SECONDS) | |
| } | |
| if (cmd.hostUuid == kvm2.uuid) { | |
| kvm2Pinged.countDown() | |
| } | |
| rsp.hostUuid = cmd.hostUuid | |
| return rsp | |
| } | |
| try { | |
| assert kvm2Pinged.await(30, TimeUnit.SECONDS) | |
| } finally { | |
| blockKvm1.countDown() | |
| env.cleanSimulatorHandlers() | |
| HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(100) | |
| recoverHostToConnected(kvm1.uuid) | |
| Integer originalParallelism = HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.value(Integer.class) | |
| HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(1) | |
| CountDownLatch blockKvm1 = new CountDownLatch(1) | |
| CountDownLatch kvm2Pinged = new CountDownLatch(1) | |
| env.simulator(KVMConstant.KVM_PING_PATH) { HttpEntity<String> e, EnvSpec espec -> | |
| KVMAgentCommands.PingCmd cmd = JSONObjectUtil.toObject(e.getBody(), KVMAgentCommands.PingCmd.class) | |
| def rsp = new KVMAgentCommands.PingResponse() | |
| if (cmd.hostUuid == kvm1.uuid) { | |
| blockKvm1.await(30, TimeUnit.SECONDS) | |
| } | |
| if (cmd.hostUuid == kvm2.uuid) { | |
| kvm2Pinged.countDown() | |
| } | |
| rsp.hostUuid = cmd.hostUuid | |
| return rsp | |
| } | |
| try { | |
| assert kvm2Pinged.await(30, TimeUnit.SECONDS) | |
| } finally { | |
| blockKvm1.countDown() | |
| env.cleanSimulatorHandlers() | |
| HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(originalParallelism) | |
| recoverHostToConnected(kvm1.uuid) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/src/test/groovy/org/zstack/test/integration/kvm/host/KVMPingCase.groovy`
around lines 476 - 503, The test currently hardcodes restoring
HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE to 100 in the finally block,
which can pollute other tests; capture the original value before calling
HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(1) (e.g. store it in
a local variable like originalDegree) and in the finally block call
HostGlobalConfig.HOST_TRACK_PARALLELISM_DEGREE.updateValue(originalDegree)
instead of updateValue(100), leaving the rest of the test (blockKvm1,
kvm2Pinged, simulators and recoverHostToConnected) unchanged.
| CountDownLatch blockKvm1 = new CountDownLatch(1) | ||
| CountDownLatch kvm2Pinged = new CountDownLatch(1) | ||
|
|
||
| env.simulator(KVMConstant.KVM_PING_PATH) { HttpEntity<String> e, EnvSpec espec -> | ||
| KVMAgentCommands.PingCmd cmd = JSONObjectUtil.toObject(e.getBody(), KVMAgentCommands.PingCmd.class) | ||
| def rsp = new KVMAgentCommands.PingResponse() | ||
|
|
||
| if (cmd.hostUuid == kvm1.uuid) { | ||
| blockKvm1.await(30, TimeUnit.SECONDS) | ||
| } | ||
|
|
||
| if (cmd.hostUuid == kvm2.uuid) { | ||
| kvm2Pinged.countDown() | ||
| } | ||
|
|
||
| rsp.hostUuid = cmd.hostUuid | ||
| return rsp | ||
| } | ||
|
|
||
| try { | ||
| assert kvm2Pinged.await(30, TimeUnit.SECONDS) | ||
| } finally { |
There was a problem hiding this comment.
补充对 kvm1 已进入阻塞分支的断言,避免假阳性。
当前只在 Line 498 断言 kvm2 收到 ping;若 kvm1 分支未实际进入阻塞,这个用例仍可能通过,无法证明“一个主机阻塞时另一个主机不受影响”。建议增加一个 kvm1Entered latch,并先断言其触发。
建议修改
CountDownLatch blockKvm1 = new CountDownLatch(1)
+ CountDownLatch kvm1Entered = new CountDownLatch(1)
CountDownLatch kvm2Pinged = new CountDownLatch(1)
@@
if (cmd.hostUuid == kvm1.uuid) {
+ kvm1Entered.countDown()
blockKvm1.await(30, TimeUnit.SECONDS)
}
@@
try {
+ assert kvm1Entered.await(30, TimeUnit.SECONDS)
assert kvm2Pinged.await(30, TimeUnit.SECONDS)
} finally {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/src/test/groovy/org/zstack/test/integration/kvm/host/KVMPingCase.groovy`
around lines 478 - 499, The test can produce false positives because it never
asserts that kvm1 actually entered the blocking branch; add a new CountDownLatch
(e.g., kvm1Entered) and trigger kvm1Entered.countDown() inside the env.simulator
closure when cmd.hostUuid == kvm1.uuid before calling blockKvm1.await(...), then
assert kvm1Entered.await(30, TimeUnit.SECONDS) prior to asserting kvm2Pinged to
ensure kvm1 truly reached the blocked state; update references to the existing
blockKvm1, kvm2Pinged, kvm1 and the simulator for KVMConstant.KVM_PING_PATH and
KVMAgentCommands.PingCmd so the new latch is created and awaited in the same
test scope.
Summary
Root Cause Analysis
The host ping mechanism uses
ChainTaskwith asyncSignatureto serialize operations. Multiple hosts shared the same syncSignature, causing ping operations for different hosts to be unnecessarily serialized. In large clusters (100+ hosts), this created a queue bottleneck: a slow ping to one host blocked pings to all other hosts, causing cascading timeouts and false-positive host disconnects.Fix Plan
Change the syncSignature from a global constant to per-host UUID:
getSyncSignature()now returns the host UUID string. This allows ping operations for different hosts to run in parallel, eliminating the serialization bottleneck while maintaining per-host ordering.Impact
Host ping/reconnect path in large clusters. Most impactful in environments with 50+ hosts.
Test Plan
🤖 Generated with Claude Code
sync from gitlab !9839