fix(sched): break two-CPU off-CPU deadlock in wake() + reproduction test#127
Closed
FlareCoding wants to merge 2 commits into
Closed
fix(sched): break two-CPU off-CPU deadlock in wake() + reproduction test#127FlareCoding wants to merge 2 commits into
FlareCoding wants to merge 2 commits into
Conversation
Adds a deterministic SMP unit test that fabricates the exact precondition for the wake() off-CPU spin deadlock: two controller tasks pin themselves to distinct CPUs with interrupts disabled, each parks a victim task in its own CPU's pending_off_cpu slot (on_cpu=1), then cross-wakes the other CPU's victim. With the current wake() implementation both controllers spin forever on each other's on_cpu; the BSP detects the stall via a bounded watchdog, fails the test, and recovers the wedged CPUs so the suite can continue. This test fails on the current code and is expected to pass once wake() publishes its own deferred off-CPU task before the cross-CPU spin. Co-authored-by: Albert Slepak <FlareCoding@users.noreply.github.com>
wake() spin-waits on a remote task's exec.on_cpu before enqueuing it, but a switched-out task's on_cpu is cleared only lazily at the owning CPU's next scheduler trap (finalize_pending_off_cpu()). Two CPUs that each wake a task still pending-off-CPU on the other, while running with interrupts disabled, spin forever: neither reaches a trap to clear the flag the other awaits. Publish this CPU's own deferred off-CPU task before the spin so the cycle cannot form. finalize_pending_off_cpu() does a non-atomic read-modify-clear of the per-CPU pending_off_cpu_task and is otherwise only invoked from the IRQ-masked scheduler trap paths; wake() runs with interrupts enabled on the futex/wait-queue/mutex paths, so the call is wrapped in irq_save/irq_restore to preserve that contract and prevent a timer tick from corrupting the pending slot. Verified by tests/sched/wake_off_cpu_deadlock.test.cpp, which deadlocks without this change and passes with it, on both x86_64 and aarch64. Co-authored-by: Albert Slepak <FlareCoding@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a deterministic SMP unit test that reproduces a real two-CPU deadlock in
sched::wake(), then fixes the deadlock in an interrupt-safe way.The bug
A task's
exec.on_cpuis set to1when it is switched in and is cleared only lazily at the owning CPU's next scheduler trap (the switched-out task is parked in the per-CPUpending_off_cpu_taskand published byfinalize_pending_off_cpu()).sched::wake()spin-waits on a remote task'son_cpubefore enqueuing it.If two CPUs each hold a just-switched-out task in their pending slot (
on_cpustill1) and each tries to wake the task parked on the other CPU while running with interrupts disabled (e.g. from an MSI/IRQ-contextwake_one), neither CPU can reach a scheduler trap to clear the flag the other is spinning on → permanent mutual stall.The test
kernel/tests/sched/wake_off_cpu_deadlock.test.cppfabricates the exact precondition deterministically: two controller tasks pin themselves to distinct non-BSP CPUs with interrupts disabled, each parks a victim in its own CPU's pending slot viasched::defer_off_cpu_finalize(), and after a barrier each cross-wakes the other CPU's victim. The BSP detects a stall via a bounded watchdog and, on failure, recovers the wedged CPUs so the rest of the suite still runs. The test fails on the unfixed scheduler and passes oncewake()is fixed.The fix
sched::wake()now publishes this CPU's own deferred off-CPU task (finalize_pending_off_cpu()) before entering the cross-CPUon_cpuspin, breaking the cycle. Becausefinalize_pending_off_cpu()performs a non-atomic read-modify-clear of the per-CPUpending_off_cpu_taskand is otherwise only ever called from the IRQ-masked scheduler trap paths, the call is wrapped incpu::irq_save()/cpu::irq_restore().wake()is routinely called with interrupts enabled (futex / wait-queue / mutex wakeups), so without the guard a timer tick landing between the read and the clear could drop a different task's pending entry — an equally severe regression. The guard preserves the function's required execution context.Testing
make test ARCH=x86_64make test ARCH=aarch64