fix(callgrind): correct AArch64/PPC call-return detection for SP-equal frames (COD-2985) by not-matthias · Pull Request #20 · CodSpeedHQ/valgrind-codspeed

not-matthias · 2026-06-29T19:19:03Z

Summary

On AArch64/PPC the call instruction leaves SP unchanged (return address in the
link register), so a callee's shadow-stack entry frame sits at SP equal to its
return target rather than strictly below it as on x86. Three defects in
callgrind's call/return classifier corrupted Simulation flamegraphs on AArch64
only — inverting call edges (malloc → core::slice::sort::stable::driftsort_main,
__rdl_dealloc/sin parenting the workload) and fabricating '2 recursion
clones of non-recursive functions.

Fixes COD-2985.

The three defects (`callgrind/callstack.c`, `callgrind/bbcc.c`)

CLG_(unwind_call_stack) — minpops accounting. minpops is meant to bound
only SP-equal pops, but was decremented for SP-lower pops too. A callee's
still-open sub-call frames exhausted the budget before its own SP-equal entry
frame was reached, leaving that frame stuck (keeping the callee's context
active → inverted edges; never decrementing its recursion depth → '2 clones).
SP-lower frames now pop unconditionally without consuming the budget.
setup_bbcc return classifier. When the returning function had made
SP-lower sub-calls, popcount_on_return defaulted to 1 and left a PLT
stub+callee pair or a tail-call chain (which share the SP-equal block) stuck.
The classifier is now keyed on where the top frame sits: an SP-lower returning
frame (every x86 return) is handled exactly as before; an SP-equal returning
frame (AArch64/PPC) pops the SP-equal block down to the deepest frame whose
recorded return address is the return target — covering both a PLT stub +
callee pair (same return address) and a tail-call chain a→b→c (c's ret
lands past b and a). The downstream assertion is relaxed to
popcount_on_return >= 0, since a return may now legitimately vacate only
SP-lower frames.
reconstruct_call_stack_from_native — seeded ret_addr off-by-one.
VG_(get_StackTrace) reports caller frames at the last byte of the call
instruction (return_PC − 1), but the classifier matches against the return
PC. Normalized with +1 so returns across frames seeded at a mid-run
CALLGRIND_START_INSTRUMENTATION (as CodSpeed always does) aren't demoted to
jumps.

Only the SP-equal code paths are changed; those exist only where the call
instruction leaves SP unchanged (AArch64 bl/blr, PPC b(c)l), so x86 is
preserved exactly.

Validation

AArch64 (the bug, fractal e2e benchmark): malloc / __rdl_dealloc were
mis-rooted blobs (55% / 47% of the benchmark) and are now thin leaves
(5.5% / 3.7%) containing only allocator internals; driftsort_main is back under
graph_benchmark::analyze_fractal_tree at 27% (matches the x86 reference's 28%);
0 infra→workload inversions; non-recursive functions no longer get '2 clones.
Rust, C++ and Python e2e flamegraphs all correct.

x86 (no regression):

callgrind regression suite (this CI builds the branch and runs it): 22/22 pass
on ubuntu-22.04 and ubuntu-24.04, identical to master.
A logic-equivalence harness over 64M realizable x86 return/jump events: the new
classifier + unwind are bit-identical to the originals.
e2e rust/cpp/python on ubuntu-latest built against this branch: all flamegraphs
correct.

🤖 Generated with Claude Code

…l frames On AArch64/PPC the call instruction leaves SP unchanged (return address in the link register), so a callee's shadow-stack entry frame sits at SP *equal* to its return target rather than strictly below it as on x86. Three defects in the call/return classifier corrupted Simulation flamegraphs on AArch64 only (malloc -> driftsort_main, dealloc/sin parenting the workload, etc.) by inverting call edges and fabricating '2 recursion clones: 1. CLG_(unwind_call_stack): minpops is meant to bound only SP-equal pops, but was decremented for SP-lower pops too. A callee's still-open sub-call frames exhausted the budget before its own SP-equal entry frame was reached, leaving it stuck. SP-lower frames now pop unconditionally without consuming the budget. 2. setup_bbcc return classifier: when the returning function had made SP-lower sub-calls, popcount_on_return defaulted to 1 and left a PLT stub+callee pair or a tail-call chain (which all share the SP-equal block) stuck. It now skips the SP-lower frames and pops the SP-equal block down to the deepest frame matching the return target. Keyed on where the top frame sits, so the x86 paths (SP-lower-top return, SP-deeper jump) are preserved exactly; only the AArch64/PPC SP-equal case is changed. 3. reconstruct_call_stack_from_native: seeded ret_addr was ips[frame+1], i.e. return_PC - 1 (VG_(get_StackTrace) reports the call insn, not the return PC), but the classifier matches against the return PC. Normalize with +1 so returns across mid-run-seeded frames are not demoted to jumps. COD-2985. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

codspeed-hq · 2026-06-29T19:25:46Z

Merging this PR will degrade performance by 64.09%

❌ 9 regressed benchmarks
✅ 33 untouched benchmarks
⏩ 88 skipped benchmarks¹

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Benchmark	`BASE`	`HEAD`	Efficiency
❌	`test_valgrind[valgrind-3.25.1, python3 testdata/test.py, no-inline]`	4.5 s	59 s	-92.43%
❌	`test_valgrind[valgrind.codspeed, python3 testdata/test.py, full-no-inline]`	6.9 s	85.9 s	-91.98%
❌	`test_valgrind[valgrind.codspeed, testdata/take_strings-aarch64 varbinview_non_null, no-inline]`	2.6 s	6.1 s	-58.11%
❌	`test_valgrind[valgrind.codspeed, testdata/take_strings-aarch64 varbinview_non_null, cycle-estimation]`	2.7 s	6 s	-54.19%
❌	`test_valgrind[valgrind.codspeed, testdata/take_strings-aarch64 varbinview_non_null, full-no-inline]`	3.1 s	6.4 s	-51.51%
❌	`test_valgrind[valgrind.codspeed, python3 testdata/test.py, no-inline]`	4.4 s	8 s	-44.33%
❌	`test_valgrind[valgrind.codspeed, testdata/take_strings-aarch64 varbinview_non_null, inline]`	6.8 s	10.1 s	-32.4%
❌	`test_valgrind[valgrind.codspeed, testdata/take_strings-aarch64 varbinview_non_null, full-with-inline]`	7.3 s	10.7 s	-31.91%
❌	`test_valgrind[valgrind.codspeed, testdata/take_strings-aarch64 varbinview_non_null, full-with-inline-with-cycle-estimation]`	7.5 s	11 s	-31.38%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing codex/stack-fix (7cb97c6) with master (ce9d871)}

88 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(callgrind): correct AArch64/PPC call-return detection for SP-equal frames (COD-2985)#20

fix(callgrind): correct AArch64/PPC call-return detection for SP-equal frames (COD-2985)#20
not-matthias wants to merge 1 commit into
masterfrom
codex/stack-fix

not-matthias commented Jun 29, 2026

Uh oh!

codspeed-hq Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

not-matthias commented Jun 29, 2026

Summary

The three defects (callgrind/callstack.c, callgrind/bbcc.c)

Validation

Uh oh!

codspeed-hq Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 64.09%

Performance Changes

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

The three defects (`callgrind/callstack.c`, `callgrind/bbcc.c`)

codspeed-hq Bot commented Jun 29, 2026 •

edited

Loading