Skip to content

fix(callgrind): correct AArch64/PPC call-return detection for SP-equal frames (COD-2985)#20

Draft
not-matthias wants to merge 1 commit into
masterfrom
codex/stack-fix
Draft

fix(callgrind): correct AArch64/PPC call-return detection for SP-equal frames (COD-2985)#20
not-matthias wants to merge 1 commit into
masterfrom
codex/stack-fix

Conversation

@not-matthias

Copy link
Copy Markdown
Member

Summary

On AArch64/PPC the call instruction leaves SP unchanged (return address in the
link register), so a callee's shadow-stack entry frame sits at SP equal to its
return target rather than strictly below it as on x86. Three defects in
callgrind's call/return classifier corrupted Simulation flamegraphs on AArch64
only
— inverting call edges (malloc → core::slice::sort::stable::driftsort_main,
__rdl_dealloc/sin parenting the workload) and fabricating '2 recursion
clones of non-recursive functions.

Fixes COD-2985.

The three defects (callgrind/callstack.c, callgrind/bbcc.c)

  1. CLG_(unwind_call_stack) — minpops accounting. minpops is meant to bound
    only SP-equal pops, but was decremented for SP-lower pops too. A callee's
    still-open sub-call frames exhausted the budget before its own SP-equal entry
    frame was reached, leaving that frame stuck (keeping the callee's context
    active → inverted edges; never decrementing its recursion depth → '2 clones).
    SP-lower frames now pop unconditionally without consuming the budget.

  2. setup_bbcc return classifier. When the returning function had made
    SP-lower sub-calls, popcount_on_return defaulted to 1 and left a PLT
    stub+callee pair or a tail-call chain (which share the SP-equal block) stuck.
    The classifier is now keyed on where the top frame sits: an SP-lower returning
    frame (every x86 return) is handled exactly as before; an SP-equal returning
    frame (AArch64/PPC) pops the SP-equal block down to the deepest frame whose
    recorded return address is the return target
    — covering both a PLT stub +
    callee pair (same return address) and a tail-call chain a→b→c (c's ret
    lands past b and a). The downstream assertion is relaxed to
    popcount_on_return >= 0, since a return may now legitimately vacate only
    SP-lower frames.

  3. reconstruct_call_stack_from_native — seeded ret_addr off-by-one.
    VG_(get_StackTrace) reports caller frames at the last byte of the call
    instruction (return_PC − 1), but the classifier matches against the return
    PC. Normalized with +1 so returns across frames seeded at a mid-run
    CALLGRIND_START_INSTRUMENTATION (as CodSpeed always does) aren't demoted to
    jumps.

Only the SP-equal code paths are changed; those exist only where the call
instruction leaves SP unchanged (AArch64 bl/blr, PPC b(c)l), so x86 is
preserved exactly.

Validation

AArch64 (the bug, fractal e2e benchmark): malloc / __rdl_dealloc were
mis-rooted blobs (55% / 47% of the benchmark) and are now thin leaves
(5.5% / 3.7%) containing only allocator internals; driftsort_main is back under
graph_benchmark::analyze_fractal_tree at 27% (matches the x86 reference's 28%);
0 infra→workload inversions; non-recursive functions no longer get '2 clones.
Rust, C++ and Python e2e flamegraphs all correct.

x86 (no regression):

  • callgrind regression suite (this CI builds the branch and runs it): 22/22 pass
    on ubuntu-22.04 and ubuntu-24.04
    , identical to master.
  • A logic-equivalence harness over 64M realizable x86 return/jump events: the new
    classifier + unwind are bit-identical to the originals.
  • e2e rust/cpp/python on ubuntu-latest built against this branch: all flamegraphs
    correct.

🤖 Generated with Claude Code

…l frames

On AArch64/PPC the call instruction leaves SP unchanged (return address in the
link register), so a callee's shadow-stack entry frame sits at SP *equal* to
its return target rather than strictly below it as on x86. Three defects in the
call/return classifier corrupted Simulation flamegraphs on AArch64 only
(malloc -> driftsort_main, dealloc/sin parenting the workload, etc.) by
inverting call edges and fabricating '2 recursion clones:

1. CLG_(unwind_call_stack): minpops is meant to bound only SP-equal pops, but
   was decremented for SP-lower pops too. A callee's still-open sub-call frames
   exhausted the budget before its own SP-equal entry frame was reached,
   leaving it stuck. SP-lower frames now pop unconditionally without consuming
   the budget.

2. setup_bbcc return classifier: when the returning function had made SP-lower
   sub-calls, popcount_on_return defaulted to 1 and left a PLT stub+callee pair
   or a tail-call chain (which all share the SP-equal block) stuck. It now skips
   the SP-lower frames and pops the SP-equal block down to the deepest frame
   matching the return target. Keyed on where the top frame sits, so the x86
   paths (SP-lower-top return, SP-deeper jump) are preserved exactly; only the
   AArch64/PPC SP-equal case is changed.

3. reconstruct_call_stack_from_native: seeded ret_addr was ips[frame+1], i.e.
   return_PC - 1 (VG_(get_StackTrace) reports the call insn, not the return
   PC), but the classifier matches against the return PC. Normalize with +1 so
   returns across mid-run-seeded frames are not demoted to jumps.

COD-2985.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@codspeed-hq

codspeed-hq Bot commented Jun 29, 2026

Copy link
Copy Markdown

Merging this PR will degrade performance by 64.09%

❌ 9 regressed benchmarks
✅ 33 untouched benchmarks
⏩ 88 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
test_valgrind[valgrind-3.25.1, python3 testdata/test.py, no-inline] 4.5 s 59 s -92.43%
test_valgrind[valgrind.codspeed, python3 testdata/test.py, full-no-inline] 6.9 s 85.9 s -91.98%
test_valgrind[valgrind.codspeed, testdata/take_strings-aarch64 varbinview_non_null, no-inline] 2.6 s 6.1 s -58.11%
test_valgrind[valgrind.codspeed, testdata/take_strings-aarch64 varbinview_non_null, cycle-estimation] 2.7 s 6 s -54.19%
test_valgrind[valgrind.codspeed, testdata/take_strings-aarch64 varbinview_non_null, full-no-inline] 3.1 s 6.4 s -51.51%
test_valgrind[valgrind.codspeed, python3 testdata/test.py, no-inline] 4.4 s 8 s -44.33%
test_valgrind[valgrind.codspeed, testdata/take_strings-aarch64 varbinview_non_null, inline] 6.8 s 10.1 s -32.4%
test_valgrind[valgrind.codspeed, testdata/take_strings-aarch64 varbinview_non_null, full-with-inline] 7.3 s 10.7 s -31.91%
test_valgrind[valgrind.codspeed, testdata/take_strings-aarch64 varbinview_non_null, full-with-inline-with-cycle-estimation] 7.5 s 11 s -31.38%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing codex/stack-fix (7cb97c6) with master (ce9d871)

Open in CodSpeed

Footnotes

  1. 88 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant