fix(hugepages): size 1G reservation by NUMA nodes, not sockets by VijitSingh97 · Pull Request #111 · p2pool-starter-stack/rigforge

VijitSingh97 · 2026-06-13T19:25:29Z

The bug

RandomX fast mode keeps a NUMA-local copy of the ~2080 MB dataset per NUMA node (XMRig allocates one dataset per node). But util/proposed-grub.sh sized 1 GB HugePages as 3 × Socket(s) — and a single-socket EPYC 7642 exposes 4 NUMA nodes (NPS / L3-as-NUMA). So setup reserved 3× 1G instead of 12×, and after a reboot three of four nodes would lose 1 GB backing — a large RandomX hashrate hit.

It's latent because affected boxes keep running on whatever reservation their last boot applied; it only bites on the next reboot after a fresh setup. Present in released v1.0.0 (and 0.1.0).

Evidence from a live EPYC (miner-2):

randomx  -- allocated 12288 MB huge pages 100% 12/12   # 3× 1G × 4 NUMA nodes
randomx  #0/#1/#2/#3 dataset ready

The fix

Detect NUMA nodes (not sockets) and scale the per-node dataset reservation by them:

lscpu NUMA node(s): → count /sys/devices/system/node/node* (overridable via NODE_SYS for tests) → fall back to socket count → 1.
TOTAL_GB_PAGES = 3 × NUMA_NODES; the pure-2 MB fallback (BASE_2MB_PAGES × NUMA_NODES) scales too.
2 MB scratchpad sizing is per-thread total and unchanged (confirmed adequate at 266 on the EPYC).

Tests

The stub lscpu now emits NUMA node(s) (defaulting to the socket count, so existing assertions are unchanged). Added cases: multi-NUMA 1G scaling (1 socket / 4 nodes → 12), 2 MB fallback scaling, and both detection fallbacks (sysfs node count, then sockets). make lint + 610 tests green.

Real-hardware validation

On a real 4-NUMA EPYC 7642, the calculator now reports:

NUMA Nodes:    4
... hugepagesz=1G hugepages=12 hugepagesz=2M hugepages=266 ...

Follow-up for a 1.0.1. (The live fleet's GRUB was already hand-corrected to 12, so those boxes are reboot-safe today; this fixes the code so a fresh setup is correct.)

🤖 Generated with Claude Code

RandomX fast mode keeps a NUMA-local copy of the ~2080 MB dataset per NUMA node (XMRig allocates one dataset per node). util/proposed-grub.sh multiplied the per-dataset 1G page count (3) by the SOCKET count, but a single-socket EPYC 7642 exposes 4 NUMA nodes — so setup reserved 3x 1G instead of 12x. The boxes ran fine on an older boot's reservation, but a fresh setup + reboot would leave 3 of 4 nodes without 1G backing and tank hashrate. Detect NUMA nodes (lscpu "NUMA node(s)", then count /sys/devices/system/node, then fall back to sockets, then 1) and scale both the 1G reservation and the pure-2M fallback by it. 2M scratchpad sizing is per-thread total and unchanged. - proposed-grub.sh: NUMA_NODES detection + use it for TOTAL_GB_PAGES and TOTAL_2MB_FALLBACK; verbose output shows the NUMA node count. - tests: stub lscpu now emits "NUMA node(s)" (defaults to socket count, so existing assertions are unchanged); added cases for the multi-NUMA 1G scaling, the 2M fallback scaling, and the sysfs/socket detection fallbacks. Verified on a real 4-NUMA EPYC 7642: lscpu reports 4 nodes, calculator now emits hugepages=1G hugepages=12 (was 3). Found while upgrading a fleet to v1.0.0; the bug is in released v1.0.0 (and 0.1.0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

diff-cover flagged the new verbose "NUMA Nodes:" output line (proposed-grub.sh:119) as uncovered — every other proposed-grub test runs with -q or --runtime. Add a verbose-mode assertion exercising it (and the sockets line alongside it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…NUMA sizing Two follow-ups found while validating the NUMA fix with a clean install on a real EPYC (autotune disabled in that worker's config): - tests/e2e-real.sh: the #92 re-own check did `op=$(systemctl cat rigforge-autotune.service ...)`. When autotune is disabled the unit doesn't exist, systemctl exits non-zero, and under the gate's `set -Eeuo pipefail` the bare assignment aborted the whole verify phase right before the SKIP branch that was meant to handle exactly this. Add `|| true` so it reaches the skip. (Earlier gate runs all had autotune enabled, so this never surfaced.) - docs/hardware.md: `randomx.numa` was described as "a no-op on single-socket", but a single-socket EPYC exposes several NUMA nodes — the misconception behind the 1G HugePage sizing bug. Clarify that the reservation scales per NUMA node. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Patch release. Roll CHANGELOG [Unreleased] -> [1.0.1] and bump VERSION to 1.0.1. Fixes NUMA-unaware 1 GB HugePage sizing (#111): a multi-NUMA CPU (e.g. EPYC) keeps one RandomX dataset copy per NUMA node, so the reservation must scale with NUMA nodes, not sockets — a fresh setup on a 4-NUMA EPYC now reserves 12x 1G (was 3), so it stays 100%-backed across a reboot. Validated end to end by the real-hardware gate on a 4-NUMA EPYC 7642: clean install -> reboot -> 12x 1G reserved, full hashrate, 12/12 NUMA-backed -> verify (38 checks) -> teardown (13), all PASS. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

VijitSingh97 and others added 4 commits June 13, 2026 14:25

docs(changelog): reference #111 on the NUMA HugePage fix entry

d338d38

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

VijitSingh97 merged commit bc375ea into main Jun 13, 2026
5 checks passed

VijitSingh97 deleted the fix/numa-hugepage-sizing branch June 13, 2026 20:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hugepages): size 1G reservation by NUMA nodes, not sockets#111

fix(hugepages): size 1G reservation by NUMA nodes, not sockets#111
VijitSingh97 merged 4 commits into
mainfrom
fix/numa-hugepage-sizing

VijitSingh97 commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

VijitSingh97 commented Jun 13, 2026

The bug

The fix

Tests

Real-hardware validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant