Skip to content

fix(hugepages): size 1G reservation by NUMA nodes, not sockets#111

Merged
VijitSingh97 merged 4 commits into
mainfrom
fix/numa-hugepage-sizing
Jun 13, 2026
Merged

fix(hugepages): size 1G reservation by NUMA nodes, not sockets#111
VijitSingh97 merged 4 commits into
mainfrom
fix/numa-hugepage-sizing

Conversation

@VijitSingh97

Copy link
Copy Markdown
Contributor

The bug

RandomX fast mode keeps a NUMA-local copy of the ~2080 MB dataset per NUMA node (XMRig allocates one dataset per node). But util/proposed-grub.sh sized 1 GB HugePages as 3 × Socket(s) — and a single-socket EPYC 7642 exposes 4 NUMA nodes (NPS / L3-as-NUMA). So setup reserved 3× 1G instead of 12×, and after a reboot three of four nodes would lose 1 GB backing — a large RandomX hashrate hit.

It's latent because affected boxes keep running on whatever reservation their last boot applied; it only bites on the next reboot after a fresh setup. Present in released v1.0.0 (and 0.1.0).

Evidence from a live EPYC (miner-2):

randomx  -- allocated 12288 MB huge pages 100% 12/12   # 3× 1G × 4 NUMA nodes
randomx  #0/#1/#2/#3 dataset ready

The fix

Detect NUMA nodes (not sockets) and scale the per-node dataset reservation by them:

  • lscpu NUMA node(s): → count /sys/devices/system/node/node* (overridable via NODE_SYS for tests) → fall back to socket count → 1.
  • TOTAL_GB_PAGES = 3 × NUMA_NODES; the pure-2 MB fallback (BASE_2MB_PAGES × NUMA_NODES) scales too.
  • 2 MB scratchpad sizing is per-thread total and unchanged (confirmed adequate at 266 on the EPYC).

Tests

The stub lscpu now emits NUMA node(s) (defaulting to the socket count, so existing assertions are unchanged). Added cases: multi-NUMA 1G scaling (1 socket / 4 nodes → 12), 2 MB fallback scaling, and both detection fallbacks (sysfs node count, then sockets). make lint + 610 tests green.

Real-hardware validation

On a real 4-NUMA EPYC 7642, the calculator now reports:

NUMA Nodes:    4
... hugepagesz=1G hugepages=12 hugepagesz=2M hugepages=266 ...

Follow-up for a 1.0.1. (The live fleet's GRUB was already hand-corrected to 12, so those boxes are reboot-safe today; this fixes the code so a fresh setup is correct.)

🤖 Generated with Claude Code

VijitSingh97 and others added 4 commits June 13, 2026 14:25
RandomX fast mode keeps a NUMA-local copy of the ~2080 MB dataset per NUMA node
(XMRig allocates one dataset per node). util/proposed-grub.sh multiplied the
per-dataset 1G page count (3) by the SOCKET count, but a single-socket EPYC 7642
exposes 4 NUMA nodes — so setup reserved 3x 1G instead of 12x. The boxes ran fine
on an older boot's reservation, but a fresh setup + reboot would leave 3 of 4
nodes without 1G backing and tank hashrate.

Detect NUMA nodes (lscpu "NUMA node(s)", then count /sys/devices/system/node,
then fall back to sockets, then 1) and scale both the 1G reservation and the
pure-2M fallback by it. 2M scratchpad sizing is per-thread total and unchanged.

- proposed-grub.sh: NUMA_NODES detection + use it for TOTAL_GB_PAGES and
  TOTAL_2MB_FALLBACK; verbose output shows the NUMA node count.
- tests: stub lscpu now emits "NUMA node(s)" (defaults to socket count, so
  existing assertions are unchanged); added cases for the multi-NUMA 1G scaling,
  the 2M fallback scaling, and the sysfs/socket detection fallbacks.

Verified on a real 4-NUMA EPYC 7642: lscpu reports 4 nodes, calculator now emits
hugepages=1G hugepages=12 (was 3). Found while upgrading a fleet to v1.0.0; the
bug is in released v1.0.0 (and 0.1.0).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
diff-cover flagged the new verbose "NUMA Nodes:" output line (proposed-grub.sh:119)
as uncovered — every other proposed-grub test runs with -q or --runtime. Add a
verbose-mode assertion exercising it (and the sockets line alongside it).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…NUMA sizing

Two follow-ups found while validating the NUMA fix with a clean install on a real
EPYC (autotune disabled in that worker's config):

- tests/e2e-real.sh: the #92 re-own check did `op=$(systemctl cat
  rigforge-autotune.service ...)`. When autotune is disabled the unit doesn't
  exist, systemctl exits non-zero, and under the gate's `set -Eeuo pipefail` the
  bare assignment aborted the whole verify phase right before the SKIP branch that
  was meant to handle exactly this. Add `|| true` so it reaches the skip. (Earlier
  gate runs all had autotune enabled, so this never surfaced.)
- docs/hardware.md: `randomx.numa` was described as "a no-op on single-socket",
  but a single-socket EPYC exposes several NUMA nodes — the misconception behind
  the 1G HugePage sizing bug. Clarify that the reservation scales per NUMA node.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@VijitSingh97 VijitSingh97 merged commit bc375ea into main Jun 13, 2026
5 checks passed
@VijitSingh97 VijitSingh97 deleted the fix/numa-hugepage-sizing branch June 13, 2026 20:05
VijitSingh97 added a commit that referenced this pull request Jun 13, 2026
Patch release. Roll CHANGELOG [Unreleased] -> [1.0.1] and bump VERSION to 1.0.1.

Fixes NUMA-unaware 1 GB HugePage sizing (#111): a multi-NUMA CPU (e.g. EPYC) keeps
one RandomX dataset copy per NUMA node, so the reservation must scale with NUMA
nodes, not sockets — a fresh setup on a 4-NUMA EPYC now reserves 12x 1G (was 3),
so it stays 100%-backed across a reboot.

Validated end to end by the real-hardware gate on a 4-NUMA EPYC 7642: clean
install -> reboot -> 12x 1G reserved, full hashrate, 12/12 NUMA-backed -> verify
(38 checks) -> teardown (13), all PASS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant