Arbitrary-precision integer library in C++20. Header-light with a thin static-library shell. 64-bit limbs, 3-prime CRT NTT multiplication (multithreaded by default), Newton–Raphson division with cached reciprocals, divide-and-conquer base conversion. Includes a REPL calculator built on top.
Target: portable C++ (with optional NEON intrinsics on aarch64, scalar fallbacks everywhere else) that reaches GMP parity or better across the large-operand bands — no hand-written assembly.
BigMath vs GMP 6.3, Apple M1 Max, paired same-run measurements (canonical run 2026-06-12 at v13.0). Below the dashed line BigMath is faster than GMP:
- Multiplication beats GMP at every balanced size from 500k digits up — peaking at 3.2× faster (10M×10M: 64 vs 209 ms) and staying at or below GMP through 200M digits warm (0.89–0.97×) — and at every measured skewed shape ≥500k×50k (0.40–0.73×).
- Division beats GMP from 20M-digit dividends (0.45–0.82×) and reaches parity at 5M (200M÷40M: 3.1 s vs GMP's 4.5 s).
- Decimal I/O beats GMP from 500k digits in both directions: parse 0.38–0.69× (20M digits: 306 vs 815 ms, 2.7× faster), warm ToString 0.49–0.62× (1M digits: 28 vs 50 ms).
- Sub-NTT sizes (≲ 13k digits) remain 2–3× behind GMP's hand-tuned basecase — the documented portable-C++ wall.
The margins come from the 2026 optimization campaign — NEON Shoup NTT butterflies, fused-MFA transforms, the wraparound-Newton division family, the Burnikel–Ziegler odd-size padding fix, parallel decimal I/O fan-outs, and CRT-NTT squaring — followed by a full-codebase correctness audit. The condensed history with per-step wins is in BENCHMARK.md; releases in CHANGELOG.md.
Full tables, methodology, and per-PR history: BENCHMARK.md.
- 64-bit limb representation (
BIGMATH_LIMB_64=1, default).DataTstores true 64-bit values; every carry/borrow chain uses__uint128_taccumulators. Halves loop iteration counts in scalar paths vs the legacy 32-bit-in-64-bit layout. Opt out via-DBIGMATH_LIMB_64=0. - Multiplication: Classic schoolbook → Karatsuba (64-bit-hybrid leaf) → 3-prime CRT NTT (2013265921 / 469762049 / 1811939329, 32-bit coefficient splitting) from 1280 total limbs (~13k digits). Toom-3 and the single-prime Goldilocks NTT remain as cross-check alternates; the NEON-accelerated CRT path beats both at every measured size.
- Radix-4 + radix-8 fused NTT butterflies (PRs #59, #60): 3× fewer memory passes vs radix-2; ~1.6× wall-clock at ≥2M limbs.
- NEON Shoup butterflies on aarch64 (PRs #96–#99,
BIGMATH_NEON=1default on Apple Silicon): the radix-8/4/2 layers and 1/n scaling run 4 lanes wide using Shoup's precomputed-reciprocal multiplication —(w·x) mod Pfrom one widening multiply, one low multiply, one conditional subtract, with interleaved(twiddle, w′)tables keeping the strided gathers at one cache line per pair. Bit-exact with the scalar path; ~4× per butterfly pass, −26…40% end-to-end on NTT-bound ops. Scalar fallback everywhere else. - MFA / Bailey 6-step CRT NTT (
BIGMATH_NTT_MFA_THRESHOLD=2^20, default). Large CRT transforms switch to a cache-friendly matrix Fourier layout with fully fused stages: the transpose, the operand packing (straight from the limb arrays), and the pointwise product all ride the tiled stage passes, and the forward/inverse stages run as row-chunkedParallelDo(6)work units. The gate sat at2^24until the fused stages flipped the break-even (PR #109); cyclic wrap-around products use the same pipeline up to the CRT ceiling2^26(PR #110). - Multithreaded NTT (
BIGMATH_USE_THREADS=1, default). Small thread pool (sizemin(hw_concurrency, BIGMATH_MAX_THREADS=8)) parallelizes the CRT path: in the fused MFA window each transform stage runs as six (operand×prime or prime×row-range) work units; the non-MFA path batches 6 forwards + 3 inverses. 2.3-3.4× speedup on large mul / skewed div / parse. Opt out via-DBIGMATH_USE_THREADS=0to drop pthread linkage. - Division: Classic short division → Knuth Algorithm D (
FastDivision, Möller-Granlund qhat, bit-shift normalization) → Burnikel–Ziegler (small near-balanced) → Newton–Raphson with cached reciprocals and wrap-around cyclic products (half-length transforms for the remainder, quotient-estimate, and reciprocal-iteration steps — GMPmu_div/invertapprstyle) → quotient-sized division for short-quotient shapes (cost scales with the quotient, not the divisor). Burnikel–Ziegler pads operands with bottom zero limbs so its recursion never hits an odd split (PR #116 — before that, odd divisor limb counts fell back to Knuth D at full size, 1.3–10× slower on ~half of real shapes). Newton dispatch bands form a ratio/size frontier (8/1 @ 896 limbs, 7/2 @ 1280, 14/5 @ 1792, 5/2 @ 2560, 2/1 @ 4096, 8/5 @ 8192, 4/3 @ 131072) with fractional ratios chosen to keep integer digit-ratio knife-edges (2.0000, 3.0000, 4.0000 ± 1 limb) off the band boundaries. Identityq·b + r == ais cross-checked intests/div_correctness.cpp. - Squaring: Specialized Classic / Karatsuba squarers below 640 limbs; above, the CRT NTT self-multiply (skips the duplicate forward transforms — 2.5–5.8× over the former Goldilocks
NTTSquare, which remains the-DBIGMATH_NTT_CRT=0fallback). - String I/O: Linear chunked parser/formatter for small inputs, divide-and-conquer with cached Newton reciprocals at scale; from 100k digits both directions fan their D&C subtrees out over the thread pool (the subtree outputs are fixed-width fields at precomputable offsets). Asymptotic
O(M(L) · log L)both directions. - BigDecimal: Java-style fixed-point decimal (unscaled BigInteger + int scale) with exact +, −, *; rounded division taking 8 rounding modes; parse/format covering plain and scientific notation.
- Calculator REPL: Variables, hex/bin/dec output, multi-line continuation, comments,
:help :quit :vars :reset :base :digits :time :load :savedirectives. Built as a separate executable. - Thread-safe by construction for concurrent use of distinct objects from distinct threads. See docs/THREAD_SAFETY.md.
include/biginteger/ public BigInteger headers (algorithms, ops, common)
include/bigdecimal/ public BigDecimal headers
src/ non-inline BigInteger implementations
bigdecimal/ BigDecimal implementation
tests/ correctness, performance, unit tests
calculator/ REPL frontend
docs/ technical references (see below)
- CMake ≥ 3.16
- C++20 compiler with
__uint128_t+__builtin_*_overflow— GCC ≥ 10, Clang ≥ 11, or Apple Clang. MSVC is not supported. - (Optional)
libgmpfor the GMP comparison benchmark.
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -jOutputs:
build/libbigmath.a— static librarybuild/calculator— REPLbuild/{mult_correctness, div_correctness, unit_tests}— test binariesbuild/{multperf_simple, divperf_simple, bench_vs_gmp}— performance harnesses
See BUILDING.md for shared-library mode, threshold overrides, and installation.
cd build && ctest --output-on-failuremult_correctness is fast; div_correctness runs a broad matrix and takes a few minutes; unit_tests is sub-second.
The standalone perf harnesses multperf_simple and divperf_simple (BigMath-internal, no GMP) take an optional run-count k (default 3); each size is timed k times and the mean reported:
./build/multperf_simple 5 # average of 5 runs per size
./build/divperf_simple 5See BENCHMARK.md for the latest k=5 numbers.
#include "biginteger/BigInteger.h"
#include "biginteger/common/Builder.h"
#include "biginteger/common/Parser.h"
#include "biginteger/ops/Operations.h"
using namespace BigMath;
BigInteger a = BigIntegerBuilder::From("123456789012345678901234567890");
BigInteger b = BigIntegerBuilder::From("987654321098765432109876543210");
BigInteger c = a * b;
std::cout << ToString(c) << '\n';CMake consumers:
find_package(bigmath REQUIRED)
target_link_libraries(my_app PRIVATE bigmath::bigmath)./build/calculator
> x = 2^256 - 1
> x * x
> :base 16
> x
> :quitOperators: + - * / % ^ (with ^ right-associative). Literals: decimal, 0x… hex, 0b… binary, underscore separators (1_000_000). # to end of line is a comment. Trailing \ continues a line.
- docs/BASE.md — number representation, 64-bit limbs, why little-endian
- docs/MULTIPLICATION.md — Classic / Karatsuba / Toom-3 / NTT (Goldilocks + multi-prime CRT) and their tradeoffs
- docs/DIVISION.md — Classic / Fast (Knuth D + Möller-Granlund qhat) / Burnikel–Ziegler / Newton / Reciprocal-cached
- docs/STRING_CONVERSION.md — chunked decimal I/O, D&C parse/format, Newton-divider chain
- docs/BIGDECIMAL.md — fixed-point decimal model, rounding modes, performance
- docs/THREAD_SAFETY.md — concurrency model, opt-in internal parallelism
- BUILDING.md — CMake build, install, build flags, threshold tuning
Each doc covers algorithms, dispatch, benchmark numbers vs GMP, optimizations that landed, and approaches that were tried and rejected with reasons.
Apple M1 Max, vs GMP 6.3.0, -O3 -march=native, full default stack (BIGMATH_LIMB_64=1 + BIGMATH_NTT_CRT=1 + BIGMATH_USE_THREADS=1, 8-thread pool). Canonical bench_vs_gmp run, 2026-06-12 at v13.0; rows marked warm are steady-state best-of-3 (see BENCHMARK.md methodology):
| operation | size (digits) | BigMath | GMP | ratio |
|---|---|---|---|---|
| mul | 100 000 × 100 000 | 0.80 ms | 0.62 ms | 1.29× |
| mul | 1 000 000 × 1 000 000 | 7.9 ms | 9.3 ms | 0.85× ← BigMath faster |
| mul | 5 000 000 × 5 000 000 | 32.9 ms | 63.5 ms | 0.52× ← 1.9× faster |
| mul | 10 000 000 × 10 000 000 | 64 ms | 209 ms | 0.31× ← 3.2× faster |
| mul | 20 000 000 × 20 000 000 | 129 ms | 275 ms | 0.47× ← 2.1× faster |
| mul | 200 000 000 × 200 000 000 | 2 809 ms (warm) | 2 907 ms | 0.97× |
| mul (skewed) | 1 000 000 / 100 000 | 2.8 ms | 4.6 ms | 0.61× ← BigMath faster |
| mul (skewed) | 20 000 000 / 2 000 000 | 94 ms | 161 ms | 0.58× |
| mul (skewed) | 50 000 000 / 5 000 000 | 272 ms | 677 ms | 0.40× ← 2.5× faster |
| div (skewed) | 500 000 / 100 000 | 7.7 ms | 4.8 ms | 1.59× |
| div (skewed) | 5 000 000 / 1 000 000 | 69 ms | 69 ms | 1.00× ← parity |
| div (skewed) | 50 000 000 / 10 000 000 | 590 ms | 1 318 ms | 0.45× ← 2.2× faster |
| div (skewed) | 200 000 000 / 40 000 000 | 3 119 ms | 4 544 ms | 0.69× ← BigMath faster |
| parse | 1 000 000 | 12.5 ms | 20.9 ms | 0.60× ← BigMath faster |
| parse | 20 000 000 | 306 ms | 815 ms | 0.38× ← 2.7× faster |
| ToString | 100 000 | 3.4 ms (warm) | 2.3 ms | 1.46× |
| ToString | 1 000 000 | 27.7 ms (warm) | 50.3 ms | 0.55× ← BigMath faster |
| ToString | 10 000 000 | 488 ms (warm) | 901 ms | 0.54× |
What remains behind GMP — division below ~200k-digit dividends and everything below ~13k digits — is the hand-tuned-assembly basecase band, documented as the accepted portable-C++ wall in the subsystem docs. See BENCHMARK.md for full tables, warm/cold methodology, and the condensed optimization history.
Opt-out flags (-DBIGMATH_USE_THREADS=0 / -DBIGMATH_NTT_CRT=0 / -DBIGMATH_LIMB_64=0) revert any subset of the defaults — useful for embedded targets, header-only-strict consumers, or A/B comparison.
See LICENSE.
S. M. Mahbub Murshed (murshed@gmail.com)

