Skip to content

TensorRT 11 support (Blackwell / strongly-typed networks), incl. FP16#1206

Open
gweber wants to merge 2 commits into
lightvector:masterfrom
gweber:trt11-strongly-typed
Open

TensorRT 11 support (Blackwell / strongly-typed networks), incl. FP16#1206
gweber wants to merge 2 commits into
lightvector:masterfrom
gweber:trt11-strongly-typed

Conversation

@gweber
Copy link
Copy Markdown

@gweber gweber commented Jun 6, 2026

Summary

Makes the TensorRT backend build and run against TensorRT 11 (e.g. on Blackwell / sm_120/sm_121 with CUDA 13), including FP16. Currently the backend fails to compile against TRT 11, and TRT 10 doesn't support these GPUs — so there's no working TRT path on the newest NVIDIA hardware today. Related: #1041 (RTX 5090 / Blackwell).

The change is in two commits that tell the story: first the FP32 port (gets it compiling + correct), then FP16 restored.

Why

TensorRT 11 removes weakly-typed networks and the entire builder-driven mixed-precision mechanism this backend relied on: BuilderFlag::kFP16, IBuilder::platformHasFastFp16(), per-layer ILayer::setPrecision()/setOutputType(), ITensor::setType(), kOBEY/kPREFER_PRECISION_CONSTRAINTS, and the kEXPLICIT_BATCH network-creation flag are all gone. Every network is now strongly typed: a tensor's precision is whatever the network (or parsed ONNX graph) declares, and the builder will not silently insert reformats.

What changed

1. FP32 strongly-typed port (neuralnet/trtbackend.cpp, command/sandbox.cpp)

  • createNetworkV2(0U) instead of the removed kEXPLICIT_BATCH flag.
  • Drop the kFP16/precision-constraint flags, platformHasFastFp16(), the post-parse FP32 pin loop, and all setPrecision/setType calls (the graph carries its own types now).
  • Outputs constrained to linear layout via setAllowedFormats only.

2. FP16 in the ONNX graph (neuralnet/onnxmodelbuilder.{cpp,h})
Under strongly-typed networks FP16 has to live in the graph, so OnnxModelBuilder gains a convertGraphToFloat16 post-pass: after the FP32 graph is built, every node is rewritten to FP16 except the numerically-sensitive regions — the RMSNorm square -> reduce -> sqrt reductions and the trunk tip + policy/value heads (exactly the node sets the old weakly-typed path pinned via setPrecision + kOBEY_PRECISION_CONSTRAINTS). Cast nodes are inserted in topological position on every FP16<->FP32 boundary edge, and FP16-only weight initializers are converted to FP16. Graph inputs/outputs stay FP32, so InputBuffers/getOutput and the device bindings are unchanged.

floatToHalf clamps finite out-of-range values to the max finite half (±65504) instead of promoting to Inf. KataGo uses a 1e9 sentinel as the off-board attention mask bias; an Inf there yields 0*Inf = NaN ("nonfinite policy sum") in the attention softmax. Clamping preserves the masking semantics (a huge finite negative bias still drives softmax to ~0) and matches onnxconverter-common.

tuneSalt bumped to invalidate FP16-pinned plan/timing caches from older builds.

Validation

Built on a GB10 (Grace-Blackwell, sm_121) with TensorRT 11.0 / CUDA 13. katago testgpuerror against an Eigen CPU reference, 2179 positions at 9x9, on both a convnet (g170 b10c128) and an nbt-transformer (b4c256 ... rsnh):

FP32 vs ref (winrate err avg/max) FP16 vs ref (avg/max)
convnet 0.031% / 0.39% 0.090% / 0.93%
nbt-transformer 0.029% / 0.40% 0.081% / 0.50%

FP32 errors are at the TF32-tactic level; FP16 errors are normal half-precision magnitude, no NaN. The analysis engine returns valid policy/value/ownership. FP16 is ~2.4x the FP32 nnEval throughput on the 256-channel transformer at 19x19.

Notes

  • The hand-built ModelParser path (trtDisableOnnx) is ported too but is FP32-only; the default ONNX-emitter path supports FP16. useFP16=false forces a fully-FP32 engine.
  • Build deps unchanged except that the TRT backend needs protobuf (already required by the ONNX-emitter path).

gweber added 2 commits June 7, 2026 00:15
…rks)

TensorRT 11 removes weakly-typed networks and the builder-driven mixed-precision
machinery the backend relied on: the kFP16 builder flag, platformHasFastFp16(),
per-layer ILayer::setPrecision()/setOutputType(), ITensor::setType(), the
kOBEY/kPREFER_PRECISION_CONSTRAINTS flags, and the kEXPLICIT_BATCH
network-creation flag are all gone. Networks are now always explicit-batch and
strongly typed, so precision is whatever the network / parsed ONNX graph declares.

Port both build paths (the default ONNX-emitter path and the legacy hand-built
ModelParser) to a fully FP32 strongly-typed network:
- createNetworkV2(0U) instead of the kEXPLICIT_BATCH flag
- drop the kFP16 / precision-constraint flags and the post-parse FP32 pin loop
- drop the per-layer setPrecision/setType calls (the graph is uniformly FP32)
- constrain outputs to a linear layout via setAllowedFormats only (setType is gone)
- same createNetworkV2 fix in command/sandbox.cpp
- bump tuneSalt 8->9 to invalidate the old FP16-pinned plan/timing caches

The engine runs in FP32 (TF32 tactics still apply). FP16 can be reintroduced
later by emitting an explicitly-typed FP16 graph with casts around the
numerically-sensitive regions (RMSNorm reductions, policy/value heads); the
forceFP32 markers and the onnxResult node-name sets still identify those regions.

Validated on a GB10 (Blackwell sm_121) with TensorRT 11.0 / CUDA 13: testgpuerror
against an Eigen CPU reference matches to TF32-tactic precision (~0.03% average
winrate error) on both a convnet and an nbt-transformer net.
Stage A made the TensorRT 11 backend build and run in FP32. This restores FP16,
which under strongly-typed networks must be expressed in the ONNX graph itself
(the kFP16 builder flag and per-layer setPrecision are gone).

OnnxModelBuilder grows a convertGraphToFloat16 post-pass: after the FP32 graph is
built, every node is rewritten to FP16 except the numerically-sensitive regions
(the RMSNorm square/reduce/sqrt reductions and the trunk tip + policy/value heads,
i.e. the same node sets the old weakly-typed path pinned via setPrecision +
kOBEY_PRECISION_CONSTRAINTS). Casts are inserted in topological position on every
edge that crosses an FP16<->FP32 boundary, and float weight initializers consumed
only by FP16 nodes are converted to FP16. Graph inputs and outputs stay FP32, so
InputBuffers / getOutput and the device bindings are unchanged.

floatToHalf clamps finite out-of-range values to the max finite half (+-65504)
instead of promoting them to Inf: KataGo uses a 1e9 sentinel as the off-board
attention mask bias, and an Inf there produces 0*Inf = NaN ("nonfinite policy
sum") in the attention softmax. Clamping preserves the masking semantics (a huge
finite negative bias still drives softmax to ~0) and matches onnxconverter-common.

The backend re-enables usingFP16 (useFP16 auto/true -> FP16 on the ONNX path;
false, or the non-ONNX ModelParser path, stays FP32) and passes it to the emitter.
tuneSalt 9->10 to invalidate FP32-only plan/timing caches.

Validated on a GB10 (Blackwell sm_121, TensorRT 11.0 / CUDA 13): testgpuerror vs an
Eigen CPU reference shows normal FP16 precision (avg ~0.08% winrate error, no NaN)
on both a convnet and an nbt-transformer net, and FP16 is ~2.4x the FP32 nnEval
throughput on the 256-channel transformer at 19x19.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant