TensorRT 11 support (Blackwell / strongly-typed networks), incl. FP16 by gweber · Pull Request #1206 · lightvector/KataGo

gweber · 2026-06-06T22:40:11Z

Summary

Makes the TensorRT backend build and run against TensorRT 11 (e.g. on Blackwell / sm_120/sm_121 with CUDA 13), including FP16. Currently the backend fails to compile against TRT 11, and TRT 10 doesn't support these GPUs — so there's no working TRT path on the newest NVIDIA hardware today. Related: #1041 (RTX 5090 / Blackwell).

The change is in two commits that tell the story: first the FP32 port (gets it compiling + correct), then FP16 restored.

Why

TensorRT 11 removes weakly-typed networks and the entire builder-driven mixed-precision mechanism this backend relied on: BuilderFlag::kFP16, IBuilder::platformHasFastFp16(), per-layer ILayer::setPrecision()/setOutputType(), ITensor::setType(), kOBEY/kPREFER_PRECISION_CONSTRAINTS, and the kEXPLICIT_BATCH network-creation flag are all gone. Every network is now strongly typed: a tensor's precision is whatever the network (or parsed ONNX graph) declares, and the builder will not silently insert reformats.

What changed

1. FP32 strongly-typed port (neuralnet/trtbackend.cpp, command/sandbox.cpp)

createNetworkV2(0U) instead of the removed kEXPLICIT_BATCH flag.
Drop the kFP16/precision-constraint flags, platformHasFastFp16(), the post-parse FP32 pin loop, and all setPrecision/setType calls (the graph carries its own types now).
Outputs constrained to linear layout via setAllowedFormats only.

2. FP16 in the ONNX graph (neuralnet/onnxmodelbuilder.{cpp,h})
Under strongly-typed networks FP16 has to live in the graph, so OnnxModelBuilder gains a convertGraphToFloat16 post-pass: after the FP32 graph is built, every node is rewritten to FP16 except the numerically-sensitive regions — the RMSNorm square -> reduce -> sqrt reductions and the trunk tip + policy/value heads (exactly the node sets the old weakly-typed path pinned via setPrecision + kOBEY_PRECISION_CONSTRAINTS). Cast nodes are inserted in topological position on every FP16<->FP32 boundary edge, and FP16-only weight initializers are converted to FP16. Graph inputs/outputs stay FP32, so InputBuffers/getOutput and the device bindings are unchanged.

floatToHalf clamps finite out-of-range values to the max finite half (±65504) instead of promoting to Inf. KataGo uses a 1e9 sentinel as the off-board attention mask bias; an Inf there yields 0*Inf = NaN ("nonfinite policy sum") in the attention softmax. Clamping preserves the masking semantics (a huge finite negative bias still drives softmax to ~0) and matches onnxconverter-common.

tuneSalt bumped to invalidate FP16-pinned plan/timing caches from older builds.

Validation

Built on a GB10 (Grace-Blackwell, sm_121) with TensorRT 11.0 / CUDA 13. katago testgpuerror against an Eigen CPU reference, 2179 positions at 9x9, on both a convnet (g170 b10c128) and an nbt-transformer (b4c256 ... rsnh):

	FP32 vs ref (winrate err avg/max)	FP16 vs ref (avg/max)
convnet	0.031% / 0.39%	0.090% / 0.93%
nbt-transformer	0.029% / 0.40%	0.081% / 0.50%

FP32 errors are at the TF32-tactic level; FP16 errors are normal half-precision magnitude, no NaN. The analysis engine returns valid policy/value/ownership. FP16 is ~2.4x the FP32 nnEval throughput on the 256-channel transformer at 19x19.

Notes

The hand-built ModelParser path (trtDisableOnnx) is ported too but is FP32-only; the default ONNX-emitter path supports FP16. useFP16=false forces a fully-FP32 engine.
Build deps unchanged except that the TRT backend needs protobuf (already required by the ONNX-emitter path).

…rks) TensorRT 11 removes weakly-typed networks and the builder-driven mixed-precision machinery the backend relied on: the kFP16 builder flag, platformHasFastFp16(), per-layer ILayer::setPrecision()/setOutputType(), ITensor::setType(), the kOBEY/kPREFER_PRECISION_CONSTRAINTS flags, and the kEXPLICIT_BATCH network-creation flag are all gone. Networks are now always explicit-batch and strongly typed, so precision is whatever the network / parsed ONNX graph declares. Port both build paths (the default ONNX-emitter path and the legacy hand-built ModelParser) to a fully FP32 strongly-typed network: - createNetworkV2(0U) instead of the kEXPLICIT_BATCH flag - drop the kFP16 / precision-constraint flags and the post-parse FP32 pin loop - drop the per-layer setPrecision/setType calls (the graph is uniformly FP32) - constrain outputs to a linear layout via setAllowedFormats only (setType is gone) - same createNetworkV2 fix in command/sandbox.cpp - bump tuneSalt 8->9 to invalidate the old FP16-pinned plan/timing caches The engine runs in FP32 (TF32 tactics still apply). FP16 can be reintroduced later by emitting an explicitly-typed FP16 graph with casts around the numerically-sensitive regions (RMSNorm reductions, policy/value heads); the forceFP32 markers and the onnxResult node-name sets still identify those regions. Validated on a GB10 (Blackwell sm_121) with TensorRT 11.0 / CUDA 13: testgpuerror against an Eigen CPU reference matches to TF32-tactic precision (~0.03% average winrate error) on both a convnet and an nbt-transformer net.

Stage A made the TensorRT 11 backend build and run in FP32. This restores FP16, which under strongly-typed networks must be expressed in the ONNX graph itself (the kFP16 builder flag and per-layer setPrecision are gone). OnnxModelBuilder grows a convertGraphToFloat16 post-pass: after the FP32 graph is built, every node is rewritten to FP16 except the numerically-sensitive regions (the RMSNorm square/reduce/sqrt reductions and the trunk tip + policy/value heads, i.e. the same node sets the old weakly-typed path pinned via setPrecision + kOBEY_PRECISION_CONSTRAINTS). Casts are inserted in topological position on every edge that crosses an FP16<->FP32 boundary, and float weight initializers consumed only by FP16 nodes are converted to FP16. Graph inputs and outputs stay FP32, so InputBuffers / getOutput and the device bindings are unchanged. floatToHalf clamps finite out-of-range values to the max finite half (+-65504) instead of promoting them to Inf: KataGo uses a 1e9 sentinel as the off-board attention mask bias, and an Inf there produces 0*Inf = NaN ("nonfinite policy sum") in the attention softmax. Clamping preserves the masking semantics (a huge finite negative bias still drives softmax to ~0) and matches onnxconverter-common. The backend re-enables usingFP16 (useFP16 auto/true -> FP16 on the ONNX path; false, or the non-ONNX ModelParser path, stays FP32) and passes it to the emitter. tuneSalt 9->10 to invalidate FP32-only plan/timing caches. Validated on a GB10 (Blackwell sm_121, TensorRT 11.0 / CUDA 13): testgpuerror vs an Eigen CPU reference shows normal FP16 precision (avg ~0.08% winrate error, no NaN) on both a convnet and an nbt-transformer net, and FP16 is ~2.4x the FP32 nnEval throughput on the 256-channel transformer at 19x19.

gweber added 2 commits June 7, 2026 00:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT 11 support (Blackwell / strongly-typed networks), incl. FP16#1206

TensorRT 11 support (Blackwell / strongly-typed networks), incl. FP16#1206
gweber wants to merge 2 commits into
lightvector:masterfrom
gweber:trt11-strongly-typed

gweber commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gweber commented Jun 6, 2026

Summary

Why

What changed

Validation

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant