TensorRT 11 support (Blackwell / strongly-typed networks), incl. FP16#1206
Open
gweber wants to merge 2 commits into
Open
TensorRT 11 support (Blackwell / strongly-typed networks), incl. FP16#1206gweber wants to merge 2 commits into
gweber wants to merge 2 commits into
Conversation
…rks) TensorRT 11 removes weakly-typed networks and the builder-driven mixed-precision machinery the backend relied on: the kFP16 builder flag, platformHasFastFp16(), per-layer ILayer::setPrecision()/setOutputType(), ITensor::setType(), the kOBEY/kPREFER_PRECISION_CONSTRAINTS flags, and the kEXPLICIT_BATCH network-creation flag are all gone. Networks are now always explicit-batch and strongly typed, so precision is whatever the network / parsed ONNX graph declares. Port both build paths (the default ONNX-emitter path and the legacy hand-built ModelParser) to a fully FP32 strongly-typed network: - createNetworkV2(0U) instead of the kEXPLICIT_BATCH flag - drop the kFP16 / precision-constraint flags and the post-parse FP32 pin loop - drop the per-layer setPrecision/setType calls (the graph is uniformly FP32) - constrain outputs to a linear layout via setAllowedFormats only (setType is gone) - same createNetworkV2 fix in command/sandbox.cpp - bump tuneSalt 8->9 to invalidate the old FP16-pinned plan/timing caches The engine runs in FP32 (TF32 tactics still apply). FP16 can be reintroduced later by emitting an explicitly-typed FP16 graph with casts around the numerically-sensitive regions (RMSNorm reductions, policy/value heads); the forceFP32 markers and the onnxResult node-name sets still identify those regions. Validated on a GB10 (Blackwell sm_121) with TensorRT 11.0 / CUDA 13: testgpuerror against an Eigen CPU reference matches to TF32-tactic precision (~0.03% average winrate error) on both a convnet and an nbt-transformer net.
Stage A made the TensorRT 11 backend build and run in FP32. This restores FP16,
which under strongly-typed networks must be expressed in the ONNX graph itself
(the kFP16 builder flag and per-layer setPrecision are gone).
OnnxModelBuilder grows a convertGraphToFloat16 post-pass: after the FP32 graph is
built, every node is rewritten to FP16 except the numerically-sensitive regions
(the RMSNorm square/reduce/sqrt reductions and the trunk tip + policy/value heads,
i.e. the same node sets the old weakly-typed path pinned via setPrecision +
kOBEY_PRECISION_CONSTRAINTS). Casts are inserted in topological position on every
edge that crosses an FP16<->FP32 boundary, and float weight initializers consumed
only by FP16 nodes are converted to FP16. Graph inputs and outputs stay FP32, so
InputBuffers / getOutput and the device bindings are unchanged.
floatToHalf clamps finite out-of-range values to the max finite half (+-65504)
instead of promoting them to Inf: KataGo uses a 1e9 sentinel as the off-board
attention mask bias, and an Inf there produces 0*Inf = NaN ("nonfinite policy
sum") in the attention softmax. Clamping preserves the masking semantics (a huge
finite negative bias still drives softmax to ~0) and matches onnxconverter-common.
The backend re-enables usingFP16 (useFP16 auto/true -> FP16 on the ONNX path;
false, or the non-ONNX ModelParser path, stays FP32) and passes it to the emitter.
tuneSalt 9->10 to invalidate FP32-only plan/timing caches.
Validated on a GB10 (Blackwell sm_121, TensorRT 11.0 / CUDA 13): testgpuerror vs an
Eigen CPU reference shows normal FP16 precision (avg ~0.08% winrate error, no NaN)
on both a convnet and an nbt-transformer net, and FP16 is ~2.4x the FP32 nnEval
throughput on the 256-channel transformer at 19x19.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes the TensorRT backend build and run against TensorRT 11 (e.g. on Blackwell /
sm_120/sm_121with CUDA 13), including FP16. Currently the backend fails to compile against TRT 11, and TRT 10 doesn't support these GPUs — so there's no working TRT path on the newest NVIDIA hardware today. Related: #1041 (RTX 5090 / Blackwell).The change is in two commits that tell the story: first the FP32 port (gets it compiling + correct), then FP16 restored.
Why
TensorRT 11 removes weakly-typed networks and the entire builder-driven mixed-precision mechanism this backend relied on:
BuilderFlag::kFP16,IBuilder::platformHasFastFp16(), per-layerILayer::setPrecision()/setOutputType(),ITensor::setType(),kOBEY/kPREFER_PRECISION_CONSTRAINTS, and thekEXPLICIT_BATCHnetwork-creation flag are all gone. Every network is now strongly typed: a tensor's precision is whatever the network (or parsed ONNX graph) declares, and the builder will not silently insert reformats.What changed
1. FP32 strongly-typed port (
neuralnet/trtbackend.cpp,command/sandbox.cpp)createNetworkV2(0U)instead of the removedkEXPLICIT_BATCHflag.kFP16/precision-constraint flags,platformHasFastFp16(), the post-parse FP32 pin loop, and allsetPrecision/setTypecalls (the graph carries its own types now).setAllowedFormatsonly.2. FP16 in the ONNX graph (
neuralnet/onnxmodelbuilder.{cpp,h})Under strongly-typed networks FP16 has to live in the graph, so
OnnxModelBuildergains aconvertGraphToFloat16post-pass: after the FP32 graph is built, every node is rewritten to FP16 except the numerically-sensitive regions — the RMSNormsquare -> reduce -> sqrtreductions and the trunk tip + policy/value heads (exactly the node sets the old weakly-typed path pinned viasetPrecision+kOBEY_PRECISION_CONSTRAINTS).Castnodes are inserted in topological position on every FP16<->FP32 boundary edge, and FP16-only weight initializers are converted to FP16. Graph inputs/outputs stay FP32, soInputBuffers/getOutputand the device bindings are unchanged.floatToHalfclamps finite out-of-range values to the max finite half (±65504) instead of promoting to Inf. KataGo uses a1e9sentinel as the off-board attention mask bias; an Inf there yields0*Inf = NaN("nonfinite policy sum") in the attention softmax. Clamping preserves the masking semantics (a huge finite negative bias still drives softmax to ~0) and matchesonnxconverter-common.tuneSaltbumped to invalidate FP16-pinned plan/timing caches from older builds.Validation
Built on a GB10 (Grace-Blackwell,
sm_121) with TensorRT 11.0 / CUDA 13.katago testgpuerroragainst an Eigen CPU reference, 2179 positions at 9x9, on both a convnet (g170 b10c128) and an nbt-transformer (b4c256 ... rsnh):FP32 errors are at the TF32-tactic level; FP16 errors are normal half-precision magnitude, no NaN. The analysis engine returns valid policy/value/ownership. FP16 is ~2.4x the FP32 nnEval throughput on the 256-channel transformer at 19x19.
Notes
ModelParserpath (trtDisableOnnx) is ported too but is FP32-only; the default ONNX-emitter path supports FP16.useFP16=falseforces a fully-FP32 engine.