test fix: Skip NVLink version checks for inactive links by rwgk · Pull Request #2154 · NVIDIA/cuda-python

rwgk · 2026-05-29T16:41:51Z

Context

This PR fixes a cuda.core system-test failure that was first observed while reviewing PR 2130:

Workflow run: https://github.com/NVIDIA/cuda-python/actions/runs/26611106556?pr=2130
Failed job: Test win-64 / Python 3.14, CUDA 13.3.0 (wheels), GPU h100 (x2) (MCDM)
Failed test: tests/system/test_system_device.py::test_nvlink
Error: RuntimeError: Invalid NvLink version returned for device

The failure was seen in the original CI attempt for PR #2130. PR 2130 itself was adding coverage-oriented tests in other areas and did not modify tests/system/test_system_device.py, so the failing test was an existing system-test fragility rather than a regression introduced by that PR.

CI log with full failure details:

pr2130_h100_win_py314_cuda133_attempt1_job78418596734.log

What Failed

The failing traceback showed that test_nvlink queried nvlink_info.version for link 0 and received NvlinkVersion.VERSION_INVALID from NVML:

tests\system\test_system_device.py:774:
>   version = nvlink_info.version

cuda\core\system\_nvlink.pxi:46:
>   raise RuntimeError("Invalid NvLink version returned for device")
E   RuntimeError: Invalid NvLink version returned for device

The relevant local values in the failure were:

link       = 0
max_links  = 18

The old test iterated over every index in range(NvlinkInfo.max_links) and queried the version before checking whether the link was active. On the failing H100 PCIe/MCDM runner, NVML reported an invalid version for at least one link slot. That is consistent with an inactive or unavailable NVLink slot, and the test should not assume that every slot up to max_links has a valid version.

Fix

This PR changes test_nvlink to query nvlink_info.state before querying nvlink_info.version.

The updated test now:

Retrieves the NvlinkInfo object for each possible link index.
Queries and validates nvlink_info.state.
Skips version validation for inactive links.
Keeps the existing version validation for active links.

This preserves the useful test invariant: if a link is active, its version should be available and well-formed. It avoids treating inactive link slots as failures.

copy-pr-bot · 2026-05-29T16:41:58Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rwgk · 2026-05-29T16:44:37Z

PR 2130 CI Flake Report: `test_system_device.py::test_nvlink`

TL;DR: Look for "The strongest signal is:" below.

Workflow run: https://github.com/NVIDIA/cuda-python/actions/runs/26611106556?pr=2130

PR: #2130

2026-05-29T01:12:35.9053712Z [command]"C:\Program Files\Git\bin\git.exe" -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +c549988f4215b2bbb703d76ffb47b66d82c28e63:refs/remotes/origin/pull-request/2130

commit c549988f4215b2bbb703d76ffb47b66d82c28e63 (HEAD -> rluo8→main, upstream/pull-request/2130, rluo8/main)
Merge: d865e33e1d 88363f8f17
Author: Rui Luo <ruluo@nvidia.com>
Date:   Thu May 28 17:46:43 2026 -0700

    Merge branch 'main' into main

Summary

PR 2130 added coverage-oriented tests under cuda_core/tests, but the observed CI failure was not in any of the newly added tests.

The original failed job was:

Job: Test win-64 / Python 3.14, CUDA 13.3.0 (wheels), GPU h100 (x2) (MCDM)
Job ID: 78418596734
Step: Run cuda.core tests
Failing test: tests/system/test_system_device.py::test_nvlink
Error: RuntimeError: Invalid NvLink version returned for device

Two reruns were observed:

Job ID 78435298937: cancelled by the 60-minute job timeout.
Job ID 78445744407: completed successfully.

pytest-randomly was active in all three cuda.core attempts.

Original Failure

In the original attempt, test_nvlink failed at 14% progress through the cuda.core test suite:

2026-05-29T01:20:11Z tests/system/test_system_device.py::test_nvlink FAILED [ 14%]

The failure traceback showed:

tests\system\test_system_device.py:774:
>   version = nvlink_info.version

cuda\core\system\_nvlink.pxi:46:
>   raise RuntimeError("Invalid NvLink version returned for device")
E   RuntimeError: Invalid NvLink version returned for device

The local variables shown in the traceback were:

link       = 0
max_links  = 18
nvlink_info = <cuda.core.system._device.NvlinkInfo object at ...>

The implementation in cuda/core/system/_nvlink.pxi calls nvml.device_get_nvlink_version(...) and raises if NVML returns NvlinkVersion.VERSION_INVALID.

The test in cuda_core/tests/system/test_system_device.py iterates all links from 0 to NvlinkInfo.max_links - 1 and asks for both version and state. The failure indicates that, on this runner, link 0 returned VERSION_INVALID rather than a usable NVLink version or a handled unsupported condition.

Rerun Behavior

The first rerun timed out at the job level, but test_nvlink had already run before cancellation:

2026-05-29T05:08:28Z tests/system/test_system_device.py::test_nvlink SKIPPED (Unsupported...) [ 14%]
2026-05-29T05:11:36Z ##[error]The operation was canceled.

So the timed-out rerun did not hang before reaching test_nvlink; it reached that test and skipped it successfully.

The second rerun completed successfully. In that run, test_nvlink also skipped:

2026-05-29T06:01:00Z tests/system/test_system_device.py::test_nvlink SKIPPED (Unsupported...) [ 14%]
2026-05-29T06:01:34Z ========= 3247 passed, 332 skipped, 3 xfailed, 89 warnings in 50.73s ==========

`pytest-randomly` State

pytest-randomly was active in all three cuda.core attempts:

Original failed attempt:

pytest-randomly      4.1.0
Using --randomly-seed=3632140741
plugins: benchmark-5.2.3, mock-3.15.1, randomly-4.1.0, repeat-0.9.4, rerunfailures-16.3, timeout-2.4.0

First rerun, cancelled by timeout:

Using --randomly-seed=4141722146
plugins: benchmark-5.2.3, mock-3.15.1, randomly-4.1.0, repeat-0.9.4, rerunfailures-16.3, timeout-2.4.0

Second rerun, successful:

Using --randomly-seed=295967675
plugins: benchmark-5.2.3, mock-3.15.1, randomly-4.1.0, repeat-0.9.4, rerunfailures-16.3, timeout-2.4.0

Observation: the differing random order may influence where test_nvlink appears in the run, but it does not by itself explain the hardware/NVML return value difference. In all three attempts, test_nvlink appeared around 14% progress.

Runner Comparison

The exact runner instance differed across all three attempts:

Attempt	Job ID	Result	Runner name	Reported GPU
Original	`78418596734`	Failed	`24ba-w-amd-g-h100-l-2-hm6cl-runner-v7tp5`	`NVIDIA H100 PCIe`
Rerun 1	`78435298937`	Cancelled by timeout	`24ba-w-amd-g-h100-l-2-hm6cl-runner-j7l6c`	`NVIDIA H100 NVL`
Rerun 2	`78445744407`	Passed	`24ba-w-amd-g-h100-l-2-hm6cl-runner-9mx8s`	`NVIDIA H100 NVL`

All three jobs used:

Current runner version: '2.334.0'
Runner group name: 'nv-gpu-amd64-h100-2gpu'
Machine name: 'NV_RUNNER'
Driver Version: 581.15
CUDA Version: 13.0
Driver mode: MCDM

The important difference is that the failing original attempt landed on an NVIDIA H100 PCIe runner, while both reruns landed on NVIDIA H100 NVL runners.

Interpretation

The failure appears to be an existing cuda.core.system test fragility or platform-specific NVML behavior, not a regression caused by PR 2130.

The strongest signal is:

Original run on H100 PCIe: test_nvlink failed because link 0 returned VERSION_INVALID.
Rerun on H100 NVL: test_nvlink skipped as unsupported.
Successful rerun on H100 NVL: test_nvlink skipped as unsupported.

This suggests the H100 PCIe MCDM runner exposed a different NVML response path than the H100 NVL MCDM runners. The test assumes that every index in range(NvlinkInfo.max_links) has a valid version. On the failing H100 PCIe runner, at least link 0 did not.

Possible follow-up for the owning code:

Adjust test_nvlink to treat VERSION_INVALID similarly to an unsupported/inactive link, or only query version after verifying the link state or availability.
Add diagnostic logging around device name, link index, device_get_nvlink_state, and raw device_get_nvlink_version when the version is invalid.
If H100 PCIe should never expose valid NVLink links in this configuration, skip test_nvlink earlier based on device/platform capability.

PR 2130 Relevance

PR 2130 only adds tests for coverage in areas such as memory, launcher, linker, program, graph memory resource, and utilities.

The failing test was:

tests/system/test_system_device.py::test_nvlink

That file was not modified by PR 2130. Based on the logs, this should be treated as an unrelated CI/platform flake rather than evidence against the PR's added tests.

rwgk · 2026-05-29T16:46:14Z

@mdboom I'll defer running the tests until you're back.

Avoid querying inactive NVLink versions

349998b

github-actions Bot added the cuda.core Everything related to the cuda.core module label May 29, 2026

rwgk self-assigned this May 29, 2026

rwgk requested a review from mdboom May 29, 2026 16:45

rwgk added the P1 Medium priority - Should do label May 29, 2026

rwgk added this to the cuda.core next milestone May 29, 2026

rwgk added the test Improvements or additions to tests label May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test fix: Skip NVLink version checks for inactive links#2154

test fix: Skip NVLink version checks for inactive links#2154
rwgk wants to merge 1 commit into
NVIDIA:mainfrom
rwgk:test_system_device_test_nvlink_fix

rwgk commented May 29, 2026

Uh oh!

copy-pr-bot Bot commented May 29, 2026

Uh oh!

rwgk commented May 29, 2026

Uh oh!

rwgk commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rwgk commented May 29, 2026

Context

What Failed

Fix

Uh oh!

copy-pr-bot Bot commented May 29, 2026

Uh oh!

rwgk commented May 29, 2026

PR 2130 CI Flake Report: test_system_device.py::test_nvlink

Summary

Original Failure

Rerun Behavior

pytest-randomly State

Runner Comparison

Interpretation

PR 2130 Relevance

Uh oh!

rwgk commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PR 2130 CI Flake Report: `test_system_device.py::test_nvlink`

`pytest-randomly` State