test fix: Skip NVLink version checks for inactive links#2154
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
PR 2130 CI Flake Report:
|
| Attempt | Job ID | Result | Runner name | Reported GPU |
|---|---|---|---|---|
| Original | 78418596734 |
Failed | 24ba-w-amd-g-h100-l-2-hm6cl-runner-v7tp5 |
NVIDIA H100 PCIe |
| Rerun 1 | 78435298937 |
Cancelled by timeout | 24ba-w-amd-g-h100-l-2-hm6cl-runner-j7l6c |
NVIDIA H100 NVL |
| Rerun 2 | 78445744407 |
Passed | 24ba-w-amd-g-h100-l-2-hm6cl-runner-9mx8s |
NVIDIA H100 NVL |
All three jobs used:
Current runner version: '2.334.0'
Runner group name: 'nv-gpu-amd64-h100-2gpu'
Machine name: 'NV_RUNNER'
Driver Version: 581.15
CUDA Version: 13.0
Driver mode: MCDM
The important difference is that the failing original attempt landed on an NVIDIA H100 PCIe runner, while both reruns landed on NVIDIA H100 NVL runners.
Interpretation
The failure appears to be an existing cuda.core.system test fragility or platform-specific NVML behavior, not a regression caused by PR 2130.
The strongest signal is:
- Original run on H100 PCIe:
test_nvlinkfailed because link0returnedVERSION_INVALID. - Rerun on H100 NVL:
test_nvlinkskipped as unsupported. - Successful rerun on H100 NVL:
test_nvlinkskipped as unsupported.
This suggests the H100 PCIe MCDM runner exposed a different NVML response path than the H100 NVL MCDM runners. The test assumes that every index in range(NvlinkInfo.max_links) has a valid version. On the failing H100 PCIe runner, at least link 0 did not.
Possible follow-up for the owning code:
- Adjust
test_nvlinkto treatVERSION_INVALIDsimilarly to an unsupported/inactive link, or only queryversionafter verifying the link state or availability. - Add diagnostic logging around device name, link index,
device_get_nvlink_state, and rawdevice_get_nvlink_versionwhen the version is invalid. - If H100 PCIe should never expose valid NVLink links in this configuration, skip
test_nvlinkearlier based on device/platform capability.
PR 2130 Relevance
PR 2130 only adds tests for coverage in areas such as memory, launcher, linker, program, graph memory resource, and utilities.
The failing test was:
tests/system/test_system_device.py::test_nvlink
That file was not modified by PR 2130. Based on the logs, this should be treated as an unrelated CI/platform flake rather than evidence against the PR's added tests.
|
@mdboom I'll defer running the tests until you're back. |
Context
This PR fixes a
cuda.coresystem-test failure that was first observed while reviewing PR 2130:Test win-64 / Python 3.14, CUDA 13.3.0 (wheels), GPU h100 (x2) (MCDM)tests/system/test_system_device.py::test_nvlinkRuntimeError: Invalid NvLink version returned for deviceThe failure was seen in the original CI attempt for PR #2130. PR 2130 itself was adding coverage-oriented tests in other areas and did not modify
tests/system/test_system_device.py, so the failing test was an existing system-test fragility rather than a regression introduced by that PR.CI log with full failure details:
What Failed
The failing traceback showed that
test_nvlinkqueriednvlink_info.versionfor link0and receivedNvlinkVersion.VERSION_INVALIDfrom NVML:The relevant local values in the failure were:
The old test iterated over every index in
range(NvlinkInfo.max_links)and queried the version before checking whether the link was active. On the failing H100 PCIe/MCDM runner, NVML reported an invalid version for at least one link slot. That is consistent with an inactive or unavailable NVLink slot, and the test should not assume that every slot up tomax_linkshas a valid version.Fix
This PR changes
test_nvlinkto querynvlink_info.statebefore queryingnvlink_info.version.The updated test now:
NvlinkInfoobject for each possible link index.nvlink_info.state.This preserves the useful test invariant: if a link is active, its version should be available and well-formed. It avoids treating inactive link slots as failures.