Skip to content

KeyError when cancelling blocked pending jobs #3694

Description

@ekouts

Reported by @toxa81 :

I’m running master branch of reframe 36c052725281cb1a1e9583b8413015cedafbce61 with the following command on starlex:

uv run reframe -J=--account=csstaff -J=--reservation=uss140-shs131-nv590-staging -J=--gpus-per-node=4 -C /capstor/store/cscs/cscs/public/reframe/reframe-stable/starlex/cscs-reframe-tests.git/config/cscs.py --mode=maintenance --report-junit=report.xml --run --name CPUNodeBurnStreamCE --name CudaNodeBurnStreamCE --name DcgmRpmCheck --name PyTorchDdpCeNv

command fails:

[==========] Running 6 check(s)
[==========] Started on Mon Jun 29 10:35:01 2026+0200

[----------] start processing checks
[ RUN      ] DcgmRpmCheck /989abc80 @starlex:normal+builtin
[ RUN      ] CudaNodeBurnStreamCE /af7164be @starlex:normal+builtin
[ RUN      ] CPUNodeBurnStreamCE /4872bcfd @starlex:normal+builtin
[ RUN      ] PyTorchDdpCeNv %num_nodes=1 %aws_ofi_nccl=True %image=nvcr.io#nvidia/pytorch:25.06-py3 /d1772459 @starlex:normal+builtin
[ RUN      ] PyTorchDdpCeNvlarge %num_nodes=3 %aws_ofi_nccl=True %image=nvcr.io#nvidia/pytorch:25.06-py3 /367e5166 @starlex:normal+builtin
[ RUN      ] PyTorchDdpCeNvlarge %num_nodes=8 %aws_ofi_nccl=True %image=nvcr.io#nvidia/pytorch:25.06-py3 /d5dbe538 @starlex:normal+builtin
[  PASSED  ] Ran 0/6 test case(s) from 6 check(s) (0 failure(s), 0 expected failure(s), 0 skipped, 0 aborted)
[==========] Finished on Mon Jun 29 10:35:08 2026+0200
ERROR: run session stopped: key error: <reframe.core.schedulers.slurm._SlurmJob object at 0xffffae3ba850>
ERROR: Traceback (most recent call last):
  File "/users/antonk/reframe/reframe/frontend/cli.py", line 1780, in main
    runner.runall(testcases, restored_cases)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/antonk/reframe/reframe/core/logging.py", line 1138, in _fn
    return fn(*args, **kwargs)
  File "/users/antonk/reframe/reframe/frontend/executors/__init__.py", line 731, in runall
    self._runall(testcases)
    ~~~~~~~~~~~~^^^^^^^^^^^
  File "/users/antonk/reframe/reframe/frontend/executors/__init__.py", line 824, in _runall
    self._policy.exit()
    ~~~~~~~~~~~~~~~~~^^
  File "/users/antonk/reframe/reframe/frontend/executors/policies.py", line 430, in exit
    self._poll_tasks()
    ~~~~~~~~~~~~~~~~^^
  File "/users/antonk/reframe/reframe/frontend/executors/policies.py", line 480, in _poll_tasks
    sched.poll(*jobs)
    ~~~~~~~~~~^^^^^^^
  File "/users/antonk/reframe/reframe/core/schedulers/slurm.py", line 561, in poll
    self._cancel_if_blocked(jobs)
    ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/users/antonk/reframe/reframe/core/schedulers/slurm.py", line 606, in _cancel_if_blocked
    pending_reasons[pending_job].setdefault([])
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^
KeyError: <reframe.core.schedulers.slurm._SlurmJob object at 0xffffae3ba850>

Log file(s) saved in '/users/antonk/reframe/reframe.log', '/users/antonk/reframe/reframe.out'

I think the bug was introduced in #3690 because we don't initialize the pending_reasons[pending_job] correctly for the SlurmJobScheduler. It is initialized fine for SqueueJobScheduler. Will make a PR to fix it

Metadata

Metadata

Assignees

Type

Fields

No fields configured for Bug.

Projects

Status
No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions