Skip to content

Debug GROMACS issue on Snellius#245

Draft
bedroge wants to merge 5 commits into
EESSI:mainfrom
bedroge:gromacs_debug
Draft

Debug GROMACS issue on Snellius#245
bedroge wants to merge 5 commits into
EESSI:mainfrom
bedroge:gromacs_debug

Conversation

@bedroge
Copy link
Copy Markdown
Contributor

@bedroge bedroge commented Jun 3, 2026

Trying to debug the issue from EESSI/software-layer#1482 here.

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented Jun 3, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-surf for:arch=x86_64/intel/icelake,accel=nvidia/cc80

@eessi-bot-surf
Copy link
Copy Markdown

eessi-bot-surf Bot commented Jun 3, 2026

New job on instance eessi-bot-surf for repository eessi.io-2025.06-software
Building on: intel-icelake and accelerator nvidia/cc80
Building for: x86_64/intel/icelake and accelerator nvidia/cc80
Job dir: /projects/eessibot/eessi-bot-surf/jobs/2026.06/pr_245/23417907

date job status comment
Jun 03 08:38:25 UTC 2026 submitted job id 23417907 will be eligible to start in about 20 seconds
Jun 03 08:38:37 UTC 2026 received job awaits launch by Slurm scheduler
Jun 03 08:38:51 UTC 2026 running job 23417907 is running
Jun 03 08:40:28 UTC 2026 finished job id 23417907 was cancelled
Jun 03 08:40:38 UTC 2026 finished
🤷 UNKNOWN (click triangle for detailed information)
  • Job results file _bot_job23417907.result does not exist in job directory, or parsing it failed.
  • No artefacts were found/reported.
Jun 03 08:40:38 UTC 2026 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job23417907.test does not exist in job directory, or parsing it failed.

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented Jun 3, 2026

bot:cancel jobid:23417907

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented Jun 3, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-surf for:arch=x86_64/intel/icelake,accel=nvidia/cc80

@eessi-bot-surf
Copy link
Copy Markdown

eessi-bot-surf Bot commented Jun 3, 2026

New job on instance eessi-bot-surf for repository eessi.io-2025.06-software
Building on: intel-icelake and accelerator nvidia/cc80
Building for: x86_64/intel/icelake and accelerator nvidia/cc80
Job dir: /projects/eessibot/eessi-bot-surf/jobs/2026.06/pr_245/23417964

date job status comment
Jun 03 08:40:52 UTC 2026 submitted job id 23417964 will be eligible to start in about 20 seconds
Jun 03 08:41:03 UTC 2026 received job awaits launch by Slurm scheduler
Jun 03 08:41:27 UTC 2026 running job 23417964 is running
Jun 03 08:49:24 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-23417964.out
✅ no message matching FATAL:
❌ found message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-intel-icelake-accel-nvidia-cc80-17804763670.tar.zstsize: 0 MiB (28870 bytes)
entries: 1
modules under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all
no module files in tarball
software under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/software
no software packages in tarball
reprod directories under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/reprod
no reprod directories in tarball
other under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80
2025.06/init/easybuild/eb_hooks.py
Jun 03 08:49:24 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] ( 1/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5.1-gompi-2025b-CUDA-12.9.1 %scale=1_4_node %device_type=gpu /15d6e239 @BotBuildTests:gpu_a100+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 2/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a-CUDA-12.8.0 %scale=1_4_node %device_type=gpu /5471f15a @BotBuildTests:gpu_a100+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 3/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_4_node %device_type=gpu /526cd259 @BotBuildTests:gpu_a100+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 4/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5.1-gompi-2025b-CUDA-12.9.1 %scale=1_4_node %device_type=gpu /1dc400ef @BotBuildTests:gpu_a100+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 5/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a-CUDA-12.8.0 %scale=1_4_node %device_type=gpu /9715dde6 @BotBuildTests:gpu_a100+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 6/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_4_node %device_type=gpu /416eaee1 @BotBuildTests:gpu_a100+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 7/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5.1-gompi-2025b-CUDA-12.9.1 %scale=1_4_node /ed938ed4 @BotBuildTests:gpu_a100+default [Skipping test : 1 GPU(s) available for this test case, need exactly 2]
[ SKIP ] ( 8/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a-CUDA-12.8.0 %scale=1_4_node /8d24cea9 @BotBuildTests:gpu_a100+default [Skipping test : 1 GPU(s) available for this test case, need exactly 2]
[ SKIP ] ( 9/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_4_node /73a202f1 @BotBuildTests:gpu_a100+default [Skipping test : 1 GPU(s) available for this test case, need exactly 2]
[ SKIP ] (10/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5.1-gompi-2025b-CUDA-12.9.1 %scale=1_4_node /946648aa @BotBuildTests:gpu_a100+default [Skipping test : 1 GPU(s) available for this test case, need exactly 2]
[ SKIP ] (11/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a-CUDA-12.8.0 %scale=1_4_node /9eb3f1e9 @BotBuildTests:gpu_a100+default [Skipping test : 1 GPU(s) available for this test case, need exactly 2]
[ SKIP ] (12/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_4_node /7f04eb2b @BotBuildTests:gpu_a100+default [Skipping test : 1 GPU(s) available for this test case, need exactly 2]
[ PASSED ] Ran 0/12 test case(s) from 12 check(s) (0 failure(s), 12 skipped, 0 aborted)
Details
✅ job output file slurm-23417964.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented Jun 3, 2026

Easystack filename was wrong, let's try again

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-surf for:arch=x86_64/intel/icelake,accel=nvidia/cc80

@eessi-bot-surf
Copy link
Copy Markdown

eessi-bot-surf Bot commented Jun 3, 2026

New job on instance eessi-bot-surf for repository eessi.io-2025.06-software
Building on: intel-icelake and accelerator nvidia/cc80
Building for: x86_64/intel/icelake and accelerator nvidia/cc80
Job dir: /projects/eessibot/eessi-bot-surf/jobs/2026.06/pr_245/23418760

date job status comment
Jun 03 09:05:38 UTC 2026 submitted job id 23418760 will be eligible to start in about 20 seconds
Jun 03 09:05:50 UTC 2026 received job awaits launch by Slurm scheduler
Jun 03 09:06:04 UTC 2026 running job 23418760 is running
Jun 03 10:29:33 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-23418760.out
✅ no message matching FATAL:
❌ found message matching ERROR:
✅ no message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-intel-icelake-accel-nvidia-cc80-17804823860.tar.zstsize: 0 MiB (28963 bytes)
entries: 1
modules under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all
no module files in tarball
software under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/software
no software packages in tarball
reprod directories under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/reprod
no reprod directories in tarball
other under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80
2025.06/init/easybuild/eb_hooks.py
Jun 03 10:29:33 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] ( 1/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5.1-gompi-2025b-CUDA-12.9.1 %scale=1_4_node %device_type=gpu /15d6e239 @BotBuildTests:gpu_a100+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 2/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a-CUDA-12.8.0 %scale=1_4_node %device_type=gpu /5471f15a @BotBuildTests:gpu_a100+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 3/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_4_node %device_type=gpu /526cd259 @BotBuildTests:gpu_a100+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 4/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5.1-gompi-2025b-CUDA-12.9.1 %scale=1_4_node %device_type=gpu /1dc400ef @BotBuildTests:gpu_a100+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 5/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a-CUDA-12.8.0 %scale=1_4_node %device_type=gpu /9715dde6 @BotBuildTests:gpu_a100+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 6/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_4_node %device_type=gpu /416eaee1 @BotBuildTests:gpu_a100+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 7/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5.1-gompi-2025b-CUDA-12.9.1 %scale=1_4_node /ed938ed4 @BotBuildTests:gpu_a100+default [Skipping test : 1 GPU(s) available for this test case, need exactly 2]
[ SKIP ] ( 8/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a-CUDA-12.8.0 %scale=1_4_node /8d24cea9 @BotBuildTests:gpu_a100+default [Skipping test : 1 GPU(s) available for this test case, need exactly 2]
[ SKIP ] ( 9/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_4_node /73a202f1 @BotBuildTests:gpu_a100+default [Skipping test : 1 GPU(s) available for this test case, need exactly 2]
[ SKIP ] (10/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5.1-gompi-2025b-CUDA-12.9.1 %scale=1_4_node /946648aa @BotBuildTests:gpu_a100+default [Skipping test : 1 GPU(s) available for this test case, need exactly 2]
[ SKIP ] (11/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a-CUDA-12.8.0 %scale=1_4_node /9eb3f1e9 @BotBuildTests:gpu_a100+default [Skipping test : 1 GPU(s) available for this test case, need exactly 2]
[ SKIP ] (12/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_4_node /7f04eb2b @BotBuildTests:gpu_a100+default [Skipping test : 1 GPU(s) available for this test case, need exactly 2]
[ PASSED ] Ran 0/12 test case(s) from 12 check(s) (0 failure(s), 12 skipped, 0 aborted)
Details
✅ job output file slurm-23418760.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented Jun 3, 2026

Log file confirms that ulimit -l is somehow set to 8192 instead of unlimited when the the test step is run. Let's try again and print the value at the start of the job.

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-surf for:arch=x86_64/intel/icelake,accel=nvidia/cc80

@eessi-bot-surf
Copy link
Copy Markdown

eessi-bot-surf Bot commented Jun 3, 2026

New job on instance eessi-bot-surf for repository eessi.io-2025.06-software
Building on: intel-icelake and accelerator nvidia/cc80
Building for: x86_64/intel/icelake and accelerator nvidia/cc80
Job dir: /projects/eessibot/eessi-bot-surf/jobs/2026.06/pr_245/23423556

date job status comment
Jun 03 11:27:50 UTC 2026 submitted job id 23423556 will be eligible to start in about 20 seconds
Jun 03 11:27:58 UTC 2026 received job awaits launch by Slurm scheduler
Jun 03 11:28:42 UTC 2026 running job 23423556 is running
Jun 03 11:29:11 UTC 2026 finished job id 23423556 was cancelled
Jun 03 11:29:27 UTC 2026 finished
🤷 UNKNOWN (click triangle for detailed information)
  • Job results file _bot_job23423556.result does not exist in job directory, or parsing it failed.
  • No artefacts were found/reported.
Jun 03 11:29:27 UTC 2026 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job23423556.test does not exist in job directory, or parsing it failed.

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented Jun 3, 2026

bot:cancel jobid:23423556

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented Jun 3, 2026

Indeed, at the start it's already 8192:

ULIMITS:
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 2062171
max locked memory           (kbytes, -l) 8192
max memory size             (kbytes, -m) 125829120
open files                          (-n) 8192
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 1030690
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

@casparvl
Copy link
Copy Markdown
Contributor

casparvl commented Jun 3, 2026

This is what the event-handler process looks like on the node where it runs:

[eessibot@tcn1 ~]$ cat /proc/2398391/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             1030690              1030690              processes
Max open files            8192                 8192                 files
Max locked memory         8388608              8388608              bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       1030690              1030690              signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

So yes, that's limited :. I guess this may be due to how the bot process is started (by our config management system). If I just login myself in an interactive shell, it's just unlimited.

@casparvl
Copy link
Copy Markdown
Contributor

casparvl commented Jun 3, 2026

It turns out that this ulimit is inherited from the salt minion thats starts the bot processes. In a regular login session, ulimit -l is set at unlimited, and that's why we couldn't reproduce this interactively.

@casparvl
Copy link
Copy Markdown
Contributor

casparvl commented Jun 3, 2026

Workaround is to try and increase ulimit -l unlimited in the site config script for the SURF bot. I'll also discuss with the sysadmins if that ulimit for the salt minion is even desirable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants