Summary
On a self-hosted deployment (serving LibreChat), the sandbox-runner process leaks host file descriptors over time. After sustained use, its open-fd count climbs to the process RLIMIT_NOFILE and then every sandbox execution fails with EMFILE — "Too many open files". Even trivial commands (cat, sleep) and Python startup fail because shared libraries can no longer be open()ed. A restart of sandbox-runner clears it and it works again, then slowly climbs back.
Environment
- Self-hosted from source, commit
3fa1f6c, KVM available (real libkrun microVM path, not the NsJail fallback).
sandbox-runner runs as a single long-lived process; /host-packages (the Python/Node package cache) is shared into each guest as a virtio-fs mount (launcher/src/main.rs → krun_set_root_disk_remount, "Mounted … as virtio-fs 'packages'").
Evidence
Measured on the sandbox-runner main PID after ~6 days of uptime, right before failures began:
- Open fds: 65,507 vs
RLIMIT_NOFILE soft = 65,536 (hard = 524,288). Effectively exhausted.
- Host-wide fd limits were not the constraint (
fs.file-max effectively unlimited; /proc/sys/fs/file-nr low). This is the per-process limit.
- fd type breakdown (top entries) was dominated by the virtio-fs package mount plus eventfds:
58 anon_inode:[eventfd]
22 /host-packages/python/<ver>/lib/python<ver>/test/.../msg_N.txt
16 /host-packages/python/<ver>/.../sklearn/datasets/tests/data/openml/.../*.json.gz
14 /host-packages/node/<ver>/node_modules/pino/test/fixtures/.../fileN.js
... (thousands more /host-packages/{python,node}/... entries)
The descriptors are overwhelmingly host handles to files under /host-packages — i.e. inodes the guests looked up through the virtio-fs packages mount — accumulating across executions in the persistent runner process.
Likely mechanism
The virtio-fs backend keeps a host fd/handle per inode a guest looks up, held until a FUSE FORGET. It looks like these per-lookup handles (and per-VM eventfds) are not released when a microVM is torn down, so they accumulate in the long-lived sandbox-runner process. Over many executions this reaches RLIMIT_NOFILE → EMFILE.
Note the launcher already raises RLIMIT_NOFILE toward LAUNCHER_NOFILE_LIMIT (launcher/src/main.rs), which delays exhaustion but does not stop the accumulation — suggesting the fd growth is a known pressure point but the underlying retention isn't being reclaimed.
Reproduce
Run many executions that import packages (e.g. Python import numpy, pandas, sklearn) against one sandbox-runner instance and watch its fd count grow monotonically and not return to baseline between runs:
PID=$(docker inspect -f '{{.State.Pid}}' sandbox-runner)
ls /proc/$PID/fd | wc -l # grows over time, never drops back
grep 'open files' /proc/$PID/limits
Workaround
Restart sandbox-runner (fd count drops back to ~1.3k and executions succeed). We now run a daily systemd timer to restart it as a stop-gap.
Ask
- Is the virtio-fs / microVM teardown expected to release these host fds, and if so where is that reclaim happening (or not)?
- Can the per-execution
/host-packages lookup handles (and eventfds) be released on guest/microVM shutdown so the count returns to baseline, rather than only raising the nofile ceiling?
Happy to provide more detail (full fd dump, timing, config) if useful.
Summary
On a self-hosted deployment (serving LibreChat), the
sandbox-runnerprocess leaks host file descriptors over time. After sustained use, its open-fd count climbs to the processRLIMIT_NOFILEand then every sandbox execution fails withEMFILE— "Too many open files". Even trivial commands (cat,sleep) and Python startup fail because shared libraries can no longer beopen()ed. A restart ofsandbox-runnerclears it and it works again, then slowly climbs back.Environment
3fa1f6c, KVM available (real libkrun microVM path, not the NsJail fallback).sandbox-runnerruns as a single long-lived process;/host-packages(the Python/Node package cache) is shared into each guest as a virtio-fs mount (launcher/src/main.rs→krun_set_root_disk_remount, "Mounted … as virtio-fs 'packages'").Evidence
Measured on the
sandbox-runnermain PID after ~6 days of uptime, right before failures began:RLIMIT_NOFILEsoft = 65,536 (hard = 524,288). Effectively exhausted.fs.file-maxeffectively unlimited;/proc/sys/fs/file-nrlow). This is the per-process limit.The descriptors are overwhelmingly host handles to files under
/host-packages— i.e. inodes the guests looked up through the virtio-fspackagesmount — accumulating across executions in the persistent runner process.Likely mechanism
The virtio-fs backend keeps a host fd/handle per inode a guest looks up, held until a FUSE
FORGET. It looks like these per-lookup handles (and per-VMeventfds) are not released when a microVM is torn down, so they accumulate in the long-livedsandbox-runnerprocess. Over many executions this reachesRLIMIT_NOFILE→EMFILE.Note the launcher already raises
RLIMIT_NOFILEtowardLAUNCHER_NOFILE_LIMIT(launcher/src/main.rs), which delays exhaustion but does not stop the accumulation — suggesting the fd growth is a known pressure point but the underlying retention isn't being reclaimed.Reproduce
Run many executions that import packages (e.g. Python
import numpy, pandas, sklearn) against onesandbox-runnerinstance and watch its fd count grow monotonically and not return to baseline between runs:Workaround
Restart
sandbox-runner(fd count drops back to ~1.3k and executions succeed). We now run a daily systemd timer to restart it as a stop-gap.Ask
/host-packageslookup handles (and eventfds) be released on guest/microVM shutdown so the count returns to baseline, rather than only raising the nofile ceiling?Happy to provide more detail (full fd dump, timing, config) if useful.