Skip to content

sandbox-runner leaks host file descriptors (virtio-fs /host-packages) → EMFILE "Too many open files" after sustained use #15

Description

@Odrec

Summary

On a self-hosted deployment (serving LibreChat), the sandbox-runner process leaks host file descriptors over time. After sustained use, its open-fd count climbs to the process RLIMIT_NOFILE and then every sandbox execution fails with EMFILE — "Too many open files". Even trivial commands (cat, sleep) and Python startup fail because shared libraries can no longer be open()ed. A restart of sandbox-runner clears it and it works again, then slowly climbs back.

Environment

  • Self-hosted from source, commit 3fa1f6c, KVM available (real libkrun microVM path, not the NsJail fallback).
  • sandbox-runner runs as a single long-lived process; /host-packages (the Python/Node package cache) is shared into each guest as a virtio-fs mount (launcher/src/main.rskrun_set_root_disk_remount, "Mounted … as virtio-fs 'packages'").

Evidence

Measured on the sandbox-runner main PID after ~6 days of uptime, right before failures began:

  • Open fds: 65,507 vs RLIMIT_NOFILE soft = 65,536 (hard = 524,288). Effectively exhausted.
  • Host-wide fd limits were not the constraint (fs.file-max effectively unlimited; /proc/sys/fs/file-nr low). This is the per-process limit.
  • fd type breakdown (top entries) was dominated by the virtio-fs package mount plus eventfds:
 58  anon_inode:[eventfd]
 22  /host-packages/python/<ver>/lib/python<ver>/test/.../msg_N.txt
 16  /host-packages/python/<ver>/.../sklearn/datasets/tests/data/openml/.../*.json.gz
 14  /host-packages/node/<ver>/node_modules/pino/test/fixtures/.../fileN.js
 ...  (thousands more /host-packages/{python,node}/... entries)

The descriptors are overwhelmingly host handles to files under /host-packages — i.e. inodes the guests looked up through the virtio-fs packages mount — accumulating across executions in the persistent runner process.

Likely mechanism

The virtio-fs backend keeps a host fd/handle per inode a guest looks up, held until a FUSE FORGET. It looks like these per-lookup handles (and per-VM eventfds) are not released when a microVM is torn down, so they accumulate in the long-lived sandbox-runner process. Over many executions this reaches RLIMIT_NOFILEEMFILE.

Note the launcher already raises RLIMIT_NOFILE toward LAUNCHER_NOFILE_LIMIT (launcher/src/main.rs), which delays exhaustion but does not stop the accumulation — suggesting the fd growth is a known pressure point but the underlying retention isn't being reclaimed.

Reproduce

Run many executions that import packages (e.g. Python import numpy, pandas, sklearn) against one sandbox-runner instance and watch its fd count grow monotonically and not return to baseline between runs:

PID=$(docker inspect -f '{{.State.Pid}}' sandbox-runner)
ls /proc/$PID/fd | wc -l          # grows over time, never drops back
grep 'open files' /proc/$PID/limits

Workaround

Restart sandbox-runner (fd count drops back to ~1.3k and executions succeed). We now run a daily systemd timer to restart it as a stop-gap.

Ask

  • Is the virtio-fs / microVM teardown expected to release these host fds, and if so where is that reclaim happening (or not)?
  • Can the per-execution /host-packages lookup handles (and eventfds) be released on guest/microVM shutdown so the count returns to baseline, rather than only raising the nofile ceiling?

Happy to provide more detail (full fd dump, timing, config) if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions