Skip to content

Fix race in concurrent process.Wait() calls on shared exec.Cmd#327

Merged
jwilder merged 2 commits into
masterfrom
repo-agent/repo-agent/concurrent-processwait-calls-on-the-same-execcmd-can-race-and-break-shutdown-handling-1-findings
Jun 11, 2026
Merged

Fix race in concurrent process.Wait() calls on shared exec.Cmd#327
jwilder merged 2 commits into
masterfrom
repo-agent/repo-agent/concurrent-processwait-calls-on-the-same-execcmd-can-race-and-break-shutdown-handling-1-findings

Conversation

@jwilder

@jwilder jwilder commented Jun 11, 2026

Copy link
Copy Markdown
Owner

What changed

Guarded the call to exec.Cmd.Wait() in exec.go so it is invoked exactly once per command, regardless of how many callers attempt to wait on the process. Concurrent callers now coordinate safely and observe a consistent result instead of independently invoking Wait().

Added a regression test (exec_signal_test.go) that exercises concurrent process.Wait() calls on the same exec.Cmd and verifies shutdown handling behaves correctly.

Why

Go's exec.Cmd.Wait() is not safe to call more than once or from multiple goroutines simultaneously. When more than one path (for example, a signal/shutdown handler and the normal completion flow) called Wait() on the same Cmd, the calls would race. This led to inconsistent process state and broken shutdown handling.

This change serializes access so only a single Wait() is performed, eliminating the race and making shutdown handling deterministic.

How to verify

  1. Run the new test with the race detector enabled:
    go test -race -run TestConcurrentWait ./...
    
    The test fails (or reports a data race) without the fix and passes with it.
  2. Run the full package test suite to confirm no regressions:
    go test -race ./...
    

jwilder added 2 commits June 11, 2026 10:38
… declare `waitDone := make(chan struct{})` before the signal-handling goroutine, pass it as a new argument to `signalProcessWithTimeout(process, sig, waitDone)`, and add `close(waitDone)` immediately after the single `err = process.Wait()` returns (before/around the existing `cancel()`). Change `signalProcessWithTimeout` to the signature `func signalProcessWithTimeout(process *exec.Cmd, sig os.Signal, waitDone <-chan struct{})`, delete the internal goroutine that called `process.Wait()` and `close(done)`, keep the existing `process.Process.Signal(sig)` call (and its existing comment), and replace the `select` so it waits on `<-waitDone` (return) vs `<-time.After(10 * time.Second)` (log + `process.Process.Kill()`). Make ONLY these changes — do not alter imports, exit-code handling, or add `signal.Stop`. CRITICAL: the file must remain valid Go starting with `package main` and ending with a normal newline — do NOT insert any ```go opening fence, closing ``` fence, or backticks anywhere (this was the cause of the prior 'patch corrupted' rejection).
…e main, NO markdown fences or backticks anywhere). Add a subprocess-helper pattern that compiles against `runCmd` (whose signature is unchanged) so it can detect the data race under `-race`: (a) a helper test `TestRunCmdSignalShutdownHelper` that calls `t.Skip(...)` unless an env var like `RUNCMD_SIGNAL_SHUTDOWN_HELPER=1` is set, otherwise runs `runCmd(ctx, cancel, "sleep", "30")`; (b) a driver test `TestRunCmdSignalShutdown` that re-execs the test binary via `exec.Command(os.Args[0], "-test.run=^TestRunCmdSignalShutdownHelper$")`, sets `Env` to include `RUNCMD_SIGNAL_SHUTDOWN_HELPER=1` and `GORACE=halt_on_error=1`, captures the child's stderr into a `bytes.Buffer`, starts it, sleeps ~1s, sends `syscall.SIGTERM` to the child, then waits (with a ~20s timeout guarding against hangs) for the child to exit and asserts the captured stderr does NOT contain "DATA RACE". With the unfixed concurrent `process.Wait()` this test fails under `-race` (DATA RACE reported); with the fix it passes. Use only standard-library imports (context, os, os/exec, syscall, testing, time, bytes, strings).
@jwilder jwilder merged commit a1f49c2 into master Jun 11, 2026
12 checks passed
@jwilder jwilder deleted the repo-agent/repo-agent/concurrent-processwait-calls-on-the-same-execcmd-can-race-and-break-shutdown-handling-1-findings branch June 11, 2026 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant