Skip to content

fix(codeapi-service): poll job results when queue events lag#10

Merged
rschlaefli merged 1 commit into
mainfrom
fix/codeapi-bullmq-wait-fallback
Jun 29, 2026
Merged

fix(codeapi-service): poll job results when queue events lag#10
rschlaefli merged 1 commit into
mainfrom
fix/codeapi-bullmq-wait-fallback

Conversation

@rschlaefli

Copy link
Copy Markdown
Member

What this fixes

This PR fixes the CodeAPI service path where a worker completes a BullMQ job but the API never receives the QueueEvents finish notification and waits until JOB_TIMEOUT.

The API now waits for each job with an abortable queue-event listener plus a bounded Redis polling fallback. If the event path is missed or delayed, the service can still observe the completed or failed job state and return the worker result.

How it works

  • Adds waitForJobFinished(job, queue, queueEvents, timeoutMs) for API-side job waits.
  • Races completed / failed queue events against polling queue.getJob(jobId) and getState().
  • Returns the stored job returnvalue for completed jobs and throws the failed reason for failed jobs.
  • Fast-fails if the job disappears before a terminal state, instead of waiting for the full timeout.
  • Increases short completed-job retention from count: 1 to count: 100 for 60 seconds so the fallback can read results under concurrent completions. Failed-job retention is unchanged.

Branch coverage

  • Base: main
  • Head: 2e451bc
  • Reviewed: 1 commit, fix(codeapi-service): poll job results when queue events lag
  • Diff: 5 files changed, 315 insertions, 9 deletions
  • Covered: queue wait helper, focused unit tests, queue export, REST execute route, programmatic execute route, replay iteration route

Review focus

  • Check the event listener and polling fallback cancellation behavior.
  • Check whether removeOnComplete: { age: 60, count: 100 } is the right short retention bound for this service.

Verification

Current head:

  • bun run test from service -> 276 pass, 0 fail
  • bun run build from service -> passed. Existing Rollup warnings and existing TS2352 warnings in src/egress-grant.ts:441 and src/service/replay-state.ts:302 remain.
  • git diff --cached --check before commit -> passed
  • opencode GLM 5.2 review -> no blocking findings after the abortable event/listener and polling cleanup

Failed or warning:

  • The first sandboxed bun run test attempt failed because Redis DNS was blocked in the sandbox. Rerunning the same suite with network access passed.

Security / privacy

  • No secret, auth, or logging behavior changes.
  • The fallback reads BullMQ job metadata already used by the API.
  • Completed-job retention is still bounded by age and count.

Blocking before merge

None.

Follow-up after merge

  • Bump the df-cloud CodeAPI target revision to the merged commit.
  • Run the df-cloud MR preview against stg.
  • Sync the CodeAPI app on staging.
  • Rerun the staging e2e tests with /tmp/codeapi-jwt.env.

@rschlaefli rschlaefli merged commit b66e87e into main Jun 29, 2026
2 checks passed
@rschlaefli rschlaefli deleted the fix/codeapi-bullmq-wait-fallback branch June 29, 2026 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant