Skip to content

improvement(schedules): retries, concurrency limits#4755

Merged
icecrasher321 merged 6 commits into
stagingfrom
improvement/sched-disp
May 27, 2026
Merged

improvement(schedules): retries, concurrency limits#4755
icecrasher321 merged 6 commits into
stagingfrom
improvement/sched-disp

Conversation

@icecrasher321
Copy link
Copy Markdown
Collaborator

@icecrasher321 icecrasher321 commented May 27, 2026

Summary

Schedule backpressure / discovery improvements.

Type of Change

  • Other: Performance

Testing

Tested in Staging environment

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

les): retries, concurrency limits
@vercel
Copy link
Copy Markdown

vercel Bot commented May 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped May 27, 2026 5:59pm

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 27, 2026

PR Summary

High Risk
Large changes to cron scheduling, locking, and async job recovery affect when workflows run and can duplicate or skip runs if claim logic regresses; migration and new env knobs need coordinated rollout.

Overview
This PR hardens scheduled workflow execution with env-tunable concurrency, enqueue budgets, jitter, and infrastructure retries, plus safer claim handling across Trigger.dev and database-backed async jobs.

The cron /api/schedules/execute path is reworked: claims carry workspace context, lastQueuedAt is compared on updates, stale runs are recovered or cancelled, database fallback jobs respect a global processing cap (advisory locks + pending resume), and the API returns processedCount instead of executedCount. infra_retry_count on schedules tracks queue/setup failures separately from failed_count, with backoff and exhaustion into normal failure/disable paths. A migration adds that column plus indexes for due schedules and schedule-execution async_jobs.

schedule-execution background work gains timezone-aware next-run calculation, retryable preprocessing errors (driver codes via a new helper), claim guards, and a Trigger.dev queue concurrency limit of 50. Schedule disable/reactivate clears lastQueuedAt / infraRetryCount as appropriate.

Smaller changes: workspace env routes invalidate a shorter-lived decrypted-env cache; copilot integration tool schemas use a brief LRU cache; envNumber supports integers; personal env API uses toError for 500 responses; cache size telemetry is dropped from memory monitoring.

Reviewed by Cursor Bugbot for commit d3103b7. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 27, 2026

Greptile Summary

This PR introduces schedule execution backpressure and reliability improvements: a configurable concurrency limit (SCHEDULE_EXECUTION_CONCURRENCY_LIMIT) gates how many schedule jobs can run in parallel, a per-tick enqueue budget (SCHEDULE_WORKFLOW_ENQUEUE_LIMIT) prevents runaway claim storms, and a new infra-retry path defers schedules that fail due to transient DB/network errors instead of immediately counting them as failures.

  • Concurrency + budget control (execute/route.ts, execution-limits.ts): The main scheduling loop now queries available processing slots before claiming new schedules, serialises job starts via an in-process turn gate backed by a PostgreSQL advisory lock, and tracks a per-tick remainingWorkflowBudget to bound total enqueue work per cron invocation.
  • Infra-retry path (schedule-execution.ts, retryable-infrastructure.ts, preprocessing.ts): Retryable errors detected before workflow core starts are deferred with exponential backoff instead of incrementing failedCount; a new infraRetryCount column tracks attempts, and retry exhaustion falls through to the normal failure path.
  • Cache tuning (environment/utils.ts, copilot/chat/payload.ts): The effective-env cache TTL is reduced from 15 s to 2 s with targeted invalidation on writes; the tool-schema cache TTL drops from 30 s to 5 s; both adopt a promise-deduplication pattern to coalesce concurrent misses.

Confidence Score: 5/5

Safe to merge. The concurrency-limit and infra-retry logic is well-guarded by PostgreSQL advisory locks, and the DB migration adds a non-breaking column with a default value.

All core state transitions (claim, defer, fail, recover) are guarded by expectedLastQueuedAt conditions that prevent double-processing. The advisory lock in tryStartDatabaseScheduleJob provides cross-process enforcement of the concurrency cap. The one edge case (null claimedAt bypassing the stale-claim check) is not reachable from any normal code path since payload.now is always a valid ISO timestamp.

No files require special attention. The migration is additive and the schedule-execution flow changes are isolated to the scheduler path.

Important Files Changed

Filename Overview
apps/sim/app/api/schedules/execute/route.ts Core scheduling loop rewritten with concurrency-limit backpressure, infra-retry deferral, and a per-tick enqueue budget.
apps/sim/background/schedule-execution.ts Added retryable-infra-failure handling, stale-claim detection pre-core, and claim-guarded applyScheduleUpdate. Minor: isScheduleClaimCurrent returns true on null claimedAt.
apps/sim/lib/workflows/schedules/execution-limits.ts New module centralising schedule concurrency/retry/jitter constants, all env-var backed with sensible defaults.
apps/sim/lib/core/errors/retryable-infrastructure.ts New utility classifying DB/network error codes as retryable by walking the error cause chain.
apps/sim/lib/environment/utils.ts Cache TTL shortened from 15 s to 2 s with targeted invalidation on env writes.
apps/sim/lib/copilot/chat/payload.ts Tool schema cache TTL reduced from 30 s to 5 s, refactored to promise-deduplication pattern.
packages/db/schema.ts Adds infraRetryCount column and two partial indexes on asyncJobs for schedule-execution rows.
apps/sim/lib/execution/preprocessing.ts Error result now carries retryable and cause fields for all 500-class preprocessing failures.
packages/db/migrations/0215_flowery_hellcat.sql Additive migration: new column with default and four targeted partial indexes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([Cron tick]) --> B[recoverStaleDatabaseScheduleJobs]
    B --> C[getDatabaseScheduleExecutionSlots]
    C --> D[resumePendingDatabaseScheduleJobs]
    D --> E[getDatabaseScheduleExecutionSlots updated]
    E --> F{slots > 0 AND budget > 0?}
    F -- No --> G[schedulesExhausted = true]
    F -- Yes --> H[claimWorkflowSchedules + claimJobSchedules]
    H --> I[processScheduleItem per claimed schedule]
    I --> J{existing job?}
    J -- stale --> K[cancelJob / releaseScheduleLock]
    J -- pending DB --> L[executeDatabaseScheduleJob]
    J -- none --> M[jobQueue.enqueue]
    M --> N{useDatabaseFallback?}
    N -- Yes --> L
    N -- No --> O[Trigger.dev handles execution]
    L --> P[tryStartDatabaseScheduleJob via advisory lock]
    P -- capacity_full --> Q[job stays pending / resumed next tick]
    P -- started --> R[executeScheduleJob]
    R --> S[preprocessExecution]
    S -- retryable 500 --> T[retryScheduleAfterInfraFailure]
    T --> U{infraRetryCount > MAX_ATTEMPTS?}
    U -- Yes --> V[markClaimedScheduleFailed]
    U -- No --> W[set nextRunAt with backoff]
    S -- success --> X[runWorkflowExecution]
    X --> Y{retryable_setup_failure?}
    Y -- Yes --> T
    Y -- No --> Z[update schedule: success / failure / skip]
Loading

Reviews (3): Last reviewed commit: "retryable errs cleanup" | Re-trigger Greptile

Comment thread apps/sim/app/api/schedules/execute/route.ts
Comment thread apps/sim/lib/environment/utils.ts
Comment thread apps/sim/lib/copilot/chat/payload.ts
Comment thread apps/sim/app/api/schedules/execute/route.ts
Comment thread apps/sim/app/api/schedules/execute/route.ts
icecrasher321 and others added 3 commits May 27, 2026 10:18
Drop the pre-merge generated 0213 migration so it can be regenerated after syncing with staging.

Co-authored-by: Cursor <cursoragent@cursor.com>
@icecrasher321
Copy link
Copy Markdown
Collaborator Author

bugbot run

@icecrasher321
Copy link
Copy Markdown
Collaborator Author

@greptile

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit d3103b7. Configure here.

@icecrasher321
Copy link
Copy Markdown
Collaborator Author

@greptile

@icecrasher321 icecrasher321 merged commit 81bcdf2 into staging May 27, 2026
13 checks passed
@waleedlatif1 waleedlatif1 deleted the improvement/sched-disp branch May 27, 2026 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant