perf: field stats more performant for high volume ingestion#1679
Conversation
WalkthroughRemoves the async semaphore concurrency gate from field-stats collection and switches to Rayon-only parallel batching. ChangesField Stats and Upload Pipeline Refactor
Sequence Diagram(s)sequenceDiagram
rect rgba(100, 150, 200, 0.5)
Note over upload_files_from_staging: Upload & collect
participant upload_files_from_staging
participant process_parquet_files
participant collect_upload_results
upload_files_from_staging->>process_parquet_files: spawn per-file upload tasks
process_parquet_files->>collect_upload_results: aggregate results
collect_upload_results-->>process_parquet_files: Vec<UploadedParquetFile>
process_parquet_files-->>upload_files_from_staging: Vec<UploadedParquetFile>
end
rect rgba(100, 200, 150, 0.5)
Note over upload_files_from_staging: Post-upload steps
participant update_snapshot_with_manifests
participant calculate_stats_for_uploaded_files
participant cleanup_uploaded_staged_files
upload_files_from_staging->>update_snapshot_with_manifests: derive manifests, update stream snapshot
upload_files_from_staging->>calculate_stats_for_uploaded_files: spawn field-stat tasks per file
calculate_stats_for_uploaded_files->>collect_all_field_stats_from_parquet: Rayon parallel batching
upload_files_from_staging->>cleanup_uploaded_staged_files: parallel remove_file for staged parquets
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/storage/object_storage.rs`:
- Around line 1253-1265: The loop spawning tasks in the
`calculate_stats_if_enabled` block creates unbounded concurrency with no
control, which can exhaust the blocking thread pool. Apply a semaphore-based
concurrency limit (following the same pattern already used in
`process_parquet_files`) to restrict the number of concurrent stats calculation
tasks. Create a semaphore with a reasonable bound (e.g., 100), acquire a permit
before spawning each task in the loop, and ensure the permit is held until the
spawned task completes so that at most N tasks run concurrently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 0ce9dfda-86be-444e-b545-a7a4604d6c5a
📒 Files selected for processing (2)
src/storage/field_stats.rssrc/storage/object_storage.rs
Summary by CodeRabbit