Skip to content

perf: field stats more performant for high volume ingestion#1679

Merged
nikhilsinhaparseable merged 1 commit into
parseablehq:mainfrom
nikhilsinhaparseable:field-stats-hv
Jun 15, 2026
Merged

perf: field stats more performant for high volume ingestion#1679
nikhilsinhaparseable merged 1 commit into
parseablehq:mainfrom
nikhilsinhaparseable:field-stats-hv

Conversation

@nikhilsinhaparseable

@nikhilsinhaparseable nikhilsinhaparseable commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

  • Refactor
    • Improved error reporting and logging when calculating dataset field statistics.
    • Optimized parallel processing of Parquet files during uploads.
    • Enhanced handling of upload failures with more reliable file cleanup.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

Removes the async semaphore concurrency gate from field-stats collection and switches to Rayon-only parallel batching. FieldCountState gains batch-aware constructors, a merge_state method, and conditional cardinality-cap logging. HyperLogLog gains a merge method. In object_storage, field-stat calculation is removed from individual file upload and deferred to a new post-snapshot batch step using a new UploadedParquetFile type.

Changes

Field Stats and Upload Pipeline Refactor

Layer / File(s) Summary
HyperLogLog merge and FieldCountState restructure
src/storage/field_stats.rs
Adds HyperLogLog::merge combining registers via per-register maxima. Extends FieldCountState with log_cardinality_cap flag and three constructors (new, new_for_batch, new_with_logging). Replaces count-merge approach with record_value and merge_state (approximate-state propagation, eviction by incoming count). Moves merge_counts behind #[cfg(test)].
Parquet scan pipeline: Rayon-only parallel batching
src/storage/field_stats.rs
Removes semaphore/Arc gating and serial-vs-parallel branching from collect_all_field_stats_from_parquet. Spawns blocking without a semaphore permit. Per-batch aggregation always uses parallel counting with max_field_statistics threaded through. Parallel helpers construct FieldCountState via new_for_batch and merge via merge_state.
Stats calculation result logging
src/storage/field_stats.rs
calculate_field_stats passes the full Result reference to log_stats_calculation_time. log_stats_calculation_time now accepts &Result<bool, PostError> and includes the underlying error in the failure warning log.
UploadedParquetFile type and deferred field-stat calculation
src/storage/object_storage.rs
Adds internal UploadedParquetFile struct. Removes schema parameter and in-upload stats from upload_single_parquet_file. Updates upload_files_from_staging to collect UploadedParquetFile results, update snapshot, calculate stats post-snapshot, then clean up staged files in parallel. Introduces update_snapshot_with_manifests and calculate_stats_for_uploaded_files.

Sequence Diagram(s)

sequenceDiagram
    rect rgba(100, 150, 200, 0.5)
        Note over upload_files_from_staging: Upload & collect
        participant upload_files_from_staging
        participant process_parquet_files
        participant collect_upload_results
        upload_files_from_staging->>process_parquet_files: spawn per-file upload tasks
        process_parquet_files->>collect_upload_results: aggregate results
        collect_upload_results-->>process_parquet_files: Vec<UploadedParquetFile>
        process_parquet_files-->>upload_files_from_staging: Vec<UploadedParquetFile>
    end
    rect rgba(100, 200, 150, 0.5)
        Note over upload_files_from_staging: Post-upload steps
        participant update_snapshot_with_manifests
        participant calculate_stats_for_uploaded_files
        participant cleanup_uploaded_staged_files
        upload_files_from_staging->>update_snapshot_with_manifests: derive manifests, update stream snapshot
        upload_files_from_staging->>calculate_stats_for_uploaded_files: spawn field-stat tasks per file
        calculate_stats_for_uploaded_files->>collect_all_field_stats_from_parquet: Rayon parallel batching
        upload_files_from_staging->>cleanup_uploaded_staged_files: parallel remove_file for staged parquets
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • parseablehq/parseable#1675: Directly overlaps at collect_all_field_stats_from_parquet, FieldCountState batching/merge/capping, and HyperLogLog register merging in field_stats.rs.
  • parseablehq/parseable#1340: Adds field-level stats generation during staging uploads in object_storage.rs, the same flow being restructured in this PR.
  • parseablehq/parseable#1369: Modifies calculate_field_stats in field_stats.rs, touching the same top-level entry point refactored here.

Suggested reviewers

  • parmesant
  • nitisht

Poem

🐇 Hop, hop — no semaphore to wait!
Rayon spins the threads at Parseable's gate.
Each batch now merges with merge_state flair,
HyperLogLog registers split the air.
Stats flow after snapshots, tidy and bright —
This bunny approves: the pipeline's right! 🌿

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request has no description provided by the author, failing to meet the repository's template requirements for goals, solutions, rationale, key changes, and testing/documentation checklists. Add a comprehensive description following the template: explain the performance improvement goal, describe the refactoring approach, list key changes (Rayon thread pool, HyperLogLog merge, upload flow changes), and complete the testing and documentation checklists.
Docstring Coverage ⚠️ Warning Docstring coverage is 46.15% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'perf: field stats more performant for high volume ingestion' clearly summarizes the main optimization focus of the changeset, aligning with the refactoring of field statistics calculation for better performance.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/storage/object_storage.rs`:
- Around line 1253-1265: The loop spawning tasks in the
`calculate_stats_if_enabled` block creates unbounded concurrency with no
control, which can exhaust the blocking thread pool. Apply a semaphore-based
concurrency limit (following the same pattern already used in
`process_parquet_files`) to restrict the number of concurrent stats calculation
tasks. Create a semaphore with a reasonable bound (e.g., 100), acquire a permit
before spawning each task in the loop, and ensure the permit is held until the
spawned task completes so that at most N tasks run concurrently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0ce9dfda-86be-444e-b545-a7a4604d6c5a

📥 Commits

Reviewing files that changed from the base of the PR and between cc0dd62 and 6b67596.

📒 Files selected for processing (2)
  • src/storage/field_stats.rs
  • src/storage/object_storage.rs

Comment thread src/storage/object_storage.rs
@nikhilsinhaparseable nikhilsinhaparseable merged commit ca348be into parseablehq:main Jun 15, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants