Skip to content

feat(storage): full object checksum: implement rolling checksum and verification in reads resumption strategy#17262

Open
chandra-siri wants to merge 6 commits into
feat/stream-checksum-metadatafrom
feat/full-object-rolling-checksum
Open

feat(storage): full object checksum: implement rolling checksum and verification in reads resumption strategy#17262
chandra-siri wants to merge 6 commits into
feat/stream-checksum-metadatafrom
feat/full-object-rolling-checksum

Conversation

@chandra-siri
Copy link
Copy Markdown
Contributor

@chandra-siri chandra-siri commented May 26, 2026

1. Overview of the Solution

This solution implements end-to-end full-object checksum validation in AsyncMultiRangeDownloader for the asynchronous Google Cloud Storage Python client library. As asynchronous multiplexed downloads of non-contiguous ranges are performed concurrently over a single bidirectional gRPC connection, this feature automatically and incrementally calculates a rolling checksum as bytes arrive and validates it against the server's authoritative object checksum once the download completes.

The technical approach consists of three coordinated layers:

  • _AsyncReadObjectStream (Stream Ingestion): Safely extracts the authoritative server checksum (full_obj_server_crc32c) and finalization status (is_finalized) from the object metadata received in the first data payload response of the stream.
  • _ReadResumptionStrategy & _DownloadState (Verification Logic): Computes an isolated, persistent rolling checksum in the individual _DownloadState object to ensure calculations do not bleed across concurrent multiplexed ranges. Crucially, the rolling hash updates only after buffer writes succeed to prevent state corruption during retry re-connects, raising a DataCorruption exception on completion if a mismatch occurs.
  • AsyncMultiRangeDownloader (Orchestration & Cleanup): Detects candidate full-object ranges (e.g., (0, 0) or (0, persisted_size)), propagates checksum settings to the resumption strategy, and guarantees robust cleanup (closing the stream immediately and unregistering IDs) if data corruption or write errors occur.

2. What This PR Specifically Does

This PR implements Step 2: Full-Object Rolling Checksum & Resumption Verification Logic of the solution:

  • Upgrades _DownloadState to track is_full_object_read and initialize an isolated google_crc32c.Checksum() rolling instance.
  • Updates _ReadResumptionStrategy.update_state_from_response() to run buffer writes before updating the rolling checksum, ensuring transactional safety during connection failures and retry reconnects.
  • Optimizes performance by bypassing rolling checksum calculations entirely if enable_checksum is False.
  • Performs the final validation match at range_end against the server's authoritative checksum, raising a DataCorruption exception if a mismatch is found.
  • Adds comprehensive unit tests in test_reads_resumption_strategy.py to verify successful validation, failure exceptions, and bypassed checks when validation is disabled.

@chandra-siri chandra-siri requested a review from a team as a code owner May 26, 2026 20:54
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for full-object rolling checksum verification during resumed reads, allowing verification once the stream completes. It also adds a flag to conditionally disable checksum validation. The review feedback suggests optimizing performance by bypassing rolling checksum updates when checksums are disabled, and recommends adding a unit test to verify that full-object checksum mismatches are ignored when checksum validation is disabled.

@chandra-siri chandra-siri changed the title feat(storage): implement rolling checksum and verification in reads resumption strategy feat(storage): full object checksum: implement rolling checksum and verification in reads resumption strategy May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants