fix: Forward storage_options to parquet metadata reads#353
fix: Forward storage_options to parquet metadata reads#353Mattijs De Paepe (mattijsdp) wants to merge 10 commits into
Conversation
Schema/collection/failure-info reads passed `storage_options` (and `credential_provider`) to the data read via `pl.read_parquet` / `pl.scan_parquet`, but the separate embedded-schema metadata read called `pl.read_parquet_metadata` with no options. Against non-AWS S3-compatible stores reached purely through `storage_options` (lakeFS, MinIO, R2, Tigris, …) the metadata read fell back to the default AWS credential chain and endpoint, breaking typed reads. Thread the storage-related options into all metadata reads in `_storage/parquet.py` via a small `_metadata_read_options` helper. Fixes Quantco#352 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`read_parquet_metadata_schema` and `read_parquet_metadata_collection` read parquet metadata from a (possibly remote) source but accepted no options, so they could not reach non-AWS S3-compatible stores either. Accept and forward `**kwargs` (e.g. `storage_options`, `credential_provider`) to `pl.read_parquet_metadata`, matching `read_parquet`/`scan_parquet`. Add s3-marked regression tests covering both helpers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…retries Address review feedback on the storage-options forwarding fix: - Match the file's docstring convention for the public metadata helpers: drop the enumerated `storage_options`/`credential_provider` note and use the same terse "passed directly to :meth:`polars.read_parquet_metadata`" wording as `read_parquet`/`scan_parquet`. - Forward `retries` alongside `storage_options`/`credential_provider` in `_metadata_read_options`, since `read_parquet_metadata` accepts it and it is storage-reaching. Clarify in the docstring why the call sites must filter the scan/read kwargs (the narrower `read_parquet_metadata` signature rejects options like `n_rows`/`columns`) instead of forwarding everything. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Shrink `_metadata_read_options` to a one-line comment matching the other private helpers in the module (no oversized docstring). - Extract an `s3_storage_options` fixture (mirrors `s3_tmp_path`, but strips the AWS_* env vars so the store is reachable *only* via `storage_options`) and use it across the schema, collection and failure-info regression tests. - Add a failure-info regression test covering the `scan_failure_info` metadata read, and split the schema test so the typed read and the standalone `read_parquet_metadata_schema` helper are asserted independently. - Drop the end-to-end collection typed-read test: it cannot pass via `storage_options` alone because member discovery goes through fsspec (`url_to_fs`/`fs.exists`), which does not receive `storage_options` -- a separate limitation from the polars metadata read this PR fixes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Documentation should stand on its own, so remove the issue-tracker links from the `s3_storage_options` fixture and the storage-options regression tests, and shrink their docstrings to a single line in line with the surrounding tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR ensures Parquet metadata reads (schema/collection/failure info) correctly forward storage_options so S3-backed files work even when credentials/endpoints are provided only via storage_options.
Changes:
- Forward metadata read options to
polars.read_parquet_metadataacross schema/collection/failure-info code paths. - Add an
s3_storage_optionsfixture that stripsAWS_*env vars to validate forwarding behavior. - Add S3-marked regression tests covering schema, collection, and failure-info metadata reads.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/schema/test_read_write_parquet.py | Adds S3 regression tests ensuring schema metadata reads forward storage_options. |
| tests/failure_info/test_storage.py | Adds S3 regression test ensuring failure-info metadata reads forward storage_options. |
| tests/conftest.py | Introduces s3_storage_options fixture that requires passing options explicitly. |
| tests/collection/test_storage.py | Adds S3 regression test ensuring collection metadata reads forward storage_options. |
| dataframely/schema.py | Allows forwarding kwargs (e.g., storage_options) to pl.read_parquet_metadata. |
| dataframely/collection/collection.py | Allows forwarding kwargs (e.g., storage_options) to pl.read_parquet_metadata. |
| dataframely/_storage/parquet.py | Forwards filtered read options to metadata reads and centralizes filtering logic. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #353 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 56 56
Lines 3458 3463 +5
=========================================
+ Hits 3458 3463 +5 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Andreas Albert (AndreasAlbertQC)
left a comment
There was a problem hiding this comment.
Thanks Mattijs De Paepe (@mattijsdp), looks great! What do you think about the inspect-based suggestion by copilot? It seems like a good idea to me.
…nature Introspect pl.read_parquet_metadata's parameters instead of hard-coding the allowlist, so the forwarded options track the installed polars version rather than drifting against it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
de6ad3b to
48793a5
Compare
Andreas Albert (AndreasAlbertQC)
left a comment
There was a problem hiding this comment.
Thanks! Looks good to me assuming CI goes through :) Oliver Borchert (@borchero) did you also want to review this?
| for var in ( | ||
| "AWS_ENDPOINT_URL", | ||
| "AWS_ACCESS_KEY_ID", | ||
| "AWS_SECRET_ACCESS_KEY", | ||
| "AWS_ALLOW_HTTP", | ||
| "AWS_S3_ALLOW_UNSAFE_RENAME", | ||
| "AWS_DEFAULT_REGION", | ||
| "AWS_REGION", | ||
| ): | ||
| monkeypatch.delenv(var, raising=False) |
There was a problem hiding this comment.
I'm not sure this has the desired effect. Polars likely only reads these variables at startup and does not re-read them for every read. Could we check that?
I don't have a good alternative idea but I think all of those tests are moot if I'm right.
There was a problem hiding this comment.
Good point! So yes, the tests were indeed moot as Polars caches an object store per bucket which isn't cleared by this monkeypatch. Other tests use the same bucket so it stayed reachable even without storage_options. This is per bucket though so I fixed this by creating a fresh bucket that is never env configured hence no cache and so requires storage_options forwarding. Ran the tests without the fix which then failed as expected. Let me know what you think.
Polars caches object stores per bucket on the Rust side, where monkeypatch.delenv cannot clear them. A bucket configured once from AWS_* env vars stayed reachable without storage_options, so the regression tests passed even when forwarding was dropped. Replace s3_storage_options with s3_isolated, which hands out a freshly-named bucket that is never env-configured. Reaching it requires forwarding storage_options to every read, so the tests now fail without the fix. Also drop a stale comment in parquet.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Oliver Borchert (borchero)
left a comment
There was a problem hiding this comment.
Thanks for the update! 🚀 just one more comment :)
| boto3.client( | ||
| "s3", endpoint_url=s3_server, aws_access_key_id="", aws_secret_access_key="" | ||
| ).create_bucket(Bucket=bucket) |
There was a problem hiding this comment.
This will keep the bucket around indefinitely which is not ideal for local development & debugging. Could you yield from the fixture and delete the bucket after the yield statement as "teardown" operation?
There was a problem hiding this comment.
Sure, done.
Note: I didn't use a paginator and technically list_objects_v2 caps at 1000 objects but I think this is fine for this use case?
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Motivation
Fixes #352.
Schema.read_parquet/scan_parquet(plus collection and failure-info reads) failed against S3-compatible stores reached viastorage_options. The data read gotstorage_options, but the separate embedded-metadata read didn't — so polars fell back to the default AWS endpoint/credentials.Changes
_storage/parquet.py— forward the storage optionspl.read_parquet_metadataaccepts (storage_options,credential_provider,retries) to every metadata read, via a small_metadata_read_options(kwargs)helper. Only present keys are forwarded, so defaults are unchanged.read_parquet_metadata_schema(schema.py) andread_parquet_metadata_collection(collection/collection.py) now accept and forward**kwargs.Note: limited to polars parquet-metadata reads (covers the issue's single-file repro). A
storage_options-only collection read still fails earlier, at fsspec-based member discovery (url_to_fs/glob), which uses a different option vocabulary than polars and would need a per-backend translation I think.