Skip to content

fix: Forward storage_options to parquet metadata reads#353

Open
Mattijs De Paepe (mattijsdp) wants to merge 10 commits into
Quantco:mainfrom
mattijsdp:fix/parquet-metadata-storage-options
Open

fix: Forward storage_options to parquet metadata reads#353
Mattijs De Paepe (mattijsdp) wants to merge 10 commits into
Quantco:mainfrom
mattijsdp:fix/parquet-metadata-storage-options

Conversation

@mattijsdp

@mattijsdp Mattijs De Paepe (mattijsdp) commented Jun 2, 2026

Copy link
Copy Markdown

Motivation

Fixes #352.

Schema.read_parquet/scan_parquet (plus collection and failure-info reads) failed against S3-compatible stores reached via storage_options. The data read got storage_options, but the separate embedded-metadata read didn't — so polars fell back to the default AWS endpoint/credentials.

Changes

  • _storage/parquet.py — forward the storage options pl.read_parquet_metadata accepts (storage_options, credential_provider, retries) to every metadata read, via a small _metadata_read_options(kwargs) helper. Only present keys are forwarded, so defaults are unchanged.
  • read_parquet_metadata_schema (schema.py) and read_parquet_metadata_collection (collection/collection.py) now accept and forward **kwargs.

Note: limited to polars parquet-metadata reads (covers the issue's single-file repro). A storage_options-only collection read still fails earlier, at fsspec-based member discovery (url_to_fs/glob), which uses a different option vocabulary than polars and would need a per-backend translation I think.

Schema/collection/failure-info reads passed `storage_options` (and
`credential_provider`) to the data read via `pl.read_parquet` /
`pl.scan_parquet`, but the separate embedded-schema metadata read called
`pl.read_parquet_metadata` with no options. Against non-AWS S3-compatible
stores reached purely through `storage_options` (lakeFS, MinIO, R2, Tigris,
…) the metadata read fell back to the default AWS credential chain and
endpoint, breaking typed reads.

Thread the storage-related options into all metadata reads in
`_storage/parquet.py` via a small `_metadata_read_options` helper.

Fixes Quantco#352

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`read_parquet_metadata_schema` and `read_parquet_metadata_collection` read
parquet metadata from a (possibly remote) source but accepted no options, so
they could not reach non-AWS S3-compatible stores either. Accept and forward
`**kwargs` (e.g. `storage_options`, `credential_provider`) to
`pl.read_parquet_metadata`, matching `read_parquet`/`scan_parquet`.

Add s3-marked regression tests covering both helpers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…retries

Address review feedback on the storage-options forwarding fix:

- Match the file's docstring convention for the public metadata helpers: drop
  the enumerated `storage_options`/`credential_provider` note and use the same
  terse "passed directly to :meth:`polars.read_parquet_metadata`" wording as
  `read_parquet`/`scan_parquet`.
- Forward `retries` alongside `storage_options`/`credential_provider` in
  `_metadata_read_options`, since `read_parquet_metadata` accepts it and it is
  storage-reaching. Clarify in the docstring why the call sites must filter the
  scan/read kwargs (the narrower `read_parquet_metadata` signature rejects
  options like `n_rows`/`columns`) instead of forwarding everything.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Shrink `_metadata_read_options` to a one-line comment matching the other
  private helpers in the module (no oversized docstring).
- Extract an `s3_storage_options` fixture (mirrors `s3_tmp_path`, but strips the
  AWS_* env vars so the store is reachable *only* via `storage_options`) and use
  it across the schema, collection and failure-info regression tests.
- Add a failure-info regression test covering the `scan_failure_info` metadata
  read, and split the schema test so the typed read and the standalone
  `read_parquet_metadata_schema` helper are asserted independently.
- Drop the end-to-end collection typed-read test: it cannot pass via
  `storage_options` alone because member discovery goes through fsspec
  (`url_to_fs`/`fs.exists`), which does not receive `storage_options` -- a
  separate limitation from the polars metadata read this PR fixes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Documentation should stand on its own, so remove the issue-tracker links from
the `s3_storage_options` fixture and the storage-options regression tests, and
shrink their docstrings to a single line in line with the surrounding tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR ensures Parquet metadata reads (schema/collection/failure info) correctly forward storage_options so S3-backed files work even when credentials/endpoints are provided only via storage_options.

Changes:

  • Forward metadata read options to polars.read_parquet_metadata across schema/collection/failure-info code paths.
  • Add an s3_storage_options fixture that strips AWS_* env vars to validate forwarding behavior.
  • Add S3-marked regression tests covering schema, collection, and failure-info metadata reads.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/schema/test_read_write_parquet.py Adds S3 regression tests ensuring schema metadata reads forward storage_options.
tests/failure_info/test_storage.py Adds S3 regression test ensuring failure-info metadata reads forward storage_options.
tests/conftest.py Introduces s3_storage_options fixture that requires passing options explicitly.
tests/collection/test_storage.py Adds S3 regression test ensuring collection metadata reads forward storage_options.
dataframely/schema.py Allows forwarding kwargs (e.g., storage_options) to pl.read_parquet_metadata.
dataframely/collection/collection.py Allows forwarding kwargs (e.g., storage_options) to pl.read_parquet_metadata.
dataframely/_storage/parquet.py Forwards filtered read options to metadata reads and centralizes filtering logic.

Comment thread tests/failure_info/test_storage.py
Comment thread tests/failure_info/test_storage.py
Comment thread dataframely/_storage/parquet.py Outdated
@mattijsdp Mattijs De Paepe (mattijsdp) marked this pull request as draft June 2, 2026 19:04
@mattijsdp Mattijs De Paepe (mattijsdp) changed the title Fix/parquet metadata storage options fix: Forward storage_options to parquet metadata reads Jun 2, 2026
@github-actions github-actions Bot added the fix label Jun 2, 2026
@mattijsdp Mattijs De Paepe (mattijsdp) marked this pull request as ready for review June 2, 2026 20:25
@codecov

codecov Bot commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (50e18a8) to head (e33a50a).

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #353   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           56        56           
  Lines         3458      3463    +5     
=========================================
+ Hits          3458      3463    +5     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mattijs De Paepe (@mattijsdp), looks great! What do you think about the inspect-based suggestion by copilot? It seems like a good idea to me.

Comment thread dataframely/_storage/parquet.py Outdated
…nature

Introspect pl.read_parquet_metadata's parameters instead of hard-coding the
allowlist, so the forwarded options track the installed polars version rather
than drifting against it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mattijsdp Mattijs De Paepe (mattijsdp) force-pushed the fix/parquet-metadata-storage-options branch from de6ad3b to 48793a5 Compare June 3, 2026 15:40

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good to me assuming CI goes through :) Oliver Borchert (@borchero) did you also want to review this?

Comment thread dataframely/_storage/parquet.py Outdated
Comment thread tests/conftest.py
Comment on lines +56 to +65
for var in (
"AWS_ENDPOINT_URL",
"AWS_ACCESS_KEY_ID",
"AWS_SECRET_ACCESS_KEY",
"AWS_ALLOW_HTTP",
"AWS_S3_ALLOW_UNSAFE_RENAME",
"AWS_DEFAULT_REGION",
"AWS_REGION",
):
monkeypatch.delenv(var, raising=False)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this has the desired effect. Polars likely only reads these variables at startup and does not re-read them for every read. Could we check that?

I don't have a good alternative idea but I think all of those tests are moot if I'm right.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! So yes, the tests were indeed moot as Polars caches an object store per bucket which isn't cleared by this monkeypatch. Other tests use the same bucket so it stayed reachable even without storage_options. This is per bucket though so I fixed this by creating a fresh bucket that is never env configured hence no cache and so requires storage_options forwarding. Ran the tests without the fix which then failed as expected. Let me know what you think.

Polars caches object stores per bucket on the Rust side, where
monkeypatch.delenv cannot clear them. A bucket configured once from
AWS_* env vars stayed reachable without storage_options, so the
regression tests passed even when forwarding was dropped.

Replace s3_storage_options with s3_isolated, which hands out a
freshly-named bucket that is never env-configured. Reaching it requires
forwarding storage_options to every read, so the tests now fail without
the fix. Also drop a stale comment in parquet.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! 🚀 just one more comment :)

Comment thread tests/conftest.py Outdated
Comment on lines +71 to +73
boto3.client(
"s3", endpoint_url=s3_server, aws_access_key_id="", aws_secret_access_key=""
).create_bucket(Bucket=bucket)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will keep the bucket around indefinitely which is not ideal for local development & debugging. Could you yield from the fixture and delete the bucket after the yield statement as "teardown" operation?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done.

Note: I didn't use a paginator and technically list_objects_v2 caps at 1000 objects but I think this is fine for this use case?

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet metadata read ignores storage_options, breaking typed reads from non-AWS S3 stores

4 participants