feat(waterdata): add waterdata.xarray module returning CF datasets#297
Draft
thodson-usgs wants to merge 7 commits into
Draft
feat(waterdata): add waterdata.xarray module returning CF datasets#297thodson-usgs wants to merge 7 commits into
thodson-usgs wants to merge 7 commits into
Conversation
Add dataretrieval.waterdata.xarray, optional-dependency wrappers that mirror the Water Data time-series getters but return CF-conventions xarray.Dataset objects instead of bare DataFrames. - Ragged (CF contiguous ragged array) layout by default; pass dense=True for the NaN-filled (monitoring_location_id, time) grid with one named variable per parameter. - CF metadata is derived from columns the getters already return (unit_of_measure -> units, statistic_id -> cell_methods, parameter_code -> standard_name/vertical_datum), plus a cached parameter-name lookup; sites carry cf_role=timeseries_id with lon/lat. - Coverage: get_daily, get_continuous, get_latest_continuous, get_latest_daily, get_nearest_continuous, get_peaks, get_field_measurements, get_samples, and preliminary get_stats_por / get_stats_date_range. - xarray is an optional extra (pip install dataretrieval[xarray]); the core package never imports it. Hash-valued ID columns are dropped inside the xarray builders, so the plain getters are left untouched. CF vocabulary maps live in waterdata.types (xarray-free, plain data). Adds a demo notebook + docs entry and offline converter unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make the (monitoring_location_id, time) dense grid the default layout (pass dense=False for the contiguous ragged array), add select_series() and clear_metadata_cache(), and fix defects found in review: - NaN-truthiness: a present-but-null parameter_name/description no longer masks the fallback (was naming a variable "nan" / dropping long_name) - dense (site, time) collisions dedup deterministically (smallest value) rather than keeping an arbitrary upstream-order row - partial point geometry keeps a numeric NaN-filled lon/lat coordinate instead of a CF-invalid object array - list-valued flag columns (qualifier) are flattened to strings so the datasets serialize to netCDF - omit CF featureType on the preliminary flat stats layout Rewrite the demo notebook around the dense default and refresh the README. 56 offline tests pass; demo executes clean against the live API. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Address the second code-review pass: - dense (site, time) collision dedup is now fully deterministic: stable sort on value then the flag columns, so the retained ancillary is order-independent - _scalarize handles numpy arrays and nested/array elements without raising (was list/tuple-only and called pd.notna on possibly-non-scalar elements) - _none_if_nan is array-safe via an is_scalar guard (new _is_missing helper) - normalize NaN name descriptors once in _MetadataCache._ingest - only map _scalarize over flag columns that are actually sequence-valued - fix the stale examples/index.rst (dense is the default; dense=False ragged) 58 offline tests pass; live get_daily/get_samples + to_netcdf verified. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- samples now surface station longitude/latitude (mapped from Location_Longitude/Location_Latitude; _point_coords reads explicit lon/lat columns in addition to an OGC geometry column) - metadata cache: a single large pull is no longer subject to within-batch FIFO eviction (the call's result is built from the freshly-parsed entries), and sites with no metadata are no longer negatively cached, so they retry - dense variable naming is deterministic and unambiguous: a bare name (e.g. discharge) is used only when unique; same-named series are all disambiguated by cell method / statistic / parameter code - dense multi-unit label is deterministic (sorted) instead of row-order dependent - row_size is int64 (was int32) to avoid overflow / cumsum truncation - select_series rejects descriptor coords as keys (lon/lat float-equality footgun) and can match a null instance key 64 offline tests pass; live samples lon/lat + dense naming verified. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Convert a ragged (dense=False) Dataset to a one-record-per-series
awkward.Array: row_size is awkward's offsets and obs is its flat content, so
it is a near-zero-copy re-view -- each series carries its scalar identity
metadata plus jagged time/value/flag fields, no NaN fill, with per-series ops
vectorized across the whole collection (e.g. ak.mean(arr.value, axis=1)).
awkward is NOT a dependency: to_awkward lazy-imports it and raises an
informative ModuleNotFoundError ("pip install awkward") when absent. Object
columns route through ak.from_iter (NaN -> missing) so flags become a clean
option[string]; numeric/datetime content stays numpy.
Adds offline tests (importorskip awkward; the missing-dep error is tested by
simulating the absent import) and a demo note.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…s` field
Group each series' per-observation fields (time, value, qualifier,
approval_status) into a single jagged `obs` list of records, instead of
several parallel top-level jagged fields. Each series now reads as
"identity metadata (scalar fields) + obs (a list of {time, value, ...}
observations)" -- `arr.obs.value`, `ak.mean(arr.obs.value, axis=1)` -- which
is clearer and the conventional awkward idiom.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…simplify) - _point_coords: merge the explicit-longitude/latitude path and the geometry path into one dedup/loop/return scaffold with a swappable per-row extractor, and route the lon/lat float coercion through _lonlat instead of a second open-coded try/except (kills the duplicated branch). - _DenseBuilder._disambiguate: replace the `name == base` proxy (a confusing stand-in for "no suffix") with an explicit "suffix didn't separate them" condition, and use collections.Counter for the base-name counts. No behavior change; 67 tests pass, both spatial-coordinate paths verified live. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
dataretrieval.waterdata.xarray, a module that mirrors the Water Datatime-series getters but returns CF-conventions
xarray.Datasets with seriesmetadata populated, instead of bare DataFrames.
Coverage
get_daily,get_continuous,get_latest_continuous,get_latest_daily,get_nearest_continuous,get_peaks,get_field_measurements,get_samples,and (preliminary)
get_stats_por/get_stats_date_range.Layout
The default is a CF contiguous ragged array (
featureType = "timeSeries"):every observation is concatenated along a single
obsdimension, one(monitoring location, parameter, statistic)series pertimeseriesinstance,with
row_sizelinking them. Only real observations are stored (no NaN fill),so it scales to large, very ragged multi-site pulls. Pass
dense=Trueforthe alternative
(monitoring_location_id, time)grid — one named variable perparameter, NaN-filled — ergonomic for a few overlapping series but memory-costly
for ragged collections.
How it works
unit_of_measure→units(UDUNITS where mapped),statistic_id→cell_methods,parameter_code→standard_name/vertical_datum/usgs_parameter_code. Only the human-readable parameter name comes from asmall, cached
parameter_code-keyed metadata lookup.cf_role=timeseries_id(the synthesizedtimeseries_idcoordinate when ragged,monitoring_location_idwhen dense),with
longitude/latitudeper site from point geometry,qualifier/approval_statusas ancillary variables, andhydrologic_unit_code/state_namewhen the metadata call already provides them.xarrayis an optional dependency (pip install dataretrieval[xarray]);it is not imported by
dataretrieval.waterdata, so the core package staysxarray-free.
Design note: the plain getters are unchanged
An earlier iteration of this branch made the
get_*getters drop hash/UUIDcolumns by default. That was reverted: the hash-dropping now lives entirely
inside the xarray builders, which surface only the columns they convert, so
opaque per-record UUIDs and per-series join keys never reach the
Dataset. TheDataFrame-returning getters and their public API are untouched. The wrappers
accept (and ignore) an
include_hashargument for call-compatibility; it doesnot apply to the xarray path.
Status
Draft. Known gaps to harden before merge:
percentile / day-of-year structure);
properties=subsets (bothcurrently guarded with a warning / empty-
Datasetfallback).NaT-time rows are dropped with a warning; a failed (supplementary) metadata
lookup degrades to a dataset without parameter names rather than discarding the
data; the per-process metadata cache is bounded (FIFO) with a public
clear_metadata_cache()opt-out; and thedocextra installsxarray+netCDF4so the demo notebook renders in the docs build.