feat(waterdata): add waterdata.xarray module returning CF datasets by thodson-usgs · Pull Request #297 · DOI-USGS/dataretrieval-python

thodson-usgs · 2026-05-30T13:15:21Z

Supersedes #281 (closed when its branch worktree-waterdata-drop-hash-ids was renamed to xarray-extension).

Summary

Adds dataretrieval.waterdata.xarray, a module that mirrors the Water Data
time-series getters but returns CF-conventions xarray.Datasets with series
metadata populated, instead of bare DataFrames.

from dataretrieval.waterdata import xarray as wdx
# dense=True shows the readable gridded view; the default is a ragged array
ds = wdx.get_daily(monitoring_location_id="USGS-05407000", parameter_code="00060",
                   time="2024-06-01/2024-06-05", dense=True)

discharge (monitoring_location_id, time)
    long_name:     Discharge, cubic feet per second
    units:         ft3 s-1
    cell_methods:  time: mean
    standard_name: water_volume_transport_in_river_channel
coords: monitoring_location_id (cf_role=timeseries_id), time, longitude, latitude
attrs: Conventions=CF-1.11, institution, source, references(URL)

Coverage

get_daily, get_continuous, get_latest_continuous, get_latest_daily,
get_nearest_continuous, get_peaks, get_field_measurements, get_samples,
and (preliminary) get_stats_por / get_stats_date_range.

Layout

The default is a CF contiguous ragged array (featureType = "timeSeries"):
every observation is concatenated along a single obs dimension, one
(monitoring location, parameter, statistic) series per timeseries instance,
with row_size linking them. Only real observations are stored (no NaN fill),
so it scales to large, very ragged multi-site pulls. Pass dense=True for
the alternative (monitoring_location_id, time) grid — one named variable per
parameter, NaN-filled — ergonomic for a few overlapping series but memory-costly
for ragged collections.

How it works

CF attributes are derived from columns the getter already returns:
unit_of_measure → units (UDUNITS where mapped), statistic_id →
cell_methods, parameter_code → standard_name / vertical_datum /
usgs_parameter_code. Only the human-readable parameter name comes from a
small, cached parameter_code-keyed metadata lookup.
the timeseries identity carries cf_role=timeseries_id (the synthesized
timeseries_id coordinate when ragged, monitoring_location_id when dense),
with longitude / latitude per site from point geometry, qualifier /
approval_status as ancillary variables, and hydrologic_unit_code /
state_name when the metadata call already provides them.
xarray is an optional dependency (pip install dataretrieval[xarray]);
it is not imported by dataretrieval.waterdata, so the core package stays
xarray-free.

Design note: the plain getters are unchanged

An earlier iteration of this branch made the get_* getters drop hash/UUID
columns by default. That was reverted: the hash-dropping now lives entirely
inside the xarray builders, which surface only the columns they convert, so
opaque per-record UUIDs and per-series join keys never reach the Dataset. The
DataFrame-returning getters and their public API are untouched. The wrappers
accept (and ignore) an include_hash argument for call-compatibility; it does
not apply to the xarray path.

Status

Draft. Known gaps to harden before merge:

the statistics conversion is a preliminary flat layout (not yet a
percentile / day-of-year structure);
broader coverage for mixed-unit groups and properties= subsets (both
currently guarded with a warning / empty-Dataset fallback).

NaT-time rows are dropped with a warning; a failed (supplementary) metadata
lookup degrades to a dataset without parameter names rather than discarding the
data; the per-process metadata cache is bounded (FIFO) with a public
clear_metadata_cache() opt-out; and the doc extra installs xarray +
netCDF4 so the demo notebook renders in the docs build.

Add dataretrieval.waterdata.xarray, optional-dependency wrappers that mirror the Water Data time-series getters but return CF-conventions xarray.Dataset objects instead of bare DataFrames. - Ragged (CF contiguous ragged array) layout by default; pass dense=True for the NaN-filled (monitoring_location_id, time) grid with one named variable per parameter. - CF metadata is derived from columns the getters already return (unit_of_measure -> units, statistic_id -> cell_methods, parameter_code -> standard_name/vertical_datum), plus a cached parameter-name lookup; sites carry cf_role=timeseries_id with lon/lat. - Coverage: get_daily, get_continuous, get_latest_continuous, get_latest_daily, get_nearest_continuous, get_peaks, get_field_measurements, get_samples, and preliminary get_stats_por / get_stats_date_range. - xarray is an optional extra (pip install dataretrieval[xarray]); the core package never imports it. Hash-valued ID columns are dropped inside the xarray builders, so the plain getters are left untouched. CF vocabulary maps live in waterdata.types (xarray-free, plain data). Adds a demo notebook + docs entry and offline converter unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Make the (monitoring_location_id, time) dense grid the default layout (pass dense=False for the contiguous ragged array), add select_series() and clear_metadata_cache(), and fix defects found in review: - NaN-truthiness: a present-but-null parameter_name/description no longer masks the fallback (was naming a variable "nan" / dropping long_name) - dense (site, time) collisions dedup deterministically (smallest value) rather than keeping an arbitrary upstream-order row - partial point geometry keeps a numeric NaN-filled lon/lat coordinate instead of a CF-invalid object array - list-valued flag columns (qualifier) are flattened to strings so the datasets serialize to netCDF - omit CF featureType on the preliminary flat stats layout Rewrite the demo notebook around the dense default and refresh the README. 56 offline tests pass; demo executes clean against the live API. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Address the second code-review pass: - dense (site, time) collision dedup is now fully deterministic: stable sort on value then the flag columns, so the retained ancillary is order-independent - _scalarize handles numpy arrays and nested/array elements without raising (was list/tuple-only and called pd.notna on possibly-non-scalar elements) - _none_if_nan is array-safe via an is_scalar guard (new _is_missing helper) - normalize NaN name descriptors once in _MetadataCache._ingest - only map _scalarize over flag columns that are actually sequence-valued - fix the stale examples/index.rst (dense is the default; dense=False ragged) 58 offline tests pass; live get_daily/get_samples + to_netcdf verified. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- samples now surface station longitude/latitude (mapped from Location_Longitude/Location_Latitude; _point_coords reads explicit lon/lat columns in addition to an OGC geometry column) - metadata cache: a single large pull is no longer subject to within-batch FIFO eviction (the call's result is built from the freshly-parsed entries), and sites with no metadata are no longer negatively cached, so they retry - dense variable naming is deterministic and unambiguous: a bare name (e.g. discharge) is used only when unique; same-named series are all disambiguated by cell method / statistic / parameter code - dense multi-unit label is deterministic (sorted) instead of row-order dependent - row_size is int64 (was int32) to avoid overflow / cumsum truncation - select_series rejects descriptor coords as keys (lon/lat float-equality footgun) and can match a null instance key 64 offline tests pass; live samples lon/lat + dense naming verified. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Convert a ragged (dense=False) Dataset to a one-record-per-series awkward.Array: row_size is awkward's offsets and obs is its flat content, so it is a near-zero-copy re-view -- each series carries its scalar identity metadata plus jagged time/value/flag fields, no NaN fill, with per-series ops vectorized across the whole collection (e.g. ak.mean(arr.value, axis=1)). awkward is NOT a dependency: to_awkward lazy-imports it and raises an informative ModuleNotFoundError ("pip install awkward") when absent. Object columns route through ak.from_iter (NaN -> missing) so flags become a clean option[string]; numeric/datetime content stays numpy. Adds offline tests (importorskip awkward; the missing-dep error is tested by simulating the absent import) and a demo note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…s` field Group each series' per-observation fields (time, value, qualifier, approval_status) into a single jagged `obs` list of records, instead of several parallel top-level jagged fields. Each series now reads as "identity metadata (scalar fields) + obs (a list of {time, value, ...} observations)" -- `arr.obs.value`, `ak.mean(arr.obs.value, axis=1)` -- which is clearer and the conventional awkward idiom. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…simplify) - _point_coords: merge the explicit-longitude/latitude path and the geometry path into one dedup/loop/return scaffold with a swappable per-row extractor, and route the lon/lat float coercion through _lonlat instead of a second open-coded try/except (kills the duplicated branch). - _DenseBuilder._disambiguate: replace the `name == base` proxy (a confusing stand-in for "no suffix") with an explicit "suffix didn't separate them" condition, and use collections.Counter for the base-name counts. No behavior change; 67 tests pass, both spatial-coordinate paths verified live. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

thodson-usgs mentioned this pull request May 30, 2026

feat(waterdata): add waterdata.xarray module returning CF datasets #281

Closed

thodson-usgs and others added 6 commits June 5, 2026 08:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(waterdata): add waterdata.xarray module returning CF datasets#297

feat(waterdata): add waterdata.xarray module returning CF datasets#297
thodson-usgs wants to merge 7 commits into
DOI-USGS:mainfrom
thodson-usgs:xarray-extension

thodson-usgs commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thodson-usgs commented May 30, 2026

Summary

Coverage

Layout

How it works

Design note: the plain getters are unchanged

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant