Skip to content

feat(waterdata): add waterdata.xarray module returning CF datasets#297

Draft
thodson-usgs wants to merge 7 commits into
DOI-USGS:mainfrom
thodson-usgs:xarray-extension
Draft

feat(waterdata): add waterdata.xarray module returning CF datasets#297
thodson-usgs wants to merge 7 commits into
DOI-USGS:mainfrom
thodson-usgs:xarray-extension

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

Supersedes #281 (closed when its branch worktree-waterdata-drop-hash-ids was renamed to xarray-extension).

Summary

Adds dataretrieval.waterdata.xarray, a module that mirrors the Water Data
time-series getters but returns CF-conventions xarray.Datasets with series
metadata populated, instead of bare DataFrames.

from dataretrieval.waterdata import xarray as wdx
# dense=True shows the readable gridded view; the default is a ragged array
ds = wdx.get_daily(monitoring_location_id="USGS-05407000", parameter_code="00060",
                   time="2024-06-01/2024-06-05", dense=True)
discharge (monitoring_location_id, time)
    long_name:     Discharge, cubic feet per second
    units:         ft3 s-1
    cell_methods:  time: mean
    standard_name: water_volume_transport_in_river_channel
coords: monitoring_location_id (cf_role=timeseries_id), time, longitude, latitude
attrs: Conventions=CF-1.11, institution, source, references(URL)

Coverage

get_daily, get_continuous, get_latest_continuous, get_latest_daily,
get_nearest_continuous, get_peaks, get_field_measurements, get_samples,
and (preliminary) get_stats_por / get_stats_date_range.

Layout

The default is a CF contiguous ragged array (featureType = "timeSeries"):
every observation is concatenated along a single obs dimension, one
(monitoring location, parameter, statistic) series per timeseries instance,
with row_size linking them. Only real observations are stored (no NaN fill),
so it scales to large, very ragged multi-site pulls. Pass dense=True for
the alternative (monitoring_location_id, time) grid — one named variable per
parameter, NaN-filled — ergonomic for a few overlapping series but memory-costly
for ragged collections.

How it works

  • CF attributes are derived from columns the getter already returns:
    unit_of_measureunits (UDUNITS where mapped), statistic_id
    cell_methods, parameter_codestandard_name / vertical_datum /
    usgs_parameter_code. Only the human-readable parameter name comes from a
    small, cached parameter_code-keyed metadata lookup.
  • the timeseries identity carries cf_role=timeseries_id (the synthesized
    timeseries_id coordinate when ragged, monitoring_location_id when dense),
    with longitude / latitude per site from point geometry, qualifier /
    approval_status as ancillary variables, and hydrologic_unit_code /
    state_name when the metadata call already provides them.
  • xarray is an optional dependency (pip install dataretrieval[xarray]);
    it is not imported by dataretrieval.waterdata, so the core package stays
    xarray-free.

Design note: the plain getters are unchanged

An earlier iteration of this branch made the get_* getters drop hash/UUID
columns by default. That was reverted: the hash-dropping now lives entirely
inside the xarray builders, which surface only the columns they convert, so
opaque per-record UUIDs and per-series join keys never reach the Dataset. The
DataFrame-returning getters and their public API are untouched. The wrappers
accept (and ignore) an include_hash argument for call-compatibility; it does
not apply to the xarray path.

Status

Draft. Known gaps to harden before merge:

  • the statistics conversion is a preliminary flat layout (not yet a
    percentile / day-of-year structure);
  • broader coverage for mixed-unit groups and properties= subsets (both
    currently guarded with a warning / empty-Dataset fallback).

NaT-time rows are dropped with a warning; a failed (supplementary) metadata
lookup degrades to a dataset without parameter names rather than discarding the
data; the per-process metadata cache is bounded (FIFO) with a public
clear_metadata_cache() opt-out; and the doc extra installs xarray +
netCDF4 so the demo notebook renders in the docs build.

Add dataretrieval.waterdata.xarray, optional-dependency wrappers that
mirror the Water Data time-series getters but return CF-conventions
xarray.Dataset objects instead of bare DataFrames.

- Ragged (CF contiguous ragged array) layout by default; pass dense=True
  for the NaN-filled (monitoring_location_id, time) grid with one named
  variable per parameter.
- CF metadata is derived from columns the getters already return
  (unit_of_measure -> units, statistic_id -> cell_methods,
  parameter_code -> standard_name/vertical_datum), plus a cached
  parameter-name lookup; sites carry cf_role=timeseries_id with lon/lat.
- Coverage: get_daily, get_continuous, get_latest_continuous,
  get_latest_daily, get_nearest_continuous, get_peaks,
  get_field_measurements, get_samples, and preliminary
  get_stats_por / get_stats_date_range.
- xarray is an optional extra (pip install dataretrieval[xarray]); the
  core package never imports it. Hash-valued ID columns are dropped
  inside the xarray builders, so the plain getters are left untouched.

CF vocabulary maps live in waterdata.types (xarray-free, plain data).
Adds a demo notebook + docs entry and offline converter unit tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
thodson-usgs and others added 6 commits June 5, 2026 08:34
Make the (monitoring_location_id, time) dense grid the default layout
(pass dense=False for the contiguous ragged array), add select_series()
and clear_metadata_cache(), and fix defects found in review:

- NaN-truthiness: a present-but-null parameter_name/description no longer
  masks the fallback (was naming a variable "nan" / dropping long_name)
- dense (site, time) collisions dedup deterministically (smallest value)
  rather than keeping an arbitrary upstream-order row
- partial point geometry keeps a numeric NaN-filled lon/lat coordinate
  instead of a CF-invalid object array
- list-valued flag columns (qualifier) are flattened to strings so the
  datasets serialize to netCDF
- omit CF featureType on the preliminary flat stats layout

Rewrite the demo notebook around the dense default and refresh the README.
56 offline tests pass; demo executes clean against the live API.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Address the second code-review pass:
- dense (site, time) collision dedup is now fully deterministic: stable sort
  on value then the flag columns, so the retained ancillary is order-independent
- _scalarize handles numpy arrays and nested/array elements without raising
  (was list/tuple-only and called pd.notna on possibly-non-scalar elements)
- _none_if_nan is array-safe via an is_scalar guard (new _is_missing helper)
- normalize NaN name descriptors once in _MetadataCache._ingest
- only map _scalarize over flag columns that are actually sequence-valued
- fix the stale examples/index.rst (dense is the default; dense=False ragged)

58 offline tests pass; live get_daily/get_samples + to_netcdf verified.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- samples now surface station longitude/latitude (mapped from
  Location_Longitude/Location_Latitude; _point_coords reads explicit lon/lat
  columns in addition to an OGC geometry column)
- metadata cache: a single large pull is no longer subject to within-batch
  FIFO eviction (the call's result is built from the freshly-parsed entries),
  and sites with no metadata are no longer negatively cached, so they retry
- dense variable naming is deterministic and unambiguous: a bare name
  (e.g. discharge) is used only when unique; same-named series are all
  disambiguated by cell method / statistic / parameter code
- dense multi-unit label is deterministic (sorted) instead of row-order dependent
- row_size is int64 (was int32) to avoid overflow / cumsum truncation
- select_series rejects descriptor coords as keys (lon/lat float-equality
  footgun) and can match a null instance key

64 offline tests pass; live samples lon/lat + dense naming verified.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Convert a ragged (dense=False) Dataset to a one-record-per-series
awkward.Array: row_size is awkward's offsets and obs is its flat content, so
it is a near-zero-copy re-view -- each series carries its scalar identity
metadata plus jagged time/value/flag fields, no NaN fill, with per-series ops
vectorized across the whole collection (e.g. ak.mean(arr.value, axis=1)).

awkward is NOT a dependency: to_awkward lazy-imports it and raises an
informative ModuleNotFoundError ("pip install awkward") when absent. Object
columns route through ak.from_iter (NaN -> missing) so flags become a clean
option[string]; numeric/datetime content stays numpy.

Adds offline tests (importorskip awkward; the missing-dep error is tested by
simulating the absent import) and a demo note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…s` field

Group each series' per-observation fields (time, value, qualifier,
approval_status) into a single jagged `obs` list of records, instead of
several parallel top-level jagged fields. Each series now reads as
"identity metadata (scalar fields) + obs (a list of {time, value, ...}
observations)" -- `arr.obs.value`, `ak.mean(arr.obs.value, axis=1)` -- which
is clearer and the conventional awkward idiom.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…simplify)

- _point_coords: merge the explicit-longitude/latitude path and the geometry
  path into one dedup/loop/return scaffold with a swappable per-row extractor,
  and route the lon/lat float coercion through _lonlat instead of a second
  open-coded try/except (kills the duplicated branch).
- _DenseBuilder._disambiguate: replace the `name == base` proxy (a confusing
  stand-in for "no suffix") with an explicit "suffix didn't separate them"
  condition, and use collections.Counter for the base-name counts.

No behavior change; 67 tests pass, both spatial-coordinate paths verified live.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant