feat: support geometry and geography types#2653
Conversation
There was a problem hiding this comment.
Very cool!
I will take a closer look tomorrow, but note that there is a PR into parquet-geospatial to fix the translation between GeoArrow and Parquet types, which mostly was incorrect for Geography but there were a few subtle issues otherwise. There are a number of test cases parameterized nicely at the bottom that you can reuse here. apache/arrow-rs#10065
Just because I happen to be reviewing it, it may also be useful to reference the initial Go PR which has a similar scope: apache/iceberg-go#1138
|
@paleolimbot Thanks for pointing out those edge cases. I will take a closer look. |
huan233usc
left a comment
There was a problem hiding this comment.
Really nice to see this — the type model, Parquet geometry/geography logical types, and especially skipping byte min/max for spatial columns all line up with the spec. A couple of design-alignment points inline.
|
|
||
| impl GeographyType { | ||
| /// Creates a geography type with an optional coordinate reference system and edge interpolation algorithm. | ||
| pub fn new(crs: Option<String>, algorithm: WkbEdges) -> Result<Self> { |
There was a problem hiding this comment.
The public iceberg::spec API exposes parquet_geospatial::WkbEdges (GeographyType::new/algorithm()), coupling the type layer to the parquet-geospatial crate. The set of edge algorithms is defined by the spec itself (edge-interpolation-algorithm) — could we define an Iceberg-owned enum mirroring the spec and convert to WkbEdges only at the Parquet boundary, keeping the type layer format-agnostic?
There was a problem hiding this comment.
Done in 3c8a92e. GeographyType now uses an Iceberg-owned EdgeInterpolationAlgorithm, and Arrow schema conversion maps to/from parquet_geospatial::WkbEdges only at the WKB extension metadata boundary.
| PrimitiveType::Fixed(_) | ||
| | PrimitiveType::Binary | ||
| | PrimitiveType::Geometry(_) | ||
| | PrimitiveType::Geography(_) => PrimitiveLiteral::Binary(Vec::from(bytes)), |
There was a problem hiding this comment.
The single-value path decodes geo bytes as opaque WKB. Per the spec, geo bounds are a single point encoded as x:y[:z][:m] little-endian f64s, not WKB (bound-serialization, bounds-for-geometry-and-geography). Both implementations skip spatial bounds today so it's latent, but worth agreeing the bound codec follows the spec point encoding so they don't diverge later.
There was a problem hiding this comment.
Agreed. This PR still omits spatial lower/upper bounds from the Parquet writer, so it does not claim WKB-based bound support. If we add spatial bounds later, the bound codec should use the spec point encoding (x:y[:z][:m] as little-endian f64 values), not WKB.
Summary
This draft PR adds Iceberg Geometry/Geography primitive type support by reusing arrow-rs/parquet-geospatial support instead of introducing a local geospatial model.
Related issues
Related to #2411 and #1884.