Skip to content

fix: sanitize Avro field names on write, respect iceberg-field-name on read#2540

Open
SreeramGarlapati wants to merge 3 commits into
apache:mainfrom
SreeramGarlapati:fix/avro-field-name-sanitization
Open

fix: sanitize Avro field names on write, respect iceberg-field-name on read#2540
SreeramGarlapati wants to merge 3 commits into
apache:mainfrom
SreeramGarlapati:fix/avro-field-name-sanitization

Conversation

@SreeramGarlapati

@SreeramGarlapati SreeramGarlapati commented May 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Iceberg field names can be anything (123column, field.with.dots, etc.) but Avro requires names to match [A-Za-z_][A-Za-z0-9_]*. Java sanitizes invalid names on write and records the original in the iceberg-field-name custom Avro property (TypeToSchema, BuildAvroProjection). On read, Java recovers the authoritative name by field-id projection against the table schema, so in Java that property is write-only provenance rather than something it reads back. iceberg-rust was doing neither half of the write contract — it wrote invalid Avro names directly, and on the standalone Avro→Iceberg schema conversion it had no way to recover the original name.

This causes two interop failures:

  1. Write path: iceberg-rust produces invalid Avro when field names have leading digits or special chars → strict parsers (Java, Python) reject the file
  2. Read path: Avro files carrying sanitized names + iceberg-field-name are converted back to the sanitized name (e.g. _123column) instead of the original

Changes

Write path (Iceberg→Avro conversion in SchemaToAvroSchema::field):

  • Added is_valid_avro_name() — checks [A-Za-z_][A-Za-z0-9_]*
  • Added sanitize_avro_name() — follows Java AvroSchemaUtil.sanitize()'s escape scheme, but is intentionally ASCII-strict so output is always a spec-valid Avro name (Java's unicode Character.isLetter/isLetterOrDigit/isDigit can leave names like café/_٠x that violate Avro's ASCII-only grammar):
    • Leading digit: prefix _ (e.g., 123col_123col)
    • Anything outside ASCII alnum/_: _x<HEX> (e.g., field.namefield_x2Ename)
    • Operates on UTF-16 code units to match Java's charAt() for supplementary chars
  • When sanitization is needed: stores original name in iceberg-field-name property

Read path (Avro→Iceberg conversion in AvroSchemaToSchema::record):

  • Checks iceberg-field-name custom attribute first, falls back to the Avro field name. Note this is deliberately more faithful than Java's SchemaToType, which uses field.name() directly; iceberg-rust restores the original on the standalone conversion path too, while staying field-id-based for real table reads.

Java reference

Closes #2535

Test plan

  • test_is_valid_avro_name — validates detection of invalid names
  • test_sanitize_avro_name — ASCII edge cases (match Java for the chars both escape)
  • test_sanitize_avro_name_unicode — BMP and supplementary char handling (surrogate pairs)
  • test_sanitization_round_trip — Iceberg→Avro→Iceberg preserves original names
  • test_avro_to_iceberg_uses_iceberg_field_name_property — reads sanitized schemas back to the original name
  • All existing tests pass (no regressions in bidirectional schema conversion tests)

@xanderbailey xanderbailey left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR, thanks for the contribution, left a couple of questions

// This const may better to maintain in avro-rs.
const LOGICAL_TYPE: &str = "logicalType";

fn is_valid_avro_name(name: &str) -> bool {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could use something from https://github.com/apache/avro-rs/blob/4edb1ce1ae1ab5bd3fafb08ca3f622946c01c9fd/avro/src/validator.rs#L4 but it looks like validate_record_field_name is pub(crate). Is it work filing an issue upstream to see if we could expose something that would allow us to validate against their implementation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call. i'd keep the local check regardless though — is_valid_avro_name is the exact Avro grammar [A-Za-z_][A-Za-z0-9_]* in ~6 lines with no coupling to avro-rs internals, and the sanitize half is the iceberg iceberg-field-name convention which avro-rs doesn't cover anyway. happy to file an upstream issue to expose validate_record_field_name as good-citizen cleanup and link it here — lmk if you'd like that.

Comment on lines +105 to +107
fn is_ascii_digit_u16(c: u16) -> bool {
matches!(c, 0x30..=0x39)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    /**
     * Determines if the specified character is a digit.
     * <p>
     * A character is a digit if its general category type, provided
     * by {@code Character.getType(ch)}, is
     * {@code DECIMAL_DIGIT_NUMBER}.
     * <p>
     * Some Unicode character ranges that contain digits:
     * <ul>
     * <li>{@code '\u005Cu0030'} through {@code '\u005Cu0039'},
     *     ISO-LATIN-1 digits ({@code '0'} through {@code '9'})
     * <li>{@code '\u005Cu0660'} through {@code '\u005Cu0669'},
     *     Arabic-Indic digits
     * <li>{@code '\u005Cu06F0'} through {@code '\u005Cu06F9'},
     *     Extended Arabic-Indic digits
     * <li>{@code '\u005Cu0966'} through {@code '\u005Cu096F'},
     *     Devanagari digits
     * <li>{@code '\u005CuFF10'} through {@code '\u005CuFF19'},
     *     Fullwidth digits
     * </ul>
     *
     * Many other character ranges contain digits as well.
     *
     * <p><b>Note:</b> This method cannot handle <a
     * href="#supplementary"> supplementary characters</a>. To support
     * all Unicode characters, including supplementary characters, use
     * the {@link #isDigit(int)} method.
     *
     * @param   ch   the character to be tested.
     * @return  {@code true} if the character is a digit;
     *          {@code false} otherwise.
     * @see     Character#digit(char, int)
     * @see     Character#forDigit(int, int)
     * @see     Character#getType(char)
     */
    public static boolean isDigit(char ch) {
        return isDigit((int)ch);
    }

https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/avro/AvroSchemaUtil.java#L551
Java's isDigit() actually covers more than just ascii digits. This is a pretty niche edge-case.

Trying to reason about this in my head and I don't think it matters for interop that the representations are identical because both will correctly underscore the field and restore it from the map? Can you check my reasoning here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your conclusion holds — representational parity with Java isn't needed — but let me correct the reasoning (incl. something I'd have gotten wrong).

this impl is deliberately not matching Java's isDigit. Java's validAvroName/sanitize use the unicode Character.isLetter/isLetterOrDigit/isDigit, so Java keeps non-ASCII letters/digits as-is (café, leading ٠, …) — which actually violates Avro's ASCII-only name grammar [A-Za-z_][A-Za-z0-9_]*. matching Java's isDigit here would re-introduce the exact invalid-Avro bug this PR fixes. so we stay ASCII-strict: anything outside ASCII alnum/_ is escaped _x<HEX>, output is always spec-valid, original preserved in iceberg-field-name.

correction to the framing — it's not "both restore from the map". I checked the Java source: iceberg-field-name is write-only in Java (written by TypeToSchema + BuildAvroProjection; no read site). Java's SchemaToType just uses field.name(), and the authoritative name is recovered by field-id projection against the table schema. so correctness doesn't depend on the escape representation either way: within iceberg-rust the round-trip is lossless because we write and read the prop; cross-engine, names resolve by field-id. our read path honoring the prop is actually a touch more faithful than Java's SchemaToType on the standalone conversion path.

net: not changing the digit logic. I did fix the doc, which wrongly claimed it "matches Java semantics" — it follows Java's escape scheme but is intentionally ASCII-stricter.

SreeramGarlapati and others added 3 commits June 21, 2026 21:09
…ld-name

Iceberg field names can be arbitrary (leading digits, dots, spaces, etc.)
but Avro requires names to match [A-Za-z_][A-Za-z0-9_]*. Java handles
this by sanitizing invalid names on write and storing the original in an
"iceberg-field-name" custom property, then checking that property on read.

iceberg-rust was writing unsanitized names directly, which causes Avro
validation failures (or produces files unreadable by strict Avro parsers)
when field names don't conform to Avro's naming rules.

This adds:
- sanitize_avro_name(): matches Java's AvroSchemaUtil.sanitize() logic
  (prefix _ for leading digits, _x<HEX> for special chars)
- Write path: sanitizes the field name and stores the original in
  iceberg-field-name when sanitization was needed
- Read path: checks iceberg-field-name property first, falls back to
  the Avro field name

Closes apache#2535

Co-authored-by: rawataaryan9 <rawataaryan9@users.noreply.github.com>
…tion

Adds test cases for:
- Non-ASCII BMP characters (U+00E9, U+4E2D)
- Supplementary characters (surrogate pair handling, matching Java's UTF-16)
- Empty string edge case
- Read-path: iceberg-field-name property resolution from Java-written schemas
- Verify iceberg-field-name property is set on dotted field names

Co-authored-by: rawataaryan9 <rawataaryan9@users.noreply.github.com>
The previous doc claimed it "matches Java AvroSchemaUtil.sanitize()
semantics". It does not: Java's validAvroName/sanitize use Unicode
Character.isLetter/isLetterOrDigit/isDigit and can emit names that
violate Avro's ASCII-only grammar (e.g. it leaves `café` and `_٠x`
intact). This impl is ASCII-strict so output is always a spec-valid
Avro name, with the exact original preserved via iceberg-field-name.
Document the divergence and that the escape scheme still matches Java
for any character both sides escape.

Co-authored-by: Sreeram Garlapati <sreeramg@salesforce.com>
@SreeramGarlapati SreeramGarlapati force-pushed the fix/avro-field-name-sanitization branch from 68c1e33 to 6e5c644 Compare June 22, 2026 04:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

avro schema writer does not sanitize field names that violate avro naming rules

2 participants