Skip to content

fix(transformer): escape < and > in extracted statute text (#200 H2)#207

Merged
williamzujkowski merged 1 commit into
mainfrom
fix/transformer-escape-html
Jun 23, 2026
Merged

fix(transformer): escape < and > in extracted statute text (#200 H2)#207
williamzujkowski merged 1 commit into
mainfrom
fix/transformer-escape-html

Conversation

@williamzujkowski

Copy link
Copy Markdown
Collaborator

Addresses H2 from #200 (the source-side / authoritative half).

Problem

The XML→Markdown transformer writes statute body text into .md files that the web app renders via <Content /> with no rehype-sanitize. The parser decodes XML entities (processEntities), so a source &lt;img onerror=...&gt; becomes the literal <img onerror=...> in extracted text — which Astro's Markdown renderer would treat as live HTML (stored-XSS surface; low likelihood today since OLRC is trusted, but unenforced).

Fix

Escape </>&lt;/&gt; in extractTextFromNodes — the single choke point all rendered statute text flows through — so the generated Markdown is provably HTML-free regardless of upstream USLM content.

  • The recursive walk does not double-escape: once <&lt;, the entity has no < for a parent level to re-match.
  • & is intentionally left untouched (no entity double-encoding; < alone is the tag-injection vector).
  • Markers, headings, and title numbers are alphanumeric → unaffected. Section paths derive from element identifiers, not this text → unaffected.

Verification

  • Golden-snapshot tests unchanged — real statute fixtures carry no raw angle brackets, so normal output is byte-identical (no rendering regression).
  • New xml-utils.test.ts: escaping, no-double-escape across nested nodes, ampersand passthrough.
  • pnpm --filter @civic-source/transformer test70 pass; pnpm build8/8.

This is the authoritative source-side fix. The render-side backstop (rehype-sanitize) and the CSP 'unsafe-inline' item (M1) remain noted in #200.

🤖 Generated with Claude Code

…pth)

Statute body text is written into Markdown that the web app renders with
no HTML sanitization (#200/H2). The XML parser decodes entities, so a
source `&lt;img onerror=...&gt;` becomes the literal `<img onerror=...>`
in extracted text and would render as live HTML.

Escape `<`/`>` to entities in extractTextFromNodes — the single choke
point all rendered statute text flows through — so the generated
Markdown is provably HTML-free regardless of upstream USLM content. The
recursive walk does not double-escape (an escaped `&lt;` contains no `<`
for a parent level to re-match), and `&` is left untouched to avoid
entity double-encoding. Markers, headings and title numbers are
alphanumeric and unaffected; section paths derive from element
identifiers, not this text.

Golden-snapshot tests unchanged (real fixtures carry no raw angle
brackets). New xml-utils tests cover escaping, no-double-escape, and
ampersand passthrough. transformer: 70 pass; monorepo builds.

Refs #200 (H2)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@williamzujkowski williamzujkowski requested a review from a team as a code owner June 23, 2026 04:07
@williamzujkowski

Copy link
Copy Markdown
Collaborator Author

Ran an independent multi-reviewer consensus panel on this change. Net: the escape is correct, idempotent (no double-escape across the recursive walk), and inert for the num/heading/path code paths — but the "provably HTML-free" claim is overstated and is now scoped:

  • This neutralizes raw HTML tags (<img onerror=…>) — the H2 vector as filed.
  • It does not neutralize Markdown-syntax injection — e.g. literal [x](javascript:…) link/autolink/image text in statute content still compiles to a live <a>/<img>. (The transformer doesn't emit link syntax, so this only applies to literal such text passing through — very low likelihood, but real.)

The complete control for that residual is the render-side rehype-sanitize backstop, which I've elevated in #200. This PR remains a correct, tested defense-in-depth improvement for the raw-tag vector, so merging it; the render-side control closes the rest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant