fix(transformer): escape < and > in extracted statute text (#200 H2)#207
Merged
Conversation
…pth) Statute body text is written into Markdown that the web app renders with no HTML sanitization (#200/H2). The XML parser decodes entities, so a source `<img onerror=...>` becomes the literal `<img onerror=...>` in extracted text and would render as live HTML. Escape `<`/`>` to entities in extractTextFromNodes — the single choke point all rendered statute text flows through — so the generated Markdown is provably HTML-free regardless of upstream USLM content. The recursive walk does not double-escape (an escaped `<` contains no `<` for a parent level to re-match), and `&` is left untouched to avoid entity double-encoding. Markers, headings and title numbers are alphanumeric and unaffected; section paths derive from element identifiers, not this text. Golden-snapshot tests unchanged (real fixtures carry no raw angle brackets). New xml-utils tests cover escaping, no-double-escape, and ampersand passthrough. transformer: 70 pass; monorepo builds. Refs #200 (H2) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
Ran an independent multi-reviewer consensus panel on this change. Net: the escape is correct, idempotent (no double-escape across the recursive walk), and inert for the num/heading/path code paths — but the "provably HTML-free" claim is overstated and is now scoped:
The complete control for that residual is the render-side |
This was referenced Jun 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses H2 from #200 (the source-side / authoritative half).
Problem
The XML→Markdown transformer writes statute body text into
.mdfiles that the web app renders via<Content />with no rehype-sanitize. The parser decodes XML entities (processEntities), so a source<img onerror=...>becomes the literal<img onerror=...>in extracted text — which Astro's Markdown renderer would treat as live HTML (stored-XSS surface; low likelihood today since OLRC is trusted, but unenforced).Fix
Escape
</>→</>inextractTextFromNodes— the single choke point all rendered statute text flows through — so the generated Markdown is provably HTML-free regardless of upstream USLM content.<→<, the entity has no<for a parent level to re-match.&is intentionally left untouched (no entity double-encoding;<alone is the tag-injection vector).Verification
xml-utils.test.ts: escaping, no-double-escape across nested nodes, ampersand passthrough.pnpm --filter @civic-source/transformer test→ 70 pass;pnpm build→ 8/8.This is the authoritative source-side fix. The render-side backstop (rehype-sanitize) and the CSP
'unsafe-inline'item (M1) remain noted in #200.🤖 Generated with Claude Code