Skip to content

fix(zoom): iteratively strip tags in transcript parser to close incomplete-sanitization gap#4745

Merged
waleedlatif1 merged 2 commits into
stagingfrom
waleedlatif1/zoom-transcript-sanitization
May 26, 2026
Merged

fix(zoom): iteratively strip tags in transcript parser to close incomplete-sanitization gap#4745
waleedlatif1 merged 2 commits into
stagingfrom
waleedlatif1/zoom-transcript-sanitization

Conversation

@waleedlatif1
Copy link
Copy Markdown
Collaborator

Summary

Why

Speaker labels in WebVTT transcripts originate from Zoom meeting attendees' display names — user-controllable content. A single-pass tag regex can be bypassed by inputs that, once partially stripped, reconstruct a tag from the surrounding fragments. Iterating until the string stabilises guarantees no nested tag layers remain.

Type of Change

  • Bug fix (security hardening)

Testing

  • Typecheck and lint clean.
  • Manually verified the loop terminates on plain transcripts (one iteration) and on nested inputs (multiple iterations until stable).

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link
Copy Markdown

vercel Bot commented May 26, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped May 26, 2026 9:23pm

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 26, 2026

PR Summary

Medium Risk
Security hardening on user-influenced transcript text before it is stored or displayed; scope is limited to the Zoom VTT parser with new regression tests.

Overview
Hardens Zoom WebVTT transcript parsing by replacing a single-pass tag strip with a loop that removes </?[^>]+> tags until the string stops changing, closing CodeQL “incomplete multi-character sanitization” bypasses where partial stripping could leave reconstructed markup (including in attendee-controlled speaker labels).

Exports parseVtt from zoom.ts only for tests (documented as non-public) and adds zoom.test.ts covering normal VTT behavior plus adversarial/nested tag inputs.

Reviewed by Cursor Bugbot for commit 79c057e. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 26, 2026

Greptile Summary

This PR hardens the WebVTT transcript parser in the Zoom connector against a CodeQL "Incomplete multi-character sanitization" finding by replacing a single-pass tag-stripping regex with a do-while loop that re-strips until the string reaches a stable fixed point. Regression tests are added to cover the hardened path.

  • zoom.ts: parseVtt now runs /<\/?[^>]+>/g iteratively until no match remains, preventing crafted inputs (e.g. <<b>b>text</<b>b>) from reconstructing a tag after one pass. The function is exported solely to enable the new tests.
  • zoom.test.ts: 14 concurrent cases exercise both normal parsing (voice tags, karaoke timestamps, whitespace collapsing) and adversarial inputs (overlapping tags, nested script fragments, crafted speaker names, deeply-nested bracket sequences).

Confidence Score: 5/5

Safe to merge — the change is tightly scoped to a single pure function with no side effects, and the new tests verify both normal operation and the adversarial inputs the fix targets.

The do-while loop terminates correctly because each pass can only shorten or preserve the string, guaranteeing convergence. The regex [^>]+ uses a simple negated character class with no nested quantifiers, so there is no ReDoS exposure. Zoom display-name length limits bound nesting depth in practice, and the tests pin the behavior for future refactors.

No files require special attention.

Important Files Changed

Filename Overview
apps/sim/connectors/zoom/zoom.ts Replaces single-pass regex tag stripping with a do-while loop that runs until stable; exports parseVtt for testing. Logic is correct and terminates reliably.
apps/sim/connectors/zoom/zoom.test.ts New test file with 14 concurrent cases covering normal parsing, voice-tag handling, and adversarial inputs (overlapping tags, nested script fragments, deeply-nested brackets).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Raw cue text lines joined] --> B[Apply voice-tag regex\n replace v Speaker text /v → Speaker: text]
    B --> C[withoutTags = withSpeakers]
    C --> D{Loop iteration:\nprevious = withoutTags\nwithoutTags = strip tags}
    D --> E{withoutTags === previous?}
    E -- No, tags were found --> D
    E -- Yes, stable fixed-point --> F[Collapse whitespace + trim]
    F --> G{Non-empty?}
    G -- Yes --> H[Push to segments]
    G -- No --> I[Skip]
Loading

Reviews (2): Last reviewed commit: "test(zoom): cover iterative sanitization..." | Re-trigger Greptile

Comment thread apps/sim/connectors/zoom/zoom.ts
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 79c057e. Configure here.

@waleedlatif1 waleedlatif1 merged commit e66daa1 into staging May 26, 2026
10 checks passed
@waleedlatif1 waleedlatif1 deleted the waleedlatif1/zoom-transcript-sanitization branch May 26, 2026 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant