Skip to content

feat(voice-transcription): speech-to-text plugin for inbound voice notes#9

Merged
rmyndharis merged 1 commit into
mainfrom
feat/voice-transcription-plugin
Jun 25, 2026
Merged

feat(voice-transcription): speech-to-text plugin for inbound voice notes#9
rmyndharis merged 1 commit into
mainfrom
feat/voice-transcription-plugin

Conversation

@rmyndharis

Copy link
Copy Markdown
Owner

Voice Note Transcription plugin

Adds the voice-transcription marketplace extension — transcribes inbound WhatsApp voice notes via an OpenAI-compatible STT backend and delivers a message.transcription event out-of-band, so bots/AI can read & reply to audio. Implements the request in rmyndharis/OpenWA#365.

Design

  • Off the message-delivery critical path. The message:received hook returns {continue:true} immediately and STT runs as an un-awaited task — so it never blocks or delays delivery, and isn't bound by the 5s sandbox hook budget. (A regression test pins this: the hook resolves even when ctx.net.fetch hangs.)
  • Binary multipart upload. The audio is sent as a Buffer body — it survives the sandbox→host structuredClone boundary intact (a string body would corrupt binary). The part is labeled voice.ogg/audio/ogg, so OpenAI-compatible servers accept WhatsApp's OGG/Opus with no transcoding.
  • Provider-agnostic. Any OpenAI-compatible /v1/audio/transcriptions endpoint: self-hosted Speaches/faster-whisper (default, free, local) or hosted Groq/OpenAI by changing one URL.

Delivery

  • Webhook (deliveryWebhookUrl) — POSTs the event to your endpoint; HMAC-SHA256 signed in X-OpenWA-Signature (same scheme as core webhooks) when a secret is set.
  • In-chat (chatDelivery: off | self | reply, default off) — self notes it to your own number without leaking to the sender; reply quote-replies to the sender.
  • Either is optional (chat-only works).

Events & guards

  • Status events: completed (transcript) / failed (STT errored) / skipped (too large, rate-limited, empty).
  • Exact maxSizeBytes cost guard, best-effort per-session hourly rate limit, best-effort idempotency (suppresses #466-style engine re-fires), and an STT circuit breaker. Fail-open throughout.
  • The transcript is marked untrusted: true — downstream LLM consumers must treat it as user-role input.

Contract change

Widens the vendored types to match the sandbox runtime: PluginNetResponse.body (the real field — the .json()/.text() method forms don't cross the worker structuredClone boundary) and IncomingMessage.media. Also updates the group-translate test fixture for the now-required body.

Note: this surfaced that group-translate calls res.json(), which doesn't exist at runtime in the sandbox — tracked separately; not touched here beyond keeping it compiling.

Tests

TDD throughout — 32 plugin tests (multipart binary integrity, STT client + circuit breaker, HMAC delivery, coordinator gate/guards/status-events/chat-delivery, non-blocking hook). Full repo gate green: tsc clean, 139/139, catalog in sync.

Status: beta (best-effort by design — at-most-once-while-worker-alive, no backpressure; documented in the README). Upgrade path for exactly-once would be a future core message.transcription event.

Adds the voice-transcription marketplace extension: transcribes inbound
WhatsApp voice notes via an OpenAI-compatible STT backend (self-hosted
Speaches/faster-whisper, or hosted Groq/OpenAI) and delivers a
message.transcription event out-of-band, so bots and AI can read and
reply to audio. Implements the request in rmyndharis/OpenWA#365.

- Off the message-delivery critical path: the message:received hook
  returns immediately and STT runs as an un-awaited task, so it never
  blocks or delays delivery (and is not bound by the 5s hook budget).
- Audio uploaded as a binary multipart Buffer body (intact across the
  sandbox boundary); part labeled voice.ogg so OGG/Opus needs no transcode.
- Delivery: configurable webhook (HMAC-SHA256 signed in X-OpenWA-Signature,
  matching core webhooks) and/or optional in-chat (off|self|reply, default
  off; self avoids leaking to the sender). Either is optional.
- Status events: completed / failed / skipped(reason).
- Guards: exact maxSizeBytes, per-session hourly rate limit, best-effort
  idempotency (suppresses #466-style engine re-fires), STT circuit breaker.
  Fail-open throughout.

Contract: widen the vendored types to match the sandbox runtime —
PluginNetResponse.body (the real field; the .json()/.text() methods do not
cross the worker boundary) and IncomingMessage.media. Also updates the
group-translate test fixture for the now-required body field.
@rmyndharis rmyndharis merged commit d1d44a9 into main Jun 25, 2026
1 check passed
@rmyndharis rmyndharis deleted the feat/voice-transcription-plugin branch June 25, 2026 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant