feat(voice-transcription): speech-to-text plugin for inbound voice notes by rmyndharis · Pull Request #9 · rmyndharis/OpenWA-plugins

rmyndharis · 2026-06-25T15:45:40Z

Voice Note Transcription plugin

Adds the voice-transcription marketplace extension — transcribes inbound WhatsApp voice notes via an OpenAI-compatible STT backend and delivers a message.transcription event out-of-band, so bots/AI can read & reply to audio. Implements the request in rmyndharis/OpenWA#365.

Design

Off the message-delivery critical path. The message:received hook returns {continue:true} immediately and STT runs as an un-awaited task — so it never blocks or delays delivery, and isn't bound by the 5s sandbox hook budget. (A regression test pins this: the hook resolves even when ctx.net.fetch hangs.)
Binary multipart upload. The audio is sent as a Buffer body — it survives the sandbox→host structuredClone boundary intact (a string body would corrupt binary). The part is labeled voice.ogg/audio/ogg, so OpenAI-compatible servers accept WhatsApp's OGG/Opus with no transcoding.
Provider-agnostic. Any OpenAI-compatible /v1/audio/transcriptions endpoint: self-hosted Speaches/faster-whisper (default, free, local) or hosted Groq/OpenAI by changing one URL.

Delivery

Webhook (deliveryWebhookUrl) — POSTs the event to your endpoint; HMAC-SHA256 signed in X-OpenWA-Signature (same scheme as core webhooks) when a secret is set.
In-chat (chatDelivery: off | self | reply, default off) — self notes it to your own number without leaking to the sender; reply quote-replies to the sender.
Either is optional (chat-only works).

Events & guards

Status events: completed (transcript) / failed (STT errored) / skipped (too large, rate-limited, empty).
Exact maxSizeBytes cost guard, best-effort per-session hourly rate limit, best-effort idempotency (suppresses #466-style engine re-fires), and an STT circuit breaker. Fail-open throughout.
The transcript is marked untrusted: true — downstream LLM consumers must treat it as user-role input.

Contract change

Widens the vendored types to match the sandbox runtime: PluginNetResponse.body (the real field — the .json()/.text() method forms don't cross the worker structuredClone boundary) and IncomingMessage.media. Also updates the group-translate test fixture for the now-required body.

Note: this surfaced that group-translate calls res.json(), which doesn't exist at runtime in the sandbox — tracked separately; not touched here beyond keeping it compiling.

Tests

TDD throughout — 32 plugin tests (multipart binary integrity, STT client + circuit breaker, HMAC delivery, coordinator gate/guards/status-events/chat-delivery, non-blocking hook). Full repo gate green: tsc clean, 139/139, catalog in sync.

Status: beta (best-effort by design — at-most-once-while-worker-alive, no backpressure; documented in the README). Upgrade path for exactly-once would be a future core message.transcription event.

Adds the voice-transcription marketplace extension: transcribes inbound WhatsApp voice notes via an OpenAI-compatible STT backend (self-hosted Speaches/faster-whisper, or hosted Groq/OpenAI) and delivers a message.transcription event out-of-band, so bots and AI can read and reply to audio. Implements the request in rmyndharis/OpenWA#365. - Off the message-delivery critical path: the message:received hook returns immediately and STT runs as an un-awaited task, so it never blocks or delays delivery (and is not bound by the 5s hook budget). - Audio uploaded as a binary multipart Buffer body (intact across the sandbox boundary); part labeled voice.ogg so OGG/Opus needs no transcode. - Delivery: configurable webhook (HMAC-SHA256 signed in X-OpenWA-Signature, matching core webhooks) and/or optional in-chat (off|self|reply, default off; self avoids leaking to the sender). Either is optional. - Status events: completed / failed / skipped(reason). - Guards: exact maxSizeBytes, per-session hourly rate limit, best-effort idempotency (suppresses #466-style engine re-fires), STT circuit breaker. Fail-open throughout. Contract: widen the vendored types to match the sandbox runtime — PluginNetResponse.body (the real field; the .json()/.text() methods do not cross the worker boundary) and IncomingMessage.media. Also updates the group-translate test fixture for the now-required body field.

rmyndharis mentioned this pull request Jun 25, 2026

group-translate: ctx.net.fetch has no .json() at runtime → translation silently fails #10

Closed

rmyndharis merged commit d1d44a9 into main Jun 25, 2026
1 check passed

rmyndharis deleted the feat/voice-transcription-plugin branch June 25, 2026 15:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(voice-transcription): speech-to-text plugin for inbound voice notes#9

feat(voice-transcription): speech-to-text plugin for inbound voice notes#9
rmyndharis merged 1 commit into
mainfrom
feat/voice-transcription-plugin

rmyndharis commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rmyndharis commented Jun 25, 2026

Voice Note Transcription plugin

Design

Delivery

Events & guards

Contract change

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant