feat(voice-transcription): speech-to-text plugin for inbound voice notes#9
Merged
Merged
Conversation
Adds the voice-transcription marketplace extension: transcribes inbound WhatsApp voice notes via an OpenAI-compatible STT backend (self-hosted Speaches/faster-whisper, or hosted Groq/OpenAI) and delivers a message.transcription event out-of-band, so bots and AI can read and reply to audio. Implements the request in rmyndharis/OpenWA#365. - Off the message-delivery critical path: the message:received hook returns immediately and STT runs as an un-awaited task, so it never blocks or delays delivery (and is not bound by the 5s hook budget). - Audio uploaded as a binary multipart Buffer body (intact across the sandbox boundary); part labeled voice.ogg so OGG/Opus needs no transcode. - Delivery: configurable webhook (HMAC-SHA256 signed in X-OpenWA-Signature, matching core webhooks) and/or optional in-chat (off|self|reply, default off; self avoids leaking to the sender). Either is optional. - Status events: completed / failed / skipped(reason). - Guards: exact maxSizeBytes, per-session hourly rate limit, best-effort idempotency (suppresses #466-style engine re-fires), STT circuit breaker. Fail-open throughout. Contract: widen the vendored types to match the sandbox runtime — PluginNetResponse.body (the real field; the .json()/.text() methods do not cross the worker boundary) and IncomingMessage.media. Also updates the group-translate test fixture for the now-required body field.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Voice Note Transcription plugin
Adds the
voice-transcriptionmarketplace extension — transcribes inbound WhatsApp voice notes via an OpenAI-compatible STT backend and delivers amessage.transcriptionevent out-of-band, so bots/AI can read & reply to audio. Implements the request in rmyndharis/OpenWA#365.Design
message:receivedhook returns{continue:true}immediately and STT runs as an un-awaited task — so it never blocks or delays delivery, and isn't bound by the 5s sandbox hook budget. (A regression test pins this: the hook resolves even whenctx.net.fetchhangs.)Bufferbody — it survives the sandbox→hoststructuredCloneboundary intact (a string body would corrupt binary). The part is labeledvoice.ogg/audio/ogg, so OpenAI-compatible servers accept WhatsApp's OGG/Opus with no transcoding./v1/audio/transcriptionsendpoint: self-hosted Speaches/faster-whisper (default, free, local) or hosted Groq/OpenAI by changing one URL.Delivery
deliveryWebhookUrl) — POSTs the event to your endpoint; HMAC-SHA256 signed inX-OpenWA-Signature(same scheme as core webhooks) when a secret is set.chatDelivery:off|self|reply, defaultoff) —selfnotes it to your own number without leaking to the sender;replyquote-replies to the sender.Events & guards
completed(transcript) /failed(STT errored) /skipped(too large, rate-limited, empty).maxSizeBytescost guard, best-effort per-session hourly rate limit, best-effort idempotency (suppresses #466-style engine re-fires), and an STT circuit breaker. Fail-open throughout.untrusted: true— downstream LLM consumers must treat it as user-role input.Contract change
Widens the vendored types to match the sandbox runtime:
PluginNetResponse.body(the real field — the.json()/.text()method forms don't cross the workerstructuredCloneboundary) andIncomingMessage.media. Also updates thegroup-translatetest fixture for the now-requiredbody.Tests
TDD throughout — 32 plugin tests (multipart binary integrity, STT client + circuit breaker, HMAC delivery, coordinator gate/guards/status-events/chat-delivery, non-blocking hook). Full repo gate green:
tscclean, 139/139, catalog in sync.Status: beta (best-effort by design — at-most-once-while-worker-alive, no backpressure; documented in the README). Upgrade path for exactly-once would be a future core
message.transcriptionevent.