Fix/voice stt speech events user state tracking#5797
Open
dhruvladia-sarvam wants to merge 6 commits into
Open
Conversation
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When no external VAD is configured, but the STT provider emits speech-boundary events (
SpeechEventType.START_OF_SPEECH/END_OF_SPEECH), those STT speech events are not always propagated into the session user-state machine.This causes
AgentSession.user_stateto remain"listening"while the user is actually speaking. As a result,user_away_timeoutcan fire mid-utterance and mark the user as"away"even though speech is ongoing.This is especially visible with STT providers such as Sarvam that expose internal VAD signals over the STT stream.
Root Cause
AudioRecognition._on_stt_eventpreviously only forwarded STT speech-boundary events into the user-state path when:That meant STT
START_OF_SPEECH/END_OF_SPEECHdrove_hooks.on_start_of_speech(...)/_hooks.on_end_of_speech(...)only forturn_detection="stt".When no external VAD is configured and turn detection is:
MultilingualModel()then
_turn_detection_modeis not"stt", so STT speech-boundary events are ignored for user-state purposes.The user-state machine is updated only through
_hooks.on_start_of_speech(...)/_hooks.on_end_of_speech(...), which eventually call:Without those calls,
_user_away_timeris not cancelled when the user starts speaking.Fix
Allow STT speech-boundary events to drive user-state transitions whenever no external VAD is configured.
The STT event handlers now run when either:
or:
This lets STT-internal VAD act as the available speech activity source for user-state tracking when there is no external VAD.
At the same time, STT events are not given unconditional turn-commit authority. The commit path remains scoped to
turn_detection_mode == "stt":So for manual/model/omitted turn detection:
Behavior Matrix
speaking/listening"stt"MultilingualModel()"manual"What This PR Does Not Include
This PR intentionally does not include the metrics-only STT EOS timestamp preservation fix. That belongs to the separate PR:
So this branch should not include:
_stt_end_of_speech_receivedFINAL_TRANSCRIPT/PREFLIGHT_TRANSCRIPTfallback logic for EOU metricsManual Verification
Validated the important combinations manually:
vad=None + turn_detection="stt"START_SPEECHproducedUser State Changed: speakingEND_SPEECHproducedUser State Changed: listeningawayduring speechvad=None + MultilingualModel()speaking/listeninguser_away_timeoutawaydid not fire during speechawayfired only after speech ended and the user was silentvad=None + turn_detection="manual"speaking/listeningawaymid-utterancevad=None + turn_detection omittedspeaking/listeningawaymid-utterancevad=Silero + MultilingualModel()awaymid-utterance