Skip to content

Fix/voice stt speech events user state tracking#5797

Open
dhruvladia-sarvam wants to merge 6 commits into
livekit:mainfrom
dhruvladia-sarvam:fix/voice-stt-speech-events-user-state-tracking
Open

Fix/voice stt speech events user state tracking#5797
dhruvladia-sarvam wants to merge 6 commits into
livekit:mainfrom
dhruvladia-sarvam:fix/voice-stt-speech-events-user-state-tracking

Conversation

@dhruvladia-sarvam
Copy link
Copy Markdown
Contributor

Problem

When no external VAD is configured, but the STT provider emits speech-boundary events (SpeechEventType.START_OF_SPEECH / END_OF_SPEECH), those STT speech events are not always propagated into the session user-state machine.

This causes AgentSession.user_state to remain "listening" while the user is actually speaking. As a result, user_away_timeout can fire mid-utterance and mark the user as "away" even though speech is ongoing.

This is especially visible with STT providers such as Sarvam that expose internal VAD signals over the STT stream.

Root Cause

AudioRecognition._on_stt_event previously only forwarded STT speech-boundary events into the user-state path when:

self._turn_detection_mode == "stt"

That meant STT START_OF_SPEECH / END_OF_SPEECH drove _hooks.on_start_of_speech(...) / _hooks.on_end_of_speech(...) only for turn_detection="stt".

When no external VAD is configured and turn detection is:

  • model-based, e.g. MultilingualModel()
  • omitted / auto
  • manual

then _turn_detection_mode is not "stt", so STT speech-boundary events are ignored for user-state purposes.

The user-state machine is updated only through _hooks.on_start_of_speech(...) / _hooks.on_end_of_speech(...), which eventually call:

self._session._update_user_state("speaking", ...)
self._session._update_user_state("listening", ...)

Without those calls, _user_away_timer is not cancelled when the user starts speaking.

Fix

Allow STT speech-boundary events to drive user-state transitions whenever no external VAD is configured.

The STT event handlers now run when either:

self._turn_detection_mode == "stt"

or:

self._vad is None

This lets STT-internal VAD act as the available speech activity source for user-state tracking when there is no external VAD.

At the same time, STT events are not given unconditional turn-commit authority. The commit path remains scoped to turn_detection_mode == "stt":

if self._turn_detection_mode == "stt":
    self._user_turn_committed = True
    chat_ctx = self._hooks.retrieve_chat_ctx().copy()
    self._run_eou_detection(chat_ctx)

So for manual/model/omitted turn detection:

  • STT events update user state
  • STT events do not auto-commit turns

Behavior Matrix

External VAD Turn detection Before After
Yes any External VAD drives speaking / listening Same
No "stt" STT events drive user-state and turn commit Same
No model-based, e.g. MultilingualModel() STT speech events ignored for user state; away timer can fire mid-speech STT events drive user state; turn detector still controls turn commit
No omitted / auto STT speech events ignored for user state; away timer can fire mid-speech STT events drive user state; existing turn handling remains unchanged
No "manual" STT speech events ignored for user state; away timer can fire mid-speech STT events drive user state; manual commit remains required

What This PR Does Not Include

This PR intentionally does not include the metrics-only STT EOS timestamp preservation fix. That belongs to the separate PR:

fix/preserve-stt-eos-timestamp-for-metrics

So this branch should not include:

  • _stt_end_of_speech_received
  • STT EOS timing tests
  • changes to FINAL_TRANSCRIPT / PREFLIGHT_TRANSCRIPT fallback logic for EOU metrics

Manual Verification

Validated the important combinations manually:

vad=None + turn_detection="stt"

  • STT START_SPEECH produced User State Changed: speaking
  • STT END_SPEECH produced User State Changed: listening
  • Long speech did not trigger away during speech

vad=None + MultilingualModel()

  • STT internal VAD drove speaking / listening
  • User spoke for ~20s, longer than user_away_timeout
  • away did not fire during speech
  • away fired only after speech ended and the user was silent

vad=None + turn_detection="manual"

  • STT internal VAD drove speaking / listening
  • Long speech did not trigger away mid-utterance
  • No automatic turn commit / agent response occurred; manual behavior preserved

vad=None + turn_detection omitted

  • STT internal VAD drove speaking / listening
  • Long speech did not trigger away mid-utterance
  • Away sequence cancellation worked when the user spoke again

vad=Silero + MultilingualModel()

  • Existing external VAD behavior remained healthy
  • External VAD continued driving user-state transitions
  • Long speech did not trigger away mid-utterance
  • No obvious regression from STT event gate changes

devin-ai-integration[bot]

This comment was marked as resolved.

dhruvladia-sarvam and others added 2 commits May 29, 2026 14:06
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
devin-ai-integration[bot]

This comment was marked as resolved.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant