Fix/voice stt speech events user state tracking by dhruvladia-sarvam · Pull Request #5797 · livekit/agents

dhruvladia-sarvam · 2026-05-21T09:22:50Z

Problem

When no external VAD is configured, but the STT provider emits speech-boundary events (SpeechEventType.START_OF_SPEECH / END_OF_SPEECH), those STT speech events are not always propagated into the session user-state machine.

This causes AgentSession.user_state to remain "listening" while the user is actually speaking. As a result, user_away_timeout can fire mid-utterance and mark the user as "away" even though speech is ongoing.

This is especially visible with STT providers such as Sarvam that expose internal VAD signals over the STT stream.

Root Cause

AudioRecognition._on_stt_event previously only forwarded STT speech-boundary events into the user-state path when:

self._turn_detection_mode == "stt"

That meant STT START_OF_SPEECH / END_OF_SPEECH drove _hooks.on_start_of_speech(...) / _hooks.on_end_of_speech(...) only for turn_detection="stt".

When no external VAD is configured and turn detection is:

model-based, e.g. MultilingualModel()
omitted / auto
manual

then _turn_detection_mode is not "stt", so STT speech-boundary events are ignored for user-state purposes.

The user-state machine is updated only through _hooks.on_start_of_speech(...) / _hooks.on_end_of_speech(...), which eventually call:

self._session._update_user_state("speaking", ...)
self._session._update_user_state("listening", ...)

Without those calls, _user_away_timer is not cancelled when the user starts speaking.

Fix

Allow STT speech-boundary events to drive user-state transitions whenever no external VAD is configured.

The STT event handlers now run when either:

self._turn_detection_mode == "stt"

or:

self._vad is None

This lets STT-internal VAD act as the available speech activity source for user-state tracking when there is no external VAD.

At the same time, STT events are not given unconditional turn-commit authority. The commit path remains scoped to turn_detection_mode == "stt":

if self._turn_detection_mode == "stt":
    self._user_turn_committed = True
    chat_ctx = self._hooks.retrieve_chat_ctx().copy()
    self._run_eou_detection(chat_ctx)

So for manual/model/omitted turn detection:

STT events update user state
STT events do not auto-commit turns

Behavior Matrix

External VAD	Turn detection	Before	After
Yes	any	External VAD drives `speaking` / `listening`	Same
No	`"stt"`	STT events drive user-state and turn commit	Same
No	model-based, e.g. `MultilingualModel()`	STT speech events ignored for user state; away timer can fire mid-speech	STT events drive user state; turn detector still controls turn commit
No	omitted / auto	STT speech events ignored for user state; away timer can fire mid-speech	STT events drive user state; existing turn handling remains unchanged
No	`"manual"`	STT speech events ignored for user state; away timer can fire mid-speech	STT events drive user state; manual commit remains required

What This PR Does Not Include

This PR intentionally does not include the metrics-only STT EOS timestamp preservation fix. That belongs to the separate PR:

fix/preserve-stt-eos-timestamp-for-metrics

So this branch should not include:

_stt_end_of_speech_received
STT EOS timing tests
changes to FINAL_TRANSCRIPT / PREFLIGHT_TRANSCRIPT fallback logic for EOU metrics

Manual Verification

Validated the important combinations manually:

`vad=None + turn_detection="stt"`

STT START_SPEECH produced User State Changed: speaking
STT END_SPEECH produced User State Changed: listening
Long speech did not trigger away during speech

`vad=None + MultilingualModel()`

STT internal VAD drove speaking / listening
User spoke for ~20s, longer than user_away_timeout
away did not fire during speech
away fired only after speech ended and the user was silent

`vad=None + turn_detection="manual"`

STT internal VAD drove speaking / listening
Long speech did not trigger away mid-utterance
No automatic turn commit / agent response occurred; manual behavior preserved

`vad=None + turn_detection omitted`

STT internal VAD drove speaking / listening
Long speech did not trigger away mid-utterance
Away sequence cancellation worked when the user spoke again

`vad=Silero + MultilingualModel()`

Existing external VAD behavior remained healthy
External VAD continued driving user-state transitions
Long speech did not trigger away mid-utterance
No obvious regression from STT event gate changes

Co-authored-by: Cursor <cursoragent@cursor.com>

dhruvladia-sarvam added 3 commits May 18, 2026 07:35

initial

77c064e

ruff fix

05075f0

initial

b0164a9

This comment was marked as resolved.

Sign in to view

dhruvladia-sarvam and others added 2 commits May 29, 2026 14:06

fix(voice): keep user-state branch focused

55cd481

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge upstream/main into user-state tracking branch

a340683

Co-authored-by: Cursor <cursoragent@cursor.com>

This comment was marked as resolved.

Sign in to view

fix(voice): run EOU after STT speech end without external VAD

4c051a2

Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/voice stt speech events user state tracking#5797

Fix/voice stt speech events user state tracking#5797
dhruvladia-sarvam wants to merge 6 commits into
livekit:mainfrom
dhruvladia-sarvam:fix/voice-stt-speech-events-user-state-tracking

dhruvladia-sarvam commented May 21, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dhruvladia-sarvam commented May 21, 2026

Problem

Root Cause

Fix

Behavior Matrix

What This PR Does Not Include

Manual Verification

vad=None + turn_detection="stt"

vad=None + MultilingualModel()

vad=None + turn_detection="manual"

vad=None + turn_detection omitted

vad=Silero + MultilingualModel()

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`vad=None + turn_detection="stt"`

`vad=None + MultilingualModel()`

`vad=None + turn_detection="manual"`

`vad=None + turn_detection omitted`

`vad=Silero + MultilingualModel()`