This is an automated archive made by the Lemmit Bot.
The original was posted on /r/homeassistant by /u/Leather_Idea_2122 on 2026-06-12 10:56:24+00:00.
Hey all, HA Voice PE user here running local STT on a N100.
I've been frustrated with the long delay between finishing a sentence and hearing the assistant respond and tracked it down to the way wyoming-faster-whisper works: it buffers your entire spoken utterance into a WAV file, then starts inference only after you stop talking.
I added streaming ASR support using sherpa-onnx's OnlineRecognizer. The model now decodes audio chunks as they arrive, so for me on my N100 by the time I stop speaking most of the inference is already done.
In day-to-day use it makes a real difference and the assistant feels much more responsive. In fact, HA assist debug typically reports 0s-0.5s STT time only. In past it took twice the time of the recorded audio after I stopped speaking (3s spoken command -> 6s processing after I stopped speaking before it even went into LLM/local pocessing).
To try it:
Pull the Docker image:
docker pull ghcr.io/pkrahmer/wyoming-faster-whisper:latest
Run it with the streaming English model:
--stt-library sherpa --model sherpa-onnx-streaming-zipformer-en-2023-06-26 --language en
I use this German model at home:
--stt-library sherpa --model sherpa-onnx-streaming-zipformer-de-kroko-2025-08-06 --language de
Fork and details: https://github.com/pkrahmer/wyoming-faster-whisper
Happy to answer questions. Would love to know if others notice the same improvement.