For converting your spoken words into text, it taps into OpenAI’s Whisper model, an automatic speech recognition system renowned for its accuracy and ability to handle various accents and background noise.
Have the hardware requirements of Whisper dropped significantly over the last few months? I played around with it in context of home assustant year of the voice. Despite using a (4 year old) ThinkPad with 32 GB of RAM and a 4 core (8 threads) i7 the accuracy and performance of Whisper was still not at a point that I'd use for productive use.
A rather simple sentence like 'turn the light in the living room on' worked maybe in 70% of the cases if I sat right next to the microphone and without any background noise. With music playing in the background or other people talking in parallel it dropped to ~25% accuracy.
If it now runs just fine on a Raspberry Pi Zero that would be a massive improvement!