Why hallucinations happen
Speech-to-text models are trained to predict text from audio. When the audio signal is unclear — because of silence, noise, or too little speech — the model still produces output, defaulting to statistically likely words rather than returning nothing. Three situations make this especially likely:- Silence or long pauses — Some models, particularly local ones, fill silence with filler text rather than stopping.
- Background noise — Noise masks your voice and lowers the model’s confidence in every word.
- Short recordings with too little acoustic context — Auto-detect language needs enough speech to make a reliable guess; short clips may produce gibberish or the wrong language entirely.
How to reduce hallucinations
1. Enable Voice Activity Detection (VAD)
VAD is the most direct fix for silence-induced hallucinations. It analyzes your audio after you stop recording, strips leading and trailing silence, and passes only the speech-containing portion to the transcription provider.- macOS
- Windows
Go to Settings → Sound and turn on Remove silence before transcription.What happens when it’s on:
- HyperWhisper runs the Silero VAD model on the recorded audio.
- Leading and trailing silence is detected and stripped from the file.
- The trimmed audio is sent to the transcription provider instead of the original.
- For large WAV recordings (≥ 25 MB), the trimmed file is also converted to M4A before upload to reduce transfer time and API costs.
- Only applies to recordings of 30 seconds or longer. Shorter recordings have minimal silence and the processing overhead isn’t worth it.
- Skips M4A conversion if your original file was already a compressed format (M4A, MP3) to avoid re-encoding artifacts.
- Falls back silently to the original audio if the trimmed result is too short (under 0.3 s of speech remaining) or too small (under 5 KB), or if less than 0.5 s of silence was found.
2. Specify your language — don’t rely on auto-detect for short recordings
When language is set to auto-detect, the model needs enough speech to identify which language you’re speaking. For recordings under about 10–15 seconds, it may not have enough signal, and the result can be gibberish, a different language entirely, or empty output. Fix: Create a dedicated mode for each language you use regularly, and select it before recording. All providers support per-language selection and produce more accurate results with an explicit setting. See Transcription Modes for how to set up and switch between modes quickly.3. Add custom vocabulary
When you speak uncommon terms — product names, technical jargon, acronyms, colleague names — the model may substitute something phonetically similar but wrong. Vocabulary entries tell the model what terms to expect. How entries are applied depends on the provider:- Deepgram — sent as keyword boosting parameters (up to 100 terms; a warning appears if you exceed this limit).
- ElevenLabs Scribe v2 — sent as keyterms (up to 100 terms; terms longer than 50 characters are dropped).
- OpenAI Whisper, HyperWhisper Cloud — sent as prompt vocabulary.
- Soniox — sent as context vocabulary (terms joined into a single comma-separated string).
- AssemblyAI, Gemini — sent as key-term context.
- Local models on macOS (Parakeet, Nemotron) and local Parakeet on Windows — matched phonetically using the Beider-Morse algorithm. Terms of two characters or fewer are excluded to avoid false positives.
- All providers — all vocabulary entries are injected into the AI post-processing system prompt, which helps clean up errors the raw transcript missed.
- Brand names and product names with unusual spellings
- Colleague and client names
- Technical terms, abbreviations, and domain jargon (e.g., Kubernetes, ETA, SCRUM)
4. Use a dedicated microphone
Every transcription provider — cloud and local — degrades with background noise. Built-in laptop microphones pick up keyboard typing, fans, and ambient sound, all of which reduce the clarity of your voice signal and increase hallucination risk. Practical improvements:- Position the microphone 6–12 inches from your mouth.
- Use a USB microphone (such as a Blue Yeti or Audio-Technica ATR2100x) or a headset microphone.
- Check your system input level: it should reach roughly 50–75% on normal speech without clipping.
- macOS
- Windows
Turn on Automatically increase microphone volume in Settings → Sound to set the microphone input volume to 90% automatically when recording starts.Turn on Keep microphone warm between recordings in Settings → Sound if you use Bluetooth microphones — it keeps the audio session open to reduce connection lag on each push-to-talk press.
5. Choose the right model for noisy conditions
Some models handle noise and accents better than others:- Cloud models (ElevenLabs Scribe v2, Deepgram Nova-3, Grok STT) — generally more robust to background noise and accents than local models.
- Local models (Parakeet, Nemotron, Whisper) — faster and private, but more sensitive to noise. Nemotron supports 40+ languages offline and tends to outperform the Whisper family for multilingual use.
- Larger models — within any family, larger models produce fewer errors but require more processing time and memory.
6. Optimize your recording environment
Even small environment changes help:- Close windows to reduce wind noise.
- Turn off fans and air conditioning during dictation.
- Avoid typing while recording.
- Record in rooms with carpet or soft furnishings to reduce echo.
- Mute system notifications and media playback before you start.
- macOS
- Windows
Settings → Sound → Media control while recording can pause or mute system audio automatically when you start recording, which prevents music or video from bleeding into your dictation.
Validation and fallback behavior
HyperWhisper’s VAD processing is designed to never make things worse. If any step of the process fails or produces a result that doesn’t meet quality thresholds, it silently falls back to the original audio:| Check | Threshold | What happens if it fails |
|---|---|---|
| Silence removed | > 0.5 s | Falls back to original (trimming wasn’t worth it) |
| Trimmed duration | ≥ 0.3 s | Falls back to original (VAD over-trimmed) |
| Trimmed file size | > 5 KB | Falls back to original (file has no usable content) |
Testing your setup
Record in a quiet environment
Record at least 30 seconds of natural speech in a quiet room. Include a few seconds of silence at the start and end.
Enable VAD and transcribe
On macOS, make sure Remove silence before transcription is on in Settings → Sound, then transcribe the recording.
Check the result
If the transcript is clean, VAD is working. If you still see hallucinated filler text, compare the transcript timing against your recording to identify whether it appears at the silence points.
If you consistently see “No speech detected” after enabling VAD, the audio contains too much noise relative to speech. Improve the recording environment or raise your microphone input level before trying again.
