Skip to main content
Voice Activity Detection (VAD) analyzes your recording and removes silence before sending audio to your transcription provider. Shorter audio means lower API costs, faster results, and — in many cases — better accuracy.

Overview

When you finish recording, HyperWhisper can detect which portions of the audio contain speech. Silence is stripped out, producing a trimmed version of your audio that goes to the transcription provider instead of the full original. VAD is disabled by default on macOS. You opt in through Settings.

How It Works

VAD uses the Silero VAD model, bundled with the macOS app (~864 KB). Processing happens entirely on your device. When VAD runs:
  • VAD is enabled in Settings, and
  • The recording is at least 30 seconds long
Recordings under 30 seconds are sent to the provider as-is — they’re short enough that silence trimming offers little benefit. Processing flow:
1

Detect speech segments

The Silero VAD model scans your audio and identifies which portions contain speech.
2

Remove silence

All detected speech segments are extracted and concatenated into a trimmed audio file. Any gap between segments larger than 200 ms is removed — this includes leading and trailing silence as well as mid-recording pauses.
3

Validate the result

HyperWhisper checks that the trimmed file meets minimum quality thresholds (see Validation & Quality Checks below). If it doesn’t, the original audio is used instead.
4

Prepare for upload

If the trimmed file is 25 MB or larger, it is converted to M4A to reduce upload size. Files under 25 MB are sent as WAV. Imported files that were already in a compressed format (M4A, MP3, etc.) skip re-encoding to avoid quality loss.
The original audio is always kept on disk. The trimmed version is stored alongside it and used only for transcription and playback — the original is never overwritten.

Benefits

  • Lower API costs — you’re billed for less audio when silence is removed.
  • Faster transcription — smaller files upload and process more quickly.
  • Potentially better accuracy — less background noise and silence for the model to work through.
  • Useful for any recording with pauses — interviews, dictation with thinking gaps, or recordings started early and stopped late.

Enabling VAD

  1. Open Settings (click the menu bar icon → Settings).
  2. Go to the Sound section.
  3. Turn on Remove silence before transcription.
The toggle takes effect immediately — your next recording will use VAD trimming if it meets the minimum duration.

Viewing Original vs. Trimmed Audio

When a recording was trimmed, the History detail view shows an Original / Trimmed toggle above the audio player. Select Trimmed to hear the version with silence removed, or Original to hear the full recording.Both files are stored on disk. Switching the toggle changes only which one plays — nothing is deleted.See Viewing Transcription History for more about the audio player.

Validation & Quality Checks

After trimming, HyperWhisper validates the result before using it. If any check fails, the original audio is used for transcription automatically — you will not see an error.
CheckThresholdWhy
Silence removedMore than 0.5 secondsAvoids unnecessary file duplication when there is almost no silence
Trimmed durationAt least 0.3 secondsGuards against over-aggressive trimming that would remove actual speech
Trimmed file sizeMore than 5 KBEnsures the output file contains real audio content, not just a WAV header
Failures during VAD processing are logged as breadcrumbs to Sentry to help diagnose edge cases. The original audio is always the fallback, so a VAD failure never blocks transcription.

Platform Support

PlatformStatus
macOSFully supported — enable in Settings → Sound → Remove silence before transcription
WindowsNo silence trimming; no user-facing toggle
iOSAmplitude-based leading/trailing silence trimming runs automatically on every recording