Local Whisper Transcription

B-Roll Me can transcribe video audio locally using OpenAI's Whisper model when YouTube captions aren't available. This runs entirely on your device — no cloud API calls required for transcription.

When Whisper Is Used

During the search phase, B-Roll Me fetches YouTube captions for each video. Some videos don't have captions available. When this happens:

  • If Auto-transcribe is enabled in Settings, Whisper will automatically transcribe the audio.
  • The transcription runs locally on your machine — no additional API costs.
  • Resulting transcript is used for keyword matching, just like YouTube captions.

Available Models

Choose a model based on your accuracy/speed tradeoff. Smaller models are faster but less accurate:

Model Size Speed vs Accuracy
tiny.en ~75 MB Fastest, lowest accuracy
base.en ~142 MB Fast, decent accuracy
small.en ~466 MB Good balance
medium.en ~1.5 GB High accuracy, slower
large-v3-turbo-q5_0 ~1.1 GB Best accuracy, quantized for efficiency

Downloading Models

Models must be downloaded before use. Go to Settings > Transcription, select a model, and click Download. The model is saved locally and can be deleted later to free space.

Apple Silicon Acceleration

On Apple Silicon Macs (M1/M2/M3/M4), Whisper runs with Metal acceleration for significantly faster transcription. On Intel Macs and Windows, it runs on CPU which is slower but still functional.

Tips

  • Start with base.en for a good speed/accuracy balance on most machines.
  • If you have an Apple Silicon Mac with 16+ GB RAM, large-v3-turbo-q5_0 provides excellent accuracy with Metal acceleration.
  • Most YouTube videos have captions. Whisper is mainly needed for less popular content, unlisted videos, or non-English content.