Whisper Transcription

B-Roll Me can transcribe video audio using OpenAI's Whisper model when YouTube captions aren't available. You can choose between running Whisper locally on your device or using the OpenAI Whisper API. Configure your preferred method in Settings > Transcription.

Transcription Methods

B-Roll Me supports two Whisper transcription methods. Choose the one that fits your workflow:

Method	Local Whisper	OpenAI Whisper API
Cost	Free	Billed per minute of audio by OpenAI
Privacy	Fully on-device, no data sent externally	Audio sent to OpenAI servers
Setup	Download model (75 MB – 1.5 GB)	OpenAI API key required, no model download
Performance	Metal-accelerated on Apple Silicon	Cloud-based, consistent speed
File limit	No limit	Max 25 MB per file

Select your preferred method in Settings > Transcription.

Settings > Transcription showing the method toggle between Local Whisper and OpenAI Whisper API — Settings > Transcription — choose between Local Whisper and the OpenAI Whisper API.

When Whisper Is Used

During the search phase, B-Roll Me fetches YouTube captions for each video. Some videos don't have captions available. When this happens:

• If Auto-transcribe is enabled in Settings, Whisper will automatically transcribe the audio using your selected method.
• The resulting transcript is used for keyword matching, just like YouTube captions.

OpenAI Whisper API

The OpenAI Whisper API sends audio to OpenAI's servers for transcription — no model download required.

• Requires an OpenAI API key configured in Settings > API Keys.
• Audio is converted to MP3 and sent to OpenAI's whisper-1 model.
• Maximum file size is 25 MB per audio file.
• Billed per minute of audio at OpenAI's standard rates.
• Audio is processed on OpenAI's servers — do not use this method if your content is confidential.

Local Whisper

Local Whisper runs entirely on your machine — no cloud API calls required for transcription.

• Completely free and private — audio never leaves your device.
• Requires downloading a Whisper model (75 MB – 1.5 GB depending on model size).
• Metal-accelerated on Apple Silicon Macs for significantly faster transcription.

Available Models (Local)

Choose a model based on your accuracy/speed tradeoff. Smaller models are faster but less accurate:

Model	Size	Speed vs Accuracy
tiny.en	~75 MB	Fastest, lowest accuracy
base.en	~142 MB	Fast, decent accuracy
small.en	~466 MB	Good balance
medium.en	~1.5 GB	High accuracy, slower
large-v3-turbo-q5_0	~1.1 GB	Best accuracy, quantized for efficiency

Downloading Models

Models must be downloaded before use. Go to Settings > Transcription, select a model, and click Download. The model is saved locally and can be deleted later to free space.

Apple Silicon Acceleration

On Apple Silicon Macs (M1/M2/M3/M4), local Whisper runs with Metal acceleration for significantly faster transcription. On Intel Macs and Windows, it runs on CPU which is slower but still functional.

Tips

Most YouTube videos have captions. Whisper is mainly needed for less popular content, unlisted videos, or non-English content.
If privacy matters, use Local Whisper — audio never leaves your machine.
For local transcription, start with base.en for a good speed/accuracy balance on most machines.
If you have an Apple Silicon Mac with 16+ GB RAM, large-v3-turbo-q5_0 provides excellent accuracy with Metal acceleration.
Use the OpenAI Whisper API if you want fast transcription without downloading a model and don't mind the per-minute cost.