How to Transcribe Video Content: Extract Audio and Convert for Text
Step-by-step workflow for transcribing videos: extract audio from MP4, convert to WAV or MP3, and feed it into transcription tools accurately and efficiently.
Alex Thompson·April 21, 2026·9 min read
Transcription starts with clean audio. Whether you're generating captions for a YouTube series, creating searchable meeting notes, or producing subtitles for a documentary, the quality of your transcript depends heavily on what you feed into the transcription tool. A compressed, noisy audio track produces mediocre results; a clean WAV file with proper levels produces near-perfect output.
This guide walks through the complete workflow: extracting audio from video files, converting to the right format for your transcription tool, and troubleshooting the common issues that produce bad transcripts.
Why Audio Format Matters for Transcription
Transcription engines — Whisper, Otter.ai, Descript, Deepgram, Google Speech-to-Text — all have preferences, and some have hard requirements.
Sample rate: Most transcription tools perform best at 16kHz mono audio. Higher sample rates (44.1kHz, 48kHz) work fine but don't improve accuracy and create larger files. Lower rates (8kHz, telephone quality) significantly hurt accuracy.
Channel count: Mono audio processes faster and often more accurately than stereo for single-speaker content. For multi-speaker content, stereo can help tools distinguish speakers.
Format: WAV (PCM) is universal — every transcription tool accepts it. MP3 is widely supported but introduces compression artifacts that can hurt accuracy at low bitrates. FLAC is lossless and smaller than WAV, supported by most modern tools.
Codec: Some video files use compressed audio codecs (AAC, AC3, Opus) internally. Extracting to WAV avoids any re-compression artifacts.
Format
Transcription Compatibility
File Size (10 min)
Notes
WAV (16kHz mono)
Universal
~19MB
Best compatibility
WAV (44.1kHz stereo)
Universal
~100MB
Larger, same accuracy
MP3 (128kbps)
Very high
~10MB
Minor quality loss
FLAC
High
~25MB
Lossless, good choice
OGG/Opus
Medium
~7MB
Some tools don't support
M4A/AAC
High
~10MB
May need conversion
Try these conversions
Free, in your browser — no signup, files auto-delete in 2 hours.
Before extracting, it's useful to know what you're working with. Different video sources have different audio characteristics:
Screen recordings (Zoom, Google Meet, Teams): Typically AAC or Opus audio at 48kHz stereo, encoded at 64–128kbps. Clean speech but moderate compression.
YouTube downloads: Usually AAC or Opus at 128–256kbps. Generally clean for speech recognition.
Smartphone videos: H.264 video with AAC audio at 48kHz. Good quality for transcription.
Broadcast/professional video (MXF, MOV files from cameras): Often uncompressed PCM or high-bitrate AAC. Excellent source quality.
Screencasts with system audio: Variable quality, may have background noise from computer fans or notification sounds.
Step 2: Extract the Audio
The cleanest approach is to extract the audio track without re-encoding. This preserves the original audio quality exactly as it was in the video.
Use /extract-audio to pull the audio from your video file. Upload the MP4, MKV, MOV, or other video format, and the tool extracts the audio stream directly.
For the output format:
Choose WAV for maximum compatibility and lossless quality
Choose MP3 at 192kbps or higher if file size is a concern and your tool supports it
Choose FLAC for lossless output at smaller sizes than WAV
If your source is a long video (conference recording, webinar, lecture), the extracted audio file will be proportionally large. A 2-hour meeting extracted as WAV at 48kHz stereo runs roughly 1.2GB. Convert to 16kHz mono WAV and it drops to about 115MB.
Pro Tip: Always extract first, then convert format and sample rate. Do it in two steps rather than trying to do everything at once — this makes it easier to troubleshoot if something goes wrong.
Step 3: Optimize Audio for Transcription
Once you have the extracted audio, a few adjustments can dramatically improve transcript accuracy:
Convert to 16kHz Mono
Most transcription engines (especially Whisper-based tools) internally downsample to 16kHz anyway. Sending a 16kHz mono file skips that step and reduces upload time.
Use the /audio-converter to convert your extracted WAV to 16kHz mono. Select "WAV" as the output format, set the sample rate to 16000 Hz, and convert to mono.
Check and Normalize Volume
Transcription accuracy drops for audio that's too quiet or inconsistently levels. Normalize audio to -3dB before sending to your transcription tool.
Most cloud transcription APIs have file size limits (typically 25MB–1GB) and some perform better on shorter segments. For recordings longer than 30 minutes, consider splitting the audio into segments before transcription.
Split at natural breaks — topic changes, speaker transitions, pauses. Transcripts from shorter segments are easier to correct and organize.
Step 4: Choose Your Transcription Tool
The right tool depends on your accuracy requirements, budget, and whether you need speaker identification.
Tool
Best For
Accuracy
Speaker ID
Price
OpenAI Whisper (local)
Privacy-sensitive content
Excellent
No
Free
Otter.ai
Meetings, collaboration
Very good
Yes
Freemium
Descript
Podcast/video editing
Excellent
Yes
Paid
Deepgram
API integration
Excellent
Yes
Pay-per-use
Google Speech-to-Text
High volume
Very good
Yes
Pay-per-use
Rev
Human + AI hybrid
Near-perfect
Yes
Paid
For technical content (code, jargon, product names), tools that allow custom vocabulary — like Deepgram or Assembly AI — outperform generic transcription.
Step 5: Post-Process the Transcript
Raw transcripts always need cleanup. Common issues:
Filler words: Transcription tools faithfully capture every "um," "uh," and "you know." Decide whether to keep these (for accurate quotes) or remove them (for readability).
Speaker names: Tools that detect speakers label them as "Speaker 1," "Speaker 2." Replace with actual names for usable transcripts.
Technical terms: Proper nouns, product names, and industry terms are often transcribed phonetically. A pass through the transcript searching for known terms catches most errors.
Punctuation: Transcription tools handle punctuation inconsistently. Commas and periods often need adjustment for readability.
For subtitle/caption workflows, the transcript needs to be timed and formatted as SRT or VTT. Several tools (including Whisper with the --output_format srt flag) produce timed output directly.
See the subtitle formats guide for details on SRT vs VTT vs ASS formatting and when to use each.
Troubleshooting Poor Transcription Quality
Problem: Lots of "[inaudible]" or wrong words
Cause: Background noise, too much reverb, or low volume.
Solution: Apply noise reduction before transcription. Tools like Audacity, Adobe Podcast, or Cleanvoice.ai handle noise removal. Re-extract if the source video has loud background music.
Problem: Speaker words bleed together
Cause: Multiple speakers too close together in the audio, or overlapping dialogue.
Solution: Use speaker diarization (speaker identification) features in your transcription tool. For interviews, consider using separate audio tracks if you have access to the original recording.
Problem: Technical terms consistently wrong
Cause: Generic transcription models aren't trained on domain-specific vocabulary.
Solution: Use custom vocabulary features, or manually correct with find-and-replace for known terms. Create a glossary of common terms in your domain and run a pass after transcription.
Problem: File won't upload to transcription service
Cause: Wrong format, file too large, or codec incompatibility.
Solution: Convert to WAV PCM first — it's accepted everywhere. Split the file if size is the issue.
Full Workflow Summary
Here's the complete pipeline for a typical video transcription job:
16kHz WAV → Upload to transcription tool → Download raw transcript
Raw transcript → Manual review and cleanup → Final document
For batch processing multiple videos — a lecture series, podcast archive, meeting recordings — this workflow scales well. Process extractions in parallel, then work through transcriptions systematically.
See also our guide on how to add subtitles to video if your goal is to embed the transcript back into the video as captions.
FAQ
What's the best audio format to send to Whisper for transcription?
OpenAI Whisper accepts WAV, MP3, M4A, FLAC, and several other formats. WAV PCM at 16kHz mono gives the cleanest input. For local Whisper runs, it handles format conversion internally, so any common audio format works.
Can I transcribe directly from the video without extracting audio first?
Some tools (Descript, Kapwing) accept video files directly and handle audio extraction internally. However, extracting audio first gives you more control over quality and lets you optimize the audio before transcription.
How long does it take to transcribe an hour of audio?
Cloud services typically take 1–5 minutes for an hour of audio. Local Whisper on a GPU takes 2–10 minutes depending on model size and hardware. Local Whisper on CPU can take 30–60 minutes for an hour of audio.
Do I need stereo audio for better transcription?
For single-speaker content (solo podcasts, lectures), mono produces equivalent or better results than stereo. For multi-speaker interviews or meetings, stereo can help with speaker diarization, though modern tools handle mono multi-speaker content well.
What should I do with background music in the source video?
Transcription accuracy drops significantly with background music. If possible, get a version of the source without music. If not, use AI audio separation tools (Demucs, LALAL.ai) to extract the vocal track before transcription.
Final Thoughts
Clean audio is the foundation of accurate transcription. The investment in extracting properly, converting to the right format, and checking levels pays back immediately in transcript quality and less correction time.
The workflow is straightforward: extract audio from your video using /extract-audio, optimize the format with the /audio-converter, and send clean audio to your transcription tool of choice. What used to require professional audio software now takes a few minutes and produces professional results.
transcriptionaudio extractionwavmp3workflowvideo to audiopodcast
About the Author
Alex Thompson
Software engineer and content creator focused on web technologies, image optimization, and developer tooling.