Automated pipeline for producing accurate speech transcripts from video URLs. Downloads media, transcribes with multiple Whisper models, and merges all available sources — Whisper, YouTube captions, and optional external transcripts — into a single "critical text" using LLM-based adjudication.
The approach applies principles from textual criticism: multiple independent "witnesses" to the same speech are aligned, compared, and merged by an LLM that judges each difference on its merits, without knowing which source produced which reading. This builds on earlier work applying similar techniques to OCR (Ringger & Lund, 2014; Lund et al., 2013), replacing trained classifiers with an LLM as the eclectic editor.
- Critical text merging: Combines 2–3+ transcript sources into the most accurate version using blind, anonymous presentation to an LLM — no source receives preferential treatment
- wdiff-based alignment: Uses longest common subsequence alignment (via
wdiff) to keep chunks properly aligned across sources of different lengths, replacing naive proportional slicing - Multi-model Whisper ensembling: Runs multiple Whisper models (e.g., small + medium) and resolves disagreements via LLM
- External transcript support: Merges in human-edited transcripts (e.g., from publisher websites) as an additional source
- Structured transcript preservation: When external transcripts have speaker labels and timestamps, the merged output preserves that structure
- Slide extraction and analysis: Automatic scene detection for presentation slides, with optional vision API descriptions
- Make-style DAG pipeline: Each stage checks whether its outputs are newer than its inputs, skipping unnecessary work
- Checkpoint resumption: Long merge operations save per-chunk checkpoints, resuming from where they left off after interruption
- Cost estimation: Shows estimated API costs before running (
--dry-runfor estimation only) - Local-only mode:
--no-apifor completely free operation (Whisper only)
# Required tools
brew install ffmpeg wdiff
pip install yt-dlp mlx-whisper
# Required for merge/ensemble features
pip install anthropicThis tool is optimized for Apple Silicon Macs using mlx-whisper. On other platforms, it falls back to openai-whisper:
pip install openai-whisper # For non-Apple Silicon# Basic: Whisper transcription + YouTube caption merge
python transcriber.py "https://youtube.com/watch?v=..."
# With an external human-edited transcript for three-way merge
python transcriber.py "https://youtube.com/watch?v=..." \
--external-transcript "https://example.com/transcript"
# Dry run: see what would happen and estimated costs
python transcriber.py "https://youtube.com/watch?v=..." --dry-run
# Free/local only (no API calls)
python transcriber.py "https://youtube.com/watch?v=..." --no-api# Podcast or interview — skip slide extraction
python transcriber.py "https://youtube.com/watch?v=..." --no-slides
# With external transcript for higher accuracy
python transcriber.py "https://youtube.com/watch?v=..." \
--no-slides \
--external-transcript "https://example.com/transcript"# Extract slides and interleave with transcript
python transcriber.py "https://youtube.com/watch?v=..."
# Also describe slide content with vision API
python transcriber.py "https://youtube.com/watch?v=..." --analyze-slides# Custom output directory
python transcriber.py "https://youtube.com/watch?v=..." -o ./my_transcript
# Use specific Whisper models
python transcriber.py "https://youtube.com/watch?v=..." --whisper-models large
# Adjust slide detection sensitivity (0.0–1.0, lower = more slides)
python transcriber.py "https://youtube.com/watch?v=..." --scene-threshold 0.15
# Force re-processing (ignore existing files)
python transcriber.py "https://youtube.com/watch?v=..." --no-skip
# Verbose output
python transcriber.py "https://youtube.com/watch?v=..." -voutput_dir/
├── metadata.json # Source URL, title, duration, etc.
├── audio.mp3 # Downloaded audio
├── video.mp4 # Downloaded video (if slides enabled)
├── captions.en.vtt # YouTube captions (if available)
├── small.txt # Whisper small transcript
├── medium.txt # Whisper medium transcript
├── ensembled.txt # Ensembled from multiple Whisper models
├── medium.json # Transcript with timestamps
├── transcript_merged.txt # Critical text (merged from all sources)
├── analysis.md # Source survival analysis
├── transcript.md # Final markdown output
├── merge_chunks/ # Per-chunk checkpoints (resumable)
│ ├── .version
│ ├── chunk_000.json
│ └── ...
└── slides/ # (if slides enabled)
├── slide_0001.png
├── slide_timestamps.json
├── slides_transcript.json # (if --analyze-slides)
└── ...
| Stage | Tool | API Required |
|---|---|---|
| 1. Download media | yt-dlp | No |
| 2. Transcribe audio | mlx-whisper | No |
| 3. Extract slides | ffmpeg | No |
| 4a. Ensemble Whisper models | Claude + wdiff | Yes |
| 4b. Merge transcript sources | Claude + wdiff | Yes |
| 5. Generate markdown | Python | No |
| 6. Source survival analysis | wdiff | No |
The core innovation is treating transcript merging as textual criticism. Given 2–3+ independent "witnesses" to the same speech:
- Align all sources against an anchor text using
wdiff(longest common subsequence), producing word-position maps that keep chunks synchronized even when sources differ in length - Chunk the aligned sources into ~500-word segments
- Present each chunk to Claude with anonymous labels (Source 1, Source 2, Source 3) — source names are never revealed, preventing provenance bias
- Adjudicate — Claude chooses the best reading at each point of disagreement, preferring proper nouns, grammatical correctness, and contextual fit
- Reassemble the merged chunks, restoring speaker labels and timestamps from the structured source (if available)
When an external transcript has structure (speaker labels, timestamps), the merge preserves that skeleton while improving the text content from all sources.
After merging, wdiff -s compares each source against the merged output:
Source Words Common % of Merged % of Source
------------------------- -------- -------- ------------ ------------
Whisper (ensembled) 28,277 27,441 90% 97%
YouTube captions 30,668 28,741 94% 94%
External transcript 33,122 30,245 99% 91%
Merged output 30,524
This shows how much each source contributed to the final text and which source the merged output most closely resembles.
When using multiple Whisper models (default: small,medium):
- Runs each model independently
- Uses
wdiffto identify differences (normalized: no caps, no punctuation) - Claude resolves disagreements, preferring real words over transcription errors and proper nouns over generic alternatives
Every stage checks is_up_to_date(output, *inputs) — if the output file is newer than all input files, the stage is skipped. This means you can re-run the pipeline after changing options and only the affected stages will execute.
==================================================
ESTIMATED API COSTS
==================================================
Source merging: 3 sources × 59 chunks = $1.03
Whisper ensemble: 2 models × 59 chunks = $0.92
TOTAL: $1.95 (estimate)
==================================================
| Feature | 20-min speech | 3-hour podcast |
|---|---|---|
| Whisper ensemble | $0.05–$0.15 | $0.50–$1.00 |
| Source merging (2 sources) | $0.10–$0.30 | $0.50–$1.00 |
| Source merging (3 sources) | $0.15–$0.40 | $1.00–$2.00 |
| Slide analysis | $0.50–$2.00 | N/A |
--no-api |
Free | Free |
This tool applies the principles of textual criticism — the scholarly discipline of comparing multiple manuscript witnesses to reconstruct an authoritative text — to the problem of speech transcription.
The approach has roots in earlier work applying noisy-channel models and multi-source correction to speech and OCR:
- Ringger & Allen (1996) — Error Correction via a Post-Processor for Continuous Speech Recognition (ICASSP). Introduced SpeechPP, a noisy-channel post-processor that corrects ASR output using language and channel models with Viterbi beam search, developed as part of the TRAINS/TRIPS spoken dialogue systems at the University of Rochester. Extended with a fertility channel model in Ringger & Allen, ICSLP 1996.
- Ringger & Lund (2014) — How Well Does Multiple OCR Error Correction Generalize? Demonstrated that aligning and merging outputs from multiple OCR engines significantly reduces word error rates.
- Lund et al. (2013) — Error Correction with In-Domain Training Across Multiple OCR System Outputs. Used A* alignment and trained classifiers (CRFs, MaxEnt) to choose the best reading from multiple OCR witnesses — a 52% relative decrease in word error rate.
The OCR work used A* alignment because page layout provides natural line boundaries, making alignment a series of short, bounded search problems. Speech has no such boundaries — different ASR systems segment a continuous audio stream arbitrarily — so this tool uses wdiff (LCS-based global alignment) instead. It also replaces the trained classifiers with an LLM, which brings world knowledge and contextual reasoning without requiring task-specific training data. The blind/anonymous presentation of sources is borrowed from peer review and prevents the LLM from developing source-level biases.
Related work in speech:
- ROVER (Fiscus, 1997) — Statistical voting across multiple ASR outputs via word transition networks
- Ensemble Methods for ASR (Lehmann) — Random Forest classifier for selecting words from multiple ASR systems
pip install mlx-whisper # Apple Silicon (recommended)
pip install openai-whisper # Other platformsRequired for alignment-based merging:
brew install wdiff # macOS
apt install wdiff # Ubuntu/DebianThe tool retries on timeouts (120s per attempt, up to 5 retries with exponential backoff). Long merges save per-chunk checkpoints, so interrupted runs resume from the last completed chunk.
python transcriber.py "..." --scene-threshold 0.05 # More slides
python transcriber.py "..." --scene-threshold 0.20 # Fewer slidesMIT
- OpenAI Whisper — Speech recognition
- MLX Whisper — Apple Silicon optimization
- yt-dlp — Media downloading
- Anthropic Claude — LLM-based adjudication and vision analysis
- wdiff — Word-level diff for alignment and comparison