Speech Transcriber

Automated pipeline for producing accurate speech transcripts from video URLs. Downloads media, transcribes with multiple Whisper models, and merges all available sources — Whisper, YouTube captions, and optional external transcripts — into a single "critical text" using LLM-based adjudication.

The approach applies principles from textual criticism: multiple independent "witnesses" to the same speech are aligned, compared, and merged by an LLM that judges each difference on its merits, without knowing which source produced which reading. This builds on earlier work applying similar techniques to OCR (Ringger & Lund, 2014; Lund et al., 2013), replacing trained classifiers with an LLM as the eclectic editor.

Features

Critical text merging: Combines 2–3+ transcript sources into the most accurate version using blind, anonymous presentation to an LLM — no source receives preferential treatment
wdiff-based alignment: Uses longest common subsequence alignment (via wdiff) to keep chunks properly aligned across sources of different lengths, replacing naive proportional slicing
Multi-model Whisper ensembling: Runs multiple Whisper models (e.g., small + medium) and resolves disagreements via LLM
External transcript support: Merges in human-edited transcripts (e.g., from publisher websites) as an additional source
Structured transcript preservation: When external transcripts have speaker labels and timestamps, the merged output preserves that structure
Slide extraction and analysis: Automatic scene detection for presentation slides, with optional vision API descriptions
Make-style DAG pipeline: Each stage checks whether its outputs are newer than its inputs, skipping unnecessary work
Checkpoint resumption: Long merge operations save per-chunk checkpoints, resuming from where they left off after interruption
Cost estimation: Shows estimated API costs before running (--dry-run for estimation only)
Local-only mode: --no-api for completely free operation (Whisper only)

Installation

Dependencies

# Required tools
brew install ffmpeg wdiff
pip install yt-dlp mlx-whisper

# Required for merge/ensemble features
pip install anthropic

Apple Silicon

This tool is optimized for Apple Silicon Macs using mlx-whisper. On other platforms, it falls back to openai-whisper:

pip install openai-whisper  # For non-Apple Silicon

Quick Start

# Basic: Whisper transcription + YouTube caption merge
python transcriber.py "https://youtube.com/watch?v=..."

# With an external human-edited transcript for three-way merge
python transcriber.py "https://youtube.com/watch?v=..." \
    --external-transcript "https://example.com/transcript"

# Dry run: see what would happen and estimated costs
python transcriber.py "https://youtube.com/watch?v=..." --dry-run

# Free/local only (no API calls)
python transcriber.py "https://youtube.com/watch?v=..." --no-api

Usage Examples

Speech-Only (No Slides)

# Podcast or interview — skip slide extraction
python transcriber.py "https://youtube.com/watch?v=..." --no-slides

# With external transcript for higher accuracy
python transcriber.py "https://youtube.com/watch?v=..." \
    --no-slides \
    --external-transcript "https://example.com/transcript"

Presentation with Slides

# Extract slides and interleave with transcript
python transcriber.py "https://youtube.com/watch?v=..."

# Also describe slide content with vision API
python transcriber.py "https://youtube.com/watch?v=..." --analyze-slides

Custom Options

# Custom output directory
python transcriber.py "https://youtube.com/watch?v=..." -o ./my_transcript

# Use specific Whisper models
python transcriber.py "https://youtube.com/watch?v=..." --whisper-models large

# Adjust slide detection sensitivity (0.0–1.0, lower = more slides)
python transcriber.py "https://youtube.com/watch?v=..." --scene-threshold 0.15

# Force re-processing (ignore existing files)
python transcriber.py "https://youtube.com/watch?v=..." --no-skip

# Verbose output
python transcriber.py "https://youtube.com/watch?v=..." -v

Output Files

output_dir/
├── metadata.json                 # Source URL, title, duration, etc.
├── audio.mp3                     # Downloaded audio
├── video.mp4                     # Downloaded video (if slides enabled)
├── captions.en.vtt               # YouTube captions (if available)
├── small.txt                     # Whisper small transcript
├── medium.txt                    # Whisper medium transcript
├── ensembled.txt                 # Ensembled from multiple Whisper models
├── medium.json                   # Transcript with timestamps
├── transcript_merged.txt         # Critical text (merged from all sources)
├── analysis.md                   # Source survival analysis
├── transcript.md                 # Final markdown output
├── merge_chunks/                 # Per-chunk checkpoints (resumable)
│   ├── .version
│   ├── chunk_000.json
│   └── ...
└── slides/                       # (if slides enabled)
    ├── slide_0001.png
    ├── slide_timestamps.json
    ├── slides_transcript.json    # (if --analyze-slides)
    └── ...

Pipeline Stages

Stage	Tool	API Required
1. Download media	yt-dlp	No
2. Transcribe audio	mlx-whisper	No
3. Extract slides	ffmpeg	No
4a. Ensemble Whisper models	Claude + wdiff	Yes
4b. Merge transcript sources	Claude + wdiff	Yes
5. Generate markdown	Python	No
6. Source survival analysis	wdiff	No

How It Works

Critical Text Merging

The core innovation is treating transcript merging as textual criticism. Given 2–3+ independent "witnesses" to the same speech:

Align all sources against an anchor text using wdiff (longest common subsequence), producing word-position maps that keep chunks synchronized even when sources differ in length
Chunk the aligned sources into ~500-word segments
Present each chunk to Claude with anonymous labels (Source 1, Source 2, Source 3) — source names are never revealed, preventing provenance bias
Adjudicate — Claude chooses the best reading at each point of disagreement, preferring proper nouns, grammatical correctness, and contextual fit
Reassemble the merged chunks, restoring speaker labels and timestamps from the structured source (if available)

When an external transcript has structure (speaker labels, timestamps), the merge preserves that skeleton while improving the text content from all sources.

Source Survival Analysis

After merging, wdiff -s compares each source against the merged output:

Source                       Words   Common  % of Merged  % of Source
------------------------- -------- -------- ------------ ------------
Whisper (ensembled)         28,277   27,441          90%          97%
YouTube captions            30,668   28,741          94%          94%
External transcript         33,122   30,245          99%          91%
Merged output               30,524

This shows how much each source contributed to the final text and which source the merged output most closely resembles.

Multi-Model Ensembling

When using multiple Whisper models (default: small,medium):

Runs each model independently
Uses wdiff to identify differences (normalized: no caps, no punctuation)
Claude resolves disagreements, preferring real words over transcription errors and proper nouns over generic alternatives

Make-Style Staleness Checks

Every stage checks is_up_to_date(output, *inputs) — if the output file is newer than all input files, the stage is skipped. This means you can re-run the pipeline after changing options and only the affected stages will execute.

Cost Estimation

==================================================
ESTIMATED API COSTS
==================================================
  Source merging: 3 sources × 59 chunks = $1.03
  Whisper ensemble: 2 models × 59 chunks = $0.92

  TOTAL: $1.95 (estimate)
==================================================

Typical Costs

Feature	20-min speech	3-hour podcast
Whisper ensemble	$0.05–$0.15	$0.50–$1.00
Source merging (2 sources)	$0.10–$0.30	$0.50–$1.00
Source merging (3 sources)	$0.15–$0.40	$1.00–$2.00
Slide analysis	$0.50–$2.00	N/A
`--no-api`	Free	Free

Background

This tool applies the principles of textual criticism — the scholarly discipline of comparing multiple manuscript witnesses to reconstruct an authoritative text — to the problem of speech transcription.

The approach has roots in earlier work applying noisy-channel models and multi-source correction to speech and OCR:

Ringger & Allen (1996) — Error Correction via a Post-Processor for Continuous Speech Recognition (ICASSP). Introduced SpeechPP, a noisy-channel post-processor that corrects ASR output using language and channel models with Viterbi beam search, developed as part of the TRAINS/TRIPS spoken dialogue systems at the University of Rochester. Extended with a fertility channel model in Ringger & Allen, ICSLP 1996.
Ringger & Lund (2014) — How Well Does Multiple OCR Error Correction Generalize? Demonstrated that aligning and merging outputs from multiple OCR engines significantly reduces word error rates.
Lund et al. (2013) — Error Correction with In-Domain Training Across Multiple OCR System Outputs. Used A* alignment and trained classifiers (CRFs, MaxEnt) to choose the best reading from multiple OCR witnesses — a 52% relative decrease in word error rate.

The OCR work used A* alignment because page layout provides natural line boundaries, making alignment a series of short, bounded search problems. Speech has no such boundaries — different ASR systems segment a continuous audio stream arbitrarily — so this tool uses wdiff (LCS-based global alignment) instead. It also replaces the trained classifiers with an LLM, which brings world knowledge and contextual reasoning without requiring task-specific training data. The blind/anonymous presentation of sources is borrowed from peer review and prevents the LLM from developing source-level biases.

Related work in speech:

ROVER (Fiscus, 1997) — Statistical voting across multiple ASR outputs via word transition networks
Ensemble Methods for ASR (Lehmann) — Random Forest classifier for selecting words from multiple ASR systems

Troubleshooting

"No Whisper implementation found"

pip install mlx-whisper    # Apple Silicon (recommended)
pip install openai-whisper # Other platforms

wdiff not found

Required for alignment-based merging:

brew install wdiff  # macOS
apt install wdiff   # Ubuntu/Debian

API timeouts

The tool retries on timeouts (120s per attempt, up to 5 retries with exponential backoff). Long merges save per-chunk checkpoints, so interrupted runs resume from the last completed chunk.

ffmpeg scene detection captures too few/many slides

python transcriber.py "..." --scene-threshold 0.05  # More slides
python transcriber.py "..." --scene-threshold 0.20  # Fewer slides

License

MIT

Acknowledgments

OpenAI Whisper — Speech recognition
MLX Whisper — Apple Silicon optimization
yt-dlp — Media downloading
Anthropic Claude — LLM-based adjudication and vision analysis
wdiff — Word-level diff for alignment and comparison

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
merge.py		merge.py
requirements.txt		requirements.txt
shared.py		shared.py
test_transcriber.py		test_transcriber.py
transcriber.py		transcriber.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Transcriber

Features

Installation

Dependencies

Apple Silicon

Quick Start

Usage Examples

Speech-Only (No Slides)

Presentation with Slides

Custom Options

Output Files

Pipeline Stages

How It Works

Critical Text Merging

Source Survival Analysis

Multi-Model Ensembling

Make-Style Staleness Checks

Cost Estimation

Typical Costs

Background

Troubleshooting

"No Whisper implementation found"

wdiff not found

API timeouts

ffmpeg scene detection captures too few/many slides

License

Acknowledgments

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

License

ringger/transcript

Folders and files

Latest commit

History

Repository files navigation

Speech Transcriber

Features

Installation

Dependencies

Apple Silicon

Quick Start

Usage Examples

Speech-Only (No Slides)

Presentation with Slides

Custom Options

Output Files

Pipeline Stages

How It Works

Critical Text Merging

Source Survival Analysis

Multi-Model Ensembling

Make-Style Staleness Checks

Cost Estimation

Typical Costs

Background

Troubleshooting

"No Whisper implementation found"

wdiff not found

API timeouts

ffmpeg scene detection captures too few/many slides

License

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages