Skip to content

Multi-source speech transcript merging using textual criticism principles — aligns and adjudicates Whisper, YouTube captions, and external transcripts via LLM with blind/anonymous presentation

License

Notifications You must be signed in to change notification settings

ringger/transcript

Repository files navigation

Speech Transcriber

Automated pipeline for producing accurate speech transcripts from video URLs. Downloads media, transcribes with multiple Whisper models, and merges all available sources — Whisper, YouTube captions, and optional external transcripts — into a single "critical text" using LLM-based adjudication.

The approach applies principles from textual criticism: multiple independent "witnesses" to the same speech are aligned, compared, and merged by an LLM that judges each difference on its merits, without knowing which source produced which reading. This builds on earlier work applying similar techniques to OCR (Ringger & Lund, 2014; Lund et al., 2013), replacing trained classifiers with an LLM as the eclectic editor.

Features

  • Critical text merging: Combines 2–3+ transcript sources into the most accurate version using blind, anonymous presentation to an LLM — no source receives preferential treatment
  • wdiff-based alignment: Uses longest common subsequence alignment (via wdiff) to keep chunks properly aligned across sources of different lengths, replacing naive proportional slicing
  • Multi-model Whisper ensembling: Runs multiple Whisper models (e.g., small + medium) and resolves disagreements via LLM
  • External transcript support: Merges in human-edited transcripts (e.g., from publisher websites) as an additional source
  • Structured transcript preservation: When external transcripts have speaker labels and timestamps, the merged output preserves that structure
  • Slide extraction and analysis: Automatic scene detection for presentation slides, with optional vision API descriptions
  • Make-style DAG pipeline: Each stage checks whether its outputs are newer than its inputs, skipping unnecessary work
  • Checkpoint resumption: Long merge operations save per-chunk checkpoints, resuming from where they left off after interruption
  • Cost estimation: Shows estimated API costs before running (--dry-run for estimation only)
  • Local-only mode: --no-api for completely free operation (Whisper only)

Installation

Dependencies

# Required tools
brew install ffmpeg wdiff
pip install yt-dlp mlx-whisper

# Required for merge/ensemble features
pip install anthropic

Apple Silicon

This tool is optimized for Apple Silicon Macs using mlx-whisper. On other platforms, it falls back to openai-whisper:

pip install openai-whisper  # For non-Apple Silicon

Quick Start

# Basic: Whisper transcription + YouTube caption merge
python transcriber.py "https://youtube.com/watch?v=..."

# With an external human-edited transcript for three-way merge
python transcriber.py "https://youtube.com/watch?v=..." \
    --external-transcript "https://example.com/transcript"

# Dry run: see what would happen and estimated costs
python transcriber.py "https://youtube.com/watch?v=..." --dry-run

# Free/local only (no API calls)
python transcriber.py "https://youtube.com/watch?v=..." --no-api

Usage Examples

Speech-Only (No Slides)

# Podcast or interview — skip slide extraction
python transcriber.py "https://youtube.com/watch?v=..." --no-slides

# With external transcript for higher accuracy
python transcriber.py "https://youtube.com/watch?v=..." \
    --no-slides \
    --external-transcript "https://example.com/transcript"

Presentation with Slides

# Extract slides and interleave with transcript
python transcriber.py "https://youtube.com/watch?v=..."

# Also describe slide content with vision API
python transcriber.py "https://youtube.com/watch?v=..." --analyze-slides

Custom Options

# Custom output directory
python transcriber.py "https://youtube.com/watch?v=..." -o ./my_transcript

# Use specific Whisper models
python transcriber.py "https://youtube.com/watch?v=..." --whisper-models large

# Adjust slide detection sensitivity (0.0–1.0, lower = more slides)
python transcriber.py "https://youtube.com/watch?v=..." --scene-threshold 0.15

# Force re-processing (ignore existing files)
python transcriber.py "https://youtube.com/watch?v=..." --no-skip

# Verbose output
python transcriber.py "https://youtube.com/watch?v=..." -v

Output Files

output_dir/
├── metadata.json                 # Source URL, title, duration, etc.
├── audio.mp3                     # Downloaded audio
├── video.mp4                     # Downloaded video (if slides enabled)
├── captions.en.vtt               # YouTube captions (if available)
├── small.txt                     # Whisper small transcript
├── medium.txt                    # Whisper medium transcript
├── ensembled.txt                 # Ensembled from multiple Whisper models
├── medium.json                   # Transcript with timestamps
├── transcript_merged.txt         # Critical text (merged from all sources)
├── analysis.md                   # Source survival analysis
├── transcript.md                 # Final markdown output
├── merge_chunks/                 # Per-chunk checkpoints (resumable)
│   ├── .version
│   ├── chunk_000.json
│   └── ...
└── slides/                       # (if slides enabled)
    ├── slide_0001.png
    ├── slide_timestamps.json
    ├── slides_transcript.json    # (if --analyze-slides)
    └── ...

Pipeline Stages

Stage Tool API Required
1. Download media yt-dlp No
2. Transcribe audio mlx-whisper No
3. Extract slides ffmpeg No
4a. Ensemble Whisper models Claude + wdiff Yes
4b. Merge transcript sources Claude + wdiff Yes
5. Generate markdown Python No
6. Source survival analysis wdiff No

How It Works

Critical Text Merging

The core innovation is treating transcript merging as textual criticism. Given 2–3+ independent "witnesses" to the same speech:

  1. Align all sources against an anchor text using wdiff (longest common subsequence), producing word-position maps that keep chunks synchronized even when sources differ in length
  2. Chunk the aligned sources into ~500-word segments
  3. Present each chunk to Claude with anonymous labels (Source 1, Source 2, Source 3) — source names are never revealed, preventing provenance bias
  4. Adjudicate — Claude chooses the best reading at each point of disagreement, preferring proper nouns, grammatical correctness, and contextual fit
  5. Reassemble the merged chunks, restoring speaker labels and timestamps from the structured source (if available)

When an external transcript has structure (speaker labels, timestamps), the merge preserves that skeleton while improving the text content from all sources.

Source Survival Analysis

After merging, wdiff -s compares each source against the merged output:

Source                       Words   Common  % of Merged  % of Source
------------------------- -------- -------- ------------ ------------
Whisper (ensembled)         28,277   27,441          90%          97%
YouTube captions            30,668   28,741          94%          94%
External transcript         33,122   30,245          99%          91%
Merged output               30,524

This shows how much each source contributed to the final text and which source the merged output most closely resembles.

Multi-Model Ensembling

When using multiple Whisper models (default: small,medium):

  1. Runs each model independently
  2. Uses wdiff to identify differences (normalized: no caps, no punctuation)
  3. Claude resolves disagreements, preferring real words over transcription errors and proper nouns over generic alternatives

Make-Style Staleness Checks

Every stage checks is_up_to_date(output, *inputs) — if the output file is newer than all input files, the stage is skipped. This means you can re-run the pipeline after changing options and only the affected stages will execute.

Cost Estimation

==================================================
ESTIMATED API COSTS
==================================================
  Source merging: 3 sources × 59 chunks = $1.03
  Whisper ensemble: 2 models × 59 chunks = $0.92

  TOTAL: $1.95 (estimate)
==================================================

Typical Costs

Feature 20-min speech 3-hour podcast
Whisper ensemble $0.05–$0.15 $0.50–$1.00
Source merging (2 sources) $0.10–$0.30 $0.50–$1.00
Source merging (3 sources) $0.15–$0.40 $1.00–$2.00
Slide analysis $0.50–$2.00 N/A
--no-api Free Free

Background

This tool applies the principles of textual criticism — the scholarly discipline of comparing multiple manuscript witnesses to reconstruct an authoritative text — to the problem of speech transcription.

The approach has roots in earlier work applying noisy-channel models and multi-source correction to speech and OCR:

The OCR work used A* alignment because page layout provides natural line boundaries, making alignment a series of short, bounded search problems. Speech has no such boundaries — different ASR systems segment a continuous audio stream arbitrarily — so this tool uses wdiff (LCS-based global alignment) instead. It also replaces the trained classifiers with an LLM, which brings world knowledge and contextual reasoning without requiring task-specific training data. The blind/anonymous presentation of sources is borrowed from peer review and prevents the LLM from developing source-level biases.

Related work in speech:

  • ROVER (Fiscus, 1997) — Statistical voting across multiple ASR outputs via word transition networks
  • Ensemble Methods for ASR (Lehmann) — Random Forest classifier for selecting words from multiple ASR systems

Troubleshooting

"No Whisper implementation found"

pip install mlx-whisper    # Apple Silicon (recommended)
pip install openai-whisper # Other platforms

wdiff not found

Required for alignment-based merging:

brew install wdiff  # macOS
apt install wdiff   # Ubuntu/Debian

API timeouts

The tool retries on timeouts (120s per attempt, up to 5 retries with exponential backoff). Long merges save per-chunk checkpoints, so interrupted runs resume from the last completed chunk.

ffmpeg scene detection captures too few/many slides

python transcriber.py "..." --scene-threshold 0.05  # More slides
python transcriber.py "..." --scene-threshold 0.20  # Fewer slides

License

MIT

Acknowledgments

About

Multi-source speech transcript merging using textual criticism principles — aligns and adjudicates Whisper, YouTube captions, and external transcripts via LLM with blind/anonymous presentation

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages