A Python CLI tool that transforms audio/video meeting recordings into structured transcriptions and to-do lists using OpenAI APIs (Whisper for transcription and GPT for analysis).
- Automatic transcription of audio and video files using Whisper/GPT-4o
- Intelligent extraction of tasks, decisions, and action items
- Large file handling with automatic chunking
- Multi-format support (MP4, MOV, WAV, MP3, M4A, etc.)
- Parallel processing to speed up transcription
- Automatic retry on API errors with exponential backoff
- Precise timestamps for each section
- Structured output in Markdown format
- π° Cost optimization with transcription-only mode (
-tflag) - Enhanced error handling with response validation and diagnostics
- Python 3.8+ installed on your system
- ffmpeg installed:
# macOS brew install ffmpeg # Ubuntu/Debian sudo apt update && sudo apt install ffmpeg # Windows # Download from https://ffmpeg.org/download.html
- OpenAI account with API key
-
Clone the repository:
git clone <repository-url> cd RecordingToTasks
-
Run the setup script:
./setup.sh
The script will automatically install:
- ffmpeg (if not present)
- Python virtual environment
- All required dependencies
- Create
.envfile from template
-
Configure your API key:
Edit the
.envfile and insert your OpenAI API key:OPENAI_API_KEY=sk-your-actual-api-key-here
-
Clone the repository:
git clone <repository-url> cd RecordingToTasks
-
Create and activate virtual environment:
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Configure environment variables:
cp env.example .env
Edit the
.envfile and insert your OpenAI API key:OPENAI_API_KEY=sk-your-actual-api-key-here
python main.py --helpOr run the setup test:
python test_setup.py# Process audio/video file (transcription + analysis)
python main.py /path/to/your/recording.mp4If you already have a transcription file, you can skip the expensive transcription step and only run the task analysis:
# Use existing transcription (SAVES MONEY!)
python main.py -t output/recording_transcription.txt
python main.py --transcription transcription.txtCost Comparison:
- Full processing (37 min video): ~$0.23 (transcription) + ~$0.01 (analysis) = $0.24
- Transcription-only mode: ~$0.01 (analysis only) = 96% cheaper!
This is perfect for:
- Testing prompt changes
- Debugging analysis issues
- Re-running analysis with different settings
- Processing the same recording multiple times
# Transcribe a video file
python main.py meeting_2024_01_15.mp4
# Transcribe an audio file
python main.py call_with_client.wav
# Process multiple files in sequence
python main.py file1.mp4 file2.wav file3.m4a
# Use existing transcription (cost-saving mode)
python main.py -t output/meeting_transcription.txt
# Show help
python main.py --helpAudio: .wav, .mp3, .m4a, .flac, .aac, .ogg, .wma
Video: .mp4, .mov, .avi, .mkv, .wmv, .flv, .webm, .m4v
Transcriptions: .txt (with -t/--transcription flag)
The tool generates files in the output/ folder:
-
filename_transcription.txt- Complete transcription with timestamps (not generated in-tmode) -
filename_tasks.md- Structured AI-powered analysis:- Executive Summary - Concise 2-3 sentence summary of concrete decisions
- Action Items / To-Do List - Tasks organized by:
- Category (development, testing, documentation, infrastructure, meeting, other)
- Priority (high π΄, medium π‘, low π’)
- Detailed description, responsible party, deadline, and context
- Markdown checkbox format for tracking:
- [ ] Task
- Decisions Made - List of concrete decisions
- Next Steps - Identified future actions
- Additional Notes - Technical references, links, relevant information
Intelligent Filtering: The AI distinguishes between generic discussions and actionable tasks, automatically ignoring casual conversation and including only concrete commitments.
# API Configuration
OPENAI_API_KEY=your_api_key_here
OPENAI_ORG_ID=your_org_id_here # Optional
# ============================================
# PRESET CONFIGURATIONS (2025 Optimized)
# ============================================
# PRESET 1: PREMIUM (Maximum Quality) β DEFAULT
# Cost: $0.367/hour | Quality: 10/10
TRANSCRIPTION_MODEL=gpt-4o-transcribe
ANALYSIS_MODEL=gpt-5-mini
# PRESET 2: RECOMMENDED (Best Value)
# Cost: $0.181/hour | Quality: 9/10 (51% cheaper)
# TRANSCRIPTION_MODEL=gpt-4o-mini-transcribe
# ANALYSIS_MODEL=gpt-5-nano
# PRESET 3: BALANCED
# Cost: $0.187/hour | Quality: 9.5/10
# TRANSCRIPTION_MODEL=gpt-4o-mini-transcribe
# ANALYSIS_MODEL=gpt-5-mini
# ============================================
# Language for transcription (ISO-639-1 format, improves accuracy)
# it=Italian, en=English, es=Spanish, fr=French, de=German
LANGUAGE=en
# Speaker Diarization (optional, requires pyannote-audio)
ENABLE_DIARIZATION=false # Identify speakers (true/false)
NUM_SPEAKERS=2 # Number of speakers (if diarization=true)
# Processing Configuration
MAX_RETRIES=3 # Retry attempts on errors
MAX_PARALLEL_TASKS=3 # Parallel transcription workers
SIZE_LIMIT_MB=20 # File size limit for chunking (MB)| Preset | Transcription | Analysis | Cost/hour | Quality | When to Use |
|---|---|---|---|---|---|
| Premium β | gpt-4o-transcribe | gpt-5-mini | $0.367 | 10/10 | Default - Critical meetings, maximum accuracy |
| Recommended | gpt-4o-mini-transcribe | gpt-5-nano | $0.181 | 9/10 | Limited budget, frequent use (51% savings) |
| Balanced | gpt-4o-mini-transcribe | gpt-5-mini | $0.187 | 9.5/10 | Complex task analysis, budget-conscious |
Transcription Models:
| Model | Price | WER | Quality | Notes |
|---|---|---|---|---|
| gpt-4o-transcribe β | $0.006/min | 2.46% | 10/10 | Near-human accuracy, excellent for difficult audio |
| gpt-4o-mini-transcribe | $0.003/min | 8.9% | 9/10 | Excellent quality/price ratio, 50% cheaper |
| whisper-1 (legacy) | $0.006/min | 7.88% | 7/10 | Deprecated, no advantage vs new models |
WER = Word Error Rate (lower is better)
Analysis Models:
| Model | Input/Output (per 1M tokens) | MMLU | Quality | Notes |
|---|---|---|---|---|
| gpt-5-mini β | $0.25 / $2.00 | ~88% | 10/10 | Advanced reasoning, excellent for complex tasks |
| gpt-5-nano | $0.05 / $0.40 | ~82% | 9/10 | Perfect for structured task extraction |
| gpt-4o-mini | $0.15 / $0.60 | 82% | 8/10 | Economical alternative, proven and reliable |
MMLU = Massive Multitask Language Understanding (higher is better)
Costs depend on recording length.
- Transcription: gpt-4o-transcribe ($0.006/min)
- Analysis: gpt-5-mini ($0.25/$2.00 per 1M tokens)
- Total 1 hour: $0.367 (maximum quality, WER 2.46%)
| Component | Model | Calculation | Cost |
|---|---|---|---|
| Transcription | gpt-4o-transcribe | 60 min Γ $0.006/min | $0.360 |
| Analysis | gpt-5-mini | 12,500 input Γ $0.25/1M 1,750 output Γ $2.00/1M |
$0.003 $0.004 |
| TOTAL | $0.367 |
| Preset | Transcription | Analysis | Total | Quality | Savings |
|---|---|---|---|---|---|
| Premium β (default) | gpt-4o-transcribe | gpt-5-mini | $0.367 | 10/10 | - |
| Recommended | gpt-4o-mini-transcribe | gpt-5-nano | $0.181 | 9/10 | 51% |
| Balanced | gpt-4o-mini-transcribe | gpt-5-mini | $0.187 | 9.5/10 | 49% |
| Legacy (old) | whisper-1 | gpt-4o-mini | $0.363 | 7/10 | 1% |
Cost Notes:
- Premium preset offers maximum quality (WER 2.46% vs 8.9%)
- Recommended preset saves 51% while maintaining excellent quality (9/10)
- Costs calculated for 1-hour meetings (~12,500 input tokens, 1,750 output)
- For frequent use, consider Recommended preset to optimize costs
- Use
-tflag with existing transcriptions to save 96% on re-processing
The tool automatically handles large files:
- Automatic chunking: Files > 20MB are split into chunks
- Parallel processing: Multiple chunks processed simultaneously
- Timeline reconstruction: Timestamps preserved in final output
RecordingToTasks/
βββ main.py # Main script
βββ requirements.txt # Python dependencies
βββ setup.sh # Automatic installation script
βββ test_setup.py # Setup verification test
βββ .env # Configuration (not committed)
βββ env.example # Configuration template
βββ README.md # Documentation
βββ CLAUDE.md # Technical documentation
βββ .gitignore # Files to ignore
βββ venv/ # Virtual environment
βββ temp/ # Temporary files
βββ output/ # Output files
βββ tests/ # Test scripts and samples
- openai: OpenAI API client
- python-dotenv: Environment variable management
- ffmpeg: Audio/video processing (external dependency)
- pyannote.audio (optional): Speaker diarization to identify who speaks
- Requires Hugging Face account and token
- Enable with
ENABLE_DIARIZATION=truein.envfile
- Fork the repository
- Create a feature branch:
git checkout -b feature/feature-name - Commit your changes:
git commit -am 'Add new feature' - Push the branch:
git push origin feature/feature-name - Open a Pull Request
"ffmpeg not found"
# Verify installation
ffmpeg -version
# If not installed, follow prerequisites"OpenAI API key not found"
# Verify .env file
cat .env
# Make sure the key is correct"File too large"
- The tool automatically handles large files
- Increase
SIZE_LIMIT_MBin.envif needed
Transcription errors
- The tool automatically retries with exponential backoff
- Check internet connection
- Verify OpenAI API rate limits
JSON parsing errors
- Now includes enhanced diagnostics with
finish_reasonvalidation - Increased token limit from 3000 to 8000 for long transcriptions
- Check logs for detailed error information
For more detailed debugging, temporarily modify main.py:
import logging
logging.basicConfig(level=logging.DEBUG)MIT License - see LICENSE file for details
For bug reports or feature requests, open an issue on GitHub.
Note: This tool is optimized for Italian and English meetings. For other languages, you may need to modify the analysis prompts in main.py.