Enable true streaming for Voxtral Realtime model on XNNPACK#17440
Enable true streaming for Voxtral Realtime model on XNNPACK#17440mergennachin wants to merge 2 commits intomainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17440
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 1 Cancelled Job, 1 Pending, 3 Unrelated FailuresAs of commit fd8137d with merge base c3e60d0 ( NEW FAILURES - The following jobs have failed:
CANCELLED JOB - The following job was cancelled. Please retry:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
3f45a9f to
509bd47
Compare
Adds true streaming support for the Voxtral-Mini-4B-Realtime-2602 model, enabling real-time transcription from live audio input. Key components: - StreamingAudioEncoderExport: encoder with KV-cached attention and conv state for chunk-by-chunk processing (8 mel frames = 80ms per step) - StreamingSession C++ API (feed_audio/flush) for incremental audio input - Streaming mel preprocessor (WhisperAudioProcessor with --streaming flag) - STFT overlap windowing (320 left + 40 right) for correct mel at chunk boundaries - Per-component quantization (--qlinear-encoder / --qlinear / --qembedding)
509bd47 to
527ae9d
Compare
There was a problem hiding this comment.
Pull request overview
This PR enables true streaming mode for the Voxtral Realtime speech-to-text model on XNNPACK, allowing incremental processing of audio input in 80ms chunks rather than requiring the full audio upfront.
Changes:
- Added streaming mode to mel spectrogram preprocessor with mutual exclusivity validation against stack_output mode
- Implemented
StreamingAudioEncoderExportclass with KV caches and stateful convolutional layers for incremental encoder processing - Added
StreamingSessionC++ class for managing streaming transcription with audio buffering, overlap handling, and text-only decoding after audio ends - Updated CI/build scripts to export and test models with streaming enabled by default
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| extension/audio/mel_spectrogram.py | Added streaming parameter to skip 30-second chunk padding, with validation to ensure mutual exclusivity with stack_output |
| examples/models/voxtral_realtime/model.py | Implemented StreamingAudioEncoderExport class with KV caches, SDPA, and stateful convolutions for 8-frame incremental processing |
| examples/models/voxtral_realtime/export_voxtral_rt.py | Added export_streaming function with per-component quantization and STFT overlap metadata calculation |
| examples/models/voxtral_realtime/voxtral_realtime_runner.h | Declared StreamingSession class with feed_audio/flush API and streaming metadata fields |
| examples/models/voxtral_realtime/voxtral_realtime_runner.cpp | Implemented streaming session with overlapping audio windows, mel extraction, encoder/decoder stepping, and buffer management |
| examples/models/voxtral_realtime/main.cpp | Added --streaming flag with 200ms chunk simulation for testing |
| examples/models/voxtral_realtime/model.md | Documented streaming encoder architecture, conv state management, KV caches, STFT overlap, and quantization |
| examples/models/voxtral_realtime/README.md | Added streaming usage instructions and command examples |
| .ci/scripts/test_model_e2e.sh | Enabled --streaming flag for CI testing |
| .ci/scripts/export_model_artifact.sh | Updated to export with --streaming and --qlinear-encoder flags |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| int64_t n = std::min( | ||
| chunk_size, static_cast<int64_t>(audio_data.size()) - offset); | ||
| session->feed_audio(audio_data.data() + offset, n); | ||
| } |
There was a problem hiding this comment.
This is "simulating" live microphone by chunking the audio and feeding into the stream.
The API supports being able to continuously read input stream and transcribe. It does the "right thing" by looking at a moving window of chunks.
The live microphone feature can be done in the application layer by end customers and demo builder (which is not super interesting in the context of model enablement)
| Instead of zero-padding (offline) or recompute-with-overlap (vLLM), | ||
| explicit conv state carries the tail of the previous chunk: | ||
|
|
||
| - **Conv1** (kernel=3, stride=1): state = last 2 mel frames from previous |
There was a problem hiding this comment.
Actually I think we need the last 4 mel frames instead of hte last 2.
Our look-back is 52.5ms which AFAIU corresponds to
4 log mel frames (each 10ms) and then last frame looks back 12.5 because it has a window size of 25ms => 4 * 10ms + 12.5 = 52.5ms
There was a problem hiding this comment.
The conv state of 2 is correct. It's kernel_size - 1 for both CausalConv1d layers (kernel_size=3). This is the exact amount needed for a causal convolution to produce identical output to processing contiguous audio.
The streaming_look_back_ms = 52.5 in tekken.json serves a different purpose. The look-back bundles STFT context, conv context, and encoder context into one value.
| int64_t step_samples_ = 1280; | ||
|
|
||
| // STFT overlap for streaming mel computation (read from model metadata). | ||
| int64_t stft_left_overlap_ = 320; |
There was a problem hiding this comment.
| int64_t stft_left_overlap_ = 320; | |
| int64_t stft_left_overlap_ = 840; |
think our left overlap should be 52.5ms - see:
},
"transcription_delay_ms": 480,
"streaming_look_ahead_ms": 2.5,
"streaming_look_back_ms": 52.5,
"streaming_n_left_pad_tokens": 32,
"transcription_format": "streaming"
}
in the tekken.json
There was a problem hiding this comment.
Our design handles these separately. Conv state carries the exact conv context (2 frames), and the encoder KV cache provides full attention history. So the left overlap only needs to cover the STFT window edge effect (n_fft/2 = 200 samples), and 320 is the aligned minimum for that.
| // STFT overlap for streaming mel computation (read from model metadata). | ||
| int64_t stft_left_overlap_ = 320; | ||
| int64_t stft_right_lookahead_ = 40; | ||
| int64_t mel_skip_frames_ = 2; |
There was a problem hiding this comment.
| int64_t mel_skip_frames_ = 2; | |
| int64_t mel_skip_frames_ = 4; |
think that this should actually be 4
|
Here's slight difference between vLLM and ExecuTorch, but they end up producing same result. vLLM: 2160 raw audio → mel (~13 frames) → conv (~6 post-conv frames) → encoder (all frames, with KV cache) → downsample → audio tokens The look_back region is re-processed every step. Its mel, conv output, and encoder output are all recomputed alongside the new frames. Ours: State-based Each step takes a smaller window — [320 overlap + 1280 step + 40 look_ahead] = 1640 samples — and processes only the new data, using explicit state for context: 1640 raw audio → mel (10 frames) → extract 8 aligned frames Why they produce the same result Mel spectrogram: Both have ≥ n_fft/2 (200 samples) of left context for the step's first mel frame. Our 320 ≥ 200, so the 8 extracted mel frames have identical values to vLLM's corresponding frames. vLLM computes extra mel frames in the look_back region, but they're just for context. I think you can implement similar approach in vLLM and make it more efficient too. |
Follow-up to #17431
Enable Voxtral Realtime streaming on XNNPACK
Adds true streaming support for the Voxtral-Mini-4B-Realtime-2602 model,
enabling real-time transcription from live audio input.
Key components:
state for chunk-by-chunk processing (8 mel frames = 80ms per step)
boundaries
Test Plan:
https://github.com/pytorch/executorch/actions/runs/21993469202/job/63546903665?pr=17440