Enable Voxtral Realtime on XNNPACK (CPU)#17431
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17431
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New Failures, 1 PendingAs of commit c8784d8 with merge base f08db65 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
This PR adds Mistral's Voxtral-Mini-4B-Realtime-2602 streaming speech-to-text model to ExecuTorch with XNNPACK backend support. The implementation is self-contained with direct checkpoint loading (no HuggingFace dependency) and includes three phases: eager model implementation with multi-method export, C++ runner for offline transcription, and hooks for future streaming support.
Changes:
- Introduces a shared quantization module (
extension/llm/export/quantize.py) for TorchAO source-transform quantization, supporting 4w/8w/8da4w/8da8w for linear layers and 4w/8w for embeddings - Implements Voxtral Realtime model with custom audio encoder, text decoder, and element-wise audio+text embedding fusion
- Adds C++ runner with mel spectrogram preprocessing, audio encoding, and autoregressive decoding
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| extension/llm/export/quantize.py | New shared TorchAO quantization module for LLM export with support for various weight-only and dynamic activation quantization schemes |
| extension/llm/export/BUCK | Added quantize.py to build configuration |
| examples/models/voxtral_realtime/model.py | Complete eager PyTorch implementation of Voxtral Realtime with causal whisper encoder, Mistral decoder, and memory-efficient checkpoint loading |
| examples/models/voxtral_realtime/model.md | Detailed architecture documentation including design choices, ExecuTorch patterns, and checkpoint format |
| examples/models/voxtral_realtime/export_voxtral_rt.py | Multi-method export script supporting dynamic shapes and TorchAO quantization |
| examples/models/voxtral_realtime/voxtral_realtime_runner.h | C++ runner header defining transcription interface and config |
| examples/models/voxtral_realtime/voxtral_realtime_runner.cpp | C++ implementation handling preprocessor execution, audio encoding, and autoregressive text generation |
| examples/models/voxtral_realtime/main.cpp | CLI entry point with gflags configuration and stats reporting |
| examples/models/voxtral_realtime/README.md | User-facing documentation with setup, export, build, and run instructions |
| examples/models/voxtral_realtime/CMakeLists.txt | CMake build configuration with XNNPACK, LLM runner, and tokenizer dependencies |
| examples/models/voxtral_realtime/CMakePresets.json | CMake presets for CPU build configuration |
| examples/models/parakeet/quantize.py | Refactored to re-export from shared quantization module, eliminating code duplication |
| Makefile | Added voxtral-realtime-cpu target for building the runner |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
2f11a3b to
787ce22
Compare
Adds Mistral's Voxtral-Mini-4B-Realtime-2602 (~4B parameter streaming
speech-to-text model) to ExecuTorch with XNNPACK backend support.
Phase 1: Self-contained eager model (model.py) with direct Mistral
checkpoint loading, multi-method export (audio_encoder, text_decoder,
token_embedding) to a single .pte, and TorchAO quantization (8da4w/8w).
Phase 2: C++ runner for offline transcription. Loads preprocessor.pte
for mel spectrogram computation, runs audio encoding, then autoregressive
decoding with element-wise audio+text embedding fusion.
Phase 3: Streaming support (follow-up PR).
Example output (8da4w quantized, 30s LibriSpeech audio):
```
$ cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner \
--model_path voxtral_realtime.pte \
--tokenizer_path tekken.json \
--preprocessor_path preprocessor.pte \
--audio_path output.wav
Mr. Quilter is the apostle of the middle classes, and we are glad to
welcome his gospel. Nor is Mr. Quilter's manner less interesting than
his matter. He tells us that at this festive season of the year, with
Christmas and roast beef looming before us, similes drawn from eating
and its results occur most readily to the mind. He has grave doubts
whether Sir Frederick Layton's work is really Greek after all, and...
Generated 392 tokens in 44s (~8.8 tok/s) on M1.
```
787ce22 to
8d88f36
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| granularity = PerGroup(qlinear_group_size) | ||
|
|
||
| if qlinear_config == "4w": | ||
| if qlinear_packing_format: |
There was a problem hiding this comment.
When qlinear_packing_format is provided, Int4WeightOnlyConfig is constructed with group_size=qlinear_group_size even if qlinear_group_size == 0 (per-axis mode). This likely creates an invalid config; consider rejecting packing_format when group_size==0, or mapping per-axis to a supported group size explicitly.
| if qlinear_packing_format: | |
| if qlinear_packing_format: | |
| if qlinear_group_size == 0: | |
| raise ValueError( | |
| "qlinear_packing_format is not supported when qlinear_group_size == 0 " | |
| "(per-axis quantization). Please specify a positive group size or " | |
| "omit qlinear_packing_format." | |
| ) |
|
|
||
| #include <cstring> | ||
| #include <ctime> | ||
| #include <vector> |
There was a problem hiding this comment.
std::min is used later in this file, but <algorithm> isn’t included here. Please include <algorithm> explicitly to avoid relying on transitive includes that can break the build on some toolchains.
| #include <vector> | |
| #include <vector> | |
| #include <algorithm> |
| @@ -0,0 +1,296 @@ | |||
| # Copyright (c) Meta Platforms, Inc. and affiliates. | |||
There was a problem hiding this comment.
There is a PR adding this model to transformers (still open): huggingface/transformers#43769
Do we plan to move this to optimum-executorch once that PR is landed?
There was a problem hiding this comment.
Yeah, when I first looked at it, the model wasn't in transformers a few days ago. FWIW, the vLLM has its own copy of the implementation in their repo, similar to what I'm doing. So, implementing directly seemed the most straightforward.
Do we plan to move this to optimum-executorch once that PR is landed?
Maybe. Once they land in transformers, we might. There are a few variables such as upgrading the transformers pin in ET -- they recently had a major 5.0 update, so I assume there will be a few breakages that need foxing. Also, I'm fine keeping as it is, if it works already.
There was a problem hiding this comment.
(Voxtral author here) BTW the transformers implementation only supports "offline" streaming for now for which the whole audio file is encoded in one go. The arch and forward passing logic is def still the same, but I think what we're really interested in is the "true" online / realtime use case that we implemented via the realtime API inside vLLM (see: https://docs.vllm.ai/en/latest/examples/online_serving/openai_realtime_client/?h=realtime#openai-realtime-client)
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
|
|
There was a problem hiding this comment.
Import of '_custom_ops' is not used.
| _ = _custom_ops # Ensure custom ops module is imported for side effects. |
87b5403 to
99aef4c
Compare
99aef4c to
0935e29
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| uint64_t prev_token = bos_id_; | ||
| int num_generated = 0; | ||
| const int64_t max_pos = std::min( | ||
| static_cast<int64_t>(config.max_new_tokens) + t_audio, max_seq_len_); |
There was a problem hiding this comment.
max_new_tokens is documented/flagged as a token-generation cap, but the loop bound adds t_audio, allowing up to t_audio + max_new_tokens decoding steps (and num_generated increments every step). Consider enforcing the cap based on num_generated (or renaming the field to reflect 'extra positions after audio') so CLI/docs match actual behavior.
0935e29 to
9e6c462
Compare
9e6c462 to
c8784d8
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #include <cstring> | ||
| #include <ctime> | ||
| #include <vector> | ||
|
|
||
| #include <executorch/extension/llm/runner/llm_runner_helper.h> |
There was a problem hiding this comment.
std::min is used below, but <algorithm> isn’t included. This can cause a compile error depending on the standard library implementation; add #include <algorithm> explicitly.
| return from_blob( | ||
| mel_ref.mutable_data_ptr<float>(), | ||
| {static_cast<int>(mel_ref.size(0)), | ||
| static_cast<int>(mel_ref.size(1)), | ||
| static_cast<int>(mel_ref.size(2))}, | ||
| ::executorch::aten::ScalarType::Float); |
There was a problem hiding this comment.
These tensor size casts to int can truncate large dimensions (e.g., long mel sequences). Prefer passing mel_ref.size(n) as int64_t/SizesType without narrowing casts.
| // e. Decode token to text and emit via callback. | ||
| auto piece = | ||
| tokenizer_->decode(prev_token, static_cast<uint64_t>(next_token)); | ||
| if (piece.ok()) { |
There was a problem hiding this comment.
token_cb is invoked unconditionally when piece.ok(). If the caller passes an empty std::function, this will throw/bad_function_call. Consider either requiring a non-empty callback (check and ET_CHECK_MSG(token_cb)), or making the callback optional and guarding before calling it.
| if (piece.ok()) { | |
| if (piece.ok() && token_cb) { |
| bool first_token = true; | ||
|
|
||
| int num_generated = runner.transcribe( | ||
| audio_data.data(), |
There was a problem hiding this comment.
any chance that there is a way to feed in audio data iteratively via some kind of generator / iterator?
There was a problem hiding this comment.
Yeap, followup PR coming soon
There was a problem hiding this comment.
Here's the true streaming mode:
| The `t_cond` is a sinusoidal embedding of `n_delay_tokens` (default 6 = 480ms), | ||
| precomputed once and passed to each decoder layer as a constant. | ||
|
|
||
| ### Differences from original Voxtral (non-realtime) |
| } | ||
|
|
||
| int VoxtralRealtimeRunner::transcribe( | ||
| const float* audio_data, |
There was a problem hiding this comment.
any chance to also provide a ::realtime interface?
Adds Mistral's Voxtral-Mini-4B-Realtime-2602 (~4B parameter streaming
speech-to-text model) to ExecuTorch with XNNPACK backend support.
Phase 1: Self-contained eager model (model.py) with direct Mistral
checkpoint loading, multi-method export (audio_encoder, text_decoder,
token_embedding) to a single .pte, and TorchAO quantization (4bit blockwise, 8bit dynamic activation for linear layer and 8bit weight per-channel for embeddings).
Phase 2: C++ runner for offline transcription. Loads preprocessor.pte
for mel spectrogram computation, runs audio encoding, then autoregressive
decoding with element-wise audio+text embedding fusion.
Phase 3: Streaming support (follow-up PR: #17440)
Phase 4: Enable on CUDA and Metal (follow-up PR)
Example output (8da4w quantized, 30s LibriSpeech audio):
Test Plan: https://github.com/pytorch/executorch/actions/runs/21986674876/job/63522558208?pr=17431