Enable Voxtral Realtime on XNNPACK (CPU) by mergennachin · Pull Request #17431 · pytorch/executorch

mergennachin · 2026-02-12T22:14:02Z

Adds Mistral's Voxtral-Mini-4B-Realtime-2602 (~4B parameter streaming
speech-to-text model) to ExecuTorch with XNNPACK backend support.

Phase 1: Self-contained eager model (model.py) with direct Mistral
checkpoint loading, multi-method export (audio_encoder, text_decoder,
token_embedding) to a single .pte, and TorchAO quantization (4bit blockwise, 8bit dynamic activation for linear layer and 8bit weight per-channel for embeddings).

Phase 2: C++ runner for offline transcription. Loads preprocessor.pte
for mel spectrogram computation, runs audio encoding, then autoregressive
decoding with element-wise audio+text embedding fusion.

Phase 3: Streaming support (follow-up PR: #17440)

Phase 4: Enable on CUDA and Metal (follow-up PR)

Example output (8da4w quantized, 30s LibriSpeech audio):

$ cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner \
    --model_path voxtral_realtime.pte \
    --tokenizer_path tekken.json \
    --preprocessor_path preprocessor.pte \
    --audio_path output.wav

Mr. Quilter is the apostle of the middle classes, and we are glad to
welcome his gospel. Nor is Mr. Quilter's manner less interesting than
his matter. He tells us that at this festive season of the year, with
Christmas and roast beef looming before us, similes drawn from eating
and its results occur most readily to the mind. He has grave doubts
whether Sir Frederick Layton's work is really Greek after all, and...

Generated 392 tokens in 44s (~8.8 tok/s) on M1 Macbook

Test Plan: https://github.com/pytorch/executorch/actions/runs/21986674876/job/63522558208?pr=17431

pytorch-bot · 2026-02-12T22:14:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17431

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 1 Pending

As of commit c8784d8 with merge base f08db65 ():

NEW FAILURES - The following jobs have failed:

pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t e61d6088f6f95e80e5f1a6cbd38e7fcad0f87e05a42f6c61462b1d1e273af643 /exec failed with exit code 139
pull / unittest-nxp-neutron / linux-job (gh)
RuntimeError: Command docker exec -t b2c92f67748eeecd3e996e7597857dd71041aa1dbf1e9d921172a85b3f232701 /exec failed with exit code 1
trunk / test-mcu-cortex-m-backend / linux-job (gh)
RuntimeError: Command docker exec -t 87e741f8af543c8c906a3b2e6b1bc63856007960e5e7f384901996f08d81a17a /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-02-12T22:14:46Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

This PR adds Mistral's Voxtral-Mini-4B-Realtime-2602 streaming speech-to-text model to ExecuTorch with XNNPACK backend support. The implementation is self-contained with direct checkpoint loading (no HuggingFace dependency) and includes three phases: eager model implementation with multi-method export, C++ runner for offline transcription, and hooks for future streaming support.

Changes:

Introduces a shared quantization module (extension/llm/export/quantize.py) for TorchAO source-transform quantization, supporting 4w/8w/8da4w/8da8w for linear layers and 4w/8w for embeddings
Implements Voxtral Realtime model with custom audio encoder, text decoder, and element-wise audio+text embedding fusion
Adds C++ runner with mel spectrogram preprocessing, audio encoding, and autoregressive decoding

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
extension/llm/export/quantize.py	New shared TorchAO quantization module for LLM export with support for various weight-only and dynamic activation quantization schemes
extension/llm/export/BUCK	Added quantize.py to build configuration
examples/models/voxtral_realtime/model.py	Complete eager PyTorch implementation of Voxtral Realtime with causal whisper encoder, Mistral decoder, and memory-efficient checkpoint loading
examples/models/voxtral_realtime/model.md	Detailed architecture documentation including design choices, ExecuTorch patterns, and checkpoint format
examples/models/voxtral_realtime/export_voxtral_rt.py	Multi-method export script supporting dynamic shapes and TorchAO quantization
examples/models/voxtral_realtime/voxtral_realtime_runner.h	C++ runner header defining transcription interface and config
examples/models/voxtral_realtime/voxtral_realtime_runner.cpp	C++ implementation handling preprocessor execution, audio encoding, and autoregressive text generation
examples/models/voxtral_realtime/main.cpp	CLI entry point with gflags configuration and stats reporting
examples/models/voxtral_realtime/README.md	User-facing documentation with setup, export, build, and run instructions
examples/models/voxtral_realtime/CMakeLists.txt	CMake build configuration with XNNPACK, LLM runner, and tokenizer dependencies
examples/models/voxtral_realtime/CMakePresets.json	CMake presets for CPU build configuration
examples/models/parakeet/quantize.py	Refactored to re-export from shared quantization module, eliminating code duplication
Makefile	Added voxtral-realtime-cpu target for building the runner

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

examples/models/voxtral_realtime/voxtral_realtime_runner.cpp

Adds Mistral's Voxtral-Mini-4B-Realtime-2602 (~4B parameter streaming speech-to-text model) to ExecuTorch with XNNPACK backend support. Phase 1: Self-contained eager model (model.py) with direct Mistral checkpoint loading, multi-method export (audio_encoder, text_decoder, token_embedding) to a single .pte, and TorchAO quantization (8da4w/8w). Phase 2: C++ runner for offline transcription. Loads preprocessor.pte for mel spectrogram computation, runs audio encoding, then autoregressive decoding with element-wise audio+text embedding fusion. Phase 3: Streaming support (follow-up PR). Example output (8da4w quantized, 30s LibriSpeech audio): ``` $ cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner \ --model_path voxtral_realtime.pte \ --tokenizer_path tekken.json \ --preprocessor_path preprocessor.pte \ --audio_path output.wav Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all, and... Generated 392 tokens in 44s (~8.8 tok/s) on M1. ```

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

extension/llm/export/quantize.py

Copilot · 2026-02-12T22:44:11Z

extension/llm/export/quantize.py

+            granularity = PerGroup(qlinear_group_size)
+
+        if qlinear_config == "4w":
+            if qlinear_packing_format:


When qlinear_packing_format is provided, Int4WeightOnlyConfig is constructed with group_size=qlinear_group_size even if qlinear_group_size == 0 (per-axis mode). This likely creates an invalid config; consider rejecting packing_format when group_size==0, or mapping per-axis to a supported group size explicitly.

Suggested change

if qlinear_packing_format:

if qlinear_packing_format:

if qlinear_group_size == 0:

raise ValueError(

"qlinear_packing_format is not supported when qlinear_group_size == 0 "

"(per-axis quantization). Please specify a positive group size or "

"omit qlinear_packing_format."

)

extension/llm/export/quantize.py

Makefile

examples/models/voxtral_realtime/export_voxtral_rt.py

Copilot · 2026-02-12T22:44:12Z

examples/models/voxtral_realtime/voxtral_realtime_runner.cpp

+
+#include <cstring>
+#include <ctime>
+#include <vector>


std::min is used later in this file, but <algorithm> isn’t included here. Please include <algorithm> explicitly to avoid relying on transitive includes that can break the build on some toolchains.

Suggested change

#include <vector>

#include <vector>

#include <algorithm>

examples/models/voxtral_realtime/voxtral_realtime_runner.cpp

examples/models/voxtral_realtime/model.py

extension/llm/export/quantize.py

manuelcandales · 2026-02-12T22:56:36Z

examples/models/voxtral_realtime/export_voxtral_rt.py

@@ -0,0 +1,296 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.


There is a PR adding this model to transformers (still open): huggingface/transformers#43769

Do we plan to move this to optimum-executorch once that PR is landed?

@manuelcandales

Yeah, when I first looked at it, the model wasn't in transformers a few days ago. FWIW, the vLLM has its own copy of the implementation in their repo, similar to what I'm doing. So, implementing directly seemed the most straightforward.

Do we plan to move this to optimum-executorch once that PR is landed?

Maybe. Once they land in transformers, we might. There are a few variables such as upgrading the transformers pin in ET -- they recently had a major 5.0 update, so I assume there will be a few breakages that need foxing. Also, I'm fine keeping as it is, if it works already.

(Voxtral author here) BTW the transformers implementation only supports "offline" streaming for now for which the whole audio file is encoded in one go. The arch and forward passing logic is def still the same, but I think what we're really interested in is the "true" online / realtime use case that we implemented via the realtime API inside vLLM (see: https://docs.vllm.ai/en/latest/examples/online_serving/openai_realtime_client/?h=realtime#openai-realtime-client)

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.ci/scripts/test_model_e2e.sh

.ci/scripts/export_model_artifact.sh

examples/models/voxtral_realtime/export_voxtral_rt.py

examples/models/voxtral_realtime/voxtral_realtime_runner.cpp

Copilot · 2026-02-13T02:13:45Z

examples/models/voxtral_realtime/model.py

+
+

Import of '_custom_ops' is not used.

Suggested change

_ = _custom_ops # Ensure custom ops module is imported for side effects.

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-13T02:55:13Z

examples/models/voxtral_realtime/voxtral_realtime_runner.cpp

+  uint64_t prev_token = bos_id_;
+  int num_generated = 0;
+  const int64_t max_pos = std::min(
+      static_cast<int64_t>(config.max_new_tokens) + t_audio, max_seq_len_);


max_new_tokens is documented/flagged as a token-generation cap, but the loop bound adds t_audio, allowing up to t_audio + max_new_tokens decoding steps (and num_generated increments every step). Consider enforcing the cap based on num_generated (or renaming the field to reflect 'extra positions after audio') so CLI/docs match actual behavior.

extension/llm/export/quantize.py

.ci/scripts/export_model_artifact.sh

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

examples/models/voxtral_realtime/export_voxtral_rt.py

.github/workflows/pull.yml

Copilot · 2026-02-13T12:30:27Z

examples/models/voxtral_realtime/voxtral_realtime_runner.cpp

+#include <cstring>
+#include <ctime>
+#include <vector>
+
+#include <executorch/extension/llm/runner/llm_runner_helper.h>


std::min is used below, but <algorithm> isn’t included. This can cause a compile error depending on the standard library implementation; add #include <algorithm> explicitly.

Copilot · 2026-02-13T12:30:27Z

examples/models/voxtral_realtime/voxtral_realtime_runner.cpp

+  return from_blob(
+      mel_ref.mutable_data_ptr<float>(),
+      {static_cast<int>(mel_ref.size(0)),
+       static_cast<int>(mel_ref.size(1)),
+       static_cast<int>(mel_ref.size(2))},
+      ::executorch::aten::ScalarType::Float);


These tensor size casts to int can truncate large dimensions (e.g., long mel sequences). Prefer passing mel_ref.size(n) as int64_t/SizesType without narrowing casts.

Copilot · 2026-02-13T12:30:27Z

examples/models/voxtral_realtime/voxtral_realtime_runner.cpp

+    // e. Decode token to text and emit via callback.
+    auto piece =
+        tokenizer_->decode(prev_token, static_cast<uint64_t>(next_token));
+    if (piece.ok()) {


token_cb is invoked unconditionally when piece.ok(). If the caller passes an empty std::function, this will throw/bad_function_call. Consider either requiring a non-empty callback (check and ET_CHECK_MSG(token_cb)), or making the callback optional and guarding before calling it.

Suggested change

if (piece.ok()) {

if (piece.ok() && token_cb) {

patrickvonplaten · 2026-02-13T13:08:47Z

examples/models/voxtral_realtime/main.cpp

+  bool first_token = true;
+
+  int num_generated = runner.transcribe(
+      audio_data.data(),


any chance that there is a way to feed in audio data iteratively via some kind of generator / iterator?

Yeap, followup PR coming soon

Here's the true streaming mode:

#17440

patrickvonplaten · 2026-02-13T13:09:27Z

examples/models/voxtral_realtime/model.md

+The `t_cond` is a sinusoidal embedding of `n_delay_tokens` (default 6 = 480ms),
+precomputed once and passed to each decoder layer as a constant.
+
+### Differences from original Voxtral (non-realtime)


nice summary

patrickvonplaten · 2026-02-13T13:12:24Z

examples/models/voxtral_realtime/voxtral_realtime_runner.cpp

+}
+
+int VoxtralRealtimeRunner::transcribe(
+    const float* audio_data,


any chance to also provide a ::realtime interface?

mergennachin requested a review from larryliu0820 as a code owner February 12, 2026 22:14

Copilot AI review requested due to automatic review settings February 12, 2026 22:14

mergennachin requested review from kirklandsign and lucylq as code owners February 12, 2026 22:14

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 12, 2026

Copilot started reviewing on behalf of mergennachin February 12, 2026 22:14 View session

mergennachin requested a review from manuelcandales February 12, 2026 22:14

mergennachin changed the title ~~Enable Voxtral Realtime on XNNPACK~~ Enable Voxtral Realtime on XNNPACK (CPU) Feb 12, 2026

mergennachin requested a review from JacobSzwejbka February 12, 2026 22:19

Copilot AI reviewed Feb 12, 2026

View reviewed changes

examples/models/voxtral_realtime/voxtral_realtime_runner.cpp Outdated Show resolved Hide resolved

mergennachin force-pushed the voxtral_realtime branch from 2f11a3b to 787ce22 Compare February 12, 2026 22:26

mergennachin force-pushed the voxtral_realtime branch from 787ce22 to 8d88f36 Compare February 12, 2026 22:32

Copilot AI review requested due to automatic review settings February 12, 2026 22:32

Copilot started reviewing on behalf of mergennachin February 12, 2026 22:33 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

Add CI job

f96f026

manuelcandales reviewed Feb 12, 2026

View reviewed changes

extension/llm/export/quantize.py Show resolved Hide resolved

manuelcandales reviewed Feb 12, 2026

View reviewed changes

mergennachin temporarily deployed to upload-benchmark-results February 12, 2026 23:52 — with GitHub Actions Inactive

Copilot AI review requested due to automatic review settings February 13, 2026 02:01

Copilot started reviewing on behalf of mergennachin February 13, 2026 02:02 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

mergennachin force-pushed the voxtral_realtime branch from 87b5403 to 99aef4c Compare February 13, 2026 02:39

Copilot AI review requested due to automatic review settings February 13, 2026 02:45

mergennachin force-pushed the voxtral_realtime branch from 99aef4c to 0935e29 Compare February 13, 2026 02:45

Copilot started reviewing on behalf of mergennachin February 13, 2026 02:46 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

mergennachin temporarily deployed to upload-benchmark-results February 13, 2026 04:03 — with GitHub Actions Inactive

mergennachin force-pushed the voxtral_realtime branch from 0935e29 to 9e6c462 Compare February 13, 2026 04:57

mergennachin temporarily deployed to upload-benchmark-results February 13, 2026 05:57 — with GitHub Actions Inactive

Add __init__.py file and use full package file

c8784d8

Copilot AI review requested due to automatic review settings February 13, 2026 12:22

mergennachin force-pushed the voxtral_realtime branch from 9e6c462 to c8784d8 Compare February 13, 2026 12:22

Copilot started reviewing on behalf of mergennachin February 13, 2026 12:22 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

patrickvonplaten reviewed Feb 13, 2026

View reviewed changes

mergennachin temporarily deployed to upload-benchmark-results February 13, 2026 13:50 — with GitHub Actions Inactive

manuelcandales approved these changes Feb 13, 2026

View reviewed changes

mergennachin merged commit c3e60d0 into main Feb 13, 2026
354 of 357 checks passed

mergennachin deleted the voxtral_realtime branch February 13, 2026 15:36

mergennachin mentioned this pull request Feb 13, 2026

Enable true streaming for Voxtral Realtime model on XNNPACK #17440

Open

-            if qlinear_packing_format:
+            if qlinear_packing_format:
+                if qlinear_group_size == 0:
+                    raise ValueError(
+                        "qlinear_packing_format is not supported when qlinear_group_size == 0 "
+                        "(per-axis quantization). Please specify a positive group size or "
+                        "omit qlinear_packing_format."
+                    )

		@@ -0,0 +1,296 @@
		# Copyright (c) Meta Platforms, Inc. and affiliates.


	_ = _custom_ops # Ensure custom ops module is imported for side effects.

Conversation

mergennachin commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17431

❌ 3 New Failures, 1 Pending

Uh oh!

github-actions bot commented Feb 12, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergennachin commented Feb 12, 2026 •

edited

Loading

pytorch-bot bot commented Feb 12, 2026 •

edited

Loading

This PR needs a `release notes:` label