Enable true streaming for Voxtral Realtime model on XNNPACK by mergennachin · Pull Request #17440 · pytorch/executorch

mergennachin · 2026-02-13T15:53:48Z

Follow-up to #17431

Enable Voxtral Realtime streaming on XNNPACK
Adds true streaming support for the Voxtral-Mini-4B-Realtime-2602 model,
enabling real-time transcription from live audio input.

Key components:

StreamingAudioEncoderExport: encoder with KV-cached attention and conv
state for chunk-by-chunk processing (8 mel frames = 80ms per step)
StreamingSession C++ API (feed_audio/flush) for incremental audio input
Streaming mel preprocessor (WhisperAudioProcessor with --streaming flag)
STFT overlap windowing (320 left + 40 right) for correct mel at chunk
boundaries
Per-component quantization (--qlinear-encoder / --qlinear / --qembedding)

Test Plan:

https://github.com/pytorch/executorch/actions/runs/21993469202/job/63546903665?pr=17440

pytorch-bot · 2026-02-13T15:53:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17440

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job, 1 Pending, 3 Unrelated Failures

As of commit fd8137d with merge base c3e60d0 ():

NEW FAILURES - The following jobs have failed:

pull / test-vulkan-operators-linux / linux-job (gh)
RuntimeError: Command docker exec -t 101bc7f892c8fa400ff81540ca4631d01c19093da142304e829fc5a20cf543d8 /exec failed with exit code 127
pull / unittest-nxp-neutron / linux-job (gh)
RuntimeError: Command docker exec -t 8acab1dced807adc4ca9089c669b0bc74494758bfee1d03b1b84f8304c949f7b /exec failed with exit code 1

CANCELLED JOB - The following job was cancelled. Please retry:

Build documentation / build (buck2) / Build doc (gh)

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / test-models-linux (mv2, portable, linux.2xlarge) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
pull / test-models-linux (resnet18, portable, linux.2xlarge) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
trunk / test-arm-backend-ethos-u (test_run_ethos_u85) / linux-job (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-02-13T15:54:34Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Adds true streaming support for the Voxtral-Mini-4B-Realtime-2602 model, enabling real-time transcription from live audio input. Key components: - StreamingAudioEncoderExport: encoder with KV-cached attention and conv state for chunk-by-chunk processing (8 mel frames = 80ms per step) - StreamingSession C++ API (feed_audio/flush) for incremental audio input - Streaming mel preprocessor (WhisperAudioProcessor with --streaming flag) - STFT overlap windowing (320 left + 40 right) for correct mel at chunk boundaries - Per-component quantization (--qlinear-encoder / --qlinear / --qembedding)

mergennachin · 2026-02-13T16:00:26Z

cc @patrickvonplaten

Copilot

Pull request overview

This PR enables true streaming mode for the Voxtral Realtime speech-to-text model on XNNPACK, allowing incremental processing of audio input in 80ms chunks rather than requiring the full audio upfront.

Changes:

Added streaming mode to mel spectrogram preprocessor with mutual exclusivity validation against stack_output mode
Implemented StreamingAudioEncoderExport class with KV caches and stateful convolutional layers for incremental encoder processing
Added StreamingSession C++ class for managing streaming transcription with audio buffering, overlap handling, and text-only decoding after audio ends
Updated CI/build scripts to export and test models with streaming enabled by default

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
extension/audio/mel_spectrogram.py	Added streaming parameter to skip 30-second chunk padding, with validation to ensure mutual exclusivity with stack_output
examples/models/voxtral_realtime/model.py	Implemented StreamingAudioEncoderExport class with KV caches, SDPA, and stateful convolutions for 8-frame incremental processing
examples/models/voxtral_realtime/export_voxtral_rt.py	Added export_streaming function with per-component quantization and STFT overlap metadata calculation
examples/models/voxtral_realtime/voxtral_realtime_runner.h	Declared StreamingSession class with feed_audio/flush API and streaming metadata fields
examples/models/voxtral_realtime/voxtral_realtime_runner.cpp	Implemented streaming session with overlapping audio windows, mel extraction, encoder/decoder stepping, and buffer management
examples/models/voxtral_realtime/main.cpp	Added --streaming flag with 200ms chunk simulation for testing
examples/models/voxtral_realtime/model.md	Documented streaming encoder architecture, conv state management, KV caches, STFT overlap, and quantization
examples/models/voxtral_realtime/README.md	Added streaming usage instructions and command examples
.ci/scripts/test_model_e2e.sh	Enabled --streaming flag for CI testing
.ci/scripts/export_model_artifact.sh	Updated to export with --streaming and --qlinear-encoder flags

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mergennachin · 2026-02-13T16:44:16Z

examples/models/voxtral_realtime/main.cpp

+      int64_t n = std::min(
+          chunk_size, static_cast<int64_t>(audio_data.size()) - offset);
+      session->feed_audio(audio_data.data() + offset, n);
+    }


@patrickvonplaten

This is "simulating" live microphone by chunking the audio and feeding into the stream.

The API supports being able to continuously read input stream and transcribe. It does the "right thing" by looking at a moving window of chunks.

The live microphone feature can be done in the application layer by end customers and demo builder (which is not super interesting in the context of model enablement)

examples/models/voxtral_realtime/export_voxtral_rt.py

examples/models/voxtral_realtime/main.cpp

patrickvonplaten · 2026-02-13T17:59:08Z

examples/models/voxtral_realtime/model.md

+Instead of zero-padding (offline) or recompute-with-overlap (vLLM),
+explicit conv state carries the tail of the previous chunk:
+
+- **Conv1** (kernel=3, stride=1): state = last 2 mel frames from previous


Actually I think we need the last 4 mel frames instead of hte last 2.

Our look-back is 52.5ms which AFAIU corresponds to

4 log mel frames (each 10ms) and then last frame looks back 12.5 because it has a window size of 25ms => 4 * 10ms + 12.5 = 52.5ms

The conv state of 2 is correct. It's kernel_size - 1 for both CausalConv1d layers (kernel_size=3). This is the exact amount needed for a causal convolution to produce identical output to processing contiguous audio.

The streaming_look_back_ms = 52.5 in tekken.json serves a different purpose. The look-back bundles STFT context, conv context, and encoder context into one value.

patrickvonplaten · 2026-02-13T18:03:06Z

examples/models/voxtral_realtime/voxtral_realtime_runner.h

+  int64_t step_samples_ = 1280;
+
+  // STFT overlap for streaming mel computation (read from model metadata).
+  int64_t stft_left_overlap_ = 320;


Suggested change

int64_t stft_left_overlap_ = 320;

int64_t stft_left_overlap_ = 840;

think our left overlap should be 52.5ms - see:

}, "transcription_delay_ms": 480, "streaming_look_ahead_ms": 2.5, "streaming_look_back_ms": 52.5, "streaming_n_left_pad_tokens": 32, "transcription_format": "streaming" }

in the tekken.json

Our design handles these separately. Conv state carries the exact conv context (2 frames), and the encoder KV cache provides full attention history. So the left overlap only needs to cover the STFT window edge effect (n_fft/2 = 200 samples), and 320 is the aligned minimum for that.

patrickvonplaten · 2026-02-13T18:03:23Z

examples/models/voxtral_realtime/voxtral_realtime_runner.h

+  // STFT overlap for streaming mel computation (read from model metadata).
+  int64_t stft_left_overlap_ = 320;
+  int64_t stft_right_lookahead_ = 40;
+  int64_t mel_skip_frames_ = 2;


Suggested change

int64_t mel_skip_frames_ = 2;

int64_t mel_skip_frames_ = 4;

think that this should actually be 4

mergennachin · 2026-02-13T19:15:16Z

@patrickvonplaten

Here's slight difference between vLLM and ExecuTorch, but they end up producing same result.

vLLM:
Each step takes a large audio window — [840 look_back + 1280 step + 40 look_ahead] = 2160 samples — and processes the entire window through the full pipeline:

2160 raw audio → mel (~13 frames) → conv (~6 post-conv frames) → encoder (all frames, with KV cache) → downsample → audio tokens

The look_back region is re-processed every step. Its mel, conv output, and encoder output are all recomputed alongside the new frames.

Ours: State-based

Each step takes a smaller window — [320 overlap + 1280 step + 40 look_ahead] = 1640 samples — and processes only the new data, using explicit state for context:

1640 raw audio → mel (10 frames) → extract 8 aligned frames
8 mel frames + conv state (2 frames) → conv → 4 encoder frames
4 encoder frames + KV cache (all history) → encoder → downsample → 1 audio token

Why they produce the same result

Mel spectrogram: Both have ≥ n_fft/2 (200 samples) of left context for the step's first mel frame. Our 320 ≥ 200, so the 8 extracted mel frames have identical values to vLLM's corresponding frames. vLLM computes extra mel frames in the look_back region, but they're just for context.

I think you can implement similar approach in vLLM and make it more efficient too.

mergennachin requested a review from lucylq as a code owner February 13, 2026 15:53

Copilot AI review requested due to automatic review settings February 13, 2026 15:53

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 13, 2026

Copilot started reviewing on behalf of mergennachin February 13, 2026 15:54 View session

mergennachin force-pushed the voxtral_realtime_streaming branch from 3f45a9f to 509bd47 Compare February 13, 2026 15:55

mergennachin force-pushed the voxtral_realtime_streaming branch from 509bd47 to 527ae9d Compare February 13, 2026 15:59

mergennachin requested review from larryliu0820 and manuelcandales February 13, 2026 16:00

mergennachin changed the title ~~Enable true streaming for Voxtral Realtime model~~ Enable true streaming for Voxtral Realtime model on XNNPACK Feb 13, 2026

Copilot AI reviewed Feb 13, 2026

View reviewed changes

mergennachin mentioned this pull request Feb 13, 2026

Enable Voxtral Realtime on XNNPACK (CPU) #17431

Merged

mergennachin commented Feb 13, 2026

View reviewed changes

mergennachin temporarily deployed to upload-benchmark-results February 13, 2026 16:59 — with GitHub Actions Inactive

patrickvonplaten reviewed Feb 13, 2026

View reviewed changes

examples/models/voxtral_realtime/export_voxtral_rt.py Show resolved Hide resolved

patrickvonplaten reviewed Feb 13, 2026

View reviewed changes

examples/models/voxtral_realtime/export_voxtral_rt.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Feb 13, 2026

View reviewed changes

examples/models/voxtral_realtime/main.cpp Outdated Show resolved Hide resolved

patrickvonplaten reviewed Feb 13, 2026

View reviewed changes

Addressing comment by patrickvonplaten

fd8137d

mergennachin deployed to upload-benchmark-results February 13, 2026 20:18 — with GitHub Actions Active

manuelcandales approved these changes Feb 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable true streaming for Voxtral Realtime model on XNNPACK#17440

Enable true streaming for Voxtral Realtime model on XNNPACK#17440
mergennachin wants to merge 2 commits intomainfrom
voxtral_realtime_streaming

mergennachin commented Feb 13, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 13, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

mergennachin commented Feb 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

mergennachin Feb 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

patrickvonplaten Feb 13, 2026

Uh oh!

mergennachin Feb 13, 2026

Uh oh!

patrickvonplaten Feb 13, 2026

Uh oh!

mergennachin Feb 13, 2026

Uh oh!

patrickvonplaten Feb 13, 2026

Uh oh!

mergennachin commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	int64_t stft_left_overlap_ = 320;
	int64_t stft_left_overlap_ = 840;

Conversation

mergennachin commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17440

❌ 2 New Failures, 1 Cancelled Job, 1 Pending, 3 Unrelated Failures

Uh oh!

github-actions bot commented Feb 13, 2026

This PR needs a release notes: label

Uh oh!

mergennachin commented Feb 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

mergennachin Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

patrickvonplaten Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

mergennachin Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

mergennachin Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

mergennachin commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mergennachin commented Feb 13, 2026 •

edited

Loading

pytorch-bot bot commented Feb 13, 2026 •

edited

Loading

This PR needs a `release notes:` label