Skip to content

Enable true streaming for Voxtral Realtime model on XNNPACK#17440

Open
mergennachin wants to merge 2 commits intomainfrom
voxtral_realtime_streaming
Open

Enable true streaming for Voxtral Realtime model on XNNPACK#17440
mergennachin wants to merge 2 commits intomainfrom
voxtral_realtime_streaming

Conversation

@mergennachin
Copy link
Contributor

@mergennachin mergennachin commented Feb 13, 2026

Follow-up to #17431

Enable Voxtral Realtime streaming on XNNPACK
Adds true streaming support for the Voxtral-Mini-4B-Realtime-2602 model,
enabling real-time transcription from live audio input.

Key components:

  • StreamingAudioEncoderExport: encoder with KV-cached attention and conv
    state for chunk-by-chunk processing (8 mel frames = 80ms per step)
  • StreamingSession C++ API (feed_audio/flush) for incremental audio input
  • Streaming mel preprocessor (WhisperAudioProcessor with --streaming flag)
  • STFT overlap windowing (320 left + 40 right) for correct mel at chunk
    boundaries
  • Per-component quantization (--qlinear-encoder / --qlinear / --qembedding)

Test Plan:

https://github.com/pytorch/executorch/actions/runs/21993469202/job/63546903665?pr=17440

@mergennachin mergennachin requested a review from lucylq as a code owner February 13, 2026 15:53
Copilot AI review requested due to automatic review settings February 13, 2026 15:53
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 13, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17440

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job, 1 Pending, 3 Unrelated Failures

As of commit fd8137d with merge base c3e60d0 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 13, 2026
@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@mergennachin mergennachin force-pushed the voxtral_realtime_streaming branch from 3f45a9f to 509bd47 Compare February 13, 2026 15:55
Adds true streaming support for the Voxtral-Mini-4B-Realtime-2602 model,
enabling real-time transcription from live audio input.

Key components:
- StreamingAudioEncoderExport: encoder with KV-cached attention and conv
  state for chunk-by-chunk processing (8 mel frames = 80ms per step)
- StreamingSession C++ API (feed_audio/flush) for incremental audio input
- Streaming mel preprocessor (WhisperAudioProcessor with --streaming flag)
- STFT overlap windowing (320 left + 40 right) for correct mel at chunk
  boundaries
- Per-component quantization (--qlinear-encoder / --qlinear / --qembedding)
@mergennachin mergennachin force-pushed the voxtral_realtime_streaming branch from 509bd47 to 527ae9d Compare February 13, 2026 15:59
@mergennachin
Copy link
Contributor Author

cc @patrickvonplaten

@mergennachin mergennachin changed the title Enable true streaming for Voxtral Realtime model Enable true streaming for Voxtral Realtime model on XNNPACK Feb 13, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables true streaming mode for the Voxtral Realtime speech-to-text model on XNNPACK, allowing incremental processing of audio input in 80ms chunks rather than requiring the full audio upfront.

Changes:

  • Added streaming mode to mel spectrogram preprocessor with mutual exclusivity validation against stack_output mode
  • Implemented StreamingAudioEncoderExport class with KV caches and stateful convolutional layers for incremental encoder processing
  • Added StreamingSession C++ class for managing streaming transcription with audio buffering, overlap handling, and text-only decoding after audio ends
  • Updated CI/build scripts to export and test models with streaming enabled by default

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.

Show a summary per file
File Description
extension/audio/mel_spectrogram.py Added streaming parameter to skip 30-second chunk padding, with validation to ensure mutual exclusivity with stack_output
examples/models/voxtral_realtime/model.py Implemented StreamingAudioEncoderExport class with KV caches, SDPA, and stateful convolutions for 8-frame incremental processing
examples/models/voxtral_realtime/export_voxtral_rt.py Added export_streaming function with per-component quantization and STFT overlap metadata calculation
examples/models/voxtral_realtime/voxtral_realtime_runner.h Declared StreamingSession class with feed_audio/flush API and streaming metadata fields
examples/models/voxtral_realtime/voxtral_realtime_runner.cpp Implemented streaming session with overlapping audio windows, mel extraction, encoder/decoder stepping, and buffer management
examples/models/voxtral_realtime/main.cpp Added --streaming flag with 200ms chunk simulation for testing
examples/models/voxtral_realtime/model.md Documented streaming encoder architecture, conv state management, KV caches, STFT overlap, and quantization
examples/models/voxtral_realtime/README.md Added streaming usage instructions and command examples
.ci/scripts/test_model_e2e.sh Enabled --streaming flag for CI testing
.ci/scripts/export_model_artifact.sh Updated to export with --streaming and --qlinear-encoder flags

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

int64_t n = std::min(
chunk_size, static_cast<int64_t>(audio_data.size()) - offset);
session->feed_audio(audio_data.data() + offset, n);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickvonplaten

This is "simulating" live microphone by chunking the audio and feeding into the stream.

The API supports being able to continuously read input stream and transcribe. It does the "right thing" by looking at a moving window of chunks.

The live microphone feature can be done in the application layer by end customers and demo builder (which is not super interesting in the context of model enablement)

@mergennachin mergennachin temporarily deployed to upload-benchmark-results February 13, 2026 16:59 — with GitHub Actions Inactive
Instead of zero-padding (offline) or recompute-with-overlap (vLLM),
explicit conv state carries the tail of the previous chunk:

- **Conv1** (kernel=3, stride=1): state = last 2 mel frames from previous

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think we need the last 4 mel frames instead of hte last 2.

Our look-back is 52.5ms which AFAIU corresponds to

4 log mel frames (each 10ms) and then last frame looks back 12.5 because it has a window size of 25ms => 4 * 10ms + 12.5 = 52.5ms

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conv state of 2 is correct. It's kernel_size - 1 for both CausalConv1d layers (kernel_size=3). This is the exact amount needed for a causal convolution to produce identical output to processing contiguous audio.

The streaming_look_back_ms = 52.5 in tekken.json serves a different purpose. The look-back bundles STFT context, conv context, and encoder context into one value.

int64_t step_samples_ = 1280;

// STFT overlap for streaming mel computation (read from model metadata).
int64_t stft_left_overlap_ = 320;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int64_t stft_left_overlap_ = 320;
int64_t stft_left_overlap_ = 840;

think our left overlap should be 52.5ms - see:

    },
    "transcription_delay_ms": 480,
    "streaming_look_ahead_ms": 2.5,
    "streaming_look_back_ms": 52.5,
    "streaming_n_left_pad_tokens": 32,
    "transcription_format": "streaming"
  }

in the tekken.json

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our design handles these separately. Conv state carries the exact conv context (2 frames), and the encoder KV cache provides full attention history. So the left overlap only needs to cover the STFT window edge effect (n_fft/2 = 200 samples), and 320 is the aligned minimum for that.

// STFT overlap for streaming mel computation (read from model metadata).
int64_t stft_left_overlap_ = 320;
int64_t stft_right_lookahead_ = 40;
int64_t mel_skip_frames_ = 2;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int64_t mel_skip_frames_ = 2;
int64_t mel_skip_frames_ = 4;

think that this should actually be 4

@mergennachin
Copy link
Contributor Author

@patrickvonplaten

Here's slight difference between vLLM and ExecuTorch, but they end up producing same result.

vLLM:
Each step takes a large audio window — [840 look_back + 1280 step + 40 look_ahead] = 2160 samples — and processes the entire window through the full pipeline:

2160 raw audio → mel (~13 frames) → conv (~6 post-conv frames) → encoder (all frames, with KV cache) → downsample → audio tokens

The look_back region is re-processed every step. Its mel, conv output, and encoder output are all recomputed alongside the new frames.

Ours: State-based

Each step takes a smaller window — [320 overlap + 1280 step + 40 look_ahead] = 1640 samples — and processes only the new data, using explicit state for context:

1640 raw audio → mel (10 frames) → extract 8 aligned frames
8 mel frames + conv state (2 frames) → conv → 4 encoder frames
4 encoder frames + KV cache (all history) → encoder → downsample → 1 audio token

Why they produce the same result

Mel spectrogram: Both have ≥ n_fft/2 (200 samples) of left context for the step's first mel frame. Our 320 ≥ 200, so the 8 extracted mel frames have identical values to vLLM's corresponding frames. vLLM computes extra mel frames in the look_back region, but they're just for context.

I think you can implement similar approach in vLLM and make it more efficient too.

@mergennachin mergennachin deployed to upload-benchmark-results February 13, 2026 20:18 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants