Skip to content

update-from-whisper-x#1

Open
croquies wants to merge 162 commits intofika-dev:mainfrom
m-bain:main
Open

update-from-whisper-x#1
croquies wants to merge 162 commits intofika-dev:mainfrom
m-bain:main

Conversation

@croquies
Copy link

No description provided.

m-bain and others added 30 commits July 11, 2024 13:01
Update alignment.py - added alignment for  sk and sl languages
Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>
Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>
Updated Norwegian Bokmål and Norwegian Nynorsk models

Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>
Force ctranslate to version 4.4.0 due libcudnn_ops_infer.so.8:
SYSTRAN/faster-whisper#729

Co-authored-by: Icaro Bombonato <ibombonatosites@gmail.com>
* Update faster-whisper to 1.0.2 to enable model distil-large-v3

* feat: add hotwords option to default_asr_options

---------

Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>


---------

Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>
* chore: bump faster-whisper to 1.1.0

* chore: bump pyannote to 3.3.2

* feat: add multilingual option in load_model function

---------

Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>


---------

Co-authored-by: Abhishek Sharma <abhishek@zipteams.com>
Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>
…mode (#867)

Adds the parameter local_files_only (default False for consistency) to whisperx.load_model so that the user can avoid downloading the file and return the path to the local cached file if it exists.

---------

Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>
feat: restrict Python versions to 3.9 - 3.12
Barabazs and others added 30 commits October 16, 2025 07:41
* docs: add troubleshooting guide for cuDNN loading errors

* docs: add cuDNN version incompatibility troubleshooting
The audio_path attribute that the __call__ method of the ResultWriter class takes is a str, not TextIO
* feat: add language-aware sentence tokenization

* feat: add missing punkt languages

---------

Co-authored-by: pulkit <129310466+p1kit@users.noreply.github.com>
Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>
* fix: pin huggingface-hub<1.0.0 for pyannote-audio compatibility

pyannote-audio uses the deprecated `use_auth_token` parameter which was removed in huggingface-hub v1.0.0

* fix: upgrade yanked dependencies

* chore: update version to 3.7.5
* chore: drop python 3.9 support

- Update requires-python to >=3.10
- Remove onnxruntime constraint (only needed for 3.9)
- Simplify numpy (remove version markers and upper bound)
- Remove pandas upper bound (<2.3.0 was for 3.9 compat)
- Remove av direct dependency (transitive via faster-whisper)

* chore(ci): remove python 3.9 from workflows

- Update build-and-release to use Python 3.10
- Remove 3.9 from python-compatibility matrix

* chore: bump version to 3.7.6
Replace O(n*m) pandas operations with O(n log m) interval tree queries
for speaker assignment, where n = words/segments and m = diarization segments.

Performance improvement:
- 7-minute video (1185 words, 147 segments): 73.9s -> 0.32s (228x faster)
- 3-hour podcast: Minutes of processing -> Seconds

Changes:
- Add IntervalTree class using sorted array + binary search
- Refactor assign_word_speakers to use interval tree for overlap queries
- Maintain backward compatibility with same function signature
- Identical output to original implementation

The interval tree uses numpy arrays for efficient storage and binary search
(np.searchsorted) for O(log n) candidate finding, then filters candidates
for actual overlaps.

Fixes #1335
…ssignment

Optimize assign_word_speakers with interval tree for 228x speedup
Fix: pass no_repeat_ngram_size and repetition_penalty to CTranslate2 generate()
[BugFix] The variable I removed was not being used anyhwere.
[BugFix] Type hint fix in decode_batch List[str] not str:
* fix: derive SRT/VTT cue times from word-level timestamps (#1315)

Subtitle cue start/end times were sourced from VAD segment boundaries
instead of word-level timestamps from forced alignment. This caused cues
to appear prematurely and could produce backwards chronological ordering
when VAD segments overlap.

Use min(word starts) / max(word ends) for cue timing, falling back to
segment-level times only when all words are unalignable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version to 3.7.7 in pyproject.toml

---------

Co-authored-by: Claude-Assistant <noreply@anthropic.com>
…-1 (#1349)

* feat: upgrade pyannote-audio dependency to v4

* fix: rename use_auth_token to token for pyannote-audio v4 compatibility

* fix: add omegaconf dep

* fix: use structured output API for pyannote-audio v4 diarization

pyannote-audio 4.x no longer returns a plain Annotation (or tuple when>
return_embeddings=True). It now returns a structured output with
speaker_diarization and speaker_embeddings attributes.

* feat: switch default diarization model to speaker-diarization-community-1

Update default from pyannote/speaker-diarization-3.1 to
pyannote/speaker-diarization-community-1 (pyannote-audio v4),
add CC-BY-4.0 attribution, and update README for v4 API changes.

* fix: correct markdown link formatting for silero-vad in README.md

* chore: update version to 3.8.0


Co-authored-by: Giorgio Azzinnaro <giorgio@azzinna.ro>
…g paths (#1285)

- Add `model_cache_only` param to `load_align_model()`, pass as `local_files_only` to HuggingFace `from_pretrained` calls
- Forward `model_dir` and `model_cache_only` to both `load_align_model` call sites (initial load and language-change reload)
- Add `cache_dir` param to `DiarizationPipeline.__init__`, forward to pyannote `Pipeline.from_pretrained`
- Pass `--model_dir` as `cache_dir` when constructing `DiarizationPipeline` in CLI

Previously only the ASR model respected these flags. Alignment and diarization models would always download from HuggingFace to the default cache, breaking offline and custom-cache workflows.


---------

Co-authored-by: Barabazs <31799121+Barabazs@users.noreply.github.com>
Forward the existing --hf_token CLI argument to faster-whisper's
WhisperModel via a new use_auth_token parameter on load_model(),
enabling downloads of gated/private HuggingFace models.
It works with the initial prompt added.

Ran pdb to make sure and check output.

Long audio works.

Existing Logic is correct without flag.
added and condition before streams, existing logic is not chnaged.
[New File] benchmark testing
Pass through the average log probability (transcription confidence score)
from ctranslate2 to the final segment output. The field is NotRequired
so existing code constructing segments without it remains valid.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.