Pipeline for language-distribution based sampling of tokenized datasets by ajude2s · Pull Request #239 · Modalities/ml_filter

ajude2s · 2025-09-26T10:06:04Z

This PR introduces a pipeline for sampling and filtering tokenized datasets based on a specified language distribution, with validation to ensure correctness.

Key features:

Language-distribution-based sampling: Samples a target number of documents per language from annotated JSONL files, ensuring the dataset reflects the desired language proportions.
Document ID tracking: Maintains consistent mapping between annotated JSONL files and tokenized .pbin files using the document_id = <file_hash>_<row_index> convention.
Tokenized dataset filtering: Filters .pbin files to include only the sampled documents while preserving ordering.
Validation of filtered data: Ensures that the filtered tokenized .pbin data matches the sampled raw data by tokenizing the raw documents and comparing tokens for correctness.
Folder structure preservation: Maintains the original folder structure for input and output datasets (raw_data, annotated, tokenized) and writes filtered .pbin files to a separate output folder.
Hash mapping integration: Uses a CSV-based hash mapping to locate JSONL files based on document hashes.

This pipeline guarantees reproducible, language-balanced sampling, produces correctly filtered tokenized datasets, and validates that the tokenized output corresponds exactly to the sampled documents.

…ized data using scores. - Included an example configuration file. - Added datatrove and pydantic-settings to requirements. - Note that modalities is also required for the pipeline to work, but it is not included in the requirements file.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…figuration and job submission scripts

… new paths and parameters

…e reference

…ions

BlueCrescent and others added 9 commits July 25, 2025 10:38

chore(filtering): More robust doc id parsing.

81aafa8

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix(filtering): Removed duplicate file exists check.

b1d1a46

fix(filtering): fixed docstring

af89182

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

feat(pipeline): add language sampling and filtering pipeline with con…

bb65bad

…figuration and job submission scripts

feat(config): update filter and sampling pipeline configurations with…

c057ecd

… new paths and parameters

fix(imports): update import path for sampling_utils to simplify modul…

0389d2b

…e reference

refactor(pipeline): remove score-based filtering files and configurat…

4c82fdf

…ions

fix(dependencies): remove unused 'datatrove' dependency from project

88435ff

ajude2s closed this Sep 26, 2025

ajude2s deleted the sampling_pipeline branch September 26, 2025 10:41

ajude2s restored the sampling_pipeline branch September 26, 2025 11:03

ajude2s reopened this Sep 26, 2025

ajude2s marked this pull request as ready for review September 26, 2025 11:53

ajude2s requested a review from fromm-m October 16, 2025 08:46

mali-git assigned ajude2s Dec 5, 2025

Merge branch 'master' into sampling_pipeline

6a00bc8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline for language-distribution based sampling of tokenized datasets#239

Pipeline for language-distribution based sampling of tokenized datasets#239
ajude2s wants to merge 10 commits intomasterfrom
sampling_pipeline

ajude2s commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ajude2s commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants