Pipeline for language-distribution based sampling of tokenized datasets#239
Open
Pipeline for language-distribution based sampling of tokenized datasets#239
Conversation
…ized data using scores. - Included an example configuration file. - Added datatrove and pydantic-settings to requirements. - Note that modalities is also required for the pipeline to work, but it is not included in the requirements file.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…figuration and job submission scripts
… new paths and parameters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces a pipeline for sampling and filtering tokenized datasets based on a specified language distribution, with validation to ensure correctness.
Key features:
.pbinfiles using thedocument_id = <file_hash>_<row_index>convention..pbinfiles to include only the sampled documents while preserving ordering..pbindata matches the sampled raw data by tokenizing the raw documents and comparing tokens for correctness.raw_data,annotated,tokenized) and writes filtered.pbinfiles to a separate output folder.This pipeline guarantees reproducible, language-balanced sampling, produces correctly filtered tokenized datasets, and validates that the tokenized output corresponds exactly to the sampled documents.