feat: evaluate translations by mali-git · Pull Request #234 · Modalities/ml_filter

mali-git · 2025-06-28T17:22:11Z

No description provided.

Copilot

Pull Request Overview

Adds a new translation evaluation feature and updates import paths to the new translation submodule.

Updates test imports to use ml_filter.translation.translate.
Introduces translation_evaluation.py with functions to load references, prepare inputs, and compute COMET scores.
Extends the CLI (__main__.py) with an evaluate_translations command.

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.

File	Description
tests/test_translate.py	Updated import path for `Translator`.
tests/conftest.py	Updated import paths for translation clients and `Translator`.
src/ml_filter/translation/translation_evaluation.py	Added new module for reference loading, input prep, and COMET evaluation.
src/ml_filter/main.py	Imported `evaluate_translations` and added a CLI subcommand for translation evaluation.

Comments suppressed due to low confidence (1)

src/ml_filter/translation/translation_evaluation.py:33

The docstring for _prepare_translation_input lists a lang parameter that isn’t in the signature. Please update the docstring to match the actual parameters.

        lang: Language code.

src/ml_filter/translation/translation_evaluation.py

Copilot · 2025-06-28T17:23:10Z

src/ml_filter/__main__.py

+@click.option("--gold-path", required=True, help="Path to gold reference JSONL file")
+@click.option("--model-name", default="Unbabel/wmt22-cometkiwi-da", help="COMET model to use")
+@click.option("--languages", type=str, required=True, help="Comma-separated list of supported language codes")
+@click.option("--batch-size", help="Batch size for processing translations")


CLI option --batch-size has no type specified; it will be parsed as a string. Consider adding type=int to ensure batch_size is passed as an integer.

Suggested change

@click.option("--batch-size", help="Batch size for processing translations")

@click.option("--batch-size", type=int, help="Batch size for processing translations")

Copilot · 2025-06-28T17:23:10Z

src/ml_filter/translation/translation_evaluation.py

+    for filename in os.listdir(data_dir):
+        if filename.endswith(".jsonl"):
+            file_path = os.path.join(data_dir, filename)
+            lang = filename.split("_")[5]


Splitting by underscore and accessing index 5 is brittle and may raise IndexError for unexpected filenames. Add a check on segment length or use a more robust parsing method.

le1nux

Functionalitywise I think it is all correct. There a few stylistic issues.
Did not get to review the plotting functions themselves, yet.

le1nux · 2025-06-29T20:35:16Z

src/ml_filter/translation/translation_evaluation.py

+    batch_size: int,
+    model_name: str = "Unbabel/wmt22-cometkiwi-da",
+) -> None:
+    """Evaluate translation quality for a set of files using a COMET model.


Maybe explain how we actually want to evaluate it? Like 2-3 sentences about pitching the idea.

le1nux · 2025-06-29T20:38:17Z

src/ml_filter/translation/translation_evaluation.py

+    target_texts = []
+    with open(file_path, "r") as f:
+        for line_num, line in enumerate(f, 1):
+            if not line:


line is never a boolean, but it treated here like one. It also works with "None" but in my opinion it is bad style.

Suggested change

if not line:

if line is None:

should we maybe even raise an exception here?

line will never be None since it is a string.
The current solution checks for empty strings.

le1nux · 2025-06-29T20:47:38Z

src/ml_filter/__main__.py

+    languages: str,
+    batch_size: int,
+):
+    """CLI entry point for evaluating translation quality."""


we should document that the files in data_dir need to folllow a certain convention

le1nux · 2025-06-29T20:48:32Z

src/ml_filter/translation/translation_evaluation.py

+
+            target_texts = _prepare_translation_input(file_path, gold_dict)
+
+            if target_texts:


Suggested change

if target_texts:

if len(target_texts) > 0:

le1nux · 2025-06-29T20:50:58Z

src/ml_filter/translation/translation_evaluation.py

+            except json.JSONDecodeError as e:
+                logging.warning(f"Skipping invalid line {line_num} in {file_path}: {e}")
+                continue
+    return target_texts


we should add a warning if len(target_texts) != len(gold_dict)
If some languages misses certain translations it introduces biases.

le1nux · 2025-06-29T21:12:50Z

src/ml_filter/translation/translation_evaluation.py

+
+    # Step 3: Count translation scores
+    for sample_id, trans_score in id_to_translation_score.items():
+        if sample_id in id_to_gt_quality_score:


what if this is not the case? Should we issue a warning?

le1nux · 2025-06-29T21:13:36Z

src/ml_filter/translation/translation_evaluation.py

+        if sample_id in id_to_gt_quality_score:
+            gt_score = float(id_to_gt_quality_score[sample_id])
+            trans_score = trans_score.lower()
+            if trans_score in score_classes:


also what about the else case?

le1nux · 2025-06-29T21:14:28Z

src/ml_filter/translation/translation_evaluation.py

+        if filename.endswith(".json"):
+            parts = filename.split("_")
+            if len(parts) != 8:
+                continue


no warning?

le1nux · 2025-06-29T21:15:34Z

src/ml_filter/translation/translation_evaluation.py

+    languages: list[str],
+    output_dir: Path,
+) -> None:
+    """Plot histograms for translation quality results.


we should also mention the convention of the file names in data_dir here

le1nux · 2025-06-29T21:24:56Z

src/ml_filter/translation/translation_evaluation.py

+                output_path=output_path,
+            )
+
+            output_path = os.path.join(output_dir, f"{lang}_translation_quality_vs_gt_histogram.png")


If we want this for the paper, I would always plot as PDF

feat: evaluate translations

ff1460c

mali-git requested a review from Copilot June 28, 2025 17:22

Copilot AI reviewed Jun 28, 2025

View reviewed changes

feat: plot quality score distributions

56070ad

le1nux requested changes Jun 29, 2025

View reviewed changes

refactor: only translate context length//2 tokens

f2286f1

	@click.option("--batch-size", help="Batch size for processing translations")
	@click.option("--batch-size", type=int, help="Batch size for processing translations")


		target_texts = _prepare_translation_input(file_path, gold_dict)

		if target_texts:

Conversation

mali-git commented Jun 28, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

le1nux left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants