Add MLflow support and expose logging configuration in TrainingArgs #680

RobotSail · 2026-01-27T14:33:45Z

Summary

Adds MLflow as a new logging backend alongside TensorBoard, W&B, and async JSONL
Exposes logging configuration through TrainingArgs for programmatic API usage
Adds wandb_project and wandb_entity fields to TrainingArgs for consistency

Changes

New `TrainingArgs` fields

Field	Type	Default	Description
`logger_type`	`str`	`"async"`	Comma-separated loggers: `tensorboard`, `wandb`, `mlflow`, `async`
`run_name`	`str \| None`	`None`	Run name with placeholder support (`{time}`, `{rank}`, etc.)
`mlflow_tracking_uri`	`str \| None`	`None`	MLflow tracking server URI
`mlflow_experiment_name`	`str \| None`	`None`	MLflow experiment name
`wandb_project`	`str \| None`	`None`	W&B project name
`wandb_entity`	`str \| None`	`None`	W&B team/entity name

New `MLflowHandler` class

Implements the same interface as TensorBoardHandler and WandbHandler:

Logs metrics via mlflow.log_metrics()
Logs hyperparameters via mlflow.log_params()
Supports tracking_uri and experiment_name configuration
Falls back to MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_NAME env vars

Updated `run_training()` API

Previously, run_training() hardcoded the logger to "async". Now it reads from TrainingArgs:

# Before
setup_metric_logger("async", None, train_args.ckpt_output_dir)

# After
setup_metric_logger(
    train_args.logger_type,
    train_args.run_name,
    train_args.ckpt_output_dir,
    mlflow_tracking_uri=train_args.mlflow_tracking_uri,
    mlflow_experiment_name=train_args.mlflow_experiment_name,
    wandb_project=train_args.wandb_project,
    wandb_entity=train_args.wandb_entity,
)

Example Usage

from instructlab.training import run_training, TrainingArgs, TorchrunArgs

train_args = TrainingArgs(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    data_path="./data.jsonl",
    ckpt_output_dir="./outputs",
    # ... other required fields ...
    
    # New logging configuration
    logger_type="wandb,mlflow",
    run_name="experiment-{time}",
    mlflow_tracking_uri="http://localhost:5000",
    mlflow_experiment_name="my-experiments",
    wandb_project="my-project",
)

run_training(torch_args, train_args)

Test plan

Verify MLflow handler logs metrics correctly to a local MLflow server
Verify W&B logging still works with new wandb_project/wandb_entity fields
Verify backward compatibility: existing code without logging params defaults to async
Verify comma-separated logger_type enables multiple backends simultaneously

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Multi-backend metric logging: MLflow, Weights & Biases (W&B), and TensorBoard support.
- New public training options to configure MLflow (tracking/experiment/run), W&B (project/entity/run), and TensorBoard log directory.
- CLI flags to set MLflow, W&B, and TensorBoard logging at runtime.
Documentation
- Updated usage examples and docs showing multi-backend metric logging, auto-detection, and configuration.

coderabbitai · 2026-01-27T14:34:08Z

📝 Walkthrough

Walkthrough

Adds MLflow, Weights & Biases, and TensorBoard logging options: new optional TrainingArgs fields, an MLflowHandler and MLflow integration in the metric logger, an expanded setup_metric_logger signature, and CLI/entrypoint flags propagated to logger and subprocess runs.

Changes

Cohort / File(s)	Summary
Config: Training arguments `src/instructlab/training/config.py`	Added seven optional `TrainingArgs` fields: `mlflow_tracking_uri`, `mlflow_experiment_name`, `mlflow_run_name`, `wandb_project`, `wandb_entity`, `wandb_run_name`, and `tensorboard_log_dir`.
Logging backend & API `src/instructlab/training/logger.py`	Added `MLflowHandler` (safe-import guard for `mlflow`) and integrated MLflow into the metric logger lifecycle (`_setup`, `emit`, `close`). Expanded `setup_metric_logger(...)` signature to accept MLflow/W&B/TensorBoard params, auto-detect backends from args/env, and wire MLflow into dictConfig alongside existing tensorboard/wandb/async handlers.
CLI / Entrypoint integration `src/instructlab/training/main_ds.py`	Updated `setup_metric_logger()` call to new signature and threaded new flags through subprocess command; added CLI flags: `--mlflow_tracking_uri`, `--mlflow_experiment_name`, `--mlflow_run_name`, `--wandb_project`, `--wandb_entity`, `--wandb_run_name`, and `--tensorboard_log_dir`.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as main_ds.py
    participant Setup as setup_metric_logger()
    participant Logger as MetricLogger
    participant MLflow as MLflow
    participant WandB as Weights&Biases
    participant TB as TensorBoard

    User->>CLI: start training with logging flags
    CLI->>Setup: setup_metric_logger(output_dir, mlflow_..., wandb_..., tensorboard_log_dir)
    Setup->>Logger: configure handlers (MLflowHandler, WandB, TensorBoard, Async)
    Logger->>MLflow: MLflowHandler._setup() (set_tracking_uri, set_experiment, start_run)
    Logger->>WandB: init run (project, entity, run_name)
    Logger->>TB: create writer (log_dir)
    CLI->>Logger: emit metrics/hparams LogRecord
    Logger->>MLflow: MLflowHandler.emit() -> log_metrics/log_params
    Logger->>WandB: log metrics
    Logger->>TB: write scalars
    CLI->>Logger: shutdown
    Logger->>MLflow: MLflowHandler.close() -> end_run()
    Logger->>WandB: finish run
    Logger->>TB: close writer

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hopped through args and stitched a log,

MLflow tracks while WandB spins a jog,
TensorBoard keeps ribbons neat,
Handlers hum with every beat,
A rabbit saved each metric log.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main objective of the PR: adding MLflow support as a new logging backend and exposing logging configuration via TrainingArgs fields.
Docstring Coverage	✅ Passed	Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch expose-mlflow

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through flat kwargs in sft(), osft(), and lora_sft() convenience functions. ## New Parameters - `loggers`: List of loggers to enable (e.g., ["wandb", "mlflow", "jsonl"]) - `run_name`: Run name with placeholder support ({time}, {rank}) - `log_level`: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) - `logging_steps`: How often to log metrics - `wandb_project`, `wandb_entity`, `wandb_run_name`: W&B configuration - `tensorboard_log_dir`: TensorBoard output directory - `mlflow_tracking_uri`, `mlflow_experiment_name`: MLflow configuration ## Backend Support | Logger | SFT | OSFT | LoRA | |-------------|-----|------|------| | wandb | Yes | Yes | Yes | | tensorboard | Yes | No | Yes | | mlflow | Yes | No | Yes | | jsonl | Yes | Yes | No | OSFT emits warnings for unsupported loggers/params and continues. Depends on: instructlab/training#680 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/instructlab/training/main_ds.py`:
- Around line 275-283: The call to setup_metric_logger in main() uses
unnecessary defensive getattr() for mlflow/wandb fields; replace getattr(args,
"mlflow_tracking_uri", None), getattr(args, "mlflow_experiment_name", None),
getattr(args, "wandb_project", None), and getattr(args, "wandb_entity", None)
with direct attribute access args.mlflow_tracking_uri,
args.mlflow_experiment_name, args.wandb_project, and args.wandb_entity
respectively so it matches the pattern used in run_training() and with
train_args.

🧹 Nitpick comments (2)

src/instructlab/training/logger.py (2)

638-665: Unused log_dir parameter.

The log_dir parameter is stored as self.log_dir but never used in _setup() or elsewhere. The docstring mentions it's "used as artifact location" but the implementation doesn't pass it to MLflow. Either use it to set the artifact location or remove it to avoid confusion.

♻️ Option 1: Use log_dir as artifact location

     def _setup(self):
         """Initialize the MLflow run with the configured settings."""
         if mlflow is None:
             msg = (
                 "Could not initialize MLflowHandler because package mlflow could not be imported.\n"
                 "Please ensure it is installed by running 'pip install mlflow'"
             )
             raise RuntimeError(msg)

         if self.tracking_uri:
             mlflow.set_tracking_uri(self.tracking_uri)

         if self.experiment_name:
-            mlflow.set_experiment(self.experiment_name)
+            mlflow.set_experiment(
+                self.experiment_name,
+                artifact_location=str(self.log_dir),
+            )

         self._mlflow_run = mlflow.start_run(
             run_name=self.run_name, **self.mlflow_init_kwargs
         )

♻️ Option 2: Remove unused parameter

     def __init__(
         self,
         level: int = logging.INFO,
         run_name: str | None = None,
-        log_dir: str | os.PathLike = "logs",
         tracking_uri: str | None = None,
         experiment_name: str | None = None,
         **mlflow_init_kwargs: Any,
     ):
         """Initialize the MLflow logger and check for required dependencies.

         Args:
             level: The logging level for this handler
             run_name: Name of the run, can contain placeholders
-            log_dir: Directory where MLflow artifacts should be stored (used as artifact location)
             tracking_uri: MLflow tracking server URI (e.g., "http://localhost:5000")
             experiment_name: Name of the MLflow experiment
             **mlflow_init_kwargs: Additional keyword arguments passed to mlflow.start_run()
         """
         super().__init__(level)

         self.run_name = _substitute_placeholders(run_name)
-        self.log_dir = Path(log_dir)
         self.tracking_uri = tracking_uri
         self.experiment_name = experiment_name
         self.mlflow_init_kwargs = mlflow_init_kwargs.copy()

         self._mlflow_run = None

Note: If removing log_dir, also update setup_metric_logger to not pass it to the MLflow handler config.

711-721: Consider adding a debug log for skipped non-numeric metrics.

Non-numeric values are silently skipped. For consistency with TensorBoardHandler (which warns on type errors), consider adding a debug-level message to help users understand why certain values aren't appearing in MLflow metrics.

♻️ Proposed change

         # Filter to only numeric values for metrics
         metrics_dict = {}
         for k, v in flat_dict.items():
             try:
                 metrics_dict[k] = float(v)
             except (ValueError, TypeError):
                 # Skip non-numeric values for metrics
-                pass
+                warnings.warn(
+                    f"MLflowHandler skipping non-numeric metric '{k}' with value {type(v).__name__}",
+                    stacklevel=2,
+                )

src/instructlab/training/main_ds.py

…logging - Add tensorboard_log_dir field to TrainingArgs in config.py - Update setup_metric_logger to use tensorboard_log_dir when provided - Add CLI argument for tensorboard_log_dir - Wire tensorboard_log_dir through run_training() to subprocess command This allows users to specify a custom directory for TensorBoard logs, defaulting to output_dir if not specified. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Replace defensive getattr() with direct attribute access in main_ds.py since args are guaranteed to exist from argparse defaults - Remove unused log_dir parameter from MLflowHandler - Add debug logging for non-numeric metrics skipped by MLflowHandler Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/instructlab/training/logger.py (1)
862-979: ⚠️ Potential issue | 🟡 Minor

Enable MLflow when experiment/run name is provided (not only tracking URI).

Currently, MLflow logging won't activate if users only set mlflow_experiment_name or mlflow_run_name, even though MLflow can operate with local filesystem storage when no tracking URI is specified. The detection logic should include these fields and the MLFLOW_EXPERIMENT_NAME environment variable to align with the design pattern used for other backends (e.g., wandb is enabled when wandb_project is set).
Proposed fix
-    if mlflow_tracking_uri or os.environ.get("MLFLOW_TRACKING_URI"):
+    if (
+        mlflow_tracking_uri
+        or mlflow_experiment_name
+        or mlflow_run_name
+        or os.environ.get("MLFLOW_TRACKING_URI")
+        or os.environ.get("MLFLOW_EXPERIMENT_NAME")
+    ):
         detected_loggers.append("mlflow")

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/instructlab/training/logger.py (2)

1-46: ⚠️ Potential issue | 🟡 Minor

Update the module docstring example to match the new setup API.

The example still uses loggers=[...] and run_name=..., but setup_metric_logger now relies on auto-detection and backend-specific kwargs. This will mislead users.

📝 Suggested docstring fix

-    # Setup logging with TensorBoard and wandb
-    setup_metric_logger(
-        loggers=["tensorboard", "wandb"],
-        run_name="my_training_run",
-        output_dir="logs"
-    )
+    # Setup logging with TensorBoard and wandb (auto-detected)
+    setup_metric_logger(
+        output_dir="logs",
+        tensorboard_log_dir="logs/tensorboard",
+        wandb_project="my_project",
+        wandb_run_name="my_training_run",
+    )

862-985: ⚠️ Potential issue | 🟠 Major

Run name no longer wired to async/tensorboard (likely regression).

setup_metric_logger used to accept run_name, but it’s now missing and both async/tensorboard handlers hardcode run_name=None. This means the new TrainingArgs.run_name won’t affect those backends. If the intent is a global run name, please reintroduce it and apply as a fallback for backend-specific names.

✅ Suggested wiring

 def setup_metric_logger(
     output_dir,
     *,
+    run_name: str | None = None,
     mlflow_tracking_uri: str | None = None,
     mlflow_experiment_name: str | None = None,
     mlflow_run_name: str | None = None,
     wandb_project: str | None = None,
     wandb_entity: str | None = None,
     wandb_run_name: str | None = None,
     tensorboard_log_dir: str | None = None,
 ):
@@
         "handlers": {
             "async": {
                 "()": AsyncStructuredHandler,
                 "log_dir": output_dir,
-                "run_name": None,  # Uses default template
+                "run_name": run_name,  # Uses default template if None
                 "filters": async_filters,
             },
             "tensorboard": {
                 "()": TensorBoardHandler,
                 "log_dir": tensorboard_log_dir or output_dir,
-                "run_name": None,  # Uses default template
+                "run_name": run_name,  # Uses default template if None
                 "filters": ["is_mapping", "is_rank0"],
             },
             "wandb": {
                 "()": WandbHandler,
                 "log_dir": output_dir,
-                "run_name": wandb_run_name,
+                "run_name": wandb_run_name or run_name,
                 "project": wandb_project,
                 "entity": wandb_entity,
                 "filters": ["is_mapping", "is_rank0"],
             },
             "mlflow": {
                 "()": MLflowHandler,
-                "run_name": mlflow_run_name,
+                "run_name": mlflow_run_name or run_name,
                 "tracking_uri": mlflow_tracking_uri
                 or os.environ.get("MLFLOW_TRACKING_URI"),
                 "experiment_name": mlflow_experiment_name
                 or os.environ.get("MLFLOW_EXPERIMENT_NAME"),
                 "filters": ["is_mapping", "is_rank0"],
             },

🤖 Fix all issues with AI agents

In `@src/instructlab/training/logger.py`:
- Around line 664-681: The _setup method currently calls mlflow.start_run()
unconditionally which raises if an MLflow run is already active; update _setup
to check mlflow.active_run() first and handle both cases: if mlflow.active_run()
exists, set self._mlflow_run = mlflow.active_run() to reuse it, otherwise call
mlflow.start_run(run_name=self.run_name, **self.mlflow_init_kwargs); optionally,
if you prefer nesting instead of reusing, call mlflow.start_run(...,
nested=True) when an active run exists—make this change around the existing
self._mlflow_run assignment and mlflow.start_run call in _setup.

src/instructlab/training/logger.py

Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through flat kwargs in sft(), osft(), and lora_sft() convenience functions. - `loggers`: List of loggers to enable (e.g., ["wandb", "mlflow", "jsonl"]) - `run_name`: Run name with placeholder support ({time}, {rank}) - `log_level`: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) - `logging_steps`: How often to log metrics - `wandb_project`, `wandb_entity`, `wandb_run_name`: W&B configuration - `tensorboard_log_dir`: TensorBoard output directory - `mlflow_tracking_uri`, `mlflow_experiment_name`: MLflow configuration | Logger | SFT | OSFT | LoRA | |-------------|-----|------|------| | wandb | Yes | Yes | Yes | | tensorboard | Yes | No | Yes | | mlflow | Yes | No | Yes | | jsonl | Yes | Yes | No | OSFT emits warnings for unsupported loggers/params and continues. Depends on: instructlab/training#680 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Pass tensorboard_log_dir through to TrainingArgs Update SFT backend to pass tensorboard_log_dir to TrainingArgs now that instructlab-training supports configurable TensorBoard log directories. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Enable MLflow support for OSFT backend - Add "mlflow" to SUPPORTED_LOGGERS - Remove mlflow params from UNSUPPORTED_PARAMS - Add mlflow_run_name parameter throughout - Update docstrings to reflect MLflow is now supported Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Address PR review comments - Replace hardcoded data path with required=True in sft_mlflow_example.py - Add stacklevel=2 to warnings.warn calls in osft.py - Rename ambiguous loop variable 'l' to 'lg' in sft.py - Add warnings for unsupported logging params in sft.py backend Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Add OSFT MLflow example script Similar to sft_mlflow_example.py but for OSFT training with unfreeze_rank_ratio and other OSFT-specific parameters. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Add logging and experiment tracking documentation Create comprehensive documentation for the unified logging API: - MLflow, Weights & Biases, TensorBoard, and JSONL logging - Configuration precedence (kwarg > env var > defaults) - Backend-specific notes and supported loggers - Example usage for all algorithms (SFT, OSFT, LoRA) - Run naming with placeholder support Update sidebar to include the new logging guide. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Remove duplicate warnings import in sft.py The warnings module is already imported at module level (line 2). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Address review nitpicks in docs and examples - Add language specifier to fenced code block in logging.md - Remove unused result variable in example scripts - Remove unnecessary wandb_run_name from MLflow-focused OSFT example - Add comment clarifying /dev/shm usage for Linux shared memory Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Auto-detect loggers from config parameters Loggers are now automatically enabled based on their configuration: - mlflow_tracking_uri or MLFLOW_TRACKING_URI env → enables MLflow - wandb_project or WANDB_PROJECT env → enables W&B - tensorboard_log_dir → enables TensorBoard (SFT only) The 'loggers' parameter is deprecated and emits a DeprecationWarning when used. Updated API documentation with new logging configuration sections and removed outdated logging guide. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Address PR review feedback - Fix markdown table alignment in docs for markdownlint compliance - Change osft_mlflow_example.py default checkpoint dir from /tmp to ./ - Document that run_name is not supported by mini-trainer backend - Add stacklevel=2 to block_size warning in osft.py - Update loggers docstring to indicate deprecated status Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> remove extraneous api kwargs removes more extraneous api additions reintroduce mlflow_run_name into lora docstrings drop formatting changes read envs when present review comments

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/instructlab/training/logger.py`:
- Around line 732-737: The handler currently always calls mlflow.end_run() in
close(), which can end runs it didn't start; add a boolean flag (e.g.,
self._owns_mlflow_run) that is set to True only when this handler creates/starts
an MLflow run (where mlflow.start_run() is invoked or when self._mlflow_run is
assigned) and False when reusing an existing active run, then modify close() to
call mlflow.end_run() only if self._owns_mlflow_run is True and clear both
self._mlflow_run and self._owns_mlflow_run during cleanup; ensure the flag is
updated wherever the run is started or adopted so ownership is tracked
correctly.

🧹 Nitpick comments (1)

src/instructlab/training/logger.py (1)

942-946: Simplify the async logger inclusion logic.

The current logic is a bit convoluted. When detected_loggers is empty, line 943 sets loggers = ["async"], then lines 945-946 check if async is not in loggers (which it already is). When detected_loggers is non-empty, it duplicates the list reference then appends.

♻️ Proposed simplification

-    # Always include async logger for file logging
-    loggers = detected_loggers if detected_loggers else ["async"]
-    # Also include async logger alongside other loggers for file-based logging
-    if detected_loggers and "async" not in loggers:
-        loggers.append("async")
+    # Always include async logger for file-based logging
+    loggers = detected_loggers + ["async"] if detected_loggers else ["async"]

Or even simpler:

-    # Always include async logger for file logging
-    loggers = detected_loggers if detected_loggers else ["async"]
-    # Also include async logger alongside other loggers for file-based logging
-    if detected_loggers and "async" not in loggers:
-        loggers.append("async")
+    # Always include async logger for file-based logging alongside other loggers
+    loggers = [*detected_loggers, "async"]

src/instructlab/training/logger.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/instructlab/training/logger.py`:
- Around line 873-953: setup_metric_logger currently auto-enables backends based
only on backend-specific args and unconditionally ensures "async" is present,
and it doesn't propagate a global run_name to all backends; update it to accept
and honor an explicit logger selection (e.g., logger_type or logger_list
argument) so callers can request "tensorboard" or "async" even when
backend-specific fields are empty, only append "async" when explicitly requested
(or present in the explicit list), and ensure a global run_name parameter (or
run_name fallback) is forwarded into backend-specific settings (mlflow_run_name,
wandb_run_name, tensorboard handler config and any async file logger) so every
backend receives the same run name; change logic around detected_loggers/loggers
and the wiring to backend initializers (functions that configure
mlflow/wandb/tensorboard and the async handler) accordingly and apply the same
fix to the analogous block referenced around 968-997.

coderabbitai · 2026-02-03T18:32:20Z

src/instructlab/training/logger.py

+def setup_metric_logger(
+    output_dir,
+    *,
+    mlflow_tracking_uri: str | None = None,
+    mlflow_experiment_name: str | None = None,
+    mlflow_run_name: str | None = None,
+    wandb_project: str | None = None,
+    wandb_entity: str | None = None,
+    wandb_run_name: str | None = None,
+    tensorboard_log_dir: str | None = None,
+):
+    """Configure the metric logging system with auto-detected backends.

    This function sets up a comprehensive logging configuration that supports
    multiple logging backends simultaneously. It configures filters, handlers,
-    and loggers for structured metric logging.
+    and loggers for structured metric logging. Backends are automatically
+    detected based on the presence of their configuration parameters.

    Args:
-        loggers: A string or list of strings specifying which logging backends to use.
-                Supported values: "tensorboard", "wandb", "async"
-        run_name: Name for the current training run. Can include placeholders like
-                 {time}, {rank}, {utc_time}, {local_rank}.
        output_dir: Directory where log files will be stored
+        mlflow_tracking_uri: MLflow tracking server URI (e.g., "http://localhost:5000").
+                Falls back to MLFLOW_TRACKING_URI environment variable if not provided.
+                When set (or env var present), MLflow logging is automatically enabled.
+        mlflow_experiment_name: MLflow experiment name.
+                Falls back to MLFLOW_EXPERIMENT_NAME environment variable if not provided.
+        mlflow_run_name: MLflow run name. Supports placeholders: {time}, {rank}, {utc_time}, {local_rank}.
+        wandb_project: Weights & Biases project name.
+                When set (or WANDB_PROJECT env var present), wandb logging is automatically enabled.
+        wandb_entity: Weights & Biases team/entity name.
+        wandb_run_name: Weights & Biases run name. Supports placeholders: {time}, {rank}, {utc_time}, {local_rank}.
+        tensorboard_log_dir: Directory for TensorBoard logs.
+                When set, TensorBoard logging is automatically enabled.

    Example:
        ```python
-        # Setup logging with multiple backends
+        # Setup logging with MLflow (auto-detected from tracking URI)
        setup_metric_logger(
-            loggers=["tensorboard", "wandb", "async"],
-            run_name="experiment_{time}",
-            output_dir="logs"
+            output_dir="logs",
+            mlflow_tracking_uri="http://localhost:5000",
+            mlflow_experiment_name="my_experiment",
+            mlflow_run_name="my_run"
        )

-        # Setup logging with a single backend
+        # Setup logging with wandb (auto-detected from project)
        setup_metric_logger(
-            loggers="tensorboard",
-            run_name="my_run",
-            output_dir="logs"
+            output_dir="logs",
+            wandb_project="my_project",
+            wandb_run_name="my_run"
+        )
+
+        # Setup logging with TensorBoard (auto-detected from log_dir)
+        setup_metric_logger(
+            output_dir="logs",
+            tensorboard_log_dir="logs/tensorboard"
        )
        ```
    """
-    if not loggers:
-        return
-
    # Enable package logging
    propagate_package_logs()

-    if isinstance(loggers, str):
-        loggers = loggers.split(",")
-    loggers = [logger.strip() for logger in loggers]
+    # Auto-detect which loggers to enable based on configuration
+    detected_loggers = []
+    if (
+        mlflow_tracking_uri
+        or mlflow_experiment_name
+        or mlflow_run_name
+        or os.environ.get("MLFLOW_TRACKING_URI")
+        or os.environ.get("MLFLOW_EXPERIMENT_NAME")
+    ):
+        detected_loggers.append("mlflow")
+    if wandb_project or os.environ.get("WANDB_PROJECT"):
+        detected_loggers.append("wandb")
+    if tensorboard_log_dir:
+        detected_loggers.append("tensorboard")
+
+    # Always include async logger for file logging
+    loggers = detected_loggers if detected_loggers else ["async"]
+    # Also include async logger alongside other loggers for file-based logging
+    if detected_loggers and "async" not in loggers:
+        loggers.append("async")



⚠️ Potential issue | 🟠 Major

Honor explicit logger selection and propagate run_name to all backends.

setup_metric_logger only enables MLflow/TensorBoard/W&B when their config fields are provided and then always adds async. This prevents explicitly enabling a backend with defaults (e.g., logger_type="tensorboard" without tensorboard_log_dir) and ignores a global run_name for async/tensorboard. This looks inconsistent with the new logger_type/run_name intent.

Consider accepting an explicit logger list (or logger_type) and wiring run_name to all handlers, only adding async when requested.

💡 Suggested adjustment (illustrative)

-def setup_metric_logger( - output_dir, - *, - mlflow_tracking_uri: str | None = None, - mlflow_experiment_name: str | None = None, - mlflow_run_name: str | None = None, - wandb_project: str | None = None, - wandb_entity: str | None = None, - wandb_run_name: str | None = None, - tensorboard_log_dir: str | None = None, -): +def setup_metric_logger( + output_dir, + *, + loggers: list[str] | None = None, + run_name: str | None = None, + mlflow_tracking_uri: str | None = None, + mlflow_experiment_name: str | None = None, + mlflow_run_name: str | None = None, + wandb_project: str | None = None, + wandb_entity: str | None = None, + wandb_run_name: str | None = None, + tensorboard_log_dir: str | None = None, +): @@ - detected_loggers = [] + detected_loggers = [] @@ - # Always include async logger for file logging - loggers = detected_loggers if detected_loggers else ["async"] - # Also include async logger alongside other loggers for file-based logging - if detected_loggers and "async" not in loggers: - loggers.append("async") + if loggers is None: + loggers = detected_loggers if detected_loggers else ["async"] @@ "async": { "()": AsyncStructuredHandler, "log_dir": output_dir, - "run_name": None, # Uses default template + "run_name": run_name, "filters": async_filters, }, "tensorboard": { "()": TensorBoardHandler, "log_dir": tensorboard_log_dir or output_dir, - "run_name": None, # Uses default template + "run_name": run_name, "filters": ["is_mapping", "is_rank0"], },

Also applies to: 968-997

🤖 Prompt for AI Agents

In `@src/instructlab/training/logger.py` around lines 873 - 953, setup_metric_logger currently auto-enables backends based only on backend-specific args and unconditionally ensures "async" is present, and it doesn't propagate a global run_name to all backends; update it to accept and honor an explicit logger selection (e.g., logger_type or logger_list argument) so callers can request "tensorboard" or "async" even when backend-specific fields are empty, only append "async" when explicitly requested (or present in the explicit list), and ensure a global run_name parameter (or run_name fallback) is forwarded into backend-specific settings (mlflow_run_name, wandb_run_name, tensorboard handler config and any async file logger) so every backend receives the same run name; change logic around detected_loggers/loggers and the wiring to backend initializers (functions that configure mlflow/wandb/tensorboard and the async handler) accordingly and apply the same fix to the analogous block referenced around 968-997.

add support for mlflow

d33da9a

mergify bot added the ci-failure label Jan 27, 2026

RobotSail mentioned this pull request Jan 27, 2026

Add unified logging configuration for all algorithms Red-Hat-AI-Innovation-Team/training_hub#34

Open

4 tasks

fix formatting changes

d35920b

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

src/instructlab/training/main_ds.py Show resolved Hide resolved

RobotSail and others added 3 commits January 27, 2026 14:56

removes generic run_name and logger_type kwargs

bdcaeba

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

review comments

e65d256

mergify bot added ci-failure and removed ci-failure labels Feb 3, 2026

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

src/instructlab/training/logger.py Outdated Show resolved Hide resolved

something something mlflow active runs

5a772a4

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

src/instructlab/training/logger.py Show resolved Hide resolved

review comments

4288623

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

coderabbit

84b4113

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MLflow support and expose logging configuration in TrainingArgs #680

Add MLflow support and expose logging configuration in TrainingArgs #680

RobotSail commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 27, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add MLflow support and expose logging configuration in TrainingArgs #680

Are you sure you want to change the base?

Add MLflow support and expose logging configuration in TrainingArgs #680

Conversation

RobotSail commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New TrainingArgs fields

New MLflowHandler class

Updated run_training() API

Example Usage

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RobotSail commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

New `TrainingArgs` fields

New `MLflowHandler` class

Updated `run_training()` API

coderabbitai bot commented Jan 27, 2026 •

edited

Loading