Skip to content

Conversation

@RobotSail
Copy link
Member

@RobotSail RobotSail commented Jan 27, 2026

Summary

  • Adds MLflow as a new logging backend alongside TensorBoard, W&B, and async JSONL
  • Exposes logging configuration through TrainingArgs for programmatic API usage
  • Adds wandb_project and wandb_entity fields to TrainingArgs for consistency

Changes

New TrainingArgs fields

Field Type Default Description
logger_type str "async" Comma-separated loggers: tensorboard, wandb, mlflow, async
run_name str | None None Run name with placeholder support ({time}, {rank}, etc.)
mlflow_tracking_uri str | None None MLflow tracking server URI
mlflow_experiment_name str | None None MLflow experiment name
wandb_project str | None None W&B project name
wandb_entity str | None None W&B team/entity name

New MLflowHandler class

Implements the same interface as TensorBoardHandler and WandbHandler:

  • Logs metrics via mlflow.log_metrics()
  • Logs hyperparameters via mlflow.log_params()
  • Supports tracking_uri and experiment_name configuration
  • Falls back to MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_NAME env vars

Updated run_training() API

Previously, run_training() hardcoded the logger to "async". Now it reads from TrainingArgs:

# Before
setup_metric_logger("async", None, train_args.ckpt_output_dir)

# After
setup_metric_logger(
    train_args.logger_type,
    train_args.run_name,
    train_args.ckpt_output_dir,
    mlflow_tracking_uri=train_args.mlflow_tracking_uri,
    mlflow_experiment_name=train_args.mlflow_experiment_name,
    wandb_project=train_args.wandb_project,
    wandb_entity=train_args.wandb_entity,
)

Example Usage

from instructlab.training import run_training, TrainingArgs, TorchrunArgs

train_args = TrainingArgs(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    data_path="./data.jsonl",
    ckpt_output_dir="./outputs",
    # ... other required fields ...
    
    # New logging configuration
    logger_type="wandb,mlflow",
    run_name="experiment-{time}",
    mlflow_tracking_uri="http://localhost:5000",
    mlflow_experiment_name="my-experiments",
    wandb_project="my-project",
)

run_training(torch_args, train_args)

Test plan

  • Verify MLflow handler logs metrics correctly to a local MLflow server
  • Verify W&B logging still works with new wandb_project/wandb_entity fields
  • Verify backward compatibility: existing code without logging params defaults to async
  • Verify comma-separated logger_type enables multiple backends simultaneously

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Multi-backend metric logging: MLflow, Weights & Biases (W&B), and TensorBoard support.
    • New public training options to configure MLflow (tracking/experiment/run), W&B (project/entity/run), and TensorBoard log directory.
    • CLI flags to set MLflow, W&B, and TensorBoard logging at runtime.
  • Documentation

    • Updated usage examples and docs showing multi-backend metric logging, auto-detection, and configuration.

@coderabbitai
Copy link

coderabbitai bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

Adds MLflow, Weights & Biases, and TensorBoard logging options: new optional TrainingArgs fields, an MLflowHandler and MLflow integration in the metric logger, an expanded setup_metric_logger signature, and CLI/entrypoint flags propagated to logger and subprocess runs.

Changes

Cohort / File(s) Summary
Config: Training arguments
src/instructlab/training/config.py
Added seven optional TrainingArgs fields: mlflow_tracking_uri, mlflow_experiment_name, mlflow_run_name, wandb_project, wandb_entity, wandb_run_name, and tensorboard_log_dir.
Logging backend & API
src/instructlab/training/logger.py
Added MLflowHandler (safe-import guard for mlflow) and integrated MLflow into the metric logger lifecycle (_setup, emit, close). Expanded setup_metric_logger(...) signature to accept MLflow/W&B/TensorBoard params, auto-detect backends from args/env, and wire MLflow into dictConfig alongside existing tensorboard/wandb/async handlers.
CLI / Entrypoint integration
src/instructlab/training/main_ds.py
Updated setup_metric_logger() call to new signature and threaded new flags through subprocess command; added CLI flags: --mlflow_tracking_uri, --mlflow_experiment_name, --mlflow_run_name, --wandb_project, --wandb_entity, --wandb_run_name, and --tensorboard_log_dir.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as main_ds.py
    participant Setup as setup_metric_logger()
    participant Logger as MetricLogger
    participant MLflow as MLflow
    participant WandB as Weights&Biases
    participant TB as TensorBoard

    User->>CLI: start training with logging flags
    CLI->>Setup: setup_metric_logger(output_dir, mlflow_..., wandb_..., tensorboard_log_dir)
    Setup->>Logger: configure handlers (MLflowHandler, WandB, TensorBoard, Async)
    Logger->>MLflow: MLflowHandler._setup() (set_tracking_uri, set_experiment, start_run)
    Logger->>WandB: init run (project, entity, run_name)
    Logger->>TB: create writer (log_dir)
    CLI->>Logger: emit metrics/hparams LogRecord
    Logger->>MLflow: MLflowHandler.emit() -> log_metrics/log_params
    Logger->>WandB: log metrics
    Logger->>TB: write scalars
    CLI->>Logger: shutdown
    Logger->>MLflow: MLflowHandler.close() -> end_run()
    Logger->>WandB: finish run
    Logger->>TB: close writer
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hopped through args and stitched a log,

MLflow tracks while WandB spins a jog,
TensorBoard keeps ribbons neat,
Handlers hum with every beat,
A rabbit saved each metric log.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective of the PR: adding MLflow support as a new logging backend and exposing logging configuration via TrainingArgs fields.
Docstring Coverage ✅ Passed Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch expose-mlflow

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify mergify bot added the ci-failure label Jan 27, 2026
RobotSail added a commit to Red-Hat-AI-Innovation-Team/training_hub that referenced this pull request Jan 27, 2026
Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through
flat kwargs in sft(), osft(), and lora_sft() convenience functions.

## New Parameters

- `loggers`: List of loggers to enable (e.g., ["wandb", "mlflow", "jsonl"])
- `run_name`: Run name with placeholder support ({time}, {rank})
- `log_level`: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- `logging_steps`: How often to log metrics
- `wandb_project`, `wandb_entity`, `wandb_run_name`: W&B configuration
- `tensorboard_log_dir`: TensorBoard output directory
- `mlflow_tracking_uri`, `mlflow_experiment_name`: MLflow configuration

## Backend Support

| Logger      | SFT | OSFT | LoRA |
|-------------|-----|------|------|
| wandb       | Yes | Yes  | Yes  |
| tensorboard | Yes | No   | Yes  |
| mlflow      | Yes | No   | Yes  |
| jsonl       | Yes | Yes  | No   |

OSFT emits warnings for unsupported loggers/params and continues.

Depends on: instructlab/training#680

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/instructlab/training/main_ds.py`:
- Around line 275-283: The call to setup_metric_logger in main() uses
unnecessary defensive getattr() for mlflow/wandb fields; replace getattr(args,
"mlflow_tracking_uri", None), getattr(args, "mlflow_experiment_name", None),
getattr(args, "wandb_project", None), and getattr(args, "wandb_entity", None)
with direct attribute access args.mlflow_tracking_uri,
args.mlflow_experiment_name, args.wandb_project, and args.wandb_entity
respectively so it matches the pattern used in run_training() and with
train_args.
🧹 Nitpick comments (2)
src/instructlab/training/logger.py (2)

638-665: Unused log_dir parameter.

The log_dir parameter is stored as self.log_dir but never used in _setup() or elsewhere. The docstring mentions it's "used as artifact location" but the implementation doesn't pass it to MLflow. Either use it to set the artifact location or remove it to avoid confusion.

♻️ Option 1: Use log_dir as artifact location
     def _setup(self):
         """Initialize the MLflow run with the configured settings."""
         if mlflow is None:
             msg = (
                 "Could not initialize MLflowHandler because package mlflow could not be imported.\n"
                 "Please ensure it is installed by running 'pip install mlflow'"
             )
             raise RuntimeError(msg)

         if self.tracking_uri:
             mlflow.set_tracking_uri(self.tracking_uri)

         if self.experiment_name:
-            mlflow.set_experiment(self.experiment_name)
+            mlflow.set_experiment(
+                self.experiment_name,
+                artifact_location=str(self.log_dir),
+            )

         self._mlflow_run = mlflow.start_run(
             run_name=self.run_name, **self.mlflow_init_kwargs
         )
♻️ Option 2: Remove unused parameter
     def __init__(
         self,
         level: int = logging.INFO,
         run_name: str | None = None,
-        log_dir: str | os.PathLike = "logs",
         tracking_uri: str | None = None,
         experiment_name: str | None = None,
         **mlflow_init_kwargs: Any,
     ):
         """Initialize the MLflow logger and check for required dependencies.

         Args:
             level: The logging level for this handler
             run_name: Name of the run, can contain placeholders
-            log_dir: Directory where MLflow artifacts should be stored (used as artifact location)
             tracking_uri: MLflow tracking server URI (e.g., "http://localhost:5000")
             experiment_name: Name of the MLflow experiment
             **mlflow_init_kwargs: Additional keyword arguments passed to mlflow.start_run()
         """
         super().__init__(level)

         self.run_name = _substitute_placeholders(run_name)
-        self.log_dir = Path(log_dir)
         self.tracking_uri = tracking_uri
         self.experiment_name = experiment_name
         self.mlflow_init_kwargs = mlflow_init_kwargs.copy()

         self._mlflow_run = None

Note: If removing log_dir, also update setup_metric_logger to not pass it to the MLflow handler config.


711-721: Consider adding a debug log for skipped non-numeric metrics.

Non-numeric values are silently skipped. For consistency with TensorBoardHandler (which warns on type errors), consider adding a debug-level message to help users understand why certain values aren't appearing in MLflow metrics.

♻️ Proposed change
         # Filter to only numeric values for metrics
         metrics_dict = {}
         for k, v in flat_dict.items():
             try:
                 metrics_dict[k] = float(v)
             except (ValueError, TypeError):
                 # Skip non-numeric values for metrics
-                pass
+                warnings.warn(
+                    f"MLflowHandler skipping non-numeric metric '{k}' with value {type(v).__name__}",
+                    stacklevel=2,
+                )

RobotSail and others added 3 commits January 27, 2026 14:56
…logging

- Add tensorboard_log_dir field to TrainingArgs in config.py
- Update setup_metric_logger to use tensorboard_log_dir when provided
- Add CLI argument for tensorboard_log_dir
- Wire tensorboard_log_dir through run_training() to subprocess command

This allows users to specify a custom directory for TensorBoard logs,
defaulting to output_dir if not specified.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace defensive getattr() with direct attribute access in main_ds.py
  since args are guaranteed to exist from argparse defaults
- Remove unused log_dir parameter from MLflowHandler
- Add debug logging for non-numeric metrics skipped by MLflowHandler

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/instructlab/training/logger.py (1)

862-979: ⚠️ Potential issue | 🟡 Minor

Enable MLflow when experiment/run name is provided (not only tracking URI).

Currently, MLflow logging won't activate if users only set mlflow_experiment_name or mlflow_run_name, even though MLflow can operate with local filesystem storage when no tracking URI is specified. The detection logic should include these fields and the MLFLOW_EXPERIMENT_NAME environment variable to align with the design pattern used for other backends (e.g., wandb is enabled when wandb_project is set).

Proposed fix
-    if mlflow_tracking_uri or os.environ.get("MLFLOW_TRACKING_URI"):
+    if (
+        mlflow_tracking_uri
+        or mlflow_experiment_name
+        or mlflow_run_name
+        or os.environ.get("MLFLOW_TRACKING_URI")
+        or os.environ.get("MLFLOW_EXPERIMENT_NAME")
+    ):
         detected_loggers.append("mlflow")

@mergify mergify bot added ci-failure and removed ci-failure labels Feb 3, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/instructlab/training/logger.py (2)

1-46: ⚠️ Potential issue | 🟡 Minor

Update the module docstring example to match the new setup API.

The example still uses loggers=[...] and run_name=..., but setup_metric_logger now relies on auto-detection and backend-specific kwargs. This will mislead users.

📝 Suggested docstring fix
-    # Setup logging with TensorBoard and wandb
-    setup_metric_logger(
-        loggers=["tensorboard", "wandb"],
-        run_name="my_training_run",
-        output_dir="logs"
-    )
+    # Setup logging with TensorBoard and wandb (auto-detected)
+    setup_metric_logger(
+        output_dir="logs",
+        tensorboard_log_dir="logs/tensorboard",
+        wandb_project="my_project",
+        wandb_run_name="my_training_run",
+    )

862-985: ⚠️ Potential issue | 🟠 Major

Run name no longer wired to async/tensorboard (likely regression).

setup_metric_logger used to accept run_name, but it’s now missing and both async/tensorboard handlers hardcode run_name=None. This means the new TrainingArgs.run_name won’t affect those backends. If the intent is a global run name, please reintroduce it and apply as a fallback for backend-specific names.

✅ Suggested wiring
 def setup_metric_logger(
     output_dir,
     *,
+    run_name: str | None = None,
     mlflow_tracking_uri: str | None = None,
     mlflow_experiment_name: str | None = None,
     mlflow_run_name: str | None = None,
     wandb_project: str | None = None,
     wandb_entity: str | None = None,
     wandb_run_name: str | None = None,
     tensorboard_log_dir: str | None = None,
 ):
@@
         "handlers": {
             "async": {
                 "()": AsyncStructuredHandler,
                 "log_dir": output_dir,
-                "run_name": None,  # Uses default template
+                "run_name": run_name,  # Uses default template if None
                 "filters": async_filters,
             },
             "tensorboard": {
                 "()": TensorBoardHandler,
                 "log_dir": tensorboard_log_dir or output_dir,
-                "run_name": None,  # Uses default template
+                "run_name": run_name,  # Uses default template if None
                 "filters": ["is_mapping", "is_rank0"],
             },
             "wandb": {
                 "()": WandbHandler,
                 "log_dir": output_dir,
-                "run_name": wandb_run_name,
+                "run_name": wandb_run_name or run_name,
                 "project": wandb_project,
                 "entity": wandb_entity,
                 "filters": ["is_mapping", "is_rank0"],
             },
             "mlflow": {
                 "()": MLflowHandler,
-                "run_name": mlflow_run_name,
+                "run_name": mlflow_run_name or run_name,
                 "tracking_uri": mlflow_tracking_uri
                 or os.environ.get("MLFLOW_TRACKING_URI"),
                 "experiment_name": mlflow_experiment_name
                 or os.environ.get("MLFLOW_EXPERIMENT_NAME"),
                 "filters": ["is_mapping", "is_rank0"],
             },
🤖 Fix all issues with AI agents
In `@src/instructlab/training/logger.py`:
- Around line 664-681: The _setup method currently calls mlflow.start_run()
unconditionally which raises if an MLflow run is already active; update _setup
to check mlflow.active_run() first and handle both cases: if mlflow.active_run()
exists, set self._mlflow_run = mlflow.active_run() to reuse it, otherwise call
mlflow.start_run(run_name=self.run_name, **self.mlflow_init_kwargs); optionally,
if you prefer nesting instead of reusing, call mlflow.start_run(...,
nested=True) when an active run exists—make this change around the existing
self._mlflow_run assignment and mlflow.start_run call in _setup.

RobotSail added a commit to Red-Hat-AI-Innovation-Team/training_hub that referenced this pull request Feb 3, 2026
Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through
flat kwargs in sft(), osft(), and lora_sft() convenience functions.

- `loggers`: List of loggers to enable (e.g., ["wandb", "mlflow", "jsonl"])
- `run_name`: Run name with placeholder support ({time}, {rank})
- `log_level`: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- `logging_steps`: How often to log metrics
- `wandb_project`, `wandb_entity`, `wandb_run_name`: W&B configuration
- `tensorboard_log_dir`: TensorBoard output directory
- `mlflow_tracking_uri`, `mlflow_experiment_name`: MLflow configuration

| Logger      | SFT | OSFT | LoRA |
|-------------|-----|------|------|
| wandb       | Yes | Yes  | Yes  |
| tensorboard | Yes | No   | Yes  |
| mlflow      | Yes | No   | Yes  |
| jsonl       | Yes | Yes  | No   |

OSFT emits warnings for unsupported loggers/params and continues.

Depends on: instructlab/training#680

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Pass tensorboard_log_dir through to TrainingArgs

Update SFT backend to pass tensorboard_log_dir to TrainingArgs now that
instructlab-training supports configurable TensorBoard log directories.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Enable MLflow support for OSFT backend

- Add "mlflow" to SUPPORTED_LOGGERS
- Remove mlflow params from UNSUPPORTED_PARAMS
- Add mlflow_run_name parameter throughout
- Update docstrings to reflect MLflow is now supported

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Address PR review comments

- Replace hardcoded data path with required=True in sft_mlflow_example.py
- Add stacklevel=2 to warnings.warn calls in osft.py
- Rename ambiguous loop variable 'l' to 'lg' in sft.py
- Add warnings for unsupported logging params in sft.py backend

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add OSFT MLflow example script

Similar to sft_mlflow_example.py but for OSFT training with
unfreeze_rank_ratio and other OSFT-specific parameters.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add logging and experiment tracking documentation

Create comprehensive documentation for the unified logging API:
- MLflow, Weights & Biases, TensorBoard, and JSONL logging
- Configuration precedence (kwarg > env var > defaults)
- Backend-specific notes and supported loggers
- Example usage for all algorithms (SFT, OSFT, LoRA)
- Run naming with placeholder support

Update sidebar to include the new logging guide.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Remove duplicate warnings import in sft.py

The warnings module is already imported at module level (line 2).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Address review nitpicks in docs and examples

- Add language specifier to fenced code block in logging.md
- Remove unused result variable in example scripts
- Remove unnecessary wandb_run_name from MLflow-focused OSFT example
- Add comment clarifying /dev/shm usage for Linux shared memory

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Auto-detect loggers from config parameters

Loggers are now automatically enabled based on their configuration:
- mlflow_tracking_uri or MLFLOW_TRACKING_URI env → enables MLflow
- wandb_project or WANDB_PROJECT env → enables W&B
- tensorboard_log_dir → enables TensorBoard (SFT only)

The 'loggers' parameter is deprecated and emits a DeprecationWarning
when used. Updated API documentation with new logging configuration
sections and removed outdated logging guide.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Address PR review feedback

- Fix markdown table alignment in docs for markdownlint compliance
- Change osft_mlflow_example.py default checkpoint dir from /tmp to ./
- Document that run_name is not supported by mini-trainer backend
- Add stacklevel=2 to block_size warning in osft.py
- Update loggers docstring to indicate deprecated status

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

remove extraneous api kwargs

removes more extraneous api additions

reintroduce mlflow_run_name into lora

docstrings

drop formatting changes

read envs when present

review comments
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/instructlab/training/logger.py`:
- Around line 732-737: The handler currently always calls mlflow.end_run() in
close(), which can end runs it didn't start; add a boolean flag (e.g.,
self._owns_mlflow_run) that is set to True only when this handler creates/starts
an MLflow run (where mlflow.start_run() is invoked or when self._mlflow_run is
assigned) and False when reusing an existing active run, then modify close() to
call mlflow.end_run() only if self._owns_mlflow_run is True and clear both
self._mlflow_run and self._owns_mlflow_run during cleanup; ensure the flag is
updated wherever the run is started or adopted so ownership is tracked
correctly.
🧹 Nitpick comments (1)
src/instructlab/training/logger.py (1)

942-946: Simplify the async logger inclusion logic.

The current logic is a bit convoluted. When detected_loggers is empty, line 943 sets loggers = ["async"], then lines 945-946 check if async is not in loggers (which it already is). When detected_loggers is non-empty, it duplicates the list reference then appends.

♻️ Proposed simplification
-    # Always include async logger for file logging
-    loggers = detected_loggers if detected_loggers else ["async"]
-    # Also include async logger alongside other loggers for file-based logging
-    if detected_loggers and "async" not in loggers:
-        loggers.append("async")
+    # Always include async logger for file-based logging
+    loggers = detected_loggers + ["async"] if detected_loggers else ["async"]

Or even simpler:

-    # Always include async logger for file logging
-    loggers = detected_loggers if detected_loggers else ["async"]
-    # Also include async logger alongside other loggers for file-based logging
-    if detected_loggers and "async" not in loggers:
-        loggers.append("async")
+    # Always include async logger for file-based logging alongside other loggers
+    loggers = [*detected_loggers, "async"]

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/instructlab/training/logger.py`:
- Around line 873-953: setup_metric_logger currently auto-enables backends based
only on backend-specific args and unconditionally ensures "async" is present,
and it doesn't propagate a global run_name to all backends; update it to accept
and honor an explicit logger selection (e.g., logger_type or logger_list
argument) so callers can request "tensorboard" or "async" even when
backend-specific fields are empty, only append "async" when explicitly requested
(or present in the explicit list), and ensure a global run_name parameter (or
run_name fallback) is forwarded into backend-specific settings (mlflow_run_name,
wandb_run_name, tensorboard handler config and any async file logger) so every
backend receives the same run name; change logic around detected_loggers/loggers
and the wiring to backend initializers (functions that configure
mlflow/wandb/tensorboard and the async handler) accordingly and apply the same
fix to the analogous block referenced around 968-997.

Comment on lines 873 to 953
def setup_metric_logger(
output_dir,
*,
mlflow_tracking_uri: str | None = None,
mlflow_experiment_name: str | None = None,
mlflow_run_name: str | None = None,
wandb_project: str | None = None,
wandb_entity: str | None = None,
wandb_run_name: str | None = None,
tensorboard_log_dir: str | None = None,
):
"""Configure the metric logging system with auto-detected backends.

This function sets up a comprehensive logging configuration that supports
multiple logging backends simultaneously. It configures filters, handlers,
and loggers for structured metric logging.
and loggers for structured metric logging. Backends are automatically
detected based on the presence of their configuration parameters.

Args:
loggers: A string or list of strings specifying which logging backends to use.
Supported values: "tensorboard", "wandb", "async"
run_name: Name for the current training run. Can include placeholders like
{time}, {rank}, {utc_time}, {local_rank}.
output_dir: Directory where log files will be stored
mlflow_tracking_uri: MLflow tracking server URI (e.g., "http://localhost:5000").
Falls back to MLFLOW_TRACKING_URI environment variable if not provided.
When set (or env var present), MLflow logging is automatically enabled.
mlflow_experiment_name: MLflow experiment name.
Falls back to MLFLOW_EXPERIMENT_NAME environment variable if not provided.
mlflow_run_name: MLflow run name. Supports placeholders: {time}, {rank}, {utc_time}, {local_rank}.
wandb_project: Weights & Biases project name.
When set (or WANDB_PROJECT env var present), wandb logging is automatically enabled.
wandb_entity: Weights & Biases team/entity name.
wandb_run_name: Weights & Biases run name. Supports placeholders: {time}, {rank}, {utc_time}, {local_rank}.
tensorboard_log_dir: Directory for TensorBoard logs.
When set, TensorBoard logging is automatically enabled.

Example:
```python
# Setup logging with multiple backends
# Setup logging with MLflow (auto-detected from tracking URI)
setup_metric_logger(
loggers=["tensorboard", "wandb", "async"],
run_name="experiment_{time}",
output_dir="logs"
output_dir="logs",
mlflow_tracking_uri="http://localhost:5000",
mlflow_experiment_name="my_experiment",
mlflow_run_name="my_run"
)

# Setup logging with a single backend
# Setup logging with wandb (auto-detected from project)
setup_metric_logger(
loggers="tensorboard",
run_name="my_run",
output_dir="logs"
output_dir="logs",
wandb_project="my_project",
wandb_run_name="my_run"
)

# Setup logging with TensorBoard (auto-detected from log_dir)
setup_metric_logger(
output_dir="logs",
tensorboard_log_dir="logs/tensorboard"
)
```
"""
if not loggers:
return

# Enable package logging
propagate_package_logs()

if isinstance(loggers, str):
loggers = loggers.split(",")
loggers = [logger.strip() for logger in loggers]
# Auto-detect which loggers to enable based on configuration
detected_loggers = []
if (
mlflow_tracking_uri
or mlflow_experiment_name
or mlflow_run_name
or os.environ.get("MLFLOW_TRACKING_URI")
or os.environ.get("MLFLOW_EXPERIMENT_NAME")
):
detected_loggers.append("mlflow")
if wandb_project or os.environ.get("WANDB_PROJECT"):
detected_loggers.append("wandb")
if tensorboard_log_dir:
detected_loggers.append("tensorboard")

# Always include async logger for file logging
loggers = detected_loggers if detected_loggers else ["async"]
# Also include async logger alongside other loggers for file-based logging
if detected_loggers and "async" not in loggers:
loggers.append("async")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Honor explicit logger selection and propagate run_name to all backends.

setup_metric_logger only enables MLflow/TensorBoard/W&B when their config fields are provided and then always adds async. This prevents explicitly enabling a backend with defaults (e.g., logger_type="tensorboard" without tensorboard_log_dir) and ignores a global run_name for async/tensorboard. This looks inconsistent with the new logger_type/run_name intent.

Consider accepting an explicit logger list (or logger_type) and wiring run_name to all handlers, only adding async when requested.

💡 Suggested adjustment (illustrative)
-def setup_metric_logger(
-    output_dir,
-    *,
-    mlflow_tracking_uri: str | None = None,
-    mlflow_experiment_name: str | None = None,
-    mlflow_run_name: str | None = None,
-    wandb_project: str | None = None,
-    wandb_entity: str | None = None,
-    wandb_run_name: str | None = None,
-    tensorboard_log_dir: str | None = None,
-):
+def setup_metric_logger(
+    output_dir,
+    *,
+    loggers: list[str] | None = None,
+    run_name: str | None = None,
+    mlflow_tracking_uri: str | None = None,
+    mlflow_experiment_name: str | None = None,
+    mlflow_run_name: str | None = None,
+    wandb_project: str | None = None,
+    wandb_entity: str | None = None,
+    wandb_run_name: str | None = None,
+    tensorboard_log_dir: str | None = None,
+):
@@
-    detected_loggers = []
+    detected_loggers = []
@@
-    # Always include async logger for file logging
-    loggers = detected_loggers if detected_loggers else ["async"]
-    # Also include async logger alongside other loggers for file-based logging
-    if detected_loggers and "async" not in loggers:
-        loggers.append("async")
+    if loggers is None:
+        loggers = detected_loggers if detected_loggers else ["async"]
@@
             "async": {
                 "()": AsyncStructuredHandler,
                 "log_dir": output_dir,
-                "run_name": None,  # Uses default template
+                "run_name": run_name,
                 "filters": async_filters,
             },
             "tensorboard": {
                 "()": TensorBoardHandler,
                 "log_dir": tensorboard_log_dir or output_dir,
-                "run_name": None,  # Uses default template
+                "run_name": run_name,
                 "filters": ["is_mapping", "is_rank0"],
             },

Also applies to: 968-997

🤖 Prompt for AI Agents
In `@src/instructlab/training/logger.py` around lines 873 - 953,
setup_metric_logger currently auto-enables backends based only on
backend-specific args and unconditionally ensures "async" is present, and it
doesn't propagate a global run_name to all backends; update it to accept and
honor an explicit logger selection (e.g., logger_type or logger_list argument)
so callers can request "tensorboard" or "async" even when backend-specific
fields are empty, only append "async" when explicitly requested (or present in
the explicit list), and ensure a global run_name parameter (or run_name
fallback) is forwarded into backend-specific settings (mlflow_run_name,
wandb_run_name, tensorboard handler config and any async file logger) so every
backend receives the same run name; change logic around detected_loggers/loggers
and the wiring to backend initializers (functions that configure
mlflow/wandb/tensorboard and the async handler) accordingly and apply the same
fix to the analogous block referenced around 968-997.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants