-
Notifications
You must be signed in to change notification settings - Fork 77
Add MLflow support and expose logging configuration in TrainingArgs #680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds MLflow, Weights & Biases, and TensorBoard logging options: new optional TrainingArgs fields, an MLflowHandler and MLflow integration in the metric logger, an expanded setup_metric_logger signature, and CLI/entrypoint flags propagated to logger and subprocess runs. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI as main_ds.py
participant Setup as setup_metric_logger()
participant Logger as MetricLogger
participant MLflow as MLflow
participant WandB as Weights&Biases
participant TB as TensorBoard
User->>CLI: start training with logging flags
CLI->>Setup: setup_metric_logger(output_dir, mlflow_..., wandb_..., tensorboard_log_dir)
Setup->>Logger: configure handlers (MLflowHandler, WandB, TensorBoard, Async)
Logger->>MLflow: MLflowHandler._setup() (set_tracking_uri, set_experiment, start_run)
Logger->>WandB: init run (project, entity, run_name)
Logger->>TB: create writer (log_dir)
CLI->>Logger: emit metrics/hparams LogRecord
Logger->>MLflow: MLflowHandler.emit() -> log_metrics/log_params
Logger->>WandB: log metrics
Logger->>TB: write scalars
CLI->>Logger: shutdown
Logger->>MLflow: MLflowHandler.close() -> end_run()
Logger->>WandB: finish run
Logger->>TB: close writer
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through
flat kwargs in sft(), osft(), and lora_sft() convenience functions.
## New Parameters
- `loggers`: List of loggers to enable (e.g., ["wandb", "mlflow", "jsonl"])
- `run_name`: Run name with placeholder support ({time}, {rank})
- `log_level`: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- `logging_steps`: How often to log metrics
- `wandb_project`, `wandb_entity`, `wandb_run_name`: W&B configuration
- `tensorboard_log_dir`: TensorBoard output directory
- `mlflow_tracking_uri`, `mlflow_experiment_name`: MLflow configuration
## Backend Support
| Logger | SFT | OSFT | LoRA |
|-------------|-----|------|------|
| wandb | Yes | Yes | Yes |
| tensorboard | Yes | No | Yes |
| mlflow | Yes | No | Yes |
| jsonl | Yes | Yes | No |
OSFT emits warnings for unsupported loggers/params and continues.
Depends on: instructlab/training#680
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/instructlab/training/main_ds.py`:
- Around line 275-283: The call to setup_metric_logger in main() uses
unnecessary defensive getattr() for mlflow/wandb fields; replace getattr(args,
"mlflow_tracking_uri", None), getattr(args, "mlflow_experiment_name", None),
getattr(args, "wandb_project", None), and getattr(args, "wandb_entity", None)
with direct attribute access args.mlflow_tracking_uri,
args.mlflow_experiment_name, args.wandb_project, and args.wandb_entity
respectively so it matches the pattern used in run_training() and with
train_args.
🧹 Nitpick comments (2)
src/instructlab/training/logger.py (2)
638-665: Unusedlog_dirparameter.The
log_dirparameter is stored asself.log_dirbut never used in_setup()or elsewhere. The docstring mentions it's "used as artifact location" but the implementation doesn't pass it to MLflow. Either use it to set the artifact location or remove it to avoid confusion.♻️ Option 1: Use log_dir as artifact location
def _setup(self): """Initialize the MLflow run with the configured settings.""" if mlflow is None: msg = ( "Could not initialize MLflowHandler because package mlflow could not be imported.\n" "Please ensure it is installed by running 'pip install mlflow'" ) raise RuntimeError(msg) if self.tracking_uri: mlflow.set_tracking_uri(self.tracking_uri) if self.experiment_name: - mlflow.set_experiment(self.experiment_name) + mlflow.set_experiment( + self.experiment_name, + artifact_location=str(self.log_dir), + ) self._mlflow_run = mlflow.start_run( run_name=self.run_name, **self.mlflow_init_kwargs )♻️ Option 2: Remove unused parameter
def __init__( self, level: int = logging.INFO, run_name: str | None = None, - log_dir: str | os.PathLike = "logs", tracking_uri: str | None = None, experiment_name: str | None = None, **mlflow_init_kwargs: Any, ): """Initialize the MLflow logger and check for required dependencies. Args: level: The logging level for this handler run_name: Name of the run, can contain placeholders - log_dir: Directory where MLflow artifacts should be stored (used as artifact location) tracking_uri: MLflow tracking server URI (e.g., "http://localhost:5000") experiment_name: Name of the MLflow experiment **mlflow_init_kwargs: Additional keyword arguments passed to mlflow.start_run() """ super().__init__(level) self.run_name = _substitute_placeholders(run_name) - self.log_dir = Path(log_dir) self.tracking_uri = tracking_uri self.experiment_name = experiment_name self.mlflow_init_kwargs = mlflow_init_kwargs.copy() self._mlflow_run = NoneNote: If removing
log_dir, also updatesetup_metric_loggerto not pass it to the MLflow handler config.
711-721: Consider adding a debug log for skipped non-numeric metrics.Non-numeric values are silently skipped. For consistency with
TensorBoardHandler(which warns on type errors), consider adding a debug-level message to help users understand why certain values aren't appearing in MLflow metrics.♻️ Proposed change
# Filter to only numeric values for metrics metrics_dict = {} for k, v in flat_dict.items(): try: metrics_dict[k] = float(v) except (ValueError, TypeError): # Skip non-numeric values for metrics - pass + warnings.warn( + f"MLflowHandler skipping non-numeric metric '{k}' with value {type(v).__name__}", + stacklevel=2, + )
…logging - Add tensorboard_log_dir field to TrainingArgs in config.py - Update setup_metric_logger to use tensorboard_log_dir when provided - Add CLI argument for tensorboard_log_dir - Wire tensorboard_log_dir through run_training() to subprocess command This allows users to specify a custom directory for TensorBoard logs, defaulting to output_dir if not specified. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace defensive getattr() with direct attribute access in main_ds.py since args are guaranteed to exist from argparse defaults - Remove unused log_dir parameter from MLflowHandler - Add debug logging for non-numeric metrics skipped by MLflowHandler Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/instructlab/training/logger.py (1)
862-979:⚠️ Potential issue | 🟡 MinorEnable MLflow when experiment/run name is provided (not only tracking URI).
Currently, MLflow logging won't activate if users only set
mlflow_experiment_nameormlflow_run_name, even though MLflow can operate with local filesystem storage when no tracking URI is specified. The detection logic should include these fields and theMLFLOW_EXPERIMENT_NAMEenvironment variable to align with the design pattern used for other backends (e.g., wandb is enabled whenwandb_projectis set).Proposed fix
- if mlflow_tracking_uri or os.environ.get("MLFLOW_TRACKING_URI"): + if ( + mlflow_tracking_uri + or mlflow_experiment_name + or mlflow_run_name + or os.environ.get("MLFLOW_TRACKING_URI") + or os.environ.get("MLFLOW_EXPERIMENT_NAME") + ): detected_loggers.append("mlflow")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/instructlab/training/logger.py (2)
1-46:⚠️ Potential issue | 🟡 MinorUpdate the module docstring example to match the new setup API.
The example still uses
loggers=[...]andrun_name=..., butsetup_metric_loggernow relies on auto-detection and backend-specific kwargs. This will mislead users.📝 Suggested docstring fix
- # Setup logging with TensorBoard and wandb - setup_metric_logger( - loggers=["tensorboard", "wandb"], - run_name="my_training_run", - output_dir="logs" - ) + # Setup logging with TensorBoard and wandb (auto-detected) + setup_metric_logger( + output_dir="logs", + tensorboard_log_dir="logs/tensorboard", + wandb_project="my_project", + wandb_run_name="my_training_run", + )
862-985:⚠️ Potential issue | 🟠 MajorRun name no longer wired to async/tensorboard (likely regression).
setup_metric_loggerused to acceptrun_name, but it’s now missing and both async/tensorboard handlers hardcoderun_name=None. This means the newTrainingArgs.run_namewon’t affect those backends. If the intent is a global run name, please reintroduce it and apply as a fallback for backend-specific names.✅ Suggested wiring
def setup_metric_logger( output_dir, *, + run_name: str | None = None, mlflow_tracking_uri: str | None = None, mlflow_experiment_name: str | None = None, mlflow_run_name: str | None = None, wandb_project: str | None = None, wandb_entity: str | None = None, wandb_run_name: str | None = None, tensorboard_log_dir: str | None = None, ): @@ "handlers": { "async": { "()": AsyncStructuredHandler, "log_dir": output_dir, - "run_name": None, # Uses default template + "run_name": run_name, # Uses default template if None "filters": async_filters, }, "tensorboard": { "()": TensorBoardHandler, "log_dir": tensorboard_log_dir or output_dir, - "run_name": None, # Uses default template + "run_name": run_name, # Uses default template if None "filters": ["is_mapping", "is_rank0"], }, "wandb": { "()": WandbHandler, "log_dir": output_dir, - "run_name": wandb_run_name, + "run_name": wandb_run_name or run_name, "project": wandb_project, "entity": wandb_entity, "filters": ["is_mapping", "is_rank0"], }, "mlflow": { "()": MLflowHandler, - "run_name": mlflow_run_name, + "run_name": mlflow_run_name or run_name, "tracking_uri": mlflow_tracking_uri or os.environ.get("MLFLOW_TRACKING_URI"), "experiment_name": mlflow_experiment_name or os.environ.get("MLFLOW_EXPERIMENT_NAME"), "filters": ["is_mapping", "is_rank0"], },
🤖 Fix all issues with AI agents
In `@src/instructlab/training/logger.py`:
- Around line 664-681: The _setup method currently calls mlflow.start_run()
unconditionally which raises if an MLflow run is already active; update _setup
to check mlflow.active_run() first and handle both cases: if mlflow.active_run()
exists, set self._mlflow_run = mlflow.active_run() to reuse it, otherwise call
mlflow.start_run(run_name=self.run_name, **self.mlflow_init_kwargs); optionally,
if you prefer nesting instead of reusing, call mlflow.start_run(...,
nested=True) when an active run exists—make this change around the existing
self._mlflow_run assignment and mlflow.start_run call in _setup.
Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through
flat kwargs in sft(), osft(), and lora_sft() convenience functions.
- `loggers`: List of loggers to enable (e.g., ["wandb", "mlflow", "jsonl"])
- `run_name`: Run name with placeholder support ({time}, {rank})
- `log_level`: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- `logging_steps`: How often to log metrics
- `wandb_project`, `wandb_entity`, `wandb_run_name`: W&B configuration
- `tensorboard_log_dir`: TensorBoard output directory
- `mlflow_tracking_uri`, `mlflow_experiment_name`: MLflow configuration
| Logger | SFT | OSFT | LoRA |
|-------------|-----|------|------|
| wandb | Yes | Yes | Yes |
| tensorboard | Yes | No | Yes |
| mlflow | Yes | No | Yes |
| jsonl | Yes | Yes | No |
OSFT emits warnings for unsupported loggers/params and continues.
Depends on: instructlab/training#680
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Pass tensorboard_log_dir through to TrainingArgs
Update SFT backend to pass tensorboard_log_dir to TrainingArgs now that
instructlab-training supports configurable TensorBoard log directories.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enable MLflow support for OSFT backend
- Add "mlflow" to SUPPORTED_LOGGERS
- Remove mlflow params from UNSUPPORTED_PARAMS
- Add mlflow_run_name parameter throughout
- Update docstrings to reflect MLflow is now supported
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Address PR review comments
- Replace hardcoded data path with required=True in sft_mlflow_example.py
- Add stacklevel=2 to warnings.warn calls in osft.py
- Rename ambiguous loop variable 'l' to 'lg' in sft.py
- Add warnings for unsupported logging params in sft.py backend
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add OSFT MLflow example script
Similar to sft_mlflow_example.py but for OSFT training with
unfreeze_rank_ratio and other OSFT-specific parameters.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add logging and experiment tracking documentation
Create comprehensive documentation for the unified logging API:
- MLflow, Weights & Biases, TensorBoard, and JSONL logging
- Configuration precedence (kwarg > env var > defaults)
- Backend-specific notes and supported loggers
- Example usage for all algorithms (SFT, OSFT, LoRA)
- Run naming with placeholder support
Update sidebar to include the new logging guide.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove duplicate warnings import in sft.py
The warnings module is already imported at module level (line 2).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Address review nitpicks in docs and examples
- Add language specifier to fenced code block in logging.md
- Remove unused result variable in example scripts
- Remove unnecessary wandb_run_name from MLflow-focused OSFT example
- Add comment clarifying /dev/shm usage for Linux shared memory
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Auto-detect loggers from config parameters
Loggers are now automatically enabled based on their configuration:
- mlflow_tracking_uri or MLFLOW_TRACKING_URI env → enables MLflow
- wandb_project or WANDB_PROJECT env → enables W&B
- tensorboard_log_dir → enables TensorBoard (SFT only)
The 'loggers' parameter is deprecated and emits a DeprecationWarning
when used. Updated API documentation with new logging configuration
sections and removed outdated logging guide.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Address PR review feedback
- Fix markdown table alignment in docs for markdownlint compliance
- Change osft_mlflow_example.py default checkpoint dir from /tmp to ./
- Document that run_name is not supported by mini-trainer backend
- Add stacklevel=2 to block_size warning in osft.py
- Update loggers docstring to indicate deprecated status
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
remove extraneous api kwargs
removes more extraneous api additions
reintroduce mlflow_run_name into lora
docstrings
drop formatting changes
read envs when present
review comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/instructlab/training/logger.py`:
- Around line 732-737: The handler currently always calls mlflow.end_run() in
close(), which can end runs it didn't start; add a boolean flag (e.g.,
self._owns_mlflow_run) that is set to True only when this handler creates/starts
an MLflow run (where mlflow.start_run() is invoked or when self._mlflow_run is
assigned) and False when reusing an existing active run, then modify close() to
call mlflow.end_run() only if self._owns_mlflow_run is True and clear both
self._mlflow_run and self._owns_mlflow_run during cleanup; ensure the flag is
updated wherever the run is started or adopted so ownership is tracked
correctly.
🧹 Nitpick comments (1)
src/instructlab/training/logger.py (1)
942-946: Simplify the async logger inclusion logic.The current logic is a bit convoluted. When
detected_loggersis empty, line 943 setsloggers = ["async"], then lines 945-946 check if async is not in loggers (which it already is). Whendetected_loggersis non-empty, it duplicates the list reference then appends.♻️ Proposed simplification
- # Always include async logger for file logging - loggers = detected_loggers if detected_loggers else ["async"] - # Also include async logger alongside other loggers for file-based logging - if detected_loggers and "async" not in loggers: - loggers.append("async") + # Always include async logger for file-based logging + loggers = detected_loggers + ["async"] if detected_loggers else ["async"]Or even simpler:
- # Always include async logger for file logging - loggers = detected_loggers if detected_loggers else ["async"] - # Also include async logger alongside other loggers for file-based logging - if detected_loggers and "async" not in loggers: - loggers.append("async") + # Always include async logger for file-based logging alongside other loggers + loggers = [*detected_loggers, "async"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/instructlab/training/logger.py`:
- Around line 873-953: setup_metric_logger currently auto-enables backends based
only on backend-specific args and unconditionally ensures "async" is present,
and it doesn't propagate a global run_name to all backends; update it to accept
and honor an explicit logger selection (e.g., logger_type or logger_list
argument) so callers can request "tensorboard" or "async" even when
backend-specific fields are empty, only append "async" when explicitly requested
(or present in the explicit list), and ensure a global run_name parameter (or
run_name fallback) is forwarded into backend-specific settings (mlflow_run_name,
wandb_run_name, tensorboard handler config and any async file logger) so every
backend receives the same run name; change logic around detected_loggers/loggers
and the wiring to backend initializers (functions that configure
mlflow/wandb/tensorboard and the async handler) accordingly and apply the same
fix to the analogous block referenced around 968-997.
| def setup_metric_logger( | ||
| output_dir, | ||
| *, | ||
| mlflow_tracking_uri: str | None = None, | ||
| mlflow_experiment_name: str | None = None, | ||
| mlflow_run_name: str | None = None, | ||
| wandb_project: str | None = None, | ||
| wandb_entity: str | None = None, | ||
| wandb_run_name: str | None = None, | ||
| tensorboard_log_dir: str | None = None, | ||
| ): | ||
| """Configure the metric logging system with auto-detected backends. | ||
|
|
||
| This function sets up a comprehensive logging configuration that supports | ||
| multiple logging backends simultaneously. It configures filters, handlers, | ||
| and loggers for structured metric logging. | ||
| and loggers for structured metric logging. Backends are automatically | ||
| detected based on the presence of their configuration parameters. | ||
|
|
||
| Args: | ||
| loggers: A string or list of strings specifying which logging backends to use. | ||
| Supported values: "tensorboard", "wandb", "async" | ||
| run_name: Name for the current training run. Can include placeholders like | ||
| {time}, {rank}, {utc_time}, {local_rank}. | ||
| output_dir: Directory where log files will be stored | ||
| mlflow_tracking_uri: MLflow tracking server URI (e.g., "http://localhost:5000"). | ||
| Falls back to MLFLOW_TRACKING_URI environment variable if not provided. | ||
| When set (or env var present), MLflow logging is automatically enabled. | ||
| mlflow_experiment_name: MLflow experiment name. | ||
| Falls back to MLFLOW_EXPERIMENT_NAME environment variable if not provided. | ||
| mlflow_run_name: MLflow run name. Supports placeholders: {time}, {rank}, {utc_time}, {local_rank}. | ||
| wandb_project: Weights & Biases project name. | ||
| When set (or WANDB_PROJECT env var present), wandb logging is automatically enabled. | ||
| wandb_entity: Weights & Biases team/entity name. | ||
| wandb_run_name: Weights & Biases run name. Supports placeholders: {time}, {rank}, {utc_time}, {local_rank}. | ||
| tensorboard_log_dir: Directory for TensorBoard logs. | ||
| When set, TensorBoard logging is automatically enabled. | ||
|
|
||
| Example: | ||
| ```python | ||
| # Setup logging with multiple backends | ||
| # Setup logging with MLflow (auto-detected from tracking URI) | ||
| setup_metric_logger( | ||
| loggers=["tensorboard", "wandb", "async"], | ||
| run_name="experiment_{time}", | ||
| output_dir="logs" | ||
| output_dir="logs", | ||
| mlflow_tracking_uri="http://localhost:5000", | ||
| mlflow_experiment_name="my_experiment", | ||
| mlflow_run_name="my_run" | ||
| ) | ||
|
|
||
| # Setup logging with a single backend | ||
| # Setup logging with wandb (auto-detected from project) | ||
| setup_metric_logger( | ||
| loggers="tensorboard", | ||
| run_name="my_run", | ||
| output_dir="logs" | ||
| output_dir="logs", | ||
| wandb_project="my_project", | ||
| wandb_run_name="my_run" | ||
| ) | ||
|
|
||
| # Setup logging with TensorBoard (auto-detected from log_dir) | ||
| setup_metric_logger( | ||
| output_dir="logs", | ||
| tensorboard_log_dir="logs/tensorboard" | ||
| ) | ||
| ``` | ||
| """ | ||
| if not loggers: | ||
| return | ||
|
|
||
| # Enable package logging | ||
| propagate_package_logs() | ||
|
|
||
| if isinstance(loggers, str): | ||
| loggers = loggers.split(",") | ||
| loggers = [logger.strip() for logger in loggers] | ||
| # Auto-detect which loggers to enable based on configuration | ||
| detected_loggers = [] | ||
| if ( | ||
| mlflow_tracking_uri | ||
| or mlflow_experiment_name | ||
| or mlflow_run_name | ||
| or os.environ.get("MLFLOW_TRACKING_URI") | ||
| or os.environ.get("MLFLOW_EXPERIMENT_NAME") | ||
| ): | ||
| detected_loggers.append("mlflow") | ||
| if wandb_project or os.environ.get("WANDB_PROJECT"): | ||
| detected_loggers.append("wandb") | ||
| if tensorboard_log_dir: | ||
| detected_loggers.append("tensorboard") | ||
|
|
||
| # Always include async logger for file logging | ||
| loggers = detected_loggers if detected_loggers else ["async"] | ||
| # Also include async logger alongside other loggers for file-based logging | ||
| if detected_loggers and "async" not in loggers: | ||
| loggers.append("async") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honor explicit logger selection and propagate run_name to all backends.
setup_metric_logger only enables MLflow/TensorBoard/W&B when their config fields are provided and then always adds async. This prevents explicitly enabling a backend with defaults (e.g., logger_type="tensorboard" without tensorboard_log_dir) and ignores a global run_name for async/tensorboard. This looks inconsistent with the new logger_type/run_name intent.
Consider accepting an explicit logger list (or logger_type) and wiring run_name to all handlers, only adding async when requested.
💡 Suggested adjustment (illustrative)
-def setup_metric_logger(
- output_dir,
- *,
- mlflow_tracking_uri: str | None = None,
- mlflow_experiment_name: str | None = None,
- mlflow_run_name: str | None = None,
- wandb_project: str | None = None,
- wandb_entity: str | None = None,
- wandb_run_name: str | None = None,
- tensorboard_log_dir: str | None = None,
-):
+def setup_metric_logger(
+ output_dir,
+ *,
+ loggers: list[str] | None = None,
+ run_name: str | None = None,
+ mlflow_tracking_uri: str | None = None,
+ mlflow_experiment_name: str | None = None,
+ mlflow_run_name: str | None = None,
+ wandb_project: str | None = None,
+ wandb_entity: str | None = None,
+ wandb_run_name: str | None = None,
+ tensorboard_log_dir: str | None = None,
+):
@@
- detected_loggers = []
+ detected_loggers = []
@@
- # Always include async logger for file logging
- loggers = detected_loggers if detected_loggers else ["async"]
- # Also include async logger alongside other loggers for file-based logging
- if detected_loggers and "async" not in loggers:
- loggers.append("async")
+ if loggers is None:
+ loggers = detected_loggers if detected_loggers else ["async"]
@@
"async": {
"()": AsyncStructuredHandler,
"log_dir": output_dir,
- "run_name": None, # Uses default template
+ "run_name": run_name,
"filters": async_filters,
},
"tensorboard": {
"()": TensorBoardHandler,
"log_dir": tensorboard_log_dir or output_dir,
- "run_name": None, # Uses default template
+ "run_name": run_name,
"filters": ["is_mapping", "is_rank0"],
},Also applies to: 968-997
🤖 Prompt for AI Agents
In `@src/instructlab/training/logger.py` around lines 873 - 953,
setup_metric_logger currently auto-enables backends based only on
backend-specific args and unconditionally ensures "async" is present, and it
doesn't propagate a global run_name to all backends; update it to accept and
honor an explicit logger selection (e.g., logger_type or logger_list argument)
so callers can request "tensorboard" or "async" even when backend-specific
fields are empty, only append "async" when explicitly requested (or present in
the explicit list), and ensure a global run_name parameter (or run_name
fallback) is forwarded into backend-specific settings (mlflow_run_name,
wandb_run_name, tensorboard handler config and any async file logger) so every
backend receives the same run name; change logic around detected_loggers/loggers
and the wiring to backend initializers (functions that configure
mlflow/wandb/tensorboard and the async handler) accordingly and apply the same
fix to the analogous block referenced around 968-997.
Summary
TrainingArgsfor programmatic API usagewandb_projectandwandb_entityfields toTrainingArgsfor consistencyChanges
New
TrainingArgsfieldslogger_typestr"async"tensorboard,wandb,mlflow,asyncrun_namestr | NoneNone{time},{rank}, etc.)mlflow_tracking_uristr | NoneNonemlflow_experiment_namestr | NoneNonewandb_projectstr | NoneNonewandb_entitystr | NoneNoneNew
MLflowHandlerclassImplements the same interface as
TensorBoardHandlerandWandbHandler:mlflow.log_metrics()mlflow.log_params()tracking_uriandexperiment_nameconfigurationMLFLOW_TRACKING_URIandMLFLOW_EXPERIMENT_NAMEenv varsUpdated
run_training()APIPreviously,
run_training()hardcoded the logger to"async". Now it reads fromTrainingArgs:Example Usage
Test plan
wandb_project/wandb_entityfieldsasynclogger_typeenables multiple backends simultaneously🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation