Weights & Biases Integration#
Nemotron Kit provides automatic W&B configuration that seamlessly passes credentials and settings to containers running via nemo-run. This eliminates manual credential management across local, Docker, Slurm, and cloud executors.
Note: The artifact system currently requires W&B. Backend-agnostic artifact tracking is in development.
Configuration#
env.toml Setup#
Add a [wandb] section to your env.toml:
[wandb]
project = "nemotron"
entity = "YOUR-TEAM"
Field |
Description |
|---|---|
|
W&B project name (required to enable tracking) |
|
W&B team/entity name |
Authentication#
Authenticate locally before running jobs:
wandb login
Your API key is stored in ~/.netrc and automatically detected by the kit.
Automatic Environment Variables#
When you run jobs via nemo-run, the kit automatically detects your W&B configuration and passes it to the container as environment variables:
Variable |
Source |
Description |
|---|---|---|
|
|
API key from local wandb login |
|
|
Project name |
|
|
Team/entity name |
This works across all executor types:
Local — Environment variables set directly
Docker — Passed via container env vars
Slurm — Included in job submission
SkyPilot — Set in cloud instance environment
Ray — Passed via
runtime_env.env_vars
How It Works#
The build_executor() function in nemotron.kit.run handles automatic detection:
# Auto-detect W&B API key from local login
if "WANDB_API_KEY" not in merged_env:
import wandb
api_key = wandb.api.api_key
if api_key:
merged_env["WANDB_API_KEY"] = api_key
# Load project/entity from env.toml [wandb] section
wandb_config = load_wandb_config()
if wandb_config is not None:
if wandb_config.project:
merged_env["WANDB_PROJECT"] = wandb_config.project
if wandb_config.entity:
merged_env["WANDB_ENTITY"] = wandb_config.entity
Using W&B in Training Scripts#
Initialization from Environment#
Training scripts running inside containers can initialize W&B from environment variables:
from nemotron.kit.train_script import init_wandb_from_env
# Reads WANDB_PROJECT and WANDB_ENTITY from environment
init_wandb_from_env()
Conditional Initialization#
For scripts that support optional W&B tracking:
from nemotron.kit import init_wandb_if_configured
from nemotron.kit.wandb import WandbConfig
# Initialize only if WandbConfig is provided and has a project set
wandb_config = WandbConfig(project="nemotron", entity="my-team")
init_wandb_if_configured(wandb_config, job_type="training")
WandbConfig Dataclass#
The WandbConfig dataclass provides typed configuration:
from nemotron.kit.wandb import WandbConfig
config = WandbConfig(
project="nemotron", # Required to enable tracking
entity="my-team", # Team/entity name
run_name="experiment-001", # Optional run name
tags=("pretrain", "nano3"), # Tags for filtering
notes="First pretrain run", # Run description
)
# Check if tracking is enabled
if config.enabled:
print(f"Logging to {config.entity}/{config.project}")
Artifact Lineage#
W&B artifacts provide full lineage tracking. See Artifact Lineage for details on:
End-to-end lineage from raw data to final model
Semantic URIs for artifact references
Viewing lineage in the W&B UI
Advanced Features#
Checkpoint Logging#
The kit automatically patches checkpoint saving to log artifacts to W&B:
from nemotron.kit.wandb import patch_wandb_checkpoint_logging
# Patch Megatron-Bridge checkpoint saving
patch_wandb_checkpoint_logging()
This enables:
Automatic artifact creation for each checkpoint
Lineage links to training data artifacts
Version tracking with step numbers
NeMo-RL Checkpoint Logging#
For reinforcement learning with NeMo-RL:
from nemotron.kit.wandb import patch_nemo_rl_checkpoint_logging
# Patch NeMo-RL checkpoint saving
patch_nemo_rl_checkpoint_logging()
Seeded Random Fix#
When using seeded random states (common in RL), W&B’s default run ID generation can fail. The kit provides a patch:
from nemotron.kit.wandb import patch_wandb_runid_for_seeded_random
# Fix "Invalid Client ID digest" errors
patch_wandb_runid_for_seeded_random()
Troubleshooting#
“WANDB_API_KEY not found”#
Ensure you’re logged in locally:
wandb login
“Project not found”#
Verify the project exists in your W&B workspace, or let W&B create it automatically on first run.
Environment variables not passed to container#
Check that your env.toml has a [wandb] section:
[wandb]
project = "nemotron"
entity = "YOUR-TEAM"
Ray workers missing credentials#
For Ray data prep jobs, credentials are passed via runtime_env.env_vars. Ensure your local wandb login is active before submitting the job.
API Reference#
wandb.py Exports#
Export |
Description |
|---|---|
|
Configuration dataclass |
|
Conditional W&B initialization |
|
Enable Megatron-Bridge checkpoint artifacts |
|
Enable NeMo-RL checkpoint artifacts |
|
Fix seeded random ID generation |
run.py Exports#
Export |
Description |
|---|---|
|
Load |
|
Build executor with auto W&B env vars |
Further Reading#
OmegaConf Configuration — Artifact interpolations and unified logging patches
Artifact Lineage — Full lineage tracking and W&B UI
Nemotron Kit — Core framework overview
Execution through NeMo-Run — Execution profiles and env.toml