Manage Files¶
NeMo Platform provides a file storage interface through the Files service. The Files service supports multiple storage backends and can be used to store datasets for training, evaluation results, model artifacts, and other files.
Concepts¶
- Fileset: A named container that holds files.
Filesets are uniquely identified by a name within a given workspace.
- Storage Backend: Each fileset is backed by a storage backend where the files are actually persisted. Supported backends include:
local: Local filesystem storage (default, read/write)s3: Amazon S3 or S3-compatible storage such as MinIO (read/write)ngc: NVIDIA GPU Cloud storage (read-only)huggingface: HuggingFace Hub repositories (read-only)
Read-only backends allow you to create a fileset that acts as a handle to external resources. This provides a unified interface to access files from different sources using the same SDK methods, and allows other platform services to reference external data through a fileset.
-
Purpose: A fileset field that indicates the intended use. Each purpose enables specific metadata fields under the corresponding key. Select a tab below to see the available metadata fields for each purpose:
Use
purpose="generic"(default) for other files that don't fit thedatasetormodelcategories.Metadata fields: No purpose-specific metadata fields.
Use
purpose="dataset"for training and evaluation data.Metadata fields (
metadata.dataset.*):Field Type Description metadata.dataset.schemaobjectSchema describing the dataset format (e.g., column names and types). Use
purpose="model"for model weights and checkpoints.Metadata fields (
metadata.model.*):Field Type Description metadata.model.tool_calling.chat_templatestringJinja2 chat template for the model. Propagated to the model entity spec by the model-spec background task. metadata.model.tool_calling.tool_call_parserstringName of the tool call parser (e.g., hermes,llama3_json,mistral).metadata.model.tool_calling.tool_call_pluginstringReference to a fileset containing a custom tool call plugin Python file ( {workspace}/{fileset_name}). Requiresmodels.tool_call_plugin.enabledat the platform level.metadata.model.tool_calling.auto_tool_choicebooleanWhether to enable automatic tool choice. These fields are merged into the model entity spec by the model-spec background task.
-
Custom Fields: Arbitrary key-value data attached to a fileset via
custom_fieldsfor user-defined metadata.
Managing Filesets¶
Fileset management operations (create, retrieve, list, delete) are available through the CLI (nemo files filesets) or the SDK (client.files.filesets).
Tip
CLI commands use the workspace from your current context by default. Use --workspace to specify a different workspace:
Creating Filesets¶
Creating a fileset involves specifying a name and workspace. You can optionally provide a description, purpose, and custom storage configuration.
{
"id": "fileset-TeufFfapeKBrMtpBb42zdv",
"created_at": "2026-01-20T03:00:00",
"custom_fields": {},
"description": "Training data for model fine-tuning",
"metadata": {
"dataset": null
},
"name": "my-files",
"project": "",
"purpose": "generic",
"storage": {
"path": "/var/mnt/filesets/default/my-files",
"read_chunk_size": 16777216,
"type": "local",
"write_buffer_size": 16777216
},
"updated_at": "2026-01-20T03:00:00",
"workspace": "default"
}
import os
from nemo_platform import NeMoPlatform
client = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace="default",
)
# Create a fileset
fileset = client.files.filesets.create(
name="my-files",
description="Training data for model fine-tuning",
)
print(fileset.model_dump_json(indent=2))
{
"id": "fileset-TeufFfapeKBrMtpBb42zdv",
"created_at": "2026-01-20T03:00:00",
"custom_fields": {},
"description": "Training data for model fine-tuning",
"metadata": {
"dataset": null
},
"name": "my-files",
"project": "",
"purpose": "generic",
"storage": {
"path": "/var/mnt/filesets/default/my-files",
"read_chunk_size": 16777216,
"type": "local",
"write_buffer_size": 16777216
},
"updated_at": "2026-01-20T03:00:00",
"workspace": "default"
}
Listing Filesets¶
List all filesets in a given workspace:
Filter filesets by purpose or storage type:
Use pagination for large result sets:
Deleting Filesets¶
Delete an entire fileset:
Warning
Deleting a fileset is permanent and cannot be undone. For local and s3 storage backends, this also deletes all underlying files.
Managing Files Within Filesets¶
High-level file operations are available through the CLI (nemo files) or the SDK (client.files), which provide convenient methods for uploading, downloading, and listing files.
For advanced use cases, a fsspec-compatible filesystem is available at client.files.fsspec. Refer to the fsspec documentation for additional methods.
Uploading Files¶
Upload files to a fileset:
# Upload a single file
nemo files upload ./data.jsonl my-files --remote-path training/data.jsonl
# Upload an entire directory
nemo files upload ./training_data/ my-files --remote-path training/
Upload without specifying a fileset to auto-create one:
# Upload a single file
client.files.upload(
fileset="my-files",
local_path="./data.jsonl",
remote_path="training/data.jsonl",
)
# Upload an entire directory
client.files.upload(
fileset="my-files",
local_path="./training_data/",
remote_path="training/",
)
# Auto-create a new fileset (generates name like "fileset-a1b2c3d4")
result = client.files.upload(
local_path="./data.jsonl",
fileset_auto_create=True,
)
print(f"Uploaded to fileset: {result.name}")
Tip
If fileset is omitted, a new fileset is automatically created with a unique name following the pattern fileset-<8-hex> (e.g., fileset-a1b2c3d4). The generated name is returned so you can reference it in subsequent operations.
Listing Files¶
List all files in a fileset:
List files under a specific directory:
Downloading Files¶
Download files to a local path:
Read file content into memory (SDK only):
content = client.files.download_content(
fileset="my-files",
remote_path="config.json",
)
print(content.decode("utf-8"))
Deleting Files¶
Delete files from a fileset:
Using Progress Callbacks¶
Note
The CLI displays progress bars automatically during uploads and downloads. This section covers custom progress handling in the SDK.
Track progress during large file transfers using the RichProgressCallback context manager:
from nemo_platform.filesets import RichProgressCallback
# Upload a directory with progress bar
with RichProgressCallback(description="Uploading dataset") as callback:
client.files.upload(
fileset="my-files",
local_path="./large_dataset/",
remote_path="",
callback=callback,
)
# Download all files from a fileset with progress bar
with RichProgressCallback(description="Downloading dataset") as callback:
client.files.download(
fileset="my-files",
remote_path="",
local_path="./downloaded_data/",
callback=callback,
)
Use Cases¶
Using External Storage Backends¶
Connect to files stored in NVIDIA GPU Cloud (NGC):
# Create a secret to store your NGC API key
echo "$NGC_API_KEY" | nemo secrets create my-ngc-api-key --from-file -
# Create a fileset pointing to NGC storage
nemo files filesets create my-nemotron-personas-dataset-en_us \
--description "Nemotron Personas USA" \
--storage '{
"type": "ngc",
"org": "nvidia",
"team": "nemotron-personas",
"resource": "nemotron-personas-dataset-en_us",
"version": "0.0.2",
"api_key_secret": "my-ngc-api-key"
}'
import os
# Create a secret to store your NGC API key
secret = client.secrets.create(name="my-ngc-api-key", value="<your-ngc-api-key>")
# Create a fileset pointing to NGC storage
ngc_fileset = client.files.filesets.create(
name="my-nemotron-personas-dataset-en_us",
description="Nemotron Personas USA",
storage={
"type": "ngc",
"org": "nvidia",
"team": "nemotron-personas",
"resource": "nemotron-personas-dataset-en_us",
"version": "0.0.2",
"api_key_secret": secret.name,
},
)
Connect to a HuggingFace repository:
# Create a secret to store your HuggingFace token (needed for gated and private repos)
echo "$HF_TOKEN" | nemo secrets create hf_token --from-file -
# Create a fileset pointing to a HuggingFace repo
nemo files filesets create hf-dataset \
--description "Dataset from HuggingFace" \
--storage '{
"type": "huggingface",
"repo_id": "nvidia/Nemotron-Personas-Japan",
"repo_type": "dataset",
"token_secret": "hf_token"
}'
import os
# Create a secret to store your HuggingFace token (needed for gated and private repos)
secret = client.secrets.create(name="hf_token", value=os.getenv("HF_TOKEN"))
# Create a fileset pointing to a HuggingFace repo
hf_fileset = client.files.filesets.create(
name="hf-dataset",
description="Dataset from HuggingFace",
storage={
"type": "huggingface",
"repo_id": "nvidia/Nemotron-Personas-Japan",
"repo_type": "dataset",
"token_secret": secret.name, # Optional, needed for gated and private repos
},
)
Connect to an S3 bucket or S3-compatible storage (e.g., MinIO, Ceph):
# Create a fileset backed by S3 storage using SDK credential chain
# (uses AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY env vars, IRSA, instance profiles, etc.)
# The "prefix" field is optional - use it to scope the fileset to a folder within the bucket
nemo files filesets create s3-training-data \
--description "Training data stored in S3" \
--storage '{
"type": "s3",
"bucket": "my-ml-bucket",
"prefix": "datasets/training",
"region": "us-east-1",
"use_sdk_auth": true
}'
# Upload data to S3
nemo files upload ./training_data/ s3-training-data
# Download data from S3
nemo files download s3-training-data -o ./downloaded_data/
# Create a fileset backed by S3 storage using SDK credential chain
# (uses AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY env vars, IRSA, instance profiles, etc.)
s3_fileset = client.files.filesets.create(
name="s3-training-data",
description="Training data stored in S3",
storage={
"type": "s3",
"bucket": "my-ml-bucket",
"prefix": "datasets/training", # Optional: scope to a folder within the bucket
"region": "us-east-1",
"use_sdk_auth": True, # Use AWS SDK credential chain (default)
},
)
# Upload data to S3
client.files.upload(
fileset="s3-training-data",
local_path="./training_data/",
remote_path="",
)
# Download data from S3
client.files.download(
fileset="s3-training-data",
remote_path="",
local_path="./downloaded_data/",
)
For S3-compatible storage like MinIO, use explicit credentials and a custom endpoint:
# Create secrets to store your S3 credentials
echo "$S3_ACCESS_KEY" | nemo secrets create s3_access_key --from-file -
echo "$S3_SECRET_KEY" | nemo secrets create s3_secret_key --from-file -
nemo files filesets create minio-fileset \
--description "Data stored in MinIO" \
--storage '{
"type": "s3",
"bucket": "my-bucket",
"endpoint_url": "http://minio.example.com:9000",
"region": "us-east-1",
"use_sdk_auth": false,
"access_key_id_secret": "s3_access_key",
"secret_access_key_secret": "s3_secret_key"
}'
import os
# Create secrets to store your S3 credentials
access_key = client.secrets.create(
name="s3_access_key", value=os.getenv("S3_ACCESS_KEY")
)
secret_key = client.secrets.create(
name="s3_secret_key", value=os.getenv("S3_SECRET_KEY")
)
s3_fileset = client.files.filesets.create(
name="minio-fileset",
description="Data stored in MinIO",
storage={
"type": "s3",
"bucket": "my-bucket",
"endpoint_url": "http://minio.example.com:9000", # Custom S3 endpoint
"region": "us-east-1",
"use_sdk_auth": False, # Use explicit credentials instead of SDK auth
"access_key_id_secret": access_key.name,
"secret_access_key_secret": secret_key.name,
},
)