Skip to content

Manage Files

NeMo Platform provides a file storage interface through the Files service. The Files service supports multiple storage backends and can be used to store datasets for training, evaluation results, model artifacts, and other files.

Concepts

  • Fileset: A named container that holds files.

Filesets are uniquely identified by a name within a given workspace.

  • Storage Backend: Each fileset is backed by a storage backend where the files are actually persisted. Supported backends include:
  • local: Local filesystem storage (default, read/write)
  • s3: Amazon S3 or S3-compatible storage such as MinIO (read/write)
  • ngc: NVIDIA GPU Cloud storage (read-only)
  • huggingface: HuggingFace Hub repositories (read-only)

Read-only backends allow you to create a fileset that acts as a handle to external resources. This provides a unified interface to access files from different sources using the same SDK methods, and allows other platform services to reference external data through a fileset.

  • Purpose: A fileset field that indicates the intended use. Each purpose enables specific metadata fields under the corresponding key. Select a tab below to see the available metadata fields for each purpose:

    Use purpose="generic" (default) for other files that don't fit the dataset or model categories.

    Metadata fields: No purpose-specific metadata fields.

    Use purpose="dataset" for training and evaluation data.

    Metadata fields (metadata.dataset.*):

    Field Type Description
    metadata.dataset.schema object Schema describing the dataset format (e.g., column names and types).

    Use purpose="model" for model weights and checkpoints.

    Metadata fields (metadata.model.*):

    Field Type Description
    metadata.model.tool_calling.chat_template string Jinja2 chat template for the model. Propagated to the model entity spec by the model-spec background task.
    metadata.model.tool_calling.tool_call_parser string Name of the tool call parser (e.g., hermes, llama3_json, mistral).
    metadata.model.tool_calling.tool_call_plugin string Reference to a fileset containing a custom tool call plugin Python file ({workspace}/{fileset_name}). Requires models.tool_call_plugin.enabled at the platform level.
    metadata.model.tool_calling.auto_tool_choice boolean Whether to enable automatic tool choice.

    These fields are merged into the model entity spec by the model-spec background task.

  • Custom Fields: Arbitrary key-value data attached to a fileset via custom_fields for user-defined metadata.


Managing Filesets

Fileset management operations (create, retrieve, list, delete) are available through the CLI (nemo files filesets) or the SDK (client.files.filesets).

Tip

CLI commands use the workspace from your current context by default. Use --workspace to specify a different workspace:

nemo files filesets list --workspace my-workspace

Creating Filesets

Creating a fileset involves specifying a name and workspace. You can optionally provide a description, purpose, and custom storage configuration.

nemo files filesets create my-files \
--description "Training data for model fine-tuning"
{
  "id": "fileset-TeufFfapeKBrMtpBb42zdv",
  "created_at": "2026-01-20T03:00:00",
  "custom_fields": {},
  "description": "Training data for model fine-tuning",
  "metadata": {
    "dataset": null
  },
  "name": "my-files",
  "project": "",
  "purpose": "generic",
  "storage": {
    "path": "/var/mnt/filesets/default/my-files",
    "read_chunk_size": 16777216,
    "type": "local",
    "write_buffer_size": 16777216
  },
  "updated_at": "2026-01-20T03:00:00",
  "workspace": "default"
}
import os

from nemo_platform import NeMoPlatform

client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)

# Create a fileset
fileset = client.files.filesets.create(
    name="my-files",
    description="Training data for model fine-tuning",
)

print(fileset.model_dump_json(indent=2))
{
  "id": "fileset-TeufFfapeKBrMtpBb42zdv",
  "created_at": "2026-01-20T03:00:00",
  "custom_fields": {},
  "description": "Training data for model fine-tuning",
  "metadata": {
    "dataset": null
  },
  "name": "my-files",
  "project": "",
  "purpose": "generic",
  "storage": {
    "path": "/var/mnt/filesets/default/my-files",
    "read_chunk_size": 16777216,
    "type": "local",
    "write_buffer_size": 16777216
  },
  "updated_at": "2026-01-20T03:00:00",
  "workspace": "default"
}

Listing Filesets

List all filesets in a given workspace:

nemo files filesets list
┏━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ name ┃ workspace ┃ created_at ┃
┡━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ my-files │ default │ 2026-01-20T03:00:00 │
└──────────┴───────────┴────────────────────────────┘
filesets = client.files.filesets.list()

for fileset in filesets:
    print(f"{fileset.name}: {fileset.description}")

Filter filesets by purpose or storage type:

# List only dataset filesets
nemo files filesets list --filter.purpose dataset

# List filesets using local storage
nemo files filesets list --filter.storage-type local
# List only dataset filesets
datasets = client.files.filesets.list(filter={"purpose": "dataset"})

# List filesets using local storage
local_filesets = client.files.filesets.list(filter={"storage_type": "local"})

Use pagination for large result sets:

# The "-" prefix sorts in descending order (newest first)
nemo files filesets list --page 1 --page-size 10 --sort "-created_at"
filesets = client.files.filesets.list(
    page=1,
    page_size=10,
    sort="-created_at",  # The "-" prefix sorts descending (newest first)
)

Deleting Filesets

Delete an entire fileset:

nemo files filesets delete my-files
✓ Deleted successfully
deleted_fileset = client.files.filesets.delete(name="my-files")

print(f"Deleted fileset: {deleted_fileset.name}")

Warning

Deleting a fileset is permanent and cannot be undone. For local and s3 storage backends, this also deletes all underlying files.


Managing Files Within Filesets

High-level file operations are available through the CLI (nemo files) or the SDK (client.files), which provide convenient methods for uploading, downloading, and listing files.

For advanced use cases, a fsspec-compatible filesystem is available at client.files.fsspec. Refer to the fsspec documentation for additional methods.

Uploading Files

Upload files to a fileset:

# Upload a single file
nemo files upload ./data.jsonl my-files --remote-path training/data.jsonl

# Upload an entire directory
nemo files upload ./training_data/ my-files --remote-path training/
Uploading ━━━━━━━━━━━━━━━━ 100% • 3/3 files
Completed upload to my-files#training/

Upload without specifying a fileset to auto-create one:

# Auto-creates a new fileset with a generated name (fileset-<8 hex chars>)
nemo files upload ./data.jsonl
Uploading ━━━━━━━━━━━━━━━━ 100% • 1/1 files
Completed upload to fileset-a1b2c3d4
# Upload a single file
client.files.upload(
    fileset="my-files",
    local_path="./data.jsonl",
    remote_path="training/data.jsonl",
)

# Upload an entire directory
client.files.upload(
    fileset="my-files",
    local_path="./training_data/",
    remote_path="training/",
)

# Auto-create a new fileset (generates name like "fileset-a1b2c3d4")
result = client.files.upload(
    local_path="./data.jsonl",
    fileset_auto_create=True,
)
print(f"Uploaded to fileset: {result.name}")

Tip

If fileset is omitted, a new fileset is automatically created with a unique name following the pattern fileset-<8-hex> (e.g., fileset-a1b2c3d4). The generated name is returned so you can reference it in subsequent operations.

Listing Files

List all files in a fileset:

nemo files list my-files
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┓
┃ PATH ┃ SIZE ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━┩
│ training/data.jsonl │ 1024 │
│ training/validation.jsonl │ 512 │
└────────────────────────────┴──────┘
response = client.files.list(fileset="my-files")

for file in response.data:
    print(f"{file.path}: {file.size} bytes")

List files under a specific directory:

nemo files list my-files --remote-path training/
training_files = client.files.list(fileset="my-files", remote_path="training/")

Downloading Files

Download files to a local path:

# Download a single file
nemo files download my-files --remote-path training/data.jsonl -o ./data.jsonl

# Download an entire directory
nemo files download my-files --remote-path training/ -o ./training_data/
Downloading ━━━━━━━━━━━━━━━━ 100% • 2/2 files
Downloaded my-files#training/ to './training_data/'
# Download a single file
client.files.download(
    fileset="my-files",
    remote_path="training/data.jsonl",
    local_path="./data.jsonl",
)

# Download an entire directory
client.files.download(
    fileset="my-files",
    remote_path="training/",
    local_path="./training_data/",
)

Read file content into memory (SDK only):

content = client.files.download_content(
    fileset="my-files",
    remote_path="config.json",
)
print(content.decode("utf-8"))

Deleting Files

Delete files from a fileset:

nemo files delete my-files --remote-path training/old-data.jsonl
Deleted my-files#training/old-data.jsonl
client.files.delete(
    fileset="my-files",
    remote_path="training/old-data.jsonl",
)

Using Progress Callbacks

Note

The CLI displays progress bars automatically during uploads and downloads. This section covers custom progress handling in the SDK.

Track progress during large file transfers using the RichProgressCallback context manager:

from nemo_platform.filesets import RichProgressCallback

# Upload a directory with progress bar
with RichProgressCallback(description="Uploading dataset") as callback:
    client.files.upload(
        fileset="my-files",
        local_path="./large_dataset/",
        remote_path="",
        callback=callback,
    )

# Download all files from a fileset with progress bar
with RichProgressCallback(description="Downloading dataset") as callback:
    client.files.download(
        fileset="my-files",
        remote_path="",
        local_path="./downloaded_data/",
        callback=callback,
    )

Use Cases

Using External Storage Backends

Connect to files stored in NVIDIA GPU Cloud (NGC):

# Create a secret to store your NGC API key
echo "$NGC_API_KEY" | nemo secrets create my-ngc-api-key --from-file -

# Create a fileset pointing to NGC storage
nemo files filesets create my-nemotron-personas-dataset-en_us \
--description "Nemotron Personas USA" \
--storage '{
"type": "ngc",
"org": "nvidia",
"team": "nemotron-personas",
"resource": "nemotron-personas-dataset-en_us",
"version": "0.0.2",
"api_key_secret": "my-ngc-api-key"
}'
import os

# Create a secret to store your NGC API key
secret = client.secrets.create(name="my-ngc-api-key", value="<your-ngc-api-key>")

# Create a fileset pointing to NGC storage
ngc_fileset = client.files.filesets.create(
    name="my-nemotron-personas-dataset-en_us",
    description="Nemotron Personas USA",
    storage={
        "type": "ngc",
        "org": "nvidia",
        "team": "nemotron-personas",
        "resource": "nemotron-personas-dataset-en_us",
        "version": "0.0.2",
        "api_key_secret": secret.name,
    },
)

Connect to a HuggingFace repository:

# Create a secret to store your HuggingFace token (needed for gated and private repos)
echo "$HF_TOKEN" | nemo secrets create hf_token --from-file -

# Create a fileset pointing to a HuggingFace repo
nemo files filesets create hf-dataset \
--description "Dataset from HuggingFace" \
--storage '{
"type": "huggingface",
"repo_id": "nvidia/Nemotron-Personas-Japan",
"repo_type": "dataset",
"token_secret": "hf_token"
}'
import os

# Create a secret to store your HuggingFace token (needed for gated and private repos)
secret = client.secrets.create(name="hf_token", value=os.getenv("HF_TOKEN"))

# Create a fileset pointing to a HuggingFace repo
hf_fileset = client.files.filesets.create(
    name="hf-dataset",
    description="Dataset from HuggingFace",
    storage={
        "type": "huggingface",
        "repo_id": "nvidia/Nemotron-Personas-Japan",
        "repo_type": "dataset",
        "token_secret": secret.name,  # Optional, needed for gated and private repos
    },
)

Connect to an S3 bucket or S3-compatible storage (e.g., MinIO, Ceph):

# Create a fileset backed by S3 storage using SDK credential chain
# (uses AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY env vars, IRSA, instance profiles, etc.)
# The "prefix" field is optional - use it to scope the fileset to a folder within the bucket
nemo files filesets create s3-training-data \
--description "Training data stored in S3" \
--storage '{
"type": "s3",
"bucket": "my-ml-bucket",
"prefix": "datasets/training",
"region": "us-east-1",
"use_sdk_auth": true
}'

# Upload data to S3
nemo files upload ./training_data/ s3-training-data

# Download data from S3
nemo files download s3-training-data -o ./downloaded_data/
# Create a fileset backed by S3 storage using SDK credential chain
# (uses AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY env vars, IRSA, instance profiles, etc.)
s3_fileset = client.files.filesets.create(
    name="s3-training-data",
    description="Training data stored in S3",
    storage={
        "type": "s3",
        "bucket": "my-ml-bucket",
        "prefix": "datasets/training",  # Optional: scope to a folder within the bucket
        "region": "us-east-1",
        "use_sdk_auth": True,  # Use AWS SDK credential chain (default)
    },
)

# Upload data to S3
client.files.upload(
    fileset="s3-training-data",
    local_path="./training_data/",
    remote_path="",
)

# Download data from S3
client.files.download(
    fileset="s3-training-data",
    remote_path="",
    local_path="./downloaded_data/",
)

For S3-compatible storage like MinIO, use explicit credentials and a custom endpoint:

# Create secrets to store your S3 credentials
echo "$S3_ACCESS_KEY" | nemo secrets create s3_access_key --from-file -
echo "$S3_SECRET_KEY" | nemo secrets create s3_secret_key --from-file -

nemo files filesets create minio-fileset \
--description "Data stored in MinIO" \
--storage '{
"type": "s3",
"bucket": "my-bucket",
"endpoint_url": "http://minio.example.com:9000",
"region": "us-east-1",
"use_sdk_auth": false,
"access_key_id_secret": "s3_access_key",
"secret_access_key_secret": "s3_secret_key"
}'
import os

# Create secrets to store your S3 credentials
access_key = client.secrets.create(
    name="s3_access_key", value=os.getenv("S3_ACCESS_KEY")
)
secret_key = client.secrets.create(
    name="s3_secret_key", value=os.getenv("S3_SECRET_KEY")
)

s3_fileset = client.files.filesets.create(
    name="minio-fileset",
    description="Data stored in MinIO",
    storage={
        "type": "s3",
        "bucket": "my-bucket",
        "endpoint_url": "http://minio.example.com:9000",  # Custom S3 endpoint
        "region": "us-east-1",
        "use_sdk_auth": False,  # Use explicit credentials instead of SDK auth
        "access_key_id_secret": access_key.name,
        "secret_access_key_secret": secret_key.name,
    },
)