Skip to content

Running in Docker

Run the full Safe Synthesizer pipeline in a container with GPU access. No local Python install required -- the container ships everything needed for training, generation, and evaluation.


Prerequisites

  • Docker 20.10+ (BuildKit enabled by default in 23.0+)
  • NVIDIA Container Toolkit installed and configured
  • NVIDIA driver compatible with CUDA 12.8
  • NVIDIA GPU (A100 or better recommended)

Verify GPU access works:

docker run --rm --gpus all nvidia/cuda:12.8.1-runtime-ubuntu22.04 nvidia-smi

Quick Start

The container wraps the safe-synthesizer CLI. Mount your data and Hugging Face cache, then pass CLI arguments after the image name:

docker run --gpus all --shm-size=1g \
  -v /path/to/your/data:/workspace/data \
  -v ~/.cache/huggingface:/workspace/.hf_cache \
  -e HF_HOME=/workspace/.hf_cache \
  nss-gpu:latest \
  run --config /workspace/data/config.yaml --data-source /workspace/data/input.csv

The entrypoint prints helpful warnings if it detects common mistakes (empty /workspace, missing HF_HOME, no GPU access).

More examples:

# Train only
docker run --gpus all --shm-size=1g \
  -v /path/to/data:/workspace/data \
  -v ~/.cache/huggingface:/workspace/.hf_cache \
  -e HF_HOME=/workspace/.hf_cache \
  nss-gpu:latest run train --data-source /workspace/data/input.csv

# Generate from a trained adapter
docker run --gpus all --shm-size=1g \
  -v /path/to/data:/workspace/data \
  -v ~/.cache/huggingface:/workspace/.hf_cache \
  -e HF_HOME=/workspace/.hf_cache \
  nss-gpu:latest run generate --data-source /workspace/data/input.csv --auto-discover-adapter

# Validate a config file (no GPU needed)
docker run \
  -v /path/to/data:/workspace/data \
  nss-gpu:latest config validate --config /workspace/data/config.yaml

Mounting Your Data

The container starts with an empty /workspace. You bring your own data by bind-mounting host directories with -v:

docker run --gpus all --shm-size=1g \
  -v /home/user/project:/workspace/data \
  ...
  nss-gpu:latest run --data-source /workspace/data/input.csv

Docker requires absolute paths for bind mounts. Relative paths like -v data:/workspace/data are silently interpreted as named volumes -- Docker won't error, but you'll get an empty mount instead of your host directory. Use $(pwd) to expand relative paths:

-v $(pwd)/my_data:/workspace/data    # correct
-v my_data:/workspace/data           # wrong -- Docker treats this as a named volume

You can mount multiple directories at different paths:

docker run --gpus all \
  -v /data/inputs:/workspace/inputs \
  -v /data/configs:/workspace/configs \
  -v /data/output:/workspace/output \
  -e NSS_ARTIFACTS_PATH=/workspace/output \
  ...
  nss-gpu:latest run --config /workspace/configs/my_config.yaml --data-source /workspace/inputs/data.csv

Artifacts are written to /workspace/safe-synthesizer-artifacts/ by default (override with NSS_ARTIFACTS_PATH). Make sure to mount a host directory there if you want to retrieve results after the container exits.


Secrets and API Keys

Pass secrets as environment variables at runtime -- never bake them into the image. The most common ones:

docker run --gpus all --shm-size=1g \
  -v /path/to/data:/workspace/data \
  -v ~/.cache/huggingface:/workspace/.hf_cache \
  -e HF_HOME=/workspace/.hf_cache \
  -e HF_TOKEN="hf_..." \
  nss-gpu:latest run --data-source /workspace/data/input.csv
Variable Required Purpose
HF_TOKEN For gated models Hugging Face token for downloading gated models (Llama, Mistral, etc.). Get one at hf.co/settings/tokens
NSS_INFERENCE_KEY For PII classification API key for NSS_INFERENCE_ENDPOINT. Set when using the CLI/SDK for column classification
NSS_INFERENCE_ENDPOINT For PII classification NIM/OpenAI-compatible endpoint URL (default: https://integrate.api.nvidia.com/v1). Override for a custom endpoint
WANDB_API_KEY For experiment tracking WandB API key. Only needed when --wandb-mode online is used

If HF_TOKEN is already stored in your HF cache (~/.cache/huggingface/token), mounting the cache directory is sufficient -- the Hub library reads the token file automatically.

See Environment Variables for the full reference.


Hugging Face Model Cache

Safe Synthesizer downloads models from Hugging Face Hub on first use. Mount a host directory to persist downloads across container runs:

docker run --gpus all \
  -v ~/.cache/huggingface:/workspace/.hf_cache \
  -e HF_HOME=/workspace/.hf_cache \
  ...
Host path Container path Env var Purpose
~/.cache/huggingface /workspace/.hf_cache HF_HOME Model weights, tokenizers, configs

Without this mount, models are downloaded into the container's ephemeral filesystem and lost when it exits.

For shared environments (team servers, CI), point at a shared cache:

-v /shared/hf_cache:/workspace/.hf_cache -e HF_HOME=/workspace/.hf_cache

GPU Access

The image declares NVIDIA_VISIBLE_DEVICES=all and NVIDIA_DRIVER_CAPABILITIES=compute,utility, so the NVIDIA Container Toolkit knows it needs GPU access. You still need --gpus to tell Docker to inject the GPU devices:

# All GPUs
docker run --gpus all ...

# Specific GPUs
docker run --gpus '"device=0,1"' ...

To restrict which GPUs are visible inside the container, override the environment variable:

docker run --gpus all -e NVIDIA_VISIBLE_DEVICES=0,1 ...

Shared Memory (--shm-size)

PyTorch uses /dev/shm for inter-process communication during training (multi-worker data loading). Docker defaults to 64 MB, which causes "Bus error" crashes. Always pass --shm-size=1g (or --ipc=host) when running training workloads:

docker run --gpus all --shm-size=1g ...

The entrypoint script warns if /dev/shm is below 256 MB. Generation-only runs are typically fine without it.


File Permissions

The container runs as appuser (uid 1000). When bind-mounting host directories, Docker preserves host ownership. If your host user has a different uid, writes to the mounted directory (artifacts, outputs) will fail with "Permission denied".

Fix by matching the container user to your host uid:

docker run --gpus all --user "$(id -u):$(id -g)" \
  -v /path/to/data:/workspace/data \
  ...

This overrides appuser with your host identity. The --user flag also works with the dev image and interactive shells.


Offline and Air-Gapped Environments

Pre-cache models by running the pipeline once with internet access, then reuse the populated cache in the target environment:

# Step 1: populate cache (internet required)
docker run --gpus all --shm-size=1g \
  -v /path/to/data:/workspace/data \
  -v ~/.cache/huggingface:/workspace/.hf_cache \
  -e HF_HOME=/workspace/.hf_cache \
  nss-gpu:latest run --config /workspace/data/config.yaml --data-source /workspace/data/input.csv

# Step 2: use in offline environment
docker run --gpus all --shm-size=1g \
  -v /path/to/data:/workspace/data \
  -v /shared/hf_cache:/workspace/.hf_cache \
  -e HF_HOME=/workspace/.hf_cache \
  -e HF_HUB_OFFLINE=1 \
  nss-gpu:latest run --config /workspace/data/config.yaml --data-source /workspace/data/input.csv

See Environment Variables -- Hugging Face Cache for details on HF_HOME, HF_HUB_OFFLINE, and VLLM_CACHE_ROOT.


Building from Source

If pulling a pre-built image is not available, build locally:

make container-build-gpu           # runtime image
make container-build-gpu-dev       # dev image with test tooling

Override build arguments for different CUDA or Python versions:

docker build -f containers/Dockerfile.cuda \
  --build-arg CUDA_VERSION=12.6.3 \
  --build-arg PYTHON_VERSION=3.12.10 \
  --target runtime -t nss-gpu:custom .

See Developer Guide -- Docker for build stages, ARG reference, and customization details.


Interactive Shell

To explore the container or debug issues, override the entrypoint to get a bash shell. Mount your data the same way as a normal run:

docker run -it --gpus all --shm-size=1g \
  -v $(pwd)/my_data:/workspace/data \
  -v ~/.cache/huggingface:/workspace/.hf_cache \
  -e HF_HOME=/workspace/.hf_cache \
  --entrypoint /bin/bash \
  nss-gpu:latest

Inside the container you can run safe-synthesizer commands directly:

appuser@container:/workspace$ safe-synthesizer run --data-source /workspace/data/input.csv
appuser@container:/workspace$ safe-synthesizer config validate --config /workspace/data/config.yaml

Makefile Shortcuts

For developers with the repo checked out, the Makefile provides convenience targets that handle GPU flags, HF cache mounts, and workspace bind mounts:

Command What it does
make container-build-gpu Build the runtime image
make container-run-gpu CMD="run --config ..." Run a pipeline command
make container-build-gpu-dev Build the dev image
make container-run-gpu-dev CMD="make test" Run a command in the dev container

Override variables as needed:

make container-run-gpu CONTAINER_HF_CACHE=/shared/hf_cache CMD="run --data-source /workspace/data.csv"

Mount data from outside the repo tree with CONTAINER_EXTRA_MOUNTS:

make container-run-gpu \
  CONTAINER_EXTRA_MOUNTS="-v /data/sensitive:/workspace/data" \
  CMD="run --data-source /workspace/data/customers.csv"