Running in Docker¶

Run the full Safe Synthesizer pipeline in a container with GPU access. No local Python install required -- the container ships everything needed for training, generation, and evaluation.

Prerequisites¶

Docker 20.10+ (BuildKit enabled by default in 23.0+)
NVIDIA Container Toolkit installed and configured
NVIDIA driver compatible with CUDA 12.8
NVIDIA GPU (A100 or better recommended)

Verify GPU access works:

docker run --rm --gpus all nvidia/cuda:12.8.1-runtime-ubuntu22.04 nvidia-smi

Quick Start¶

The container wraps the safe-synthesizer CLI. Mount your data and Hugging Face cache, then pass CLI arguments after the image name:

docker run --gpus all --shm-size=1g \
  -v /path/to/your/data:/workspace/data \
  -v ~/.cache/huggingface:/workspace/.hf_cache \
  -e HF_HOME=/workspace/.hf_cache \
  nss-gpu:latest \
  run --config /workspace/data/config.yaml --data-source /workspace/data/input.csv

The entrypoint prints helpful warnings if it detects common mistakes (empty /workspace, missing HF_HOME, no GPU access).

More examples:

# Train only
docker run --gpus all --shm-size=1g \
  -v /path/to/data:/workspace/data \
  -v ~/.cache/huggingface:/workspace/.hf_cache \
  -e HF_HOME=/workspace/.hf_cache \
  nss-gpu:latest run train --data-source /workspace/data/input.csv

# Generate from a trained adapter
docker run --gpus all --shm-size=1g \
  -v /path/to/data:/workspace/data \
  -v ~/.cache/huggingface:/workspace/.hf_cache \
  -e HF_HOME=/workspace/.hf_cache \
  nss-gpu:latest run generate --data-source /workspace/data/input.csv --auto-discover-adapter

# Validate a config file (no GPU needed)
docker run \
  -v /path/to/data:/workspace/data \
  nss-gpu:latest config validate --config /workspace/data/config.yaml

Mounting Your Data¶

The container starts with an empty /workspace. You bring your own data by bind-mounting host directories with -v:

docker run --gpus all --shm-size=1g \
  -v /home/user/project:/workspace/data \
  ...
  nss-gpu:latest run --data-source /workspace/data/input.csv

Docker requires absolute paths for bind mounts. Relative paths like -v data:/workspace/data are silently interpreted as named volumes -- Docker won't error, but you'll get an empty mount instead of your host directory. Use $(pwd) to expand relative paths:

-v $(pwd)/my_data:/workspace/data    # correct
-v my_data:/workspace/data           # wrong -- Docker treats this as a named volume

You can mount multiple directories at different paths:

docker run --gpus all \
  -v /data/inputs:/workspace/inputs \
  -v /data/configs:/workspace/configs \
  -v /data/output:/workspace/output \
  -e NSS_ARTIFACTS_PATH=/workspace/output \
  ...
  nss-gpu:latest run --config /workspace/configs/my_config.yaml --data-source /workspace/inputs/data.csv

Artifacts are written to /workspace/safe-synthesizer-artifacts/ by default (override with NSS_ARTIFACTS_PATH). Make sure to mount a host directory there if you want to retrieve results after the container exits.

Secrets and API Keys¶

Pass secrets as environment variables at runtime -- never bake them into the image. The most common ones:

docker run --gpus all --shm-size=1g \
  -v /path/to/data:/workspace/data \
  -v ~/.cache/huggingface:/workspace/.hf_cache \
  -e HF_HOME=/workspace/.hf_cache \
  -e HF_TOKEN="hf_..." \
  nss-gpu:latest run --data-source /workspace/data/input.csv

Variable	Required	Purpose
`HF_TOKEN`	For gated models	Hugging Face token for downloading gated models (Llama, Mistral, etc.). Get one at hf.co/settings/tokens
`NSS_INFERENCE_KEY`	For PII classification	API key for `NSS_INFERENCE_ENDPOINT`. Set when using the CLI/SDK for column classification
`NSS_INFERENCE_ENDPOINT`	For PII classification	NIM/OpenAI-compatible endpoint URL (default: `https://integrate.api.nvidia.com/v1`). Override for a custom endpoint
`WANDB_API_KEY`	For experiment tracking	WandB API key. Only needed when `--wandb-mode online` is used

If HF_TOKEN is already stored in your HF cache (~/.cache/huggingface/token), mounting the cache directory is sufficient -- the Hub library reads the token file automatically.

See Environment Variables for the full reference.

Hugging Face Model Cache¶

Safe Synthesizer downloads models from Hugging Face Hub on first use. Mount a host directory to persist downloads across container runs:

docker run --gpus all \
  -v ~/.cache/huggingface:/workspace/.hf_cache \
  -e HF_HOME=/workspace/.hf_cache \
  ...

Host path	Container path	Env var	Purpose
`~/.cache/huggingface`	`/workspace/.hf_cache`	`HF_HOME`	Model weights, tokenizers, configs

Without this mount, models are downloaded into the container's ephemeral filesystem and lost when it exits.

For shared environments (team servers, CI), point at a shared cache:

-v /shared/hf_cache:/workspace/.hf_cache -e HF_HOME=/workspace/.hf_cache

GPU Access¶

The image declares NVIDIA_VISIBLE_DEVICES=all and NVIDIA_DRIVER_CAPABILITIES=compute,utility, so the NVIDIA Container Toolkit knows it needs GPU access. You still need --gpus to tell Docker to inject the GPU devices:

# All GPUs
docker run --gpus all ...

# Specific GPUs
docker run --gpus '"device=0,1"' ...

To restrict which GPUs are visible inside the container, override the environment variable:

docker run --gpus all -e NVIDIA_VISIBLE_DEVICES=0,1 ...

Shared Memory (`--shm-size`)¶

PyTorch uses /dev/shm for inter-process communication during training (multi-worker data loading). Docker defaults to 64 MB, which causes "Bus error" crashes. Always pass --shm-size=1g (or --ipc=host) when running training workloads:

docker run --gpus all --shm-size=1g ...

The entrypoint script warns if /dev/shm is below 256 MB. Generation-only runs are typically fine without it.

File Permissions¶

The container runs as appuser (uid 1000). When bind-mounting host directories, Docker preserves host ownership. If your host user has a different uid, writes to the mounted directory (artifacts, outputs) will fail with "Permission denied".

Fix by matching the container user to your host uid:

docker run --gpus all --user "$(id -u):$(id -g)" \
  -v /path/to/data:/workspace/data \
  ...

This overrides appuser with your host identity. The --user flag also works with the dev image and interactive shells.

Offline and Air-Gapped Environments¶

Pre-cache models by running the pipeline once with internet access, then reuse the populated cache in the target environment:

# Step 1: populate cache (internet required)
docker run --gpus all --shm-size=1g \
  -v /path/to/data:/workspace/data \
  -v ~/.cache/huggingface:/workspace/.hf_cache \
  -e HF_HOME=/workspace/.hf_cache \
  nss-gpu:latest run --config /workspace/data/config.yaml --data-source /workspace/data/input.csv

# Step 2: use in offline environment
docker run --gpus all --shm-size=1g \
  -v /path/to/data:/workspace/data \
  -v /shared/hf_cache:/workspace/.hf_cache \
  -e HF_HOME=/workspace/.hf_cache \
  -e HF_HUB_OFFLINE=1 \
  nss-gpu:latest run --config /workspace/data/config.yaml --data-source /workspace/data/input.csv

See Environment Variables -- Hugging Face Cache for details on HF_HOME, HF_HUB_OFFLINE, and VLLM_CACHE_ROOT.

Building from Source¶

If pulling a pre-built image is not available, build locally:

make container-build-gpu           # runtime image
make container-build-gpu-dev       # dev image with test tooling

Override build arguments for different CUDA or Python versions:

docker build -f containers/Dockerfile.cuda \
  --build-arg CUDA_VERSION=12.6.3 \
  --build-arg PYTHON_VERSION=3.12.10 \
  --target runtime -t nss-gpu:custom .

See Developer Guide -- Docker for build stages, ARG reference, and customization details.

Interactive Shell¶

To explore the container or debug issues, override the entrypoint to get a bash shell. Mount your data the same way as a normal run:

docker run -it --gpus all --shm-size=1g \
  -v $(pwd)/my_data:/workspace/data \
  -v ~/.cache/huggingface:/workspace/.hf_cache \
  -e HF_HOME=/workspace/.hf_cache \
  --entrypoint /bin/bash \
  nss-gpu:latest

Inside the container you can run safe-synthesizer commands directly:

appuser@container:/workspace$ safe-synthesizer run --data-source /workspace/data/input.csv
appuser@container:/workspace$ safe-synthesizer config validate --config /workspace/data/config.yaml

Makefile Shortcuts¶

For developers with the repo checked out, the Makefile provides convenience targets that handle GPU flags, HF cache mounts, and workspace bind mounts:

Command	What it does
`make container-build-gpu`	Build the runtime image
`make container-run-gpu CMD="run --config ..."`	Run a pipeline command
`make container-build-gpu-dev`	Build the dev image
`make container-run-gpu-dev CMD="make test"`	Run a command in the dev container

Override variables as needed:

make container-run-gpu CONTAINER_HF_CACHE=/shared/hf_cache CMD="run --data-source /workspace/data.csv"

Mount data from outside the repo tree with CONTAINER_EXTRA_MOUNTS:

make container-run-gpu \
  CONTAINER_EXTRA_MOUNTS="-v /data/sensitive:/workspace/data" \
  CMD="run --data-source /workspace/data/customers.csv"

What to Read Next¶

Running Safe Synthesizer -- pipeline execution, CLI commands
Configuration Reference -- parameter tables
Environment Variables -- HF_HOME, NSS_ARTIFACTS_PATH, logging
Troubleshooting -- OOM fixes, offline errors