Docker: Build and Customize¶
How the CUDA Docker image is built, how to customize it, and how it relates to the CI test image.
For running Safe Synthesizer in a container, see User Guide -- Docker.
Dockerfile Layout¶
containers/Dockerfile.cuda
uses a three-stage multistage build:
flowchart TD
uvImage["ghcr.io/astral-sh/uv:UV_VERSION"]
cudaBuild["nvidia/cuda:CUDA_VERSION-CUDA_IMAGE_TYPE-ubuntu\n(deps stage -- may be runtime or devel)"]
cudaRuntime["nvidia/cuda:CUDA_VERSION-runtime-ubuntu\n(runtime stage -- always runtime)"]
subgraph stages [Build Stages]
deps["deps\nInstalls Python 3.11 via uv\nuv sync cu128+engine"]
runtime["runtime\nCopies venv + Python\nNon-root appuser\ntini + entrypoint.sh"]
dev["dev\nExtends runtime\nAdds git, make, pytest\nRoot user"]
end
uvImage -->|"COPY /uv"| deps
cudaBuild --> deps
deps -->|"COPY venv + toolchain"| runtime
cudaRuntime --> runtime
runtime --> dev
- deps: installs uv, Python, and all cu128+engine dependencies. Uses
--mount=type=cacheto avoid re-downloading ~10 GB of PyTorch/CUDA wheels. - runtime: copies the venv and uv-managed Python into a fresh CUDA runtime
base. Runs as non-root
appuser(uid 1000). GPU access is declared viaNVIDIA_VISIBLE_DEVICES=allandNVIDIA_DRIVER_CAPABILITIES=compute,utilityenvironment variables baked into the image. Uses a wrapper entrypoint (containers/entrypoint.sh) that detects common misconfigurations before delegating tosafe-synthesizer. - dev: extends runtime with git, make, uv, and the full dev dependency group (pytest, ruff, etc.). Runs as root for flexibility. Used for interactive development and running tests inside the container.
Entrypoint Script¶
The runtime stage uses containers/entrypoint.sh instead of a bare
ENTRYPOINT ["safe-synthesizer"]. The script checks for common mistakes
and prints hints to stderr before calling exec safe-synthesizer "$@":
- Empty
/workspace-- user forgot to mount data HF_HOMEnot set or pointing to a nonexistent directory -- models will download to a temporary location and be lost on exitHF_TOKENmissing and no cached token file -- gated models will failnvidia-sminot found -- user may have forgotten--gpus all/dev/shmbelow 256 MB -- training with multi-worker data loading will crash with "Bus error"; hints at--shm-size=1g
These checks run only on stderr and do not interfere with normal CLI
output. For non-GPU commands (--help, config validate), the GPU check
is skipped.
OCI Labels¶
The runtime image includes OCI image metadata
visible via docker inspect:
Labels include title, description, vendor, licenses, source,
documentation, base.name, and usage (a full example docker run
command).
Build Arguments¶
| ARG | Default | Description |
|---|---|---|
CUDA_VERSION |
12.8.1 |
CUDA toolkit version in the base image tag |
UBUNTU_VERSION |
22.04 |
Ubuntu version in the base image tag |
CUDA_IMAGE_TYPE |
runtime |
Base image variant for the deps stage. Change to devel if a dependency requires CUDA headers for compilation |
PYTHON_VERSION |
3.11.13 |
Python version installed via uv python install |
UV_VERSION |
0.9.14 |
uv version (pinned to the lower bound of pyproject.toml required-version) |
TARGETARCH |
(set by BuildKit) | Target architecture (amd64 or arm64). Automatically populated by docker buildx build --platform |
CUDA_ARCH_FLAGS |
80;86;90;90a |
CUDA SM capabilities for nvcc. Override for arm64: 90;90a;120;120a |
Override at build time:
docker build -f containers/Dockerfile.cuda \
--build-arg PYTHON_VERSION=3.12.10 \
--build-arg CUDA_VERSION=12.6.3 \
--target runtime -t nss-gpu:custom .
Key Build Details¶
uv Environment Variables¶
The deps stage sets several uv environment variables for reproducible builds. See the uv Docker guide for full documentation.
| Variable | Value | Why |
|---|---|---|
UV_PROJECT_ENVIRONMENT |
/opt/venv |
Installs into a fixed venv path |
UV_PYTHON_INSTALL_DIR |
/opt/python |
Stable path for cross-stage COPY |
UV_PYTHON_CACHE_DIR |
/root/.cache/uv/python |
Lets Python downloads benefit from the uv cache mount |
UV_LINK_MODE |
copy |
Hardlinks into cache mounts vanish after unmount; copy is safe |
UV_COMPILE_BYTECODE |
1 |
Precompile .pyc for faster container startup |
UV_NO_INSTALLER_METADATA |
1 |
Deterministic layers (no installer/direct_url.json variance) |
UV_FROZEN |
true |
Equivalent to --frozen on every uv command; prevents accidental re-locking |
NVIDIA Runtime Environment¶
The runtime stage sets two environment variables that declare GPU requirements to the NVIDIA Container Toolkit:
| Variable | Value | Why |
|---|---|---|
NVIDIA_VISIBLE_DEVICES |
all |
Tells the toolkit this image needs GPU access. Equivalent to --gpus all intent; the user still passes --gpus to inject devices |
NVIDIA_DRIVER_CAPABILITIES |
compute,utility |
Requests CUDA compute libraries and nvidia-smi. Matches the NMP convention (nmp-gpu-base) |
This follows the same pattern used by NeMo Customizer and nmp-gpu-base.
No video group membership is needed -- the toolkit handles device
permissions when these variables are set.
Python Toolchain Portability¶
UV_PYTHON_INSTALL_DIR=/opt/python puts the uv-managed Python at a
stable, explicit path instead of the default ~/.local/share/uv/python/.
The venv at /opt/venv symlinks into that directory. Both paths are
copied to the runtime stage:
Intermediate Layers¶
The deps stage uses two-pass installation following uv best practices:
uv sync --no-install-project-- installs all dependencies from the lockfile without the project itself. This layer is invalidated only whenpyproject.tomloruv.lockchanges.uv sync --no-editable-- installs the project non-editably into the existing venv. Non-editable means the venv is self-contained and does not need source code at runtime.
APT Cache Mounts¶
APT layers use --mount=type=cache,target=/var/cache/apt,sharing=locked
instead of the traditional rm -rf /var/lib/apt/lists/* pattern. This
caches downloaded .deb files across rebuilds, speeding up layer
re-creation when the apt install list changes.
Why runtime, Not devel¶
All current locked dependencies (torch, vllm, xformers, flashinfer,
etc.) ship pre-built wheels. The CUDA devel image (~3 GB larger) is not
needed. If a future dependency requires source compilation with CUDA headers,
change CUDA_IMAGE_TYPE to devel.
Makefile Targets¶
| Target | Description |
|---|---|
container-build-gpu |
Build the runtime stage |
container-build-gpu-dev |
Build the dev stage |
container-build-gpu-multiarch |
Build multi-arch manifest (requires CONTAINER_GPU_REGISTRY) |
container-run-gpu |
Run a command in the runtime container |
container-run-gpu-dev |
Run a command in the dev container |
For interactive shells, use docker run -it --entrypoint /bin/bash directly --
this gives full control over mounts and flags. See the
user guide for examples.
Overridable variables:
| Variable | Default | Description |
|---|---|---|
CONTAINER_GPU_IMAGE |
nss-gpu:latest |
Runtime image tag |
CONTAINER_GPU_IMAGE_DEV |
nss-gpu-dev:latest |
Dev image tag |
CONTAINER_GPU_PLATFORM |
linux/amd64 |
Target platform (override for arm64) |
CONTAINER_GPU_REGISTRY |
(empty) | Registry for multi-arch manifest pushes |
CONTAINER_GPU_FLAG |
--gpus all |
GPU access flag |
CONTAINER_HF_CACHE |
$(HOME)/.cache/huggingface |
Host HF cache dir |
CONTAINER_EXTRA_MOUNTS |
(empty) | Additional -v flags for data outside the repo tree |
Testing the Image¶
Smoke test (no GPU required):
Run unit tests inside the dev container:
Interactive dev shell (use docker run directly for full mount control):
docker run -it --gpus all --shm-size=1g \
-v $(pwd):/workspace \
-v ~/.cache/huggingface:/workspace/.hf_cache \
-e HF_HOME=/workspace/.hf_cache \
--entrypoint /bin/bash \
nss-gpu-dev:latest
Image Size¶
The runtime image is approximately 15--25 GB, dominated by PyTorch, vllm, and CUDA libraries. This is expected for a full ML inference stack.
To reduce size:
- The runtime stage excludes dev dependencies (
--no-group dev) - The base is
cuda-runtime(notcuda-devel) - Build caches (
--mount=type=cache) stay out of the final image layers
Relationship to Dockerfile.test_ci¶
Dockerfile.test_ci provides a CPU-only test image for CI and local testing.
| Aspect | Dockerfile.cuda |
Dockerfile.test_ci |
|---|---|---|
| Base | nvidia/cuda:12.8.1-runtime-ubuntu22.04 |
python:3.11-slim |
| Extras | cu128 + engine |
cpu + engine |
| GPU | Required | Not needed |
| Stages | deps / runtime / dev |
Single stage |
| Use case | Training, generation, evaluation | CPU-only unit tests and CI checks |
| Build target | make container-build-gpu |
make container-build-test |
Both follow the conventions in STYLE_GUIDE.md -- Dockerfiles.
Multi-Architecture Support¶
The CUDA Dockerfile supports linux/amd64 and linux/arm64
(Grace/Blackwell). The nvidia/cuda base images and uv binaries are
already multi-platform, so the same Dockerfile works for both architectures
without conditional logic.
How it works¶
Docker BuildKit sets the TARGETARCH build argument automatically when
you pass --platform:
docker buildx build --platform linux/arm64 \
-f containers/Dockerfile.cuda --target runtime -t nss-gpu:arm64 .
No code paths branch on TARGETARCH today -- it exists as documentation
and forward-compatibility for when CUDA_IMAGE_TYPE=devel builds need to
pass architecture-specific flags to nvcc.
Building for arm64 (Blackwell)¶
Single-platform arm64 build via Make:
This uses docker build --platform linux/arm64, which works with local
--load and QEMU emulation (or natively on an arm64 host).
Multi-platform manifest¶
A multi-platform manifest contains images for both architectures in a
single tag. Clients pull the correct variant automatically. Because
--load only supports one platform, multi-arch builds must be pushed
directly to a registry:
This runs:
docker buildx build \
--platform linux/amd64,linux/arm64 \
--tag ghcr.io/nvidia-nemo/nss-gpu:latest \
--target runtime --push \
-f containers/Dockerfile.cuda .
docker buildx requirements¶
- Docker 19.03+ with BuildKit enabled (
DOCKER_BUILDKIT=1or Docker 23.0+ where it is the default). - A buildx builder instance with multi-platform support. Create one with:
- For cross-architecture builds on amd64 hosts, QEMU user-static must be
registered:
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes.
CUDA compute capabilities (CUDA_ARCH_FLAGS)¶
When CUDA_IMAGE_TYPE=devel is used and CUDA kernels must be compiled,
pass the appropriate SM values via --build-arg CUDA_ARCH_FLAGS:
| Architecture | CUDA_ARCH_FLAGS |
GPUs |
|---|---|---|
| amd64 | 80;86;90;90a |
A100, A10/3090, H100 |
| arm64 | 90;90a;120;120a |
H100 Grace, Blackwell |
The Dockerfile defaults to the amd64 set. Override for arm64:
docker build -f containers/Dockerfile.cuda \
--build-arg CUDA_IMAGE_TYPE=devel \
--build-arg CUDA_ARCH_FLAGS="90;90a;120;120a" \
--platform linux/arm64 \
--target runtime -t nss-gpu:arm64-devel .
Makefile targets¶
| Target | Description |
|---|---|
container-build-gpu |
Single-platform build (default linux/amd64, override with CONTAINER_GPU_PLATFORM) |
container-build-gpu-multiarch |
Multi-platform manifest build (requires CONTAINER_GPU_REGISTRY) |