Run an Anonymizer Job¶
This tutorial walks through the anonymizer.run job: defining a run spec, executing it locally or on the NeMo Platform Jobs worker, and loading the parquet artifacts it produces.
For detection, rewrite, and replacement strategy details, see the open-source library documentation.
Prerequisites¶
- The Anonymizer plugin installed and the
nemo anonymizerCLI available. See the Quick Start. - An inference provider configured (default examples use
nvidia-build). - A fileset named
anonymizer-inputswithanonymizer-input.csvuploaded (created in the Quick Start).
What run Does¶
anonymizer.run executes the full Anonymizer pipeline on every record of an input file and writes the output as job artifacts.
There are three run commands:
| Command | Where it runs | Local paths | model_configs required |
Artifacts |
|---|---|---|---|---|
nemo anonymizer run run |
Local CLI process via generated is_local path |
Allowed | Optional | Written under persistent/results/artifacts locally |
nemo anonymizer run submit |
NeMo Platform Jobs worker | Rejected | Required | Stored in NeMo Platform job artifact storage; pull with download_artifacts() |
nemo anonymizer run explain |
Local schema introspection | n/a | n/a | Prints job key, submit endpoint, and input/spec schemas |
Job artifacts (under the artifacts/ directory):
| File | Description |
|---|---|
dataset.parquet |
User-facing anonymized dataframe (replace/rewrite output). |
trace.parquet |
Internal trace dataframe with detection details. |
metadata.json |
Run metadata (includes the original text column name). |
failed_records.json |
Per-record failures with reasons. Only written when records failed. |
Step 1: Build an AnonymizerRequest¶
AnonymizerRequest contains the execution fields shared by preview and run (config, data, model_configs, and selected_models). A run processes the full input file, so it does not include num_records:
import os
from anonymizer.config.anonymizer_config import AnonymizerConfig
from anonymizer.config.replace_strategies import Redact
from data_designer.config import ModelConfig
from nemo_anonymizer_plugin.app.input import AnonymizerInputSpec
from nemo_anonymizer_plugin.app.task_config import AnonymizerRequest
WORKSPACE = os.environ.get("NMP_WORKSPACE", "default")
MODEL_PROVIDER = os.environ.get("NMP_ANON_PROVIDER", "nvidia-build")
config = AnonymizerConfig(
replace=Redact(format_template="[REDACTED_{label}]"),
)
model_configs = [
ModelConfig(alias="gliner-pii-detector", provider=MODEL_PROVIDER, model="nvidia/gliner-pii"),
ModelConfig(alias="gpt-oss-120b", provider=MODEL_PROVIDER, model="openai/gpt-oss-120b"),
ModelConfig(alias="nemotron-30b-thinking", provider=MODEL_PROVIDER, model="nvidia/nemotron-3-nano-30b-a3b"),
]
request = AnonymizerRequest(
config=config,
data=AnonymizerInputSpec(
source=f"fileset://{WORKSPACE}/anonymizer-inputs#anonymizer-input.csv",
text_column="biography",
id_column="id",
),
model_configs=model_configs,
)
Step 2: Write the Spec to YAML¶
The CLI run commands read a YAML spec file. Serialize the AnonymizerRequest directly:
import yaml
from pathlib import Path
spec_path = Path("/tmp/anonymizer-run.yaml")
spec_path.write_text(yaml.safe_dump(request.model_dump(mode="json", exclude_none=True)))
Step 3: Run the Job¶
Choose one execution path. Option A runs in the local CLI process. Option B submits the same request to the NeMo Platform Jobs worker.
Option A: Run Locally¶
The local job context runs the Anonymizer library Anonymizer.run(...) in-process, then writes artifacts through the generated local job results manager.
Expected output:
run run does not echo the artifact path on stdout. The local job results manager logs the path to stderr in the form:
Use that path in the next step.
Option B: Submit to the Jobs Worker¶
To execute the same spec on the NeMo Platform Jobs worker instead of in the CLI process, use run submit:
nemo anonymizer run submit \
--spec-file /tmp/anonymizer-run.yaml \
--workspace "${NMP_WORKSPACE:-default}" \
--base-url "${NMP_BASE_URL:-http://localhost:8080}"
The command prints the assigned job name. You need that name to poll status and download artifacts in Step 4.
The SDK equivalent is sdk.anonymizer.run(request). It posts the request to the plugin's /jobs/run endpoint and returns an AnonymizerJobResource:
import os
from nemo_platform import NeMoPlatform
sdk = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace=WORKSPACE,
)
job = sdk.anonymizer.run(request)
Compared to run run, the submit path:
- Rejects local file paths in
data.source— use a fileset reference (<fileset>#<path>) orhttp(s)URL. - Requires explicit
model_configsreferencing Inference Gateway providers, because the job runs outside the CLI process and cannot inherit Data Designer's locally-defined providers.
Step 4: Get Results¶
Option A Results: Local Run¶
For run run, the result already exists on the local filesystem. Use the artifact directory printed in stderr:
Then load the parquet artifacts from that directory:
import json
from pathlib import Path
import pandas as pd
artifacts_dir = Path("/path/to/persistent/results/artifacts") # from the stderr log
metadata = json.loads((artifacts_dir / "metadata.json").read_text())
dataset = pd.read_parquet(artifacts_dir / "dataset.parquet", dtype_backend="pyarrow")
trace = pd.read_parquet(artifacts_dir / "trace.parquet", dtype_backend="pyarrow")
failed_path = artifacts_dir / "failed_records.json"
failed_records = json.loads(failed_path.read_text()) if failed_path.exists() else []
print(dataset.head())
print(f"records={len(dataset)} failures={len(failed_records)}")
The trace dataset (and the dataset itself for annotate / substitute strategies) contains pyarrow-backed struct<entities: list<...>> columns. If you need plain Python dict/list values for JSON output, use pyarrow.parquet:
import pyarrow.parquet as pq
table = pq.read_table(artifacts_dir / "dataset.parquet")
records = table.slice(0, 5).to_pylist()
Option B Results: Remote Run¶
For run submit, track the platform job first. The job is ready for artifact download when its status is completed:
# Replace with the job name printed by `run submit`.
nemo jobs get-status <job-name> --workspace "${NMP_WORKSPACE:-default}"
nemo jobs get-logs <job-name> --workspace "${NMP_WORKSPACE:-default}"
To download from the CLI, fetch the artifacts result and extract it:
nemo jobs results download artifacts \
--job <job-name> \
--workspace "${NMP_WORKSPACE:-default}" \
--output-file /tmp/anonymizer-artifacts.tar.gz
mkdir -p /tmp/anonymizer-artifacts
tar -xzf /tmp/anonymizer-artifacts.tar.gz -C /tmp/anonymizer-artifacts
ls /tmp/anonymizer-artifacts/artifacts
Then point AnonymizerJobResults at the extracted artifacts directory:
from pathlib import Path
from nemo_anonymizer_plugin.sdk.job_results import AnonymizerJobResults
results = AnonymizerJobResults(Path("/tmp/anonymizer-artifacts/artifacts"))
dataset = results.load_dataset()
trace = results.load_trace()
failed = results.load_failed_records()
If you used the SDK, use the AnonymizerJobResource methods directly. get_job_status() reads the current status, check_if_complete() tests whether artifacts are ready, wait_until_done() blocks until a terminal state, and download_artifacts() downloads and extracts the result:
job = sdk.anonymizer.run(request)
status = job.get_job_status()
is_done = job.check_if_complete()
job.wait_until_done()
results = job.download_artifacts()
dataset = results.load_dataset()
trace = results.load_trace()
failed = results.load_failed_records()
AnonymizerJobResults exposes load_dataset(), load_trace(), load_failed_records(), and display_record() over the same underlying files. See SDK Resources.
Inspect the Schema Without Running¶
run explain prints the job key, submit endpoint, and JSON schemas for AnonymizerRequest and the canonical AnonymizerStepConfig:
This is useful when authoring a spec programmatically or wiring the job into another tool.
How the Job Compiles¶
For each request, the plugin:
- Validates the Anonymizer library
AnonymizerConfig. - Validates the input source (rejects local paths on remote execution; checks fileset refs).
- Validates that
selected_modelsoverrides also havemodel_configs. - Resolves
model_configsproviders — locally-defined Data Designer providers first, then Inference Gateway providers. Remote execution (run submit) resolves only through the Inference Gateway. - Renders a unified
model_configsYAML body for the library. - Stores the resolved providers and YAML in the internal
AnonymizerStepConfigconsumed by the worker (in-process forrun run, or on the Jobs worker forrun submit).
For run submit, provider endpoints are re-resolved at runtime so the job uses the in-cluster Inference Gateway address rather than the address captured at submission time.
Next Steps¶
- Iterate faster with preview before scaling to a full job.
- Refer to SDK Resources for
AnonymizerJobResourceandAnonymizerJobResultsdetails. - Replacement strategy parameters and rewrite mode are documented in the library docs.