Skip to content

Anonymizer NeMo Platform SDK Resources

The anonymizer.config module (from the NVIDIA NeMo Anonymizer library) builds AnonymizerConfig objects in a context-agnostic way. Once you are ready to execute that config against the NeMo Platform Anonymizer service, you use objects from the nemo_platform SDK. This page describes the NeMo Platform-specific objects.

AnonymizerResource

The AnonymizerResource is the entry point for working with Anonymizer on NeMo Platform. It wraps the streaming preview endpoint and job submission for the plugin service.

A AnonymizerResource is accessed directly from a NeMoPlatform instance:

import os
from nemo_platform import NeMoPlatform

sdk = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)
anonymizer = sdk.anonymizer  # AnonymizerResource

An AsyncAnonymizerResource with the same surface is available via AsyncNeMoPlatform.anonymizer.

Method Description
preview(request, *, workspace=None) Runs a streaming preview against the plugin service and returns an AnonymizerPreviewResult after the stream completes.
run(request, *, workspace=None, wait_until_done=False) Submits an anonymizer.run job to the NeMo Platform Jobs worker. Returns an AnonymizerJobResource. When wait_until_done=True, blocks until the job reaches a terminal state.
get_job_resource(job_name, workspace=None) Returns an AnonymizerJobResource for an existing job (by job name).

request is a PreviewRequest or AnonymizerRequest instance from nemo_anonymizer_plugin.app.task_config. Both accept the same config, data, model_configs, and selected_models fields; PreviewRequest adds num_records.

Both preview and run call the plugin service, so they require model_configs and reject local file paths in data.source — use a fileset reference or http(s) URL.

AnonymizerPreviewResult

AnonymizerResource.preview collects the frame stream and returns an AnonymizerPreviewResult once the stream completes.

Attribute / Method Description
dataset pandas.DataFrame of anonymized records (the preview_dataset frame contents).
trace_dataset pandas.DataFrame with detection trace columns (the trace_dataset frame contents).
failed_records list[dict] of per-record failures with reasons. Empty when nothing failed.
display_record(index=None) Renders a single trace record as HTML in a notebook. When index is omitted, cycles through records.
More about preview results

AnonymizerPreviewResult holds everything in memory; nothing is persisted to disk by default. The dataset and trace_dataset fields are regular pandas DataFrames and can be saved with to_csv / to_parquet.

AnonymizerJobResource

AnonymizerResource.run returns an AnonymizerJobResource. You can also use AnonymizerResource.get_job_resource to get one for an existing job.

job = sdk.anonymizer.run(run_request)
job.wait_until_done()
results = job.download_artifacts()
dataset = results.load_dataset()
Method Description
get_job() Returns the raw job record from the jobs service.
get_job_status() Returns the current PlatformJobStatus.
check_if_complete(*, raise_if_not_complete=False) Returns True when the job is completed. Returns False (or raises) for terminal incomplete and running states.
wait_until_done() Polls the jobs service until the job reaches a terminal state. Logs progress as it goes.
get_logs() Returns logs from the job as a list of dicts. Handles pagination automatically.
download_artifacts(path=None) Downloads the job artifacts tarball and unarchives it. Returns an AnonymizerJobResults object.

The async variant (AsyncAnonymizerJobResource) exposes the same surface with async def methods.

AnonymizerJobResults

download_artifacts returns an AnonymizerJobResults object that loads parquet / JSON artifacts into memory. The same class also works for the local run run flow — point it at the artifact directory the local job results manager logs:

from pathlib import Path
from nemo_anonymizer_plugin.sdk.job_results import AnonymizerJobResults

results = AnonymizerJobResults(Path("/path/to/persistent/results/artifacts"))
dataset = results.load_dataset()
Method Description
load_dataset() Returns the anonymized dataset as a pandas.DataFrame (dataset.parquet).
load_trace() Returns the trace dataframe (trace.parquet). The original_text_column from metadata.json is attached for display_record.
load_failed_records() Returns failed_records.json as list[dict]. Returns [] when the file isn't present.
display_record(index=None) Renders a single trace record as HTML in a notebook. When index is omitted, cycles through records.
More about job results

AnonymizerJobResults reads files lazily — methods load the corresponding parquet or JSON only when called. The underlying directory layout is:

<artifacts_dir>/
  dataset.parquet
  trace.parquet
  metadata.json
  failed_records.json   # only when there were failures

By default, download_artifacts saves the tarball contents to a local directory named after the job; pass path= to override.

Request Models

Both request models live in nemo_anonymizer_plugin.app.task_config.

Request Fields

AnonymizerRequest defines the execution fields below, run jobs use AnonymizerRequest directly and process the full input file.

Field Type Description
config AnonymizerConfig Upstream library config (replace strategy or rewrite, detection params).
data AnonymizerInputSpec Input source plus column metadata. See below.
model_configs list[data_designer.config.ModelConfig] \| None Model pool. provider references an Inference Gateway provider name.
selected_models SelectedModelsOverrides \| None Optional role overrides on top of bundled defaults. Requires model_configs.

PreviewRequest extends AnonymizerRequest with num_records

Field Type Description
config AnonymizerConfig Upstream library config (replace strategy or rewrite, detection params).
data AnonymizerInputSpec Input source plus column metadata. See below.
model_configs list[data_designer.config.ModelConfig] \| None Model pool. provider references an Inference Gateway provider name.
selected_models SelectedModelsOverrides \| None Optional role overrides on top of bundled defaults. Requires model_configs.
num_records int (≥ 1, default 10) Preview-only. Number of records to preview. Capped by the service's preview_num_records.max.

AnonymizerInputSpec

The plugin-owned API-boundary input spec:

Field Type Description
source str Local path, http(s) URL, or fileset reference for a CSV / Parquet file.
text_column str (default "text") Column containing text to anonymize.
id_column str \| None Optional record identifier column.
data_summary str \| None Optional short description of the data passed to Anonymizer library prompts.

Fileset references can take any of the three forms fileset://<workspace>/<fileset>#<path>, <workspace>/<fileset>#<path>, or <fileset>#<path>, and must resolve to a single .csv or .parquet file.

SelectedModelsOverrides

Partial role → alias overrides for the three workflows. Each section is optional and is merged on top of the bundled default selection by the library.

Field Type Description
detection dict[str, str \| list[str]] \| None Role → alias or alias pool for detection.
replace dict[str, str] \| None Role → alias for replacement (for example replacement_generator).
rewrite dict[str, str] \| None Role → alias for rewrite mode.

Supplying overrides without model_configs raises a config validation error.