dp_utils
dp_utils
¶
DP training utilities for Hugging Face Trainer and data collation.
Provides OpacusDPTrainer (DP-aware Trainer with entity-level sampling and
Opacus optimizer), DPCallback for Trainer hooks, data collators that
expose position_ids for per-sample gradients, and GradSampleModule
wrapper with no_sync support.
Classes:
| Name | Description |
|---|---|
DPCallback |
Trainer callback that integrates Opacus DP-SGD with |
DataCollatorForPrivateCausalLanguageModeling |
Adds |
DataCollatorForPrivateTokenClassification |
Collator for token classification that adds |
GradSampleModule |
Opacus GradSampleModule with |
OpacusDPTrainer |
DP-aware Trainer for PEFT/LoRA fine-tuning with Opacus. |
Functions:
| Name | Description |
|---|---|
create_entity_mapping |
Build a mapping from each entity to its dataset indices. |
DPCallback(noise_multiplier, sampling_probability, accountant, max_epsilon=float('inf'))
¶
Bases: TrainerCallback
Trainer callback that integrates Opacus DP-SGD with transformers.Trainer.
Handles per-step optimizer behavior (skip signal, step, zero_grad), optional
RDP step accounting, and early stopping when max_epsilon is exceeded.
Used with OpacusDPTrainer; the trainer injects this callback when
privacy arguments are enabled.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
noise_multiplier
|
float
|
Gaussian noise scale for gradients. |
required |
sampling_probability
|
float
|
Probability of a record being in a batch. |
required |
accountant
|
SafeSynthesizerAccountant
|
Privacy accountant for epsilon computation and (if RDP) step tracking. |
required |
max_epsilon
|
float
|
Stop training when computed epsilon exceeds this value. |
float('inf')
|
Methods:
| Name | Description |
|---|---|
on_substep_end |
Run DP optimizer step at the end of each gradient-accumulation substep. |
on_step_end |
Clear gradients and update RDP accountant at the end of each optimizer step. |
on_save |
Called when the Trainer is about to save a checkpoint. Ensures training |
on_evaluate |
Check epsilon budget and stop training if |
Source code in src/nemo_safe_synthesizer/privacy/dp_transformers/dp_utils.py
on_substep_end(args, state, control, optimizer=None, **kwargs)
¶
Run DP optimizer step at the end of each gradient-accumulation substep.
Signals the Opacus optimizer to skip the step, calls step() and
zero_grad() on the underlying DP optimizer (or the optimizer itself
if not wrapped by Accelerate). Required when using gradient accumulation
so that the optimizer step runs once per micro-batch.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
args
|
TrainingArguments
|
HF Trainer arguments. |
required |
state
|
TrainerState
|
Current trainer state. |
required |
control
|
TrainerControl
|
Trainer control object (not modified). |
required |
optimizer
|
The Trainer's optimizer (Opacus DP optimizer or AcceleratedOptimizer wrapping it). |
None
|
|
**kwargs
|
Additional callback keyword arguments. |
{}
|
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If optimizer is None (callback cannot access optimizer). |
Source code in src/nemo_safe_synthesizer/privacy/dp_transformers/dp_utils.py
on_step_end(args, state, control, optimizer=None, **kwargs)
¶
Clear gradients and update RDP accountant at the end of each optimizer step.
Calls zero_grad() on the optimizer (Opacus expects this; Trainer does not
call it by default). When using the RDP accountant (not PRV), increments the
accountant step for accurate epsilon calculation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
args
|
TrainingArguments
|
Trainer training arguments (used to check gradient_accumulation_steps). |
required |
state
|
TrainerState
|
Current trainer state. |
required |
control
|
TrainerControl
|
Trainer control object (not modified). |
required |
optimizer
|
The Trainer's optimizer (required for |
None
|
|
**kwargs
|
Additional callback keyword arguments. |
{}
|
Raises:
| Type | Description |
|---|---|
RuntimeError
|
If gradient accumulation is used but |
Source code in src/nemo_safe_synthesizer/privacy/dp_transformers/dp_utils.py
on_save(args, state, control, **kwargs)
¶
Called when the Trainer is about to save a checkpoint. Ensures training stops before saving if the privacy budget would be exceeded.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
args
|
TrainingArguments
|
HF Trainer arguments. |
required |
state
|
TrainerState
|
Current trainer state (used for global_step). |
required |
control
|
TrainerControl
|
Trainer control object; |
required |
**kwargs
|
Additional callback keyword arguments. |
{}
|
Returns:
| Type | Description |
|---|---|
|
TrainerControl with |
|
|
epsilon exceeds |
Source code in src/nemo_safe_synthesizer/privacy/dp_transformers/dp_utils.py
on_evaluate(args, state, control, **kwargs)
¶
Check epsilon budget and stop training if max_epsilon is exceeded.
Called when the Trainer runs evaluation. Ensures training stops before further steps if the privacy budget would be exceeded.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
args
|
TrainingArguments
|
HF Trainer arguments. |
required |
state
|
TrainerState
|
Current trainer state (used for global_step). |
required |
control
|
TrainerControl
|
Trainer control object; |
required |
**kwargs
|
Additional callback keyword arguments. |
{}
|
Returns:
| Type | Description |
|---|---|
|
TrainerControl with |
|
|
epsilon exceeds |
Source code in src/nemo_safe_synthesizer/privacy/dp_transformers/dp_utils.py
DataCollatorForPrivateCausalLanguageModeling(tokenizer)
¶
Bases: DataCollatorForLanguageModeling
Adds position_ids for Opacus per-sample gradients.
Trainer and model code often create position_ids inside the model
forward pass, which Opacus cannot see. This collator builds position_ids
during batching so they are present in the batch and available for
per-sample gradient computation. See https://github.com/huggingface/transformers/blob/5c1c72be5f864d10d0efe8ece0768d9ed6ee4fdd/src/transformers/models/mistral/modeling_mistral.py#L379
for an example.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokenizer
|
PreTrainedTokenizer
|
Tokenizer for padding and encoding. |
required |
Source code in src/nemo_safe_synthesizer/privacy/dp_transformers/dp_utils.py
DataCollatorForPrivateTokenClassification(tokenizer)
¶
Bases: DataCollatorForTokenClassification
Collator for token classification that adds position_ids for Opacus.
Same rationale as DataCollatorForPrivateCausalLanguageModeling: ensures
position_ids are in the batch for per-sample gradient computation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokenizer
|
PreTrainedTokenizer
|
Tokenizer for padding and encoding. |
required |
Source code in src/nemo_safe_synthesizer/privacy/dp_transformers/dp_utils.py
GradSampleModule
¶
Bases: GradSampleModule
Opacus GradSampleModule with no_sync for Hugging Face Trainer.
Trainer expects a no_sync context manager to defer gradient sync in
distributed settings. This wrapper provides a no-op no_sync so the
Trainer API is satisfied.
Methods:
| Name | Description |
|---|---|
no_sync |
Context manager that does nothing; required by Trainer's expected API. |
OpacusDPTrainer(train_dataset, model, args=None, privacy_args=None, data_fraction=None, true_dataset_size=None, entity_column_values=None, callbacks=None, secure_mode=True, **kwargs)
¶
Bases: Trainer
DP-aware Trainer for PEFT/LoRA fine-tuning with Opacus.
Adapts Hugging Face Trainer for differential privacy: uses entity-level
(or record-level) sampling, wraps the model in GradSampleModule and
the optimizer in Opacus DPOptimizer, and avoids double-scaling of
loss by gradient accumulation. Saves only the PEFT/LoRA adapter weights.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
train_dataset
|
Dataset
|
Dataset for training. |
required |
model
|
PreTrainedModel | Module
|
Base model (will be wrapped with GradSampleModule). |
required |
args
|
Training arguments (e.g. |
None
|
|
privacy_args
|
PrivacyArguments | None
|
DP parameters (epsilon, delta, noise, clipping). Required. |
None
|
data_fraction
|
float | None
|
If set, scales effective number of epochs for privacy math. |
None
|
true_dataset_size
|
int | None
|
Override number of entities/records for privacy accounting. |
None
|
entity_column_values
|
list | None
|
If set, entity-level DP; each value is the entity ID for the corresponding dataset row. If None, record-level DP (one entity per row). |
None
|
callbacks
|
list[TrainerCallback] | None
|
Additional Trainer callbacks. |
None
|
secure_mode
|
bool | None
|
If True, use secure RNG for noise (recommended). |
True
|
**kwargs
|
dict
|
Passed to |
{}
|
Attributes:
| Name | Type | Description |
|---|---|---|
accountant |
Privacy accountant used for epsilon computation. |
|
entity_mapping |
For entity i, list of dataset indices in that entity. |
Methods:
| Name | Description |
|---|---|
get_epsilon |
Uses the trainer's privacy accountant and the current number of |
create_optimizer |
Create the base optimizer then wrap it with Opacus DPOptimizer. |
training_step |
Run one training step and return the loss scaled for logging. |
get_train_dataloader |
DataLoader with entity-level sampler and DP data collator. |
Source code in src/nemo_safe_synthesizer/privacy/dp_transformers/dp_utils.py
sampling_probability
property
¶
Probability that an entity is included in a batch (capped at 1.0).
For record-level DP (one entity per row), it is \(min(1, (per_device_batch_size × gradient_accumulation_steps) / n_entities)\). For entity-level DP, n_entities can be small so the ratio may exceed 1; the result is capped at 1.0. Used as the sampling probability in the privacy accountant for ε computation.
num_steps
property
¶
The number of optimizer steps used for privacy accounting.
Either user-supplied (via max_steps when true_num_epochs == -1)
or determined from num_train_epochs. When the user specifies
num_train_epochs, we determine num_steps from
sampling_probability so we pass over each entity roughly once per
epoch, similarly to passing over each record once per epoch in
record-level training.
Always at least 1, because we add 1 to 1 / sampling_probability;
this can happen when there are fewer entities than
batch_size * gradient_accumulation_steps (e.g. 4 * 8 = 32).
Used to determine the privacy budget (noise multiplier and epsilon)
during training.
get_epsilon()
¶
Uses the trainer's privacy accountant and the current number of optimizer steps to return the epsilon consumed so far.
Source code in src/nemo_safe_synthesizer/privacy/dp_transformers/dp_utils.py
create_optimizer()
¶
Create the base optimizer then wrap it with Opacus DPOptimizer.
Source code in src/nemo_safe_synthesizer/privacy/dp_transformers/dp_utils.py
training_step(model, inputs, num_items_in_batch=None)
¶
Run one training step and return the loss scaled for logging.
Forward pass and backward are performed as usual. Loss is not scaled by
batch size or per-sample factors here: Opacus handles per-sample gradient
scaling. The returned value is the raw loss divided by
gradient_accumulation_steps so that the logged loss matches the
effective per-step loss (averaged over accumulation steps).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Module
|
The model to train (wrapped in |
required |
inputs
|
dict[str, Tensor | Any]
|
Batch of inputs (e.g. |
required |
num_items_in_batch
|
Unused; passed for API compatibility. Opacus
handles scaling; we pass |
None
|
Returns:
| Type | Description |
|---|---|
Tensor
|
Detached loss tensor scaled by 1 / |
Tensor
|
for logging only (optimizer step is driven by the callback). |
Source code in src/nemo_safe_synthesizer/privacy/dp_transformers/dp_utils.py
get_train_dataloader()
¶
DataLoader with entity-level sampler and DP data collator.
Source code in src/nemo_safe_synthesizer/privacy/dp_transformers/dp_utils.py
create_entity_mapping(entity_column_values)
¶
Build a mapping from each entity to its dataset indices.
Groups rows by the entity column; each group's indices are the dataset positions for that entity. Entity order follows groupby sort; order within a group is preserved.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity_column_values
|
list
|
List of entity IDs aligned with dataset rows (e.g. one value per row in the same order). |
required |
Returns:
| Type | Description |
|---|---|
Sequence[Sequence[int]]
|
Sequence of sequences: for entity i, result[i] is the list of dataset |
Sequence[Sequence[int]]
|
indices belonging to that entity. |