Synthetic Data Quality¶
Reference for diagnosing and improving synthetic data quality and privacy. Covers differential privacy errors, PII replacement issues, evaluation metric behavior, and score interpretation for operational use. For conceptual explanations of what SQS and DPS measure and how to read the HTML report, see Product Overview -- Evaluation. For runtime errors, OOM issues, and configuration problems, see Program Runtime. For environment variables and model caching, see Environment Variables.
There is always an inherent trade off between privacy and quality. A high level of privacy protection is achieved simply through the process of generating synthetic data, and is often a sufficient balance between privacy and utility. However, parameter tuning can often help you improve one without sacrificing the other.
Row count¶
Your dataset should have at least 1,000 records (10,000 records if enabling differential privacy). If you have fewer, both quality and privacy are likely to suffer considerably.
Differential Privacy¶
Differentially private (DP) training has strict requirements. Violating them produces errors that may not immediately point to the root cause.
Requirements¶
For the full list of DP compatibility constraints (use_unsloth,
max_sequences_per_example, gradient checkpointing), see
Configuration -- Differential Privacy.
Note
data_fraction and true_dataset_size must be available at runtime --
these are set automatically when running the full pipeline.
Common DP Errors¶
The privacy budget (epsilon) is too low for your dataset size. Either increase
privacy.epsilon or add more training records. Generally, you need a minimum of ~10,000 records to achieve reasonable quality when DP is enabled.
The PRV accountant failed and the system is falling back to the Opacus RDP accountant. This is handled automatically -- there is no user-facing config to select the accountant. The fallback may produce slightly different privacy guarantees, since the two accountants use different composition methods: PRV uses privacy loss random variables (Gopi et al. 2021), while RDP uses Rényi divergence (Mironov 2017).
Small datasets cause poor privacy budget utilization. Consider lowering
training.batch_size or adding more records.
PII Replacement¶
Entity detection and classification issues during PII replacement.
PII Uses Unexpected Entity Types¶
If PII replacement is not detecting the entity types you expect, the column classifier may have failed silently. When the classifier fails to initialize or classify, it falls back to default entity types.
Look for the following log lines if PII replacement seems to use unexpected entity types:
or
When NSS_INFERENCE_KEY is not set, the same log line is followed by guidance to set it (and a note that NSS_INFERENCE_ENDPOINT is optional with the default API). When the key is set, a traceback may be included to show the underlying API error.
Fix: set entity types explicitly in your config, or when using the CLI ensure
NSS_INFERENCE_KEY is set (and NSS_INFERENCE_ENDPOINT if not using the default). PII classify config is deeply nested -- use YAML or SDK:
from nemo_safe_synthesizer.config.replace_pii import PiiReplacerConfig
pii_config = PiiReplacerConfig.get_default_config()
pii_config.globals.classify.enable_classify = True
pii_config.globals.classify.entities = ["name", "email", "phone_number"]
synthesizer = (
SafeSynthesizer(config)
.with_data_source("data.csv")
.with_replace_pii(config=pii_config)
)
Evaluation¶
For out-of-memory errors during evaluation, see Troubleshooting > OOM During Evaluation.
Minimum Data Requirements¶
Several evaluation metrics have minimum data requirements:
| Metric | Minimum | Behavior if Unmet |
|---|---|---|
| Holdout split | 200 records | Raises ValueError (pipeline stops) |
| Text semantic similarity | 200 records | Skipped; score marked UNAVAILABLE |
| Attribute Inference Protection | FAISS installed + evaluation.quasi_identifier_count columns (default 3; auto-reduced for smaller datasets) |
Skipped if FAISS missing; UNAVAILABLE if too few columns |
| Deep Structure Stability | 2 columns, 2 rows | Skipped with warning; score marked UNAVAILABLE |
UNAVAILABLE Metrics¶
UNAVAILABLE is the literal string that appears in the evaluation report when
a metric could not be computed. Many evaluation components catch errors and
return this grade instead of failing the pipeline.
Common reasons a metric shows UNAVAILABLE:
- Column type mismatch --
ColumnDistribution,DeepStructure(PCA), andCorrelationapply only to numeric and categorical columns;TextSemanticSimilarityandTextStructureSimilarityapply only to text columns. A dataset with no text columns will showUNAVAILABLEfor text metrics, and vice versa. This is by design. - No holdout split --
TextSemanticSimilarityandMembershipInferenceProtectionboth require a held-out test set. Ifdata.holdoutis0(no holdout), these metrics are skipped and markedUNAVAILABLE. - Too few records or columns -- see the minimums table above.
- Model download failure -- the SentenceTransformer model must be present in
your Hugging Face cache (
$HF_HOME, default~/.cache/huggingface). Run once with internet access before switching to offline mode.
If the reason is not obvious, check the logs for warnings and exceptions logged during the evaluation stage.
Report Truncation¶
SQS reports are limited to sqs_report_columns=250 columns and
sqs_report_rows=5000 rows by default. Larger datasets are silently
truncated in the HTML report. Adjust these in evaluation config if needed.
See Configuration -- Evaluation for the full list of evaluation fields.
Low SQS Scores¶
If the SQS (Synthetic Quality Score) report shows low quality scores:
- Review column distributions in the HTML report -- large divergences indicate the model did not learn the data patterns well
- Check that training data is representative and not too small
- Consider increasing
generation.num_recordsfor a larger sample - Modify
training.num_input_records_to_sample-- this controls how much data the model sees during training (analogous to training duration) and affects generation quality. Increasing it is usually the first thing to try, but note that very small input datasets can lead to over-training, so try both increasing and decreasing it if quality remains poor
Interpreting Results¶
For a conceptual overview of evaluation metrics -- what SQS and DPS measure, how to read the HTML report, and what score ranges indicate -- see Product Overview -- Evaluation.