Skip to content

Dev Notes

Updates, design decisions, and technical notes from the NeMo Safe Synthesizer team.

Speeding Up NeMo Safe Synthesizer Generation with Prompt-Aware Token Budgets

NeMo Safe Synthesizer has expanded to more models and larger context windows. That flexibility helps with longer rows, richer text columns, and time-series groups. However, it also exposed a generation bottleneck. Small jobs and short-row datasets could still pay for decode budgets sized for the full context window.

The updated generation path removes that waste in two places. It starts with a small prompt probe before scaling up the batch size, and it replaces full-context decode caps with prompt-aware caps empirically derived from token lengths of the actual data.

Our experiments show up to 10.0x generation speedup, a 1.62x median speedup across 39 before/after comparisons, and more consistent generation times overall.

This improvement shipped in NeMo Safe Synthesizer v0.0.6.

Does NeMo Safe Synthesizer Actually Work? A Case Study with Financial Transactions Data

NeMo Safe Synthesizer creates private, safe versions of sensitive tabular datasets: entirely synthetic data with no one-to-one mapping to the original records, while preserving the statistical structure to remain useful for downstream AI and analytics.

That promise sounds simple, but it raises the question every synthetic data system eventually has to answer: does it actually work?

Private by Design: Introducing NeMo Safe Synthesizer

Every organization working on AI faces the same challenge: the data that would make their models most useful is also proprietary data with the highest barriers to access. The data is right there: patient records, financial transactions, customer support logs, and datasets full of names, account numbers, and personal details. It is rich and perfectly suited to the task, but legal and compliance teams have marked it off-limits for good reason.

We built NeMo Safe Synthesizer to break that deadlock by helping organizations create synthetic versions of sensitive tabular data.