PII Replacement¶
PII (Personally Identifiable Information) replacement is a critical privacy protection step that detects and replaces sensitive information in your datasets before synthesis. This ensures that the model never has the opportunity to learn the most sensitive information (e.g. names, addresses, identifiers) from the training data.
How It Works¶
The PII replacement pipeline operates in multiple stages:
- Detection: Classifies PII entities within free text and entire columns.
- Replacement: Substitutes PII using configurable rules.
Detection Methods¶
NeMo Safe Synthesizer supports multiple PII detection approaches described in the table below:
| Method | Scope | Description | Key Features |
|---|---|---|---|
| LLM Classification | Entire columns | Leverages language models for column classification when the entire column is a single entity | - Contextual understanding of entities - Handles complex PII patterns - Flexible entity definitions - Configurable prompts and models |
| GLiNER PII | Free text | Uses the GLiNER PII model for entity recognition within free text columns | - Zero-shot entity detection - Supports custom entity types - High accuracy for standard PII categories - Configurable confidence thresholds |
Replacement Methods¶
After detection, PII can be handled in multiple ways:
| Strategy | Description | Example |
|---|---|---|
| Annotate | Add identified entity to original PII | Alice → <entity type="first_name" value="Alice"> |
| Redact | Replace PII with a generic tag | Alice → <first_name> |
| Hash | Replace PII with a hashed value | Alice → 3bf676c57 |
| Substitute | Replace PII with a context-relevant alternative | Alice → Erica |
Supported Entity Types¶
GLiNER PII will attempt to identify any custom entity type you provide. However, it has specifically been fine-tuned to detect the following entities, organized by category:
Default entity set
The default configuration detects and replaces a focused subset of these entities: first_name,
last_name, name, street_address, city, state, postcode, address, phone_number,
fax_number, email, ssn, national_id, tax_id, and credit_debit_card. To detect additional
entity types, add them to replace_pii.globals.classify.entities and
replace_pii.globals.ner.ner_entities in your configuration. Also add replacement steps to replace_pii.steps.
Personal Information¶
first_name- Given nameslast_name- Surnames and family namesname- Full namesage- Agesemail- Email addressesphone_number- Phone numbers in various formatsfax_number- Fax numbers in various formats
Addresses¶
address- Complete physical addresses (for example, 123 Main Street, Anytown, CA 90210)street_address- Street addresses (for example, 123 Main Street)city- City namescounty- County namesstate- State/province namespostcode- Postal/ZIP codescountry- Country names
Personal Identifiers¶
ssn- Social Security Numbersnational_id- National ID numberstax_id- Tax ID numberscertificate_license_number- Driver’s license numbersunique_identifier- Generic unique IDscustomer_id- Customer identifiersemployee_id- Employee identifiers
Financial Information¶
credit_debit_card- Credit and debit card numberscvv- Credit card verification codepin- Personal identification numbersaccount_number- Bank account numbersbank_routing_number- Bank routing numbersswift_bic- Swift/BIC codesiban- International bank account numbers
Medical Information¶
medical_record_number- Medical record numbershealth_plan_beneficiary_number- Insurance IDsbiometric_identifier- Biometric data references
Technical Identifiers¶
url- Web URLsipv4- IPv4 addressesipv6- IPv6 addressesmac_address- Hardware MAC addressesapi_key- API keys and tokensuser_name- Usernamespassword- Passwordshttp_cookie- HTTP Cookiesdevice_identifier- Device IDs
Vehicle Identifiers¶
vehicle_identifier- Vehicle identification numbers (VINs)license_plate- License plates
Geographic Information¶
latitude- Latitude coordinateslongitude- Longitude coordinatescoordinate- Coordinate pairs
Quasi Identifiers¶
date- Date valuesdate_time- Date and time valuesblood_type- Blood type informationgender- Gender informationsexuality- Sexual orientationpolitical_view- Political affiliationsrace- Raceethnicity- Ethnicity informationreligious_belief- Religious affiliationslanguage- Language preferenceseducation- Education leveljob_title- Professional titlesemployment_status- Employment informationcompany_name- Organization names
Custom Entity Types¶
Beyond these built-in types, you can define custom entities using:
globals:
classify:
enable_classify: true
entities:
- first_name
- last_name
- email
- employee_id
- project_code
Configuration¶
PII replacement is configured through the replace_pii section. For the full schema, refer to Configuration Reference -- Replacing PII.
replace_pii:
globals:
locales:
- en_US
steps:
- rows:
update:
- entity:
- email
- phone_number
value: "column.entity | fake"
When to Use PII Replacement¶
Consider using PII replacement when:
- Your data contains names, addresses, or other direct identifiers
- Compliance requires PII removal before processing
- You want to ensure the model cannot memorize sensitive values
- You need to share synthetic data with external parties
PII replacement is on by default as a pre-processing step before synthesis.