🕵️ Rewriting Legal Documents¶
Rewriting legal text (TAB dataset) with a domain-specific privacy goal and custom entity labels tailored for legal proceedings.
📚 What you'll learn¶
- Define domain-specific entity labels for legal text (case numbers, court names, etc.)
- Configure rewrite mode with legal-specific privacy goals
- Preview and run on court decision documents
- Triage flagged records with
needs_human_review
Tip: First time running notebooks? Start with setup instructions.
⚙️ Setup¶
- Check if your
NVIDIA_API_KEYfrom build.nvidia.com is registered for model access.- The default
build.nvidia.com(NVIDIA Build) setup is a convenient way to try Anonymizer and iterate on previews. Use of NVIDIA Build is subject to NVIDIA Build's own terms of service and privacy practices, which are separate from and independent of the NeMo Framework library. NVIDIA Build is intended for evaluation and testing purposes only and may not be used in production environments. Do not upload any confidential information or personal data when using NVIDIA Build. Your use of NVIDIA Build is logged for security purposes and to improve NVIDIA products and services. - Request and token rate limits on
build.nvidia.comvary by account and model access, and lower-volume development access can be slow for full-dataset runs. Start withpreview()on a small sample, then move to your own endpoint for production data and usage.
- The default
- Import
Detect(for custom entity labels),Rewrite, and its config classes. Anonymizer()initializes with the default model provider -- no extra config needed.configure_logging(LoggingConfig.default())keeps logs at INFO. Switch toLoggingConfig.debug()when troubleshooting.
In [1]:
Copied!
import getpass
import os
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
import getpass
import os
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
In [ ]:
Copied!
from anonymizer import (
Anonymizer,
AnonymizerConfig,
AnonymizerInput,
Detect,
LoggingConfig,
PrivacyGoal,
Rewrite,
configure_logging,
)
configure_logging(LoggingConfig.default())
from anonymizer import (
Anonymizer,
AnonymizerConfig,
AnonymizerInput,
Detect,
LoggingConfig,
PrivacyGoal,
Rewrite,
configure_logging,
)
configure_logging(LoggingConfig.default())
In [3]:
Copied!
anonymizer = Anonymizer()
anonymizer = Anonymizer()
[16:41:39] [INFO] 🔧 Anonymizer initialized with 3 model configs [16:41:39] [INFO] |-- 🔎 detector: gliner-pii-detector [16:41:39] [INFO] |-- ✅ validator: gpt-oss-120b [16:41:39] [INFO] |-- 🧩 augmenter: gpt-oss-120b
📦 Input data¶
- TAB (Text Anonymization Benchmark) legal documents -- court decisions containing names, dates, case numbers, and other legal identifiers.
LEGAL_ENTITY_LABELSdefines the domain-specific entity types to detect. This replaces the default label set with one tailored to legal text.
In [4]:
Copied!
LEGAL_ENTITY_LABELS = [
"first_name",
"last_name",
"court_name",
"organization_name",
"company_name",
"prison_detention_facility",
"street_address",
"city",
"state",
"country",
"date",
"date_time",
"time",
"date_of_birth",
"age",
"email",
"phone_number",
"ssn",
"unique_id",
"legal_role",
"case_number",
"application_number",
"monetary_amount",
"sentence_duration",
"nationality",
]
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/TAB_legal_sample25.csv",
text_column="text",
data_summary="Legal court decisions containing personal identifiers, case numbers, and institutional references",
)
LEGAL_ENTITY_LABELS = [
"first_name",
"last_name",
"court_name",
"organization_name",
"company_name",
"prison_detention_facility",
"street_address",
"city",
"state",
"country",
"date",
"date_time",
"time",
"date_of_birth",
"age",
"email",
"phone_number",
"ssn",
"unique_id",
"legal_role",
"case_number",
"application_number",
"monetary_amount",
"sentence_duration",
"nationality",
]
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/TAB_legal_sample25.csv",
text_column="text",
data_summary="Legal court decisions containing personal identifiers, case numbers, and institutional references",
)
🎛️ Configure¶
Detect(entity_labels=...)overrides the default entity set with legal-specific labels.PrivacyGoaltells the rewriter what to protect (identifiers, case numbers, institutional references) and what to preserve (legal reasoning, statutory references, ruling structure).
In [5]:
Copied!
config = AnonymizerConfig(
detect=Detect(
entity_labels=LEGAL_ENTITY_LABELS,
),
rewrite=Rewrite(
privacy_goal=PrivacyGoal(
protect="All personal identifiers, case numbers, court names, and institutional references that could identify parties",
preserve="Legal reasoning, procedural facts, statutory references, and the structure of the ruling",
),
risk_tolerance="minimal",
max_repair_iterations=3,
),
)
config = AnonymizerConfig(
detect=Detect(
entity_labels=LEGAL_ENTITY_LABELS,
),
rewrite=Rewrite(
privacy_goal=PrivacyGoal(
protect="All personal identifiers, case numbers, court names, and institutional references that could identify parties",
preserve="Legal reasoning, procedural facts, statutory references, and the structure of the ruling",
),
risk_tolerance="minimal",
max_repair_iterations=3,
),
)
👁️ Preview¶
- Preview on a few records to check that legal entities are detected and the rewrite preserves the ruling's structure.
In [6]:
Copied!
preview = anonymizer.preview(
config=config,
data=input_data,
num_records=3,
)
preview.display_record(0)
preview = anonymizer.preview(
config=config,
data=input_data,
num_records=3,
)
preview.display_record(0)
[16:41:39] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/TAB_legal_sample25.csv (column: 'text') [16:41:39] [INFO] detection labels in scope: ['age', 'application_number', 'case_number', 'city', 'company_name', 'country', 'court_name', 'date', 'date_of_birth', 'date_time', 'email', 'first_name', 'last_name', 'legal_role', 'monetary_amount', 'nationality', 'organization_name', 'phone_number', 'prison_detention_facility', 'sentence_duration', 'ssn', 'state', 'street_address', 'time', 'unique_id'] [16:41:39] [INFO] 🔍 Running entity detection on 3 records [16:42:17] [INFO] |-- 📋 Detection complete — 141 entities found across 3 records (0 failed) [37.8s] [16:42:17] [INFO] |-- labels: date=51, court_name=35, legal_role=10, nationality=7, last_name=7, organization_name=6, country=5, first_name=5, city=5, application_number=3, date_of_birth=3, monetary_amount=2, case_number=1, sentence_duration=1 [16:42:17] [INFO] ✏️ Running rewrite pipeline [16:45:15] [INFO] Evaluate-repair loop iteration 0: 2/3 rows need repair [16:46:10] [INFO] Evaluate-repair loop iteration 1: 1/3 rows need repair [16:46:56] [INFO] Evaluate-repair loop: all rows pass at iteration 2 [16:47:13] [INFO] |-- 📋 Rewrite complete (0 failed) [296.2s] [16:47:13] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
In [7]:
Copied!
preview.display_record(1)
preview.display_record(1)
🚀 Full run¶
result.dataframehas user-facing columns: rewritten text, scores, and the review flag.
In [8]:
Copied!
result = anonymizer.run(config=config, data=input_data)
result.dataframe.head()
result = anonymizer.run(config=config, data=input_data)
result.dataframe.head()
[16:47:13] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/TAB_legal_sample25.csv (column: 'text') [16:47:13] [INFO] detection labels in scope: ['age', 'application_number', 'case_number', 'city', 'company_name', 'country', 'court_name', 'date', 'date_of_birth', 'date_time', 'email', 'first_name', 'last_name', 'legal_role', 'monetary_amount', 'nationality', 'organization_name', 'phone_number', 'prison_detention_facility', 'sentence_duration', 'ssn', 'state', 'street_address', 'time', 'unique_id'] [16:47:13] [INFO] 🔍 Running entity detection on 25 records [16:51:28] [INFO] |-- 📋 Detection complete — 1285 entities found across 25 records (0 failed) [254.7s] [16:51:28] [INFO] |-- labels: date=418, court_name=241, legal_role=167, last_name=84, organization_name=76, first_name=62, city=47, nationality=46, country=43, application_number=26, date_of_birth=25, prison_detention_facility=17, sentence_duration=13, monetary_amount=10, state=4, unique_id=2, case_number=1, age=1, time=1, company_name=1 [16:51:28] [INFO] ✏️ Running rewrite pipeline [17:05:12] [INFO] Evaluate-repair loop iteration 0: 16/25 rows need repair [17:08:02] [INFO] Evaluate-repair loop iteration 1: 9/25 rows need repair [17:09:34] [INFO] Evaluate-repair loop iteration 2: 7/25 rows need repair [17:11:39] [INFO] |-- 📋 Rewrite complete (0 failed) [1211.3s] [17:11:39] [INFO] 🎉 Pipeline complete — 25 records processed, 0 total failures
Out[8]:
| text | text_rewritten | utility_score | leakage_mass | weighted_leakage_rate | any_high_leaked | needs_human_review | |
|---|---|---|---|---|---|---|---|
| 0 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 0.86 | 0.9 | 0.056962 | True | True |
| 1 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 0.841667 | 0.54 | 0.020769 | False | False |
| 2 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 0.857143 | 0.0 | 0.0 | False | False |
| 3 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 0.982353 | 0.0 | 0.0 | False | False |
| 4 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 0.984615 | 0.57 | 0.033529 | False | False |
In [9]:
Copied!
result.dataframe[["text_rewritten", "utility_score", "leakage_mass", "needs_human_review"]].head()
result.dataframe[["text_rewritten", "utility_score", "leakage_mass", "needs_human_review"]].head()
Out[9]:
| text_rewritten | utility_score | leakage_mass | needs_human_review | |
|---|---|---|---|---|
| 0 | PROCEDURE The case originated in an applicati... | 0.86 | 0.9 | True |
| 1 | PROCEDURE The case originated in an applicati... | 0.841667 | 0.54 | False |
| 2 | PROCEDURE The case originated in an applicati... | 0.857143 | 0.0 | False |
| 3 | PROCEDURE The case originated in an applicati... | 0.982353 | 0.0 | False |
| 4 | PROCEDURE The case originated in an applicati... | 0.984615 | 0.57 | False |
🚩 Filter by review flag¶
- Records where automated metrics exceed thresholds are flagged for manual review.
- Use this to prioritize human attention on the records that need it most.
- See Working with flagged records for guidance on diagnosing and resolving flagged records.
In [10]:
Copied!
df = result.dataframe
flagged = df[df["needs_human_review"] == True] # noqa: E712
print(f"{len(flagged)} of {len(df)} records flagged for human review")
flagged.head()
df = result.dataframe
flagged = df[df["needs_human_review"] == True] # noqa: E712
print(f"{len(flagged)} of {len(df)} records flagged for human review")
flagged.head()
7 of 25 records flagged for human review
Out[10]:
| text | text_rewritten | utility_score | leakage_mass | weighted_leakage_rate | any_high_leaked | needs_human_review | |
|---|---|---|---|---|---|---|---|
| 0 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 0.86 | 0.9 | 0.056962 | True | True |
| 6 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 0.966667 | 1.6 | 0.070796 | True | True |
| 10 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 0.873333 | 1.85 | 0.064685 | True | True |
| 12 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 0.9 | 4.3 | 0.119444 | True | True |
| 18 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 1.0 | 4.38 | 0.183264 | True | True |
⏭️ Next steps¶
- 🔍 Inspecting Detected Entities -- debug what the detection pipeline found before rewriting.
- Try it on your own data! Swap in your CSV, define entity labels for your
domain, and set a
PrivacyGoalthat fits -- you've got all the building blocks.