🕵️ Rewriting Legal Documents¶
Rewriting legal text (TAB dataset) with a domain-specific privacy goal and custom entity labels tailored for legal proceedings.
📚 What you'll learn¶
- Define domain-specific entity labels for legal text (case numbers, court names, etc.)
- Configure rewrite mode with legal-specific privacy goals
- Preview and run on court decision documents
- Triage flagged records with
needs_human_review
Tip: First time running notebooks? Start with setup instructions.
⚙️ Setup¶
- Check if your
NVIDIA_API_KEYfrom build.nvidia.com is registered for model access.- Treat the default
build.nvidia.comsetup as a convenient experimentation path. For privacy-sensitive or production data, switch to a secure endpoint you trust and to which you are comfortable sending data. - Request/token rate limits on
build.nvidia.comvary by account and model access, and lower-volume development access can be slow for full runs. Start withpreview()on a small sample.
- Treat the default
- Import
Detect(for custom entity labels),Rewrite, and its config classes. Anonymizer()initializes with the default model provider -- no extra config needed.configure_logging(enabled=False)suppresses pipeline logs for cleaner output. Switch toconfigure_logging(LoggingConfig.debug())when troubleshooting.
In [1]:
Copied!
import getpass
import os
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
import getpass
import os
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
In [2]:
Copied!
from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Detect, Rewrite, configure_logging, LoggingConfig
from anonymizer.config.rewrite import PrivacyGoal
configure_logging(LoggingConfig.default())
from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Detect, Rewrite, configure_logging, LoggingConfig
from anonymizer.config.rewrite import PrivacyGoal
configure_logging(LoggingConfig.default())
In [3]:
Copied!
anonymizer = Anonymizer()
anonymizer = Anonymizer()
[16:41:39] [INFO] 🔧 Anonymizer initialized with 3 model configs [16:41:39] [INFO] |-- 🔎 detector: gliner-pii-detector [16:41:39] [INFO] |-- ✅ validator: gpt-oss-120b [16:41:39] [INFO] |-- 🧩 augmenter: gpt-oss-120b
📦 Input data¶
- TAB (Text Anonymization Benchmark) legal documents -- court decisions containing names, dates, case numbers, and other legal identifiers.
LEGAL_ENTITY_LABELSdefines the domain-specific entity types to detect. This replaces the default label set with one tailored to legal text.
In [4]:
Copied!
LEGAL_ENTITY_LABELS = [
"first_name",
"last_name",
"court_name",
"organization_name",
"company_name",
"prison_detention_facility",
"street_address",
"city",
"state",
"country",
"date",
"date_time",
"time",
"date_of_birth",
"age",
"email",
"phone_number",
"ssn",
"unique_id",
"legal_role",
"case_number",
"application_number",
"monetary_amount",
"sentence_duration",
"nationality",
]
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/TAB_legal_sample25.csv",
text_column="text",
data_summary="Legal court decisions containing personal identifiers, case numbers, and institutional references",
)
LEGAL_ENTITY_LABELS = [
"first_name",
"last_name",
"court_name",
"organization_name",
"company_name",
"prison_detention_facility",
"street_address",
"city",
"state",
"country",
"date",
"date_time",
"time",
"date_of_birth",
"age",
"email",
"phone_number",
"ssn",
"unique_id",
"legal_role",
"case_number",
"application_number",
"monetary_amount",
"sentence_duration",
"nationality",
]
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/TAB_legal_sample25.csv",
text_column="text",
data_summary="Legal court decisions containing personal identifiers, case numbers, and institutional references",
)
🎛️ Configure¶
Detect(entity_labels=...)overrides the default entity set with legal-specific labels.PrivacyGoaltells the rewriter what to protect (identifiers, case numbers, institutional references) and what to preserve (legal reasoning, statutory references, ruling structure).
In [5]:
Copied!
config = AnonymizerConfig(
detect=Detect(
entity_labels=LEGAL_ENTITY_LABELS,
),
rewrite=Rewrite(
privacy_goal=PrivacyGoal(
protect="All personal identifiers, case numbers, court names, and institutional references that could identify parties",
preserve="Legal reasoning, procedural facts, statutory references, and the structure of the ruling",
),
risk_tolerance="minimal",
max_repair_iterations=3,
),
)
config = AnonymizerConfig(
detect=Detect(
entity_labels=LEGAL_ENTITY_LABELS,
),
rewrite=Rewrite(
privacy_goal=PrivacyGoal(
protect="All personal identifiers, case numbers, court names, and institutional references that could identify parties",
preserve="Legal reasoning, procedural facts, statutory references, and the structure of the ruling",
),
risk_tolerance="minimal",
max_repair_iterations=3,
),
)
👁️ Preview¶
- Preview on a few records to check that legal entities are detected and the rewrite preserves the ruling's structure.
In [6]:
Copied!
preview = anonymizer.preview(
config=config,
data=input_data,
num_records=3,
)
preview.display_record(0)
preview = anonymizer.preview(
config=config,
data=input_data,
num_records=3,
)
preview.display_record(0)
[16:41:39] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/TAB_legal_sample25.csv (column: 'text') [16:41:39] [INFO] detection labels in scope: ['age', 'application_number', 'case_number', 'city', 'company_name', 'country', 'court_name', 'date', 'date_of_birth', 'date_time', 'email', 'first_name', 'last_name', 'legal_role', 'monetary_amount', 'nationality', 'organization_name', 'phone_number', 'prison_detention_facility', 'sentence_duration', 'ssn', 'state', 'street_address', 'time', 'unique_id'] [16:41:39] [INFO] |-- 👀 Preview mode: processing 3 of 25 records [16:41:39] [INFO] 🔍 Running entity detection on 3 records [16:42:17] [INFO] |-- 📋 Detection complete — 141 entities found across 3 records (0 failed) [37.8s] [16:42:17] [INFO] |-- labels: date=51, court_name=35, legal_role=10, nationality=7, last_name=7, organization_name=6, country=5, first_name=5, city=5, application_number=3, date_of_birth=3, monetary_amount=2, case_number=1, sentence_duration=1 [16:42:17] [INFO] ✏️ Running rewrite pipeline [16:45:15] [INFO] Evaluate-repair loop iteration 0: 2/3 rows need repair [16:46:10] [INFO] Evaluate-repair loop iteration 1: 1/3 rows need repair [16:46:56] [INFO] Evaluate-repair loop: all rows pass at iteration 2 [16:47:13] [INFO] |-- 📋 Rewrite complete (0 failed) [296.2s] [16:47:13] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
In [7]:
Copied!
preview.display_record(1)
preview.display_record(1)
🚀 Full run¶
result.dataframehas user-facing columns: rewritten text, scores, and the review flag.
In [8]:
Copied!
result = anonymizer.run(config=config, data=input_data)
result.dataframe.head()
result = anonymizer.run(config=config, data=input_data)
result.dataframe.head()
[16:47:13] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/TAB_legal_sample25.csv (column: 'text') [16:47:13] [INFO] detection labels in scope: ['age', 'application_number', 'case_number', 'city', 'company_name', 'country', 'court_name', 'date', 'date_of_birth', 'date_time', 'email', 'first_name', 'last_name', 'legal_role', 'monetary_amount', 'nationality', 'organization_name', 'phone_number', 'prison_detention_facility', 'sentence_duration', 'ssn', 'state', 'street_address', 'time', 'unique_id'] [16:47:13] [INFO] 🔍 Running entity detection on 25 records [16:51:28] [INFO] |-- 📋 Detection complete — 1285 entities found across 25 records (0 failed) [254.7s] [16:51:28] [INFO] |-- labels: date=418, court_name=241, legal_role=167, last_name=84, organization_name=76, first_name=62, city=47, nationality=46, country=43, application_number=26, date_of_birth=25, prison_detention_facility=17, sentence_duration=13, monetary_amount=10, state=4, unique_id=2, case_number=1, age=1, time=1, company_name=1 [16:51:28] [INFO] ✏️ Running rewrite pipeline [17:01:53] [WARNING] Row count mismatch: target=25, source=24; dropping 1 failed row(s). [17:05:12] [INFO] Evaluate-repair loop iteration 0: 16/24 rows need repair [17:08:02] [INFO] Evaluate-repair loop iteration 1: 9/24 rows need repair [17:09:34] [INFO] Evaluate-repair loop iteration 2: 7/24 rows need repair [17:11:39] [INFO] |-- 📋 Rewrite complete (1 failed) [1211.3s] [17:11:39] [WARNING] 1 record(s) failed during pipeline processing. [17:11:39] [INFO] 🎉 Pipeline complete — 25 records processed, 1 total failures
Out[8]:
| text | text_rewritten | utility_score | leakage_mass | weighted_leakage_rate | any_high_leaked | needs_human_review | |
|---|---|---|---|---|---|---|---|
| 0 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 0.86 | 0.9 | 0.056962 | True | True |
| 1 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 0.841667 | 0.54 | 0.020769 | False | False |
| 2 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 0.857143 | 0.0 | 0.0 | False | False |
| 3 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 0.982353 | 0.0 | 0.0 | False | False |
| 4 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 0.984615 | 0.57 | 0.033529 | False | False |
In [9]:
Copied!
result.dataframe[["text_rewritten", "utility_score", "leakage_mass", "needs_human_review"]].head()
result.dataframe[["text_rewritten", "utility_score", "leakage_mass", "needs_human_review"]].head()
Out[9]:
| text_rewritten | utility_score | leakage_mass | needs_human_review | |
|---|---|---|---|---|
| 0 | PROCEDURE The case originated in an applicati... | 0.86 | 0.9 | True |
| 1 | PROCEDURE The case originated in an applicati... | 0.841667 | 0.54 | False |
| 2 | PROCEDURE The case originated in an applicati... | 0.857143 | 0.0 | False |
| 3 | PROCEDURE The case originated in an applicati... | 0.982353 | 0.0 | False |
| 4 | PROCEDURE The case originated in an applicati... | 0.984615 | 0.57 | False |
🚩 Filter by review flag¶
- Records where automated metrics exceed thresholds are flagged for manual review.
- Use this to prioritize human attention on the records that need it most.
- See Working with flagged records for guidance on diagnosing and resolving flagged records.
In [10]:
Copied!
df = result.dataframe
flagged = df[df["needs_human_review"] == True] # noqa: E712
print(f"{len(flagged)} of {len(df)} records flagged for human review")
flagged.head()
df = result.dataframe
flagged = df[df["needs_human_review"] == True] # noqa: E712
print(f"{len(flagged)} of {len(df)} records flagged for human review")
flagged.head()
7 of 24 records flagged for human review
Out[10]:
| text | text_rewritten | utility_score | leakage_mass | weighted_leakage_rate | any_high_leaked | needs_human_review | |
|---|---|---|---|---|---|---|---|
| 0 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 0.86 | 0.9 | 0.056962 | True | True |
| 6 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 0.966667 | 1.6 | 0.070796 | True | True |
| 10 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 0.873333 | 1.85 | 0.064685 | True | True |
| 12 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 0.9 | 4.3 | 0.119444 | True | True |
| 18 | PROCEDURE The case originated in an applicati... | PROCEDURE The case originated in an applicati... | 1.0 | 4.38 | 0.183264 | True | True |
⏭️ Next steps¶
- 🔍 Inspecting Detected Entities -- debug what the detection pipeline found before rewriting.
- Try it on your own data! Swap in your CSV, define entity labels for your
domain, and set a
PrivacyGoalthat fits -- you've got all the building blocks.