Troubleshooting¶
Symptom-first guide to common problems and how to fix them. Each entry says how to diagnose, what knob to turn, and what to verify after.
When something looks wrong, first confirm the run completed cleanly (no failed_records — these are rows that didn't make it through the pipeline at all, usually a rate-limit / infra issue). Once you know the pipeline ran, run preview and inspect rows with quality issues (needs_human_review=True) — the trace columns it produces are where to start.
This guide is written against the Python API
Diagnostic objects like result.failed_records and result.trace_dataframe only exist in Python. The CLI is for production batch runs and only emits a per-stage summary line on stderr (📋 Detection complete — N entities … (K failed) [Xs]); if K > 0 you have a problem to investigate, but the CLI can't tell you which rows or why. Drop into Python (or import the same config from the agent's config file) for everything below.
Diagnostics: what to read first¶
Most fixes start with one of these.
Did the run actually complete cleanly?¶
Before debugging quality, confirm the run wasn't degraded by infra problems. Anonymizer never silently drops rows — every dropped record appears on result.failed_records with full provenance:
for fr in result.failed_records:
print(fr.record_id, fr.step, fr.reason)
Common patterns and what they mean:
reason substring |
Likely cause | Action |
|---|---|---|
Record missing from workflow output at step="detection" |
Validator pool exhausted on at least one chunk for this row — usually rate limiting (429s) burning through the AIMD throttler's retry budget | Model/provider config issue, not strategy. See Rate limits / dropped rows below |
Record missing from workflow output at step="replace-map-generation" or step="rewrite-*" |
Same as above but for the replace/rewrite stage | Same fix |
Output is missing required tracking column |
Internal — file an issue | — |
If failed_records is non-empty, fix that first. Strategy/prompt knobs (risk_tolerance, protect, gliner_threshold) won't help — those rows didn't fail because of bad output, they failed because the model never returned an output at all.
Rate limits / dropped rows¶
If you have rows failing with Record missing from workflow output, the chain is:
GLiNER candidate set → chunked into validator calls → each call dispatched to a model in the validator pool → ThrottledModelClient does AIMD on 429s → if every alias in the pool fails on at least one chunk, the row drops.
Fix in this order:
- If failures are at
step="detection", add aliases to the validator pool inmodels.yaml. The validator is the only role that supports a pool — setentity_validatorto a list of aliases and chunked validation will round-robin across them, giving you failover when one provider rate-limits. Every other role (detector, augmenter, rewriter, evaluator, etc.) is a single alias. - Lower
validation_max_entities_per_call(default100) onDetectso each call sends fewer tokens — easier on tight per-minute token budgets. Helps any stage that's hitting per-call token limits, but most useful for validation. - Switch the heavy alias to a different
providerinproviders.yaml. If you're hammering one tenant's quota, moving to a second deployment of the same model helps more than tuning batch sizes. This is the only lever for non-validator stages (rewrite, evaluate, etc.) since they don't have pools. - Re-run on just the failed records — filter the input dataframe to those
record_ids and callanonymizer.runagain. Failures are usually transient.
See Models and Validator pools for the config shape.
Read the preview trace¶
Rewrite mode preview returns intermediate columns alongside the rewritten text. Inspect them via result.trace_dataframe:
result = anonymizer.preview(config=config, data=data, num_records=5)
result.trace_dataframe[[
"_domain",
"_sensitivity_disposition",
"leakage_mass",
"utility_score",
"needs_human_review",
]]
Key columns:
| Column | What it tells you |
|---|---|
_domain |
Which domain the classifier picked. Wrong domain → wrong supplement → poor rewrite |
_sensitivity_disposition |
Per-entity sensitivity assignments (high/medium/low) |
leakage_mass |
Confidence-weighted sum of leaked entities |
utility_score |
0–1 quality preservation score |
weighted_leakage_rate |
Leakage normalized by maximum possible leakage |
any_high_leaked |
Whether any high-sensitivity entity leaked through |
needs_human_review |
Crossed the configured threshold |
_judge_evaluation |
Final-judge qualitative comments |
Re-run with Annotate to see detection output¶
When you suspect a detection problem (missed entities, weird labels), run a tiny preview with replace=Annotate() against the same Detect config. The output text shows <original, label> for every detected entity in place — easier to eyeball than the trace columns.
from anonymizer import Annotate, AnonymizerConfig
debug_config = AnonymizerConfig(detect=detect, replace=Annotate())
preview = anonymizer.preview(config=debug_config, data=data, num_records=5)
print(preview.dataframe.iloc[0][f"{data.text_column}_with_spans"])
Detection problems¶
Detection missed an entity I expected¶
Try in order:
- Lower
gliner_thresholdfrom0.3to0.2(or0.15). False positives get caught downstream by validation. - Extend the default list with the entity's label if it's not in
DEFAULT_ENTITY_LABELS. SettingDetect.entity_labelsto a custom list switches detection to strict mode (only listed labels detected, augmenter can't invent), so to keep the defaults plus one extra label use:
from anonymizer import DEFAULT_ENTITY_LABELS, Detect
detect = Detect(entity_labels=[*DEFAULT_ENTITY_LABELS, "clinical_facility"])
Domain-specific labels (clinical_facility, case_number, internal_project_codename) won't be detected reliably without being listed this way.
3. Set AnonymizerInput.data_summary so the augmenter LLM has domain context. A line like "De-identified pediatric oncology progress notes" materially improves coverage.
4. For rewrite mode, latent entities are detected separately. If a piece of inferable information (e.g. "during her third round of chemo" → cancer treatment) is being preserved verbatim, the latent detector likely missed it — refine Rewrite.privacy_goal.protect to call out the inference category explicitly.
Verify by re-running preview with Annotate and confirming the entity now appears tagged.
Too many false-positive entities¶
Symptoms: detected entities include obvious common words, dates that aren't dates, etc.
- Raise
gliner_thresholdto0.5. The augmenter will pick up real misses, so this rarely costs recall. - Lower
validation_excerpt_window_chars(default500) if context-driven validation is being misled by far-away sentences. Smaller per-chunk prompts trade context for precision. - Sanity-check the validator with an
Annotatepreview. A flaky validator (or a misconfigured alias) returns "keep" on almost everything, which presents as recall going way up — easiest spotted by eyeballing the entity list on a handful of rows.
A new domain isn't being detected well¶
Symptom: rewrite output is generic-sounding even though the input is clearly in a specialized domain.
- Inspect
_domaininresult.trace_dataframe. If it showsgeneralor an unrelated domain, the classifier is missing the cue. - Set
AnonymizerInput.data_summaryto name the domain explicitly. - If your domain isn't represented in
DOMAIN_SUPPLEMENT_MAP, the engine falls back to generic supplements and rewrite quality suffers. This is a code-level extension — file an issue, or add the domain tosrc/anonymizer/engine/rewrite/domain_classification.py.
Rewrite quality¶
leakage_mass is too high¶
leakage_mass is a confidence-weighted sum of leaked entities (high=1.0, medium=0.6, low=0.3). Targets vary by risk_tolerance:
| Tolerance | Repair triggers above | Flagged for review above |
|---|---|---|
minimal |
0.6 | 1.0 |
low |
1.0 | 2.0 |
moderate |
1.5 | 2.5 |
high |
2.0 | 3.0 |
If you're consistently above your threshold:
- Tighten
risk_toleranceone step (e.g.low→minimal). Cheapest knob. - Refine
privacy_goal.protectto name the categories that are leaking. Inspect_sensitivity_dispositionto see which entities the engine deemed protected — anything classifiedlowmay be slipping through. - Set
strict_entity_protection=Trueto force every detected entity into a protective disposition. - Increase
max_repair_iterationsfrom3to5if the trace shows leakage shrinking across iterations but not finishing. - Detection coverage — leakage can't be fixed if the entity wasn't detected. Walk back through "Detection missed an entity I expected" before giving up.
utility_score is too low¶
utility_score measures how well meaning was preserved (0–1). Below ~0.5 is usually unusable; the human-review threshold depends on risk_tolerance (0.3–0.6).
Most common causes:
protectis too aggressive — it's removing things downstream tasks need. Move the over-suppressed content intopreserve.preserveis too vague — generic phrasing like "preserve meaning" gives the rewriter no signal. Name the specific facets that matter (clinical findings, argument structure, timeline, etc.).risk_tolerance="minimal"plusstrict_entity_protection=Trueis the most aggressive combination and can over-modify. Loosen one of the two if downstream task quality matters more than blanket coverage.- Repair loop is over-correcting — inspect repair iterations in the trace. If utility falls each iteration, lower
max_repair_iterations.
Most rows have needs_human_review=True¶
Three failure modes look the same in the column:
| Cause | Diagnosis |
|---|---|
| Leakage too high | weighted_leakage_rate near 1, or any_high_leaked=True |
| Utility too low | utility_score below flag_utility_below for your tolerance |
| Both | Almost always means protect and preserve are pulling in opposite directions |
For the third case, the fix is rewriting the privacy goal so it draws a cleaner line between "remove this" and "keep this." See Choosing a strategy > Privacy goal.
Repair runs every iteration but never converges¶
Symptom: every record uses all max_repair_iterations and still ends up flagged.
- The leakage threshold for your
risk_tolerancemay be unreachable for your data. Look at the floor ofleakage_massacross repair iterations — if it plateaus above the threshold, the data has more sensitive content than the threshold permits given current detection coverage. - Loosen
risk_toleranceone step, only after confirming detection has caught everything you can see. Loosening before detection is solid just hides the leak. - Set
max_repair_iterations=0for an audit pass. You'll get the metrics without paying for repair attempts that won't succeed, which makes it easy to see how far off you are.
Pipeline / output issues¶
Output rows are missing (FailedRecords)¶
See Did the run actually complete cleanly? and Rate limits / dropped rows. The short version: this is almost always a model/provider issue, not a strategy issue, and the fix is in models.yaml / providers.yaml.
Replacement looks repetitive or unrealistic (Substitute)¶
Symptoms: every "Alice" becomes "Maya," every city becomes "Springfield," every email becomes john@example.com.
- Add domain hints to
Substitute.instructions— see Choosing a strategy > Writing Substitute.instructions. - Check the
replacement_generatormodel — small models default-collapse to high-frequency names. Try a stronger model for this role. - If you need stable cross-row mappings, use
Hashor post-processresult.trace_dataframe["_replacement_map"].Substituteis consistent within a row but not across rows: repeated mentions of the same value in one row collapse to one replacement (so a person's name stays consistent in a single document), but across rows the LLM has no shared state, so "Alice" in different rows usually gets different replacements.
Hash output isn't stable across runs¶
The digest is deterministic given the same algorithm, digest_length, and input text. The label is not part of the digest — but it is templated into the output wrapper (default format_template="<HASH_{label}_{digest}>"). So:
- Digest differs across runs: the detected entity text changed (whitespace, casing, surrounding context). Check the
<your_text_column>_with_spanscolumn to confirm. - Only the wrapper differs (digest is the same, but e.g.
<HASH_FIRST_NAME_abc>becomes<HASH_NAME_abc>): the label changed between runs. The digest is still stable; only the wrapper text moved. To avoid label drift in the output, setformat_template="<HASH_{digest}>"to drop the label entirely.
Validation passed but preview errors at LLM call¶
Configuration is structurally valid but a runtime model call failed. Check:
- The provider for the model alias has an API key set in your environment.
- The base URL is reachable (corporate VPN / proxy).
- The model alias actually exists at the provider —
anonymizer validatechecks the alias is in your config; it doesn't dial out to confirm the model is live.