Self-hosting GLiNER¶
By default, Anonymizer's entity detection stage calls the hosted nvidia/gliner-pii model on build.nvidia.com. For PHI-sensitive workloads that cannot leave the host, or latency-critical setups, you can serve GLiNER locally instead.
The model is small (~500 MB) and runs comfortably on CPU — making it a good fit to run alongside a local LLM without competing for GPU memory. It also runs on GPU if one is available, which cuts detection latency on long documents.
The reference server script (tools/serve_gliner.py) is not installed with pip install nemo-anonymizer — get it from a source checkout of this repository (see Running it below).
How it works¶
Anonymizer's detection workflow calls the entity_detector role via an OpenAI-compatible POST /v1/chat/completions endpoint, passing extra parameters through extra_body:
{
"model": "nvidia/gliner-pii",
"messages": [{"role": "user", "content": "<the input text>"}],
"labels": ["first_name", "last_name", "email", ...],
"threshold": 0.3,
"chunk_length": 384,
"overlap": 128,
"flat_ner": false
}
The server must respond with the chat-completion JSON shape, where message.content is a JSON string of the form {"entities": [...]}:
{
"id": "chatcmpl-...",
"object": "chat.completion",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "{\"entities\": [{\"text\": \"Alice\", \"label\": \"first_name\", \"start\": 0, \"end\": 5, \"score\": 0.94}, ...]}"
},
"finish_reason": "stop"
}]
}
Each entity has text, label, start, end, score. The request fields above come from anonymizer.engine.detection.detection_workflow._inject_detector_params (labels, threshold, chunk_length, overlap, flat_ner); the response is parsed by anonymizer.engine.detection.postprocess.parse_raw_entities.
Long inputs are split into overlapping chunks before inference. A self-hosted server should honor chunk_length and overlap so detection matches the hosted build.nvidia.com path, while keeping the chat-completion adapter expected by Anonymizer.
Reference implementation¶
A minimal FastAPI reference server at tools/serve_gliner.py in the Anonymizer GitHub repository implements the contract above. It loads nvidia/gliner-pii, exposes POST /v1/chat/completions (and GET /v1/models), and uses two levels of batching:
- Chunk batching — long text is split into overlapping windows; all chunks are passed to one
model.inference(...)call. - Request coalescing (optional, on by default) — concurrent HTTP requests from DataDesigner are grouped briefly, then all their chunks are inferred together.
```python title="tools/serve_gliner.py (excerpt)" @app.post("/v1/chat/completions") async def chat_completions(request: Request): body = await request.json() text = _extract_text(body.get("messages", [])) params = DetectParams( labels=tuple(body.get("labels") or []), threshold=float(body.get("threshold", 0.3)), chunk_length=int(body.get("chunk_length", 384)), overlap=int(body.get("overlap", 128)), flat_ner=bool(body.get("flat_ner", False)), inference_batch_size=int(body.get("batch_size", 8)), ) entities = await detector.detect(text, params) ...
When `flat_ner` is `false` (Anonymizer's default), the server removes nested subset spans before score-based deduplication across chunk overlaps.
| Environment variable | Default | Purpose |
|---|---|---|
| `DEVICE` | `auto` | `auto`, `cuda`, `cpu`, or `mps` (Apple Silicon GPU) |
| `GLINER_BATCH_MODE` | `true` | Coalesce concurrent HTTP requests before inference |
| `GLINER_MAX_BATCH_REQUESTS` | `32` | Max requests per coalesced batch |
| `GLINER_BATCH_WAIT_MS` | `10` | Max wait time to fill a batch (milliseconds) |
Set `GLINER_BATCH_MODE=false` to disable request coalescing; chunk batching still runs per request.
---
## Running it
!!! note "Source checkout only"
`tools/serve_gliner.py` ships in the [Anonymizer GitHub repository](https://github.com/NVIDIA-NeMo/Anonymizer), not in the `nemo-anonymizer` wheel. Clone the repo or download the file, then run it from that tree:
```bash
git clone https://github.com/NVIDIA-NeMo/Anonymizer.git
cd Anonymizer
pip install fastapi uvicorn gliner
python tools/serve_gliner.py
```
### Dependencies
```bash
pip install fastapi uvicorn gliner
# or with uv
uv pip install fastapi uvicorn gliner
On first launch the gliner package will download nvidia/gliner-pii from HuggingFace and cache it under ~/.cache/huggingface/. No HuggingFace token is required (public model).
Start the server¶
python tools/serve_gliner.py
# INFO Uvicorn running on http://127.0.0.1:8001
# Optional: override port (default: 8001)
python tools/serve_gliner.py --port 9000
# Optional: listen on all interfaces — no auth; use only on trusted networks
python tools/serve_gliner.py --host 0.0.0.0
# Optional: pick device explicitly (auto prefers mps, then cuda, then cpu)
DEVICE=cuda python tools/serve_gliner.py
The reference server has no authentication. The default bind address is 127.0.0.1 so detection traffic stays on localhost. Use --host 0.0.0.0 only when Anonymizer runs on another host in a trusted environment.
Verify the server is reachable:
curl -sf http://localhost:8001/v1/models | python -m json.tool
# {
# "object": "list",
# "data": [{"id": "nvidia/gliner-pii", "object": "model"}]
# }
Run a real detection call — this is exactly what Anonymizer sends at the entity_detector role:
curl -s http://localhost:8001/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/gliner-pii",
"messages": [{"role": "user", "content": "Hi support, I can'\''t log in! My account username is '\''johndoe88'\''. Every time I try, it says '\''invalid credentials'\''. Please reset my password. You can reach me at (555) 123-4567 or johnd@example.com"}],
"labels": ["user_name", "phone_number", "email"],
"threshold": 0.3
}' | jq -r '.choices[0].message.content' | jq
The first jq unwraps choices[0].message.content (an escaped JSON string); the second pretty-prints the decoded payload. Expected output:
{
"entities": [
{ "text": "johndoe88", "label": "user_name", "start": 52, "end": 61, "score": 0.95 },
{ "text": "(555) 123-4567", "label": "phone_number", "start": 159, "end": 173, "score": 1.00 },
{ "text": "johnd@example.com", "label": "email", "start": 177, "end": 194, "score": 1.00 }
]
}
An empty "entities": [] means either no labels in the request matched real PII in the text, or the threshold is too high.
Pointing Anonymizer at the local server¶
Pass separate model_providers and model_configs files to Anonymizer. model_configs replaces the entire model pool — it is not merged with defaults. Copy the bundled models.yaml, change only the gliner-pii-detector entry's provider, and keep the other default aliases (gpt-oss-120b, nemotron-30b-thinking). Default role→alias mappings still apply unless you override selected_models (see Custom models).
Custom model_providers also replaces the provider list, so include both your local GLiNER endpoint and the nvidia provider used by the LLM roles:
```yaml title="providers.yaml" providers: - name: local-gliner endpoint: http://localhost:8001/v1 provider_type: openai api_key: EMPTY # ignored; the reference server does not check auth
- name: nvidia
endpoint: https://integrate.api.nvidia.com/v1
provider_type: openai
api_key: NVIDIA_API_KEY
```bash export NVIDIA_API_KEY="your-nvidia-api-key"
``yaml title="models.yaml"
model_configs:
- alias: gliner-pii-detector
model: nvidia/gliner-pii
provider: local-gliner
skip_health_check: true # the default health check sends nolabels`, which GLiNER can't handle
inference_parameters:
max_parallel_requests: 8 # send concurrent rows; the reference server batches them
timeout: 120
-
alias: gpt-oss-120b model: openai/gpt-oss-120b provider: nvidia inference_parameters: max_parallel_requests: 16 max_tokens: 16384 temperature: 0.3 top_p: 0.95 timeout: 300
-
alias: nemotron-30b-thinking model: nvidia/nemotron-3-nano-30b-a3b provider: nvidia inference_parameters: max_parallel_requests: 16 max_tokens: 8192 temperature: 0.4 top_p: 1.0 timeout: 300
```python from anonymizer import Anonymizer anonymizer = Anonymizer( model_providers="providers.yaml", model_configs="models.yaml", )
Set skip_health_check: true on the detector alias: Anonymizer's default probe sends prompt="Hello!" with no labels field, which is not a valid GLiNER request.
Performance notes¶
- Batch mode: The reference server coalesces concurrent detector requests by default. Pair it with a higher
max_parallel_requestson thegliner-pii-detectoralias (see YAML above) so DataDesigner sends multiple rows at once and the server fills GPU batches efficiently. - On CPU, detection of a ~1000-character note with ~30 candidate labels takes 5–20 ms per request on a modern x86 core. For typical Anonymizer workflows this is a rounding error compared to the LLM roles that follow, and keeping GLiNER on CPU frees GPU memory for the LLM.
- On GPU the same request drops to roughly 1–3 ms — worth it when you're processing tens of thousands of documents in a batch workflow, or when the host has spare GPU memory next to the LLM.
- Choose device with the
DEVICEenvironment variable (auto,cuda,mps,cpu).autoprefers Apple Silicon GPU (MPS), then NVIDIA CUDA, then CPU. - The default GLiNER threshold is
0.3. Lower values detect more spans (higher recall, more false positives); higher values improve precision but miss edge cases. Tune viaDetect(gliner_threshold=...). - Each request loads the FULL list of candidate labels passed from
Detect.entity_labels. If you only need a subset (e.g. a clinical-only deployment), narrowing that list materially speeds up detection.