Skip to content

Self-hosting GLiNER

By default, Anonymizer's entity detection stage calls the hosted nvidia/gliner-pii model on build.nvidia.com. For PHI-sensitive workloads that cannot leave the host, or latency-critical setups, you can serve GLiNER locally instead.

The model is small (~500 MB) and runs comfortably on CPU — making it a good fit to run alongside a local LLM without competing for GPU memory. It also runs on GPU if one is available, which cuts detection latency on long documents.

The reference server script (tools/serve_gliner.py) is not installed with pip install nemo-anonymizer — get it from a source checkout of this repository (see Running it below).


How it works

Anonymizer's detection workflow calls the entity_detector role via an OpenAI-compatible POST /v1/chat/completions endpoint, passing extra parameters through extra_body:

{
    "model": "nvidia/gliner-pii",
    "messages": [{"role": "user", "content": "<the input text>"}],
    "labels": ["first_name", "last_name", "email", ...],
    "threshold": 0.3,
    "chunk_length": 384,
    "overlap": 128,
    "flat_ner": false
}

The server must respond with the chat-completion JSON shape, where message.content is a JSON string of the form {"entities": [...]}:

{
    "id": "chatcmpl-...",
    "object": "chat.completion",
    "choices": [{
        "index": 0,
        "message": {
            "role": "assistant",
            "content": "{\"entities\": [{\"text\": \"Alice\", \"label\": \"first_name\", \"start\": 0, \"end\": 5, \"score\": 0.94}, ...]}"
        },
        "finish_reason": "stop"
    }]
}

Each entity has text, label, start, end, score. The request fields above come from anonymizer.engine.detection.detection_workflow._inject_detector_params (labels, threshold, chunk_length, overlap, flat_ner); the response is parsed by anonymizer.engine.detection.postprocess.parse_raw_entities.

Long inputs are split into overlapping chunks before inference. A self-hosted server should honor chunk_length and overlap so detection matches the hosted build.nvidia.com path, while keeping the chat-completion adapter expected by Anonymizer.


Reference implementation

A minimal FastAPI reference server at tools/serve_gliner.py in the Anonymizer GitHub repository implements the contract above. It loads nvidia/gliner-pii, exposes POST /v1/chat/completions (and GET /v1/models), and uses two levels of batching:

  1. Chunk batching — long text is split into overlapping windows; all chunks are passed to one model.inference(...) call.
  2. Request coalescing (optional, on by default) — concurrent HTTP requests from DataDesigner are grouped briefly, then all their chunks are inferred together.

```python title="tools/serve_gliner.py (excerpt)" @app.post("/v1/chat/completions") async def chat_completions(request: Request): body = await request.json() text = _extract_text(body.get("messages", [])) params = DetectParams( labels=tuple(body.get("labels") or []), threshold=float(body.get("threshold", 0.3)), chunk_length=int(body.get("chunk_length", 384)), overlap=int(body.get("overlap", 128)), flat_ner=bool(body.get("flat_ner", False)), inference_batch_size=int(body.get("batch_size", 8)), ) entities = await detector.detect(text, params) ...


When `flat_ner` is `false` (Anonymizer's default), the server removes nested subset spans before score-based deduplication across chunk overlaps.

| Environment variable | Default | Purpose |
|---|---|---|
| `DEVICE` | `auto` | `auto`, `cuda`, `cpu`, or `mps` (Apple Silicon GPU) |
| `GLINER_BATCH_MODE` | `true` | Coalesce concurrent HTTP requests before inference |
| `GLINER_MAX_BATCH_REQUESTS` | `32` | Max requests per coalesced batch |
| `GLINER_BATCH_WAIT_MS` | `10` | Max wait time to fill a batch (milliseconds) |

Set `GLINER_BATCH_MODE=false` to disable request coalescing; chunk batching still runs per request.

---

## Running it

!!! note "Source checkout only"

    `tools/serve_gliner.py` ships in the [Anonymizer GitHub repository](https://github.com/NVIDIA-NeMo/Anonymizer), not in the `nemo-anonymizer` wheel. Clone the repo or download the file, then run it from that tree:

    ```bash
    git clone https://github.com/NVIDIA-NeMo/Anonymizer.git
    cd Anonymizer
    pip install fastapi uvicorn gliner
    python tools/serve_gliner.py
    ```

### Dependencies

```bash
pip install fastapi uvicorn gliner
# or with uv
uv pip install fastapi uvicorn gliner

On first launch the gliner package will download nvidia/gliner-pii from HuggingFace and cache it under ~/.cache/huggingface/. No HuggingFace token is required (public model).

Start the server

python tools/serve_gliner.py
# INFO     Uvicorn running on http://127.0.0.1:8001

# Optional: override port (default: 8001)
python tools/serve_gliner.py --port 9000

# Optional: listen on all interfaces — no auth; use only on trusted networks
python tools/serve_gliner.py --host 0.0.0.0

# Optional: pick device explicitly (auto prefers mps, then cuda, then cpu)
DEVICE=cuda python tools/serve_gliner.py

The reference server has no authentication. The default bind address is 127.0.0.1 so detection traffic stays on localhost. Use --host 0.0.0.0 only when Anonymizer runs on another host in a trusted environment.

Verify the server is reachable:

curl -sf http://localhost:8001/v1/models | python -m json.tool
# {
#     "object": "list",
#     "data": [{"id": "nvidia/gliner-pii", "object": "model"}]
# }

Run a real detection call — this is exactly what Anonymizer sends at the entity_detector role:

curl -s http://localhost:8001/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "nvidia/gliner-pii",
        "messages": [{"role": "user", "content": "Hi support, I can'\''t log in! My account username is '\''johndoe88'\''. Every time I try, it says '\''invalid credentials'\''. Please reset my password. You can reach me at (555) 123-4567 or johnd@example.com"}],
        "labels": ["user_name", "phone_number", "email"],
        "threshold": 0.3
    }' | jq -r '.choices[0].message.content' | jq

The first jq unwraps choices[0].message.content (an escaped JSON string); the second pretty-prints the decoded payload. Expected output:

{
  "entities": [
    { "text": "johndoe88",          "label": "user_name",    "start":  52, "end":  61, "score": 0.95 },
    { "text": "(555) 123-4567",     "label": "phone_number", "start": 159, "end": 173, "score": 1.00 },
    { "text": "johnd@example.com",  "label": "email",        "start": 177, "end": 194, "score": 1.00 }
  ]
}

An empty "entities": [] means either no labels in the request matched real PII in the text, or the threshold is too high.


Pointing Anonymizer at the local server

Pass separate model_providers and model_configs files to Anonymizer. model_configs replaces the entire model pool — it is not merged with defaults. Copy the bundled models.yaml, change only the gliner-pii-detector entry's provider, and keep the other default aliases (gpt-oss-120b, nemotron-30b-thinking). Default role→alias mappings still apply unless you override selected_models (see Custom models).

Custom model_providers also replaces the provider list, so include both your local GLiNER endpoint and the nvidia provider used by the LLM roles:

```yaml title="providers.yaml" providers: - name: local-gliner endpoint: http://localhost:8001/v1 provider_type: openai api_key: EMPTY # ignored; the reference server does not check auth

  • name: nvidia endpoint: https://integrate.api.nvidia.com/v1 provider_type: openai api_key: NVIDIA_API_KEY
    
    ```bash
    export NVIDIA_API_KEY="your-nvidia-api-key"

``yaml title="models.yaml" model_configs: - alias: gliner-pii-detector model: nvidia/gliner-pii provider: local-gliner skip_health_check: true # the default health check sends nolabels`, which GLiNER can't handle inference_parameters: max_parallel_requests: 8 # send concurrent rows; the reference server batches them timeout: 120

  • alias: gpt-oss-120b model: openai/gpt-oss-120b provider: nvidia inference_parameters: max_parallel_requests: 16 max_tokens: 16384 temperature: 0.3 top_p: 0.95 timeout: 300

  • alias: nemotron-30b-thinking model: nvidia/nemotron-3-nano-30b-a3b provider: nvidia inference_parameters: max_parallel_requests: 16 max_tokens: 8192 temperature: 0.4 top_p: 1.0 timeout: 300

    
    ```python
    from anonymizer import Anonymizer
    
    anonymizer = Anonymizer(
        model_providers="providers.yaml",
        model_configs="models.yaml",
    )

Set skip_health_check: true on the detector alias: Anonymizer's default probe sends prompt="Hello!" with no labels field, which is not a valid GLiNER request.


Performance notes

  • Batch mode: The reference server coalesces concurrent detector requests by default. Pair it with a higher max_parallel_requests on the gliner-pii-detector alias (see YAML above) so DataDesigner sends multiple rows at once and the server fills GPU batches efficiently.
  • On CPU, detection of a ~1000-character note with ~30 candidate labels takes 5–20 ms per request on a modern x86 core. For typical Anonymizer workflows this is a rounding error compared to the LLM roles that follow, and keeping GLiNER on CPU frees GPU memory for the LLM.
  • On GPU the same request drops to roughly 1–3 ms — worth it when you're processing tens of thousands of documents in a batch workflow, or when the host has spare GPU memory next to the LLM.
  • Choose device with the DEVICE environment variable (auto, cuda, mps, cpu). auto prefers Apple Silicon GPU (MPS), then NVIDIA CUDA, then CPU.
  • The default GLiNER threshold is 0.3. Lower values detect more spans (higher recall, more false positives); higher values improve precision but miss edge cases. Tune via Detect(gliner_threshold=...).
  • Each request loads the FULL list of candidate labels passed from Detect.entity_labels. If you only need a subset (e.g. a clinical-only deployment), narrowing that list materially speeds up detection.