Self-hosting Llama Guard using vLLM#
Detailed below are steps to self-host Llama Guard using vLLM and HuggingFace. Alternatively, you can do this using your own custom inference code with the downloaded model weights, too.
- Get access to the Llama Guard model from Meta on HuggingFace. See this page for more details. 
- Log in to Hugging Face with your account token 
huggingface-cli login
- Here, we use vLLM to host a Llama Guard inference endpoint in the OpenAI-compatible mode. 
pip install vllm
python -m vllm.entrypoints.openai.api_server --port 5123 --model meta-llama/LlamaGuard-7b
This will serve up the vLLM inference server on http://localhost:5123/.
- Set the host and port in your bot’s YAML configuration files (example config). If you’re running the - nemoguardrailsapp on another server, remember to replace- localhostwith your vLLM server’s public IP address.