Streaming#
NeMo Guardrails supports streaming LLM responses via the stream_async() method. No configuration is required to enable streaming—simply use stream_async() instead of generate_async().
Basic Usage#
from nemoguardrails import LLMRails, RailsConfig
config = RailsConfig.from_path("./config")
rails = LLMRails(config)
messages = [{"role": "user", "content": "Hello!"}]
async for chunk in rails.stream_async(messages=messages):
print(chunk, end="", flush=True)
Streaming With Output Rails#
When using output rails with streaming, you must configure output rail streaming:
rails:
output:
flows:
- self check output
streaming:
enabled: True
If output rails are configured but rails.output.streaming.enabled is not set to True, calling stream_async() will raise an StreamingNotSupportedError.
Streaming With Handler (Deprecated)#
Warning: Using
StreamingHandlerdirectly is deprecated and will be removed in a future release. Usestream_async()instead.
For advanced use cases requiring more control over token processing, you can use a StreamingHandler with generate_async():
from nemoguardrails import LLMRails, RailsConfig
from nemoguardrails.streaming import StreamingHandler
import asyncio
config = RailsConfig.from_path("./config")
rails = LLMRails(config)
streaming_handler = StreamingHandler()
async def process_tokens():
async for chunk in streaming_handler:
print(chunk, end="", flush=True)
asyncio.create_task(process_tokens())
result = await rails.generate_async(
messages=[{"role": "user", "content": "Hello!"}],
streaming_handler=streaming_handler
)
Server API#
Enable streaming in the request body by setting stream to true:
{
"config_id": "my_config",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}
CLI Usage#
Use the --streaming flag with the chat command:
nemoguardrails chat path/to/config --streaming
Token Usage Tracking#
When using stream_async(), NeMo Guardrails automatically enables token usage tracking by setting stream_usage = True on the underlying LLM model.
Access token usage through the log generation option:
response = rails.generate(messages=messages, options={
"log": {
"llm_calls": True
}
})
for llm_call in response.log.llm_calls:
print(f"Total tokens: {llm_call.total_tokens}")
print(f"Prompt tokens: {llm_call.prompt_tokens}")
print(f"Completion tokens: {llm_call.completion_tokens}")
HuggingFace Pipeline Streaming#
For LLMs deployed using HuggingFacePipeline, additional configuration is required:
from nemoguardrails.llm.providers.huggingface import AsyncTextIteratorStreamer
# Create streamer with tokenizer
streamer = AsyncTextIteratorStreamer(tokenizer, skip_prompt=True)
params = {"temperature": 0.01, "max_new_tokens": 100, "streamer": streamer}
pipe = pipeline(
# other parameters
**params,
)
llm = HuggingFacePipelineCompatible(pipeline=pipe, model_kwargs=params)