Global Streaming#
Enable streaming mode for the main LLM generation at the top level of config.yml.
Configuration#
streaming: True
What It Does#
When enabled, global streaming:
Sets
streaming = Trueon the underlying LLM modelEnables
stream_usage = Truefor token usage trackingAllows using the
stream_async()method onLLMRailsMakes the LLM produce tokens incrementally instead of all at once
Default#
False
When to Use#
Streaming Without Output Rails#
If you do not have output rails configured, only global streaming is needed:
streaming: True
Streaming With Output Rails#
When using output rails with streaming, you must also configure output rail streaming:
streaming: True
rails:
output:
flows:
- self check output
streaming:
enabled: True
Python API Usage#
Simple Streaming#
from nemoguardrails import LLMRails, RailsConfig
config = RailsConfig.from_path("./config")
rails = LLMRails(config)
messages = [{"role": "user", "content": "Hello!"}]
async for chunk in rails.stream_async(messages=messages):
print(chunk, end="", flush=True)
Streaming With Handler#
For more control, use a StreamingHandler:
from nemoguardrails import LLMRails, RailsConfig
from nemoguardrails.streaming import StreamingHandler
import asyncio
config = RailsConfig.from_path("./config")
rails = LLMRails(config)
streaming_handler = StreamingHandler()
async def process_tokens():
async for chunk in streaming_handler:
print(chunk, end="", flush=True)
asyncio.create_task(process_tokens())
result = await rails.generate_async(
messages=[{"role": "user", "content": "Hello!"}],
streaming_handler=streaming_handler
)
Server API#
Enable streaming in the request body by setting stream to true:
{
"config_id": "my_config",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}
Token Usage Tracking#
When streaming is enabled, NeMo Guardrails automatically enables token usage tracking by setting stream_usage = True for the underlying LLM model.
Access token usage through the log generation option:
response = rails.generate(messages=messages, options={
"log": {
"llm_calls": True
}
})
for llm_call in response.log.llm_calls:
print(f"Total tokens: {llm_call.total_tokens}")
print(f"Prompt tokens: {llm_call.prompt_tokens}")
print(f"Completion tokens: {llm_call.completion_tokens}")
HuggingFace Pipeline Streaming#
For LLMs deployed using HuggingFacePipeline, additional configuration is required:
from nemoguardrails.llm.providers.huggingface import AsyncTextIteratorStreamer
# Create streamer with tokenizer
streamer = AsyncTextIteratorStreamer(tokenizer, skip_prompt=True)
params = {"temperature": 0.01, "max_new_tokens": 100, "streamer": streamer}
pipe = pipeline(
# other parameters
**params,
)
llm = HuggingFacePipelineCompatible(pipeline=pipe, model_kwargs=params)