Output Rail Streaming#
Configure how output rails process streamed tokens under rails.output.streaming.
Configuration#
rails:
output:
flows:
- self check output
streaming:
enabled: True
chunk_size: 200
context_size: 50
stream_first: True
Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Must be |
|
int |
|
Number of tokens per chunk that output rails process |
|
int |
|
Tokens carried over between chunks for continuity |
|
bool |
|
If |
Tips for Setting Parameters#
enabled#
When you configure output rails and want to use stream_async(), set this to True.
If not enabled, you receive an error:
stream_async() cannot be used when output rails are configured but
rails.output.streaming.enabled is False. Either set
rails.output.streaming.enabled to True in your configuration, or use
generate_async() instead of stream_async().
chunk_size#
The number of tokens buffered before output rails run.
Larger values: Fewer rail executions, but higher latency to first output
Smaller values: More rail executions, but faster time-to-first-token
Default: 200 tokens
context_size#
The number of tokens from the previous chunk carried over to provide context for the next chunk.
This helps output rails make consistent decisions across chunk boundaries. For example, if a sentence spans two chunks, the context ensures the rail can evaluate the complete sentence.
Default: 50 tokens
stream_first#
Controls when tokens are streamed relative to output rail processing:
True(default): The client receives each chunk of tokens before output rails process that chunk. This provides faster time-to-first-token, but if a rail blocks the content, the user has already received the tokens. The stream terminates with a JSON error on violation.False: Output rails process each chunk before the client receives tokens. The user never sees blocked content, but time-to-first-token increases by the rail execution time per chunk.
Requirements#
Output rail streaming requires using the stream_async() method:
rails:
output:
flows:
- self check output
streaming:
enabled: True
Note
The top-level streaming: True field is deprecated and no longer required. Use stream_async() directly instead.
Usage Examples#
Basic Output Rail Streaming#
rails:
output:
flows:
- self check output
streaming:
enabled: True
chunk_size: 200
context_size: 50
Parallel Output Rails With Streaming#
For parallel execution of multiple output rails during streaming:
rails:
output:
parallel: True
flows:
- content_safety_check
- pii_detection
- hallucination_check
streaming:
enabled: True
chunk_size: 200
context_size: 50
stream_first: True
Low-Latency Configuration#
For faster time-to-first-token with smaller chunks:
rails:
output:
flows:
- self check output
streaming:
enabled: True
chunk_size: 50
context_size: 20
stream_first: True
Warning
With stream_first: True, the client receives tokens before output rails run. If a rail blocks the content, the user has already received the tokens up to that point. The stream terminates with a JSON error object when it detects a violation.
Safety-First Configuration#
For maximum safety with rails applied before streaming:
rails:
output:
flows:
- content_safety_check
streaming:
enabled: True
chunk_size: 300
context_size: 75
stream_first: False
How It Works#
Token Buffering: The system buffers tokens from the LLM until
chunk_sizetokens accumulate.Streaming or Rail Execution (depends on
stream_first):stream_first: True(default): The client receives the new tokens immediately, then output rails run on the chunk (including context). If the rails block the content, the stream terminates with a JSON error, while the client receives the tokens up to that point.stream_first: False: Output rails run on the chunk first. The client receives the new tokens only if rails pass. If the rails block the content, the client never receives the tokens.
Context Overlap: The system retains the last
context_sizetokens from the current chunk and prepends them to the next chunk’s processing context. This gives rails visibility across chunk boundaries.Blocking: If any rail blocks the content, the stream yields a JSON error object (
{"error": {...}}) and terminates immediately.
stream_first: True (default)#
Buffer fills to chunk_size
↓
Yield new tokens to client (user sees them immediately)
↓
Run output rails on [context + new tokens]
↓
Pass → continue to next chunk
Block → yield JSON error, terminate stream
stream_first: False#
Buffer fills to chunk_size
↓
Run output rails on [context + new tokens]
↓
Pass → yield new tokens to client
Block → yield JSON error, terminate stream (user never sees blocked content)
Buffer Overlap#
The client receives only new tokens. Output rails use the context_size tokens solely for processing context:
Chunk 1: rails process [token1 ... token200]
user receives [token1 ... token200]
Chunk 2: rails process [token151 ... token200, token201 ... token400]
└── context_size ──┘ └── new tokens ───────┘
user receives [token201 ... token400]
Python API#
from nemoguardrails import LLMRails, RailsConfig
config = RailsConfig.from_path("./config")
rails = LLMRails(config)
messages = [{"role": "user", "content": "Tell me a story"}]
# stream_async() automatically uses output rail streaming when configured
async for chunk in rails.stream_async(messages=messages):
print(chunk, end="", flush=True)