Parallel Execution of Input and Output Rails#

You can configure input and output rails to run in parallel. This can improve latency and throughput.

When to Use Parallel Rails Execution#

Use parallel execution for I/O-bound rails such as external API calls to LLMs or third-party integrations.
Enable parallel execution if you have two or more independent input or output rails without shared state dependencies.
Use parallel execution in production environments where response latency affects user experience and business metrics.

When Not to Use Parallel Rails Execution#

Avoid parallel execution for CPU-bound rails; it might not improve performance and can introduce overhead.
Use sequential mode during development and testing for debugging and simpler workflows.

Configuration Example#

To enable parallel execution, set parallel: True in the rails.input and rails.output sections in the config.yml file. The following configuration example is tested by NVIDIA and shows how to enable parallel execution for input and output rails.

Note

Input rail mutations can lead to erroneous results during parallel execution because of race conditions arising from the execution order and timing of parallel operations. This can result in output divergence compared to sequential execution. For such cases, use sequential mode.

The following is an example configuration for parallel rails using models from NVIDIA Cloud Functions (NVCF). When you use NVCF models, make sure that you export NVIDIA_API_KEY to access those models.

models:
  - type: main
    engine: nim
    model: meta/llama-3.1-70b-instruct
  - type: content_safety
    engine: nim
    model: nvidia/llama-3.1-nemoguard-8b-content-safety
  - type: topic_control
    engine: nim
    model: nvidia/llama-3.1-nemoguard-8b-topic-control

rails:
  input:
    parallel: True
    flows:
      - content safety check input $model=content_safety
      - topic safety check input $model=topic_control
  output:
    parallel: True
    flows:
      - content safety check output $model=content_safety
      - self check output
    streaming:
      enabled: True
      chunk_size: 200
      context_size: 50
      stream_first: True
streaming: True