Skip to content

Sticky Routing

By default, routing is recomputed on every request, so a single task can hop between models mid-conversation. This throws away the upstream prompt/KV cache and, on the classifier router, pays for an LLM classifier call every turn.

Session affinity, also called sticky routing, is an opt-in feature configured directly on the deterministic LLM-classifier router. It pins the routing decision for a conversation and reuses it for later turns, so a task stays on one model tier. The router can optionally delay that pin for a fixed number of warmup turns before committing to a tier.

How a conversation is identified

No client session ID is required. A conversation is keyed by hashing the stable prefix made from the system prompt + the first user message. Agent harnesses do not rewrite that prefix, so every turn of one task maps to the same key while distinct tasks differ. The key is derived per request and the pin store is a bounded (least-recently-used) in-process map.

Configuration

Opt in with session_affinity: true on the route. affinity_max_sessions (default 10000) caps the number of pinned conversations.

These fields are part of the deterministic router configuration, not a standalone route type. In CLI YAML, configure them under a deterministic entry in a routes: bundle loaded with --routing-profiles. The Rust profiles: schema loaded by switchyard serve --config does not yet expose them.

For deterministic routes, affinity_warmup_turns controls how many initial turns remain non-sticky. The default is 0, which preserves the historical "pin the first confident verdict" behavior.

Deterministic (LLM-classifier) router

routes:
  llm-classifier:
    type: deterministic
    profile: coding_agent
    session_affinity: true          # pin a confident tier per conversation
    affinity_max_sessions: 10000
    affinity_warmup_turns: 2        # turns 1-2 classify normally; turn 3 can pin
    fallback_target_on_evict: strong
    classifier:
      model: openai/gpt-4o-mini
      api_key: ${OPENROUTER_API_KEY}
      base_url: https://openrouter.ai/api/v1
    strong:
      model: openai/gpt-4o
      api_key: ${OPENROUTER_API_KEY}
      base_url: https://openrouter.ai/api/v1
    weak:
      model: openai/gpt-4o-mini
      api_key: ${OPENROUTER_API_KEY}
      base_url: https://openrouter.ai/api/v1

The first confident classifier verdict after the warmup period is pinned and reused. With affinity_warmup_turns: 2, turns 1 and 2 route normally and cannot read or write a pin; turn 3 is the first turn that can commit a confident tier. Once pinned, the classifier skips its LLM call entirely. It runs only until the route has a committed tier for the task. Fallback decisions (abstain / low confidence) are never pinned, so a transient classifier failure can't lock the task to the default tier.

Behavior notes

  • Off by default. With session_affinity: false, routing is recomputed per turn as before.
  • Warmup. affinity_warmup_turns: N means turns 1..N stay non-sticky; the first eligible commit is turn N + 1.
  • affinity_max_sessions: 0 with affinity on is rejected at config load. A zero-capacity store retains nothing and would silently disable stickiness.
  • In-process, per-worker. Pins live in memory and are not shared across workers or restarts.
  • Degrades safe. Affinity assumes the system + first-user prefix is stable across a task's turns. If a harness injects per-turn churn there, the key changes each turn and routing simply reverts to per-turn behavior. No lock, no error.