Token Triage
Policy-driven model routing evaluated per request before invocation.
Problem
Teams default to expensive models (GPT-4o, Claude Opus) for every request because manual routing is error-prone. Baseline cost with always-on GPT-4: $0.03 per request average. Simple queries waste 80% of token budget. Complex queries hit rate limits without fallback. No deterministic cost ceiling.
Neusnap Behavior
Routing happens before model invocation. Neusnap evaluates routing policy constraints (cost ceiling, latency target, quality threshold) and selects the cheapest model that satisfies all constraints. If the primary model fails or violates constraints, fallback occurs within the same transaction. Cache hits bypass routing entirely—no model is invoked.
Guaranteed
- →Cost ceiling enforced per request (hard limit, not advisory)
- →Fallback respects quality threshold—no silent downgrades
- →Cached results prevent duplicate spend (cache key = semantic hash of prompt + params)
- →Routing decision logged for audit trail
NOT Guaranteed
- × Semantic quality equivalence across models (GPT-4 ≠ Claude Opus output)
- × Identical reasoning depth across models
- × Sub-100ms routing overhead under extreme load
- × Fallback if all models in policy chain are unavailable (request fails)
Example Policy
{
"routingPolicy": {
"primary": "openai/gpt-4o-mini", // evaluated first
"fallback": [
"anthropic/claude-3-haiku", // if primary fails or violates cost
"local/llama-3-8b" // final fallback
],
"constraints": {
"maxCostPerRequest": 0.01, // enforced before invocation
"maxLatencyMs": 2000, // provider timeout
"minQualityScore": 0.85 // blocks downgrade if score < threshold
},
"cache": {
"enabled": true,
"ttl": 3600, // cache invalidation after 1 hour
"keyStrategy": "semantic" // hash prompt + params, not raw string
}
}
}Measured Results
Measured against always-on GPT-4 baseline under mixed workload (10k requests, 70% simple queries, 20% medium, 10% complex).