Benchmarks

Production-like synthetic workloads—not theoretical max throughput.

Workload Definition

Shape

10,000 transactions. Each transaction: 1 tool call + 1 LLM invocation + 2 state writes. Query complexity distribution: 70% simple (10–50 tokens), 20% medium (50–200 tokens), 10% complex (200–500 tokens). Cache state: 70% cold, 30% warm.

Concurrency

100 concurrent clients. Ramp-up: 0→100 over 60 seconds. Sustained load: 10 minutes at 100 concurrent.

Failure Injection

Simulated failures: 5% tool timeout rate, 2% LLM provider timeout rate, 1% process crash rate.

Infrastructure

AWS us-east-1, c6i.2xlarge instances (8 vCPU, 16GB RAM). Postgres 15 with read replicas. LLM routing: OpenAI GPT-4o-mini (primary), GPT-4o (fallback).

Measured Results

Token Savings
38.2%
Measured against always-on GPT-4 baseline
Excludes network transfer cost
Rollback Success
100%
No partial state corruption detected
Measured across 127 injected failures
p95 Latency
1.8s
End-to-end transaction time
Includes LLM call (80% of total)
Cache Hit Rate
22.4%
Bypasses routing + model invocation
30% warm cache workload
Throughput
247
Transactions per second
Sustained under 100 concurrent clients
Availability
99.8%
Request success rate
With fallback chain enabled

Latency Breakdown (p95)

Tool execution
120ms
6.7%
LLM invocation
1450ms
80.6%
State write + commit
180ms
10%
Router + overhead
50ms
2.7%

What These Metrics Exclude

  • • Network transfer cost (egress charges from AWS to OpenAI)
  • • Cold start latency (assumes warm instances)
  • • Provider-side queueing delays (OpenAI internal queue time not measured)
  • • Schema migration overhead (benchmark uses fixed schema)
  • • Cross-region latency (single-region deployment)

Disclaimer

Results will vary based on provider SLAs, regional network conditions, query complexity distribution, and cache hit rates. These benchmarks represent typical production workloads—not best-case scenarios. Your mileage may vary. Run your own benchmarks before production deployment.