Benchmarks
Production-like synthetic workloads—not theoretical max throughput.
Workload Definition
10,000 transactions. Each transaction: 1 tool call + 1 LLM invocation + 2 state writes. Query complexity distribution: 70% simple (10–50 tokens), 20% medium (50–200 tokens), 10% complex (200–500 tokens). Cache state: 70% cold, 30% warm.
100 concurrent clients. Ramp-up: 0→100 over 60 seconds. Sustained load: 10 minutes at 100 concurrent.
Simulated failures: 5% tool timeout rate, 2% LLM provider timeout rate, 1% process crash rate.
AWS us-east-1, c6i.2xlarge instances (8 vCPU, 16GB RAM). Postgres 15 with read replicas. LLM routing: OpenAI GPT-4o-mini (primary), GPT-4o (fallback).
Measured Results
Latency Breakdown (p95)
What These Metrics Exclude
- • Network transfer cost (egress charges from AWS to OpenAI)
- • Cold start latency (assumes warm instances)
- • Provider-side queueing delays (OpenAI internal queue time not measured)
- • Schema migration overhead (benchmark uses fixed schema)
- • Cross-region latency (single-region deployment)
Disclaimer
Results will vary based on provider SLAs, regional network conditions, query complexity distribution, and cache hit rates. These benchmarks represent typical production workloads—not best-case scenarios. Your mileage may vary. Run your own benchmarks before production deployment.