Benchmarks

Production-like synthetic workloads—not theoretical max throughput.

Workload Definition

Shape

10,000 transactions. Each transaction: 1 tool call + 1 LLM invocation + 2 state writes. Query complexity distribution: 70% simple (10–50 tokens), 20% medium (50–200 tokens), 10% complex (200–500 tokens). Cache state: 70% cold, 30% warm.

Concurrency

100 concurrent clients. Ramp-up: 0→100 over 60 seconds. Sustained load: 10 minutes at 100 concurrent.

Failure Injection

Simulated failures: 5% tool timeout rate, 2% LLM provider timeout rate, 1% process crash rate.

Infrastructure

AWS us-east-1, c6i.2xlarge instances (8 vCPU, 16GB RAM). Postgres 15 with read replicas. LLM routing: OpenAI GPT-4o-mini (primary), GPT-4o (fallback).

Measured Results

Token Savings

38.2%

Measured against always-on GPT-4 baseline

Excludes network transfer cost

Rollback Success

100%

No partial state corruption detected

Measured across 127 injected failures

p95 Latency

1.8s

End-to-end transaction time

Includes LLM call (80% of total)

Cache Hit Rate

22.4%

Bypasses routing + model invocation

30% warm cache workload

Throughput

247

Transactions per second

Sustained under 100 concurrent clients

Availability

99.8%

Request success rate

With fallback chain enabled

Latency Breakdown (p95)

Tool execution

120ms

6.7%

LLM invocation

1450ms

80.6%

State write + commit

180ms

10%

Router + overhead

50ms

2.7%

What These Metrics Exclude

• Network transfer cost (egress charges from AWS to OpenAI)
• Cold start latency (assumes warm instances)
• Provider-side queueing delays (OpenAI internal queue time not measured)
• Schema migration overhead (benchmark uses fixed schema)
• Cross-region latency (single-region deployment)

Disclaimer

Results will vary based on provider SLAs, regional network conditions, query complexity distribution, and cache hit rates. These benchmarks represent typical production workloads—not best-case scenarios. Your mileage may vary. Run your own benchmarks before production deployment.