Failure Semantics
Explicit delivery guarantees and retry behavior for state writes, tool calls, and LLM invocations.
Problem
Production AI systems fail non-deterministically. Network timeouts (3–5% of requests in distributed environments). Provider rate limits (429 errors). Process crashes mid-transaction. Without explicit semantics: duplicate charges, partial state corruption, non-idempotent side effects. Cost of debugging: 4–8 hours per incident.
Neusnap Behavior
State writes: exactly-once commit via transactional boundary. Tool calls: at-most-once per transaction (cached after first execution). LLM calls: at-least-once (retried on timeout). Retries use exponential backoff with jitter (100ms → 5s max delay). Non-transient errors (401, 400, schema validation failure) are not retried. Uncommitted state is discarded on process crash.
Execution Guarantees
| Operation | Execution Guarantee | Retry Behavior | Developer Responsibility |
|---|---|---|---|
| State writes | Exactly-once | Cached, no re-execution | Schema validation |
| Tool calls | At-most-once | Cached, no duplicate execution | External API idempotency |
| LLM calls | At-least-once | Retry on timeout (3 attempts max) | Use temp=0 or seed for determinism |
| Transaction rollback | Full state revert | Automatic on failure | None |
| Process crash | Uncommitted state discarded | None (restart transaction) | Restart from last commit |
Guaranteed
- →State writes commit exactly once (no partial writes, no duplicate commits)
- →Tool results cached after first execution (prevents duplicate API calls)
- →Transient failures retried automatically (timeout, 429, 503)
- →Full rollback on failure (uncommitted state never persists)
NOT Guaranteed
- × Idempotency of external APIs (Stripe, Twilio, etc. must implement idempotency keys)
- × Deterministic LLM outputs without temperature=0 or seed
- × Zero network latency or infinite uptime
- × Protection against schema changes in external systems
- × Recovery from catastrophic provider outages (all fallbacks down)
Retry Policy Example
{
"retryPolicy": {
"maxAttempts": 3, // hard limit (not advisory)
"backoff": "exponential", // 100ms, 400ms, 1600ms
"initialDelayMs": 100,
"maxDelayMs": 5000,
"jitter": true, // prevents thundering herd
"retryableErrors": [
"timeout", // network timeout
"rate_limit", // 429 from provider
"service_unavailable" // 503 from provider
],
"nonRetryableErrors": [
"authentication_failed", // 401 (fix API key)
"invalid_request", // 400 (fix request shape)
"schema_validation_failed" // fix data schema
]
}
}