Failure Semantics

Explicit delivery guarantees and retry behavior for state writes, tool calls, and LLM invocations.

Problem

Production AI systems fail non-deterministically. Network timeouts (3–5% of requests in distributed environments). Provider rate limits (429 errors). Process crashes mid-transaction. Without explicit semantics: duplicate charges, partial state corruption, non-idempotent side effects. Cost of debugging: 4–8 hours per incident.

Neusnap Behavior

State writes: exactly-once commit via transactional boundary. Tool calls: at-most-once per transaction (cached after first execution). LLM calls: at-least-once (retried on timeout). Retries use exponential backoff with jitter (100ms → 5s max delay). Non-transient errors (401, 400, schema validation failure) are not retried. Uncommitted state is discarded on process crash.

Execution Guarantees

Operation	Execution Guarantee	Retry Behavior	Developer Responsibility
State writes	Exactly-once	Cached, no re-execution	Schema validation
Tool calls	At-most-once	Cached, no duplicate execution	External API idempotency
LLM calls	At-least-once	Retry on timeout (3 attempts max)	Use temp=0 or seed for determinism
Transaction rollback	Full state revert	Automatic on failure	None
Process crash	Uncommitted state discarded	None (restart transaction)	Restart from last commit

Guaranteed

→State writes commit exactly once (no partial writes, no duplicate commits)
→Tool results cached after first execution (prevents duplicate API calls)
→Transient failures retried automatically (timeout, 429, 503)
→Full rollback on failure (uncommitted state never persists)

NOT Guaranteed

× Idempotency of external APIs (Stripe, Twilio, etc. must implement idempotency keys)
× Deterministic LLM outputs without temperature=0 or seed
× Zero network latency or infinite uptime
× Protection against schema changes in external systems
× Recovery from catastrophic provider outages (all fallbacks down)

Retry Policy Example

{
  "retryPolicy": {
    "maxAttempts": 3,                // hard limit (not advisory)
    "backoff": "exponential",        // 100ms, 400ms, 1600ms
    "initialDelayMs": 100,
    "maxDelayMs": 5000,
    "jitter": true,                  // prevents thundering herd
    "retryableErrors": [
      "timeout",                     // network timeout
      "rate_limit",                  // 429 from provider
      "service_unavailable"          // 503 from provider
    ],
    "nonRetryableErrors": [
      "authentication_failed",       // 401 (fix API key)
      "invalid_request",             // 400 (fix request shape)
      "schema_validation_failed"     // fix data schema
    ]
  }
}