Failure Semantics

Explicit delivery guarantees and retry behavior for state writes, tool calls, and LLM invocations.

Problem

Production AI systems fail non-deterministically. Network timeouts (3–5% of requests in distributed environments). Provider rate limits (429 errors). Process crashes mid-transaction. Without explicit semantics: duplicate charges, partial state corruption, non-idempotent side effects. Cost of debugging: 4–8 hours per incident.

Neusnap Behavior

State writes: exactly-once commit via transactional boundary. Tool calls: at-most-once per transaction (cached after first execution). LLM calls: at-least-once (retried on timeout). Retries use exponential backoff with jitter (100ms → 5s max delay). Non-transient errors (401, 400, schema validation failure) are not retried. Uncommitted state is discarded on process crash.

Execution Guarantees

OperationExecution GuaranteeRetry BehaviorDeveloper Responsibility
State writesExactly-onceCached, no re-executionSchema validation
Tool callsAt-most-onceCached, no duplicate executionExternal API idempotency
LLM callsAt-least-onceRetry on timeout (3 attempts max)Use temp=0 or seed for determinism
Transaction rollbackFull state revertAutomatic on failureNone
Process crashUncommitted state discardedNone (restart transaction)Restart from last commit

Guaranteed

  • State writes commit exactly once (no partial writes, no duplicate commits)
  • Tool results cached after first execution (prevents duplicate API calls)
  • Transient failures retried automatically (timeout, 429, 503)
  • Full rollback on failure (uncommitted state never persists)

NOT Guaranteed

  • × Idempotency of external APIs (Stripe, Twilio, etc. must implement idempotency keys)
  • × Deterministic LLM outputs without temperature=0 or seed
  • × Zero network latency or infinite uptime
  • × Protection against schema changes in external systems
  • × Recovery from catastrophic provider outages (all fallbacks down)

Retry Policy Example

{
  "retryPolicy": {
    "maxAttempts": 3,                // hard limit (not advisory)
    "backoff": "exponential",        // 100ms, 400ms, 1600ms
    "initialDelayMs": 100,
    "maxDelayMs": 5000,
    "jitter": true,                  // prevents thundering herd
    "retryableErrors": [
      "timeout",                     // network timeout
      "rate_limit",                  // 429 from provider
      "service_unavailable"          // 503 from provider
    ],
    "nonRetryableErrors": [
      "authentication_failed",       // 401 (fix API key)
      "invalid_request",             // 400 (fix request shape)
      "schema_validation_failed"     // fix data schema
    ]
  }
}