Skip to content

Chaos Engineering

Inject failures into agent systems to test resilience before production. Pisama's chaos engineering system provides 6 experiment types with safety controls.

Enable

Chaos engineering is feature-flagged. Enable it with:

FEATURE_CHAOS_ENGINEERING=true

Experiment Types

Experiment What It Does
Latency Injects random or fixed delays (100-5000ms) on tool calls
Error Returns HTTP error codes (500, 502, 503, 504)
Tool Unavailable Makes specific tools fail with configurable failure modes
Uncooperative Agent Agents refuse tasks or return partial output
Context Truncation Simulates context window exhaustion
Malformed Output Truncates or corrupts agent responses

Each experiment supports a probability field (0.0-1.0) to control how often it triggers.

API

Create a Session

curl -X POST https://api.pisama.ai/api/v1/chaos/sessions \
  -H "Authorization: Bearer $PISAMA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "experiments": [
      {"type": "latency", "config": {"min_ms": 500, "max_ms": 2000}, "probability": 0.3},
      {"type": "error", "config": {"codes": [500, 503]}, "probability": 0.1}
    ],
    "targeting": {"agent_ids": ["my-agent"]},
    "safety": {"max_affected_requests": 100, "auto_abort_on_cascade": true}
  }'

Start / Stop / Abort

# Start
curl -X POST .../chaos/sessions/{id}/start

# Stop gracefully
curl -X POST .../chaos/sessions/{id}/stop

# Emergency abort
curl -X POST .../chaos/sessions/{id}/abort

List Experiment Types

curl https://api.pisama.ai/api/v1/chaos/experiment-types

Safety System

Every chaos session has safety controls:

  • Blast radius levels: SINGLE_REQUEST, SINGLE_AGENT, SINGLE_TRACE, SINGLE_TENANT, MULTI_TENANT
  • Max affected requests: Auto-abort if exceeded
  • Cascade detection: Counts cascading errors, aborts if threshold hit
  • Auto-abort: Stops experiment automatically when safety limits are breached

Configure safety per session:

{
  "safety": {
    "max_blast_radius": "SINGLE_TENANT",
    "max_affected_requests": 50,
    "cascade_threshold": 5,
    "auto_abort_on_cascade": true
  }
}

Targeting

Control which agents and tools are affected:

{
  "targeting": {
    "agent_ids": ["planner", "researcher"],
    "tool_names": ["search", "database_query"],
    "tenant_ids": ["test-tenant-123"],
    "percentage": 0.5
  }
}

SDK-Level Chaos Hooks

For injecting failures directly into agent execution (not just managing sessions via API), use the SDK chaos hooks. These intercept pre_tool_use and post_tool_use hooks in the Claude Agent SDK.

from pisama_agent_sdk import configure_bridge
from pisama_agent_sdk.chaos import (
    ChaosConfig,
    ToolFailure,
    LatencyInjection,
    ErrorInjection,
    OutputCorruption,
    ContextTruncation,
)

configure_bridge(
    chaos=ChaosConfig(
        experiments=[
            ToolFailure(tools=["database_query"], probability=0.3),
            LatencyInjection(min_ms=500, max_ms=3000, probability=0.2),
            OutputCorruption(tools=["search"], corruption="truncate", probability=0.1),
        ],
        safety_max_affected=50,  # auto-disable after 50 affected calls
    )
)

SDK Experiment Types

Experiment Phase Effect
ToolFailure pre Blocks tool call entirely
LatencyInjection pre Adds delay before tool executes
ErrorInjection pre Returns error response
OutputCorruption post Truncates, empties, or breaks tool output
ContextTruncation pre Truncates string values in tool input

Each experiment supports:

  • tools: List of tool names to target (empty = all tools)
  • agents: List of agent names to target (empty = all agents)
  • probability: 0.0-1.0 trigger probability

Safety

ChaosConfig.safety_max_affected auto-disables all experiments after the limit is reached. Call config.reset() to re-enable.

Combining with Detection

Run chaos experiments while Pisama monitors for failures. This validates that your agents handle degraded conditions gracefully:

  1. Configure chaos hooks on the SDK bridge
  2. Run your agent workflow — tools will randomly fail/delay
  3. Pisama detects how the agent handles the degraded conditions
  4. Check detections for resilience patterns (does it retry? fall back? crash?)