Chaos Engineering¶

Inject failures into agent systems to test resilience before production. Pisama's chaos engineering system provides 6 experiment types with safety controls.

Enable¶

Chaos engineering is feature-flagged. Enable it with:

FEATURE_CHAOS_ENGINEERING=true

Experiment Types¶

Experiment	What It Does
Latency	Injects random or fixed delays (100-5000ms) on tool calls
Error	Returns HTTP error codes (500, 502, 503, 504)
Tool Unavailable	Makes specific tools fail with configurable failure modes
Uncooperative Agent	Agents refuse tasks or return partial output
Context Truncation	Simulates context window exhaustion
Malformed Output	Truncates or corrupts agent responses

Each experiment supports a probability field (0.0-1.0) to control how often it triggers.

API¶

Create a Session¶

curl -X POST https://api.pisama.ai/api/v1/chaos/sessions \
  -H "Authorization: Bearer $PISAMA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "experiments": [
      {"type": "latency", "config": {"min_ms": 500, "max_ms": 2000}, "probability": 0.3},
      {"type": "error", "config": {"codes": [500, 503]}, "probability": 0.1}
    ],
    "targeting": {"agent_ids": ["my-agent"]},
    "safety": {"max_affected_requests": 100, "auto_abort_on_cascade": true}
  }'

Start / Stop / Abort¶

# Start
curl -X POST .../chaos/sessions/{id}/start

# Stop gracefully
curl -X POST .../chaos/sessions/{id}/stop

# Emergency abort
curl -X POST .../chaos/sessions/{id}/abort

List Experiment Types¶

curl https://api.pisama.ai/api/v1/chaos/experiment-types

Safety System¶

Every chaos session has safety controls:

Blast radius levels: SINGLE_REQUEST, SINGLE_AGENT, SINGLE_TRACE, SINGLE_TENANT, MULTI_TENANT
Max affected requests: Auto-abort if exceeded
Cascade detection: Counts cascading errors, aborts if threshold hit
Auto-abort: Stops experiment automatically when safety limits are breached

Configure safety per session:

{
  "safety": {
    "max_blast_radius": "SINGLE_TENANT",
    "max_affected_requests": 50,
    "cascade_threshold": 5,
    "auto_abort_on_cascade": true
  }
}

Targeting¶

Control which agents and tools are affected:

{
  "targeting": {
    "agent_ids": ["planner", "researcher"],
    "tool_names": ["search", "database_query"],
    "tenant_ids": ["test-tenant-123"],
    "percentage": 0.5
  }
}

SDK-Level Chaos Hooks¶

For injecting failures directly into agent execution (not just managing sessions via API), use the SDK chaos hooks. These intercept pre_tool_use and post_tool_use hooks in the Claude Agent SDK.

from pisama_agent_sdk import configure_bridge
from pisama_agent_sdk.chaos import (
    ChaosConfig,
    ToolFailure,
    LatencyInjection,
    ErrorInjection,
    OutputCorruption,
    ContextTruncation,
)

configure_bridge(
    chaos=ChaosConfig(
        experiments=[
            ToolFailure(tools=["database_query"], probability=0.3),
            LatencyInjection(min_ms=500, max_ms=3000, probability=0.2),
            OutputCorruption(tools=["search"], corruption="truncate", probability=0.1),
        ],
        safety_max_affected=50,  # auto-disable after 50 affected calls
    )
)

SDK Experiment Types¶

Experiment	Phase	Effect
`ToolFailure`	pre	Blocks tool call entirely
`LatencyInjection`	pre	Adds delay before tool executes
`ErrorInjection`	pre	Returns error response
`OutputCorruption`	post	Truncates, empties, or breaks tool output
`ContextTruncation`	pre	Truncates string values in tool input

Each experiment supports:

tools: List of tool names to target (empty = all tools)
agents: List of agent names to target (empty = all agents)
probability: 0.0-1.0 trigger probability

Safety¶

ChaosConfig.safety_max_affected auto-disables all experiments after the limit is reached. Call config.reset() to re-enable.

Combining with Detection¶

Run chaos experiments while Pisama monitors for failures. This validates that your agents handle degraded conditions gracefully:

Configure chaos hooks on the SDK bridge
Run your agent workflow — tools will randomly fail/delay
Pisama detects how the agent handles the degraded conditions
Check detections for resilience patterns (does it retry? fall back? crash?)