Chaos Engineering¶
Inject failures into agent systems to test resilience before production. Pisama's chaos engineering system provides 6 experiment types with safety controls.
Enable¶
Chaos engineering is feature-flagged. Enable it with:
Experiment Types¶
| Experiment | What It Does |
|---|---|
| Latency | Injects random or fixed delays (100-5000ms) on tool calls |
| Error | Returns HTTP error codes (500, 502, 503, 504) |
| Tool Unavailable | Makes specific tools fail with configurable failure modes |
| Uncooperative Agent | Agents refuse tasks or return partial output |
| Context Truncation | Simulates context window exhaustion |
| Malformed Output | Truncates or corrupts agent responses |
Each experiment supports a probability field (0.0-1.0) to control how often it triggers.
API¶
Create a Session¶
curl -X POST https://api.pisama.ai/api/v1/chaos/sessions \
-H "Authorization: Bearer $PISAMA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"experiments": [
{"type": "latency", "config": {"min_ms": 500, "max_ms": 2000}, "probability": 0.3},
{"type": "error", "config": {"codes": [500, 503]}, "probability": 0.1}
],
"targeting": {"agent_ids": ["my-agent"]},
"safety": {"max_affected_requests": 100, "auto_abort_on_cascade": true}
}'
Start / Stop / Abort¶
# Start
curl -X POST .../chaos/sessions/{id}/start
# Stop gracefully
curl -X POST .../chaos/sessions/{id}/stop
# Emergency abort
curl -X POST .../chaos/sessions/{id}/abort
List Experiment Types¶
Safety System¶
Every chaos session has safety controls:
- Blast radius levels: SINGLE_REQUEST, SINGLE_AGENT, SINGLE_TRACE, SINGLE_TENANT, MULTI_TENANT
- Max affected requests: Auto-abort if exceeded
- Cascade detection: Counts cascading errors, aborts if threshold hit
- Auto-abort: Stops experiment automatically when safety limits are breached
Configure safety per session:
{
"safety": {
"max_blast_radius": "SINGLE_TENANT",
"max_affected_requests": 50,
"cascade_threshold": 5,
"auto_abort_on_cascade": true
}
}
Targeting¶
Control which agents and tools are affected:
{
"targeting": {
"agent_ids": ["planner", "researcher"],
"tool_names": ["search", "database_query"],
"tenant_ids": ["test-tenant-123"],
"percentage": 0.5
}
}
SDK-Level Chaos Hooks¶
For injecting failures directly into agent execution (not just managing sessions via API), use the SDK chaos hooks. These intercept pre_tool_use and post_tool_use hooks in the Claude Agent SDK.
from pisama_agent_sdk import configure_bridge
from pisama_agent_sdk.chaos import (
ChaosConfig,
ToolFailure,
LatencyInjection,
ErrorInjection,
OutputCorruption,
ContextTruncation,
)
configure_bridge(
chaos=ChaosConfig(
experiments=[
ToolFailure(tools=["database_query"], probability=0.3),
LatencyInjection(min_ms=500, max_ms=3000, probability=0.2),
OutputCorruption(tools=["search"], corruption="truncate", probability=0.1),
],
safety_max_affected=50, # auto-disable after 50 affected calls
)
)
SDK Experiment Types¶
| Experiment | Phase | Effect |
|---|---|---|
ToolFailure | pre | Blocks tool call entirely |
LatencyInjection | pre | Adds delay before tool executes |
ErrorInjection | pre | Returns error response |
OutputCorruption | post | Truncates, empties, or breaks tool output |
ContextTruncation | pre | Truncates string values in tool input |
Each experiment supports:
tools: List of tool names to target (empty = all tools)agents: List of agent names to target (empty = all agents)probability: 0.0-1.0 trigger probability
Safety¶
ChaosConfig.safety_max_affected auto-disables all experiments after the limit is reached. Call config.reset() to re-enable.
Combining with Detection¶
Run chaos experiments while Pisama monitors for failures. This validates that your agents handle degraded conditions gracefully:
- Configure chaos hooks on the SDK bridge
- Run your agent workflow — tools will randomly fail/delay
- Pisama detects how the agent handles the degraded conditions
- Check detections for resilience patterns (does it retry? fall back? crash?)