Self-Healing¶
Pisama's self-healing pipeline automatically generates, validates, and applies fixes when failure modes are detected in agent workflows. The system is designed with safety-first principles: every fix requires a checkpoint, every applied fix can be rolled back.
Pipeline Overview¶
Detection Result
│
▼
┌──────────────────┐
│ 1. Analyze │ Identify failure root cause
│ │ Determine fix category
└────────┬─────────┘
│
▼
┌──────────────────┐
│ 2. Generate Fix │ AI-powered fix suggestion
│ │ Code-level or config-level
└────────┬─────────┘
│
▼
┌──────────────────┐
│ 3. Approval │ Manual or automatic
│ Policy │ Based on risk level
└────────┬─────────┘
│
▼
┌──────────────────┐
│ 4. Apply & │ Execute with checkpoint
│ Validate │ Run validation checks
└────────┬─────────┘
│
▼
┌──────────────────┐
│ 5. Rollback │ If validation fails,
│ (if needed) │ restore from checkpoint
└──────────────────┘
Fix Categories¶
Basic Fix Suggestions (All Plans)¶
Text-based suggestions that describe what to change:
- Loop breaking: "Consider adding a maximum iteration count or convergence check"
- Injection defense: "Add input sanitization before passing to LLM"
- Overflow prevention: "Implement conversation summarization when context exceeds 70%"
Code-Level Fixes (Startup+)¶
Specific code changes with copy-paste solutions:
# Example: Loop detection fix suggestion
# Add to your agent's step function:
if iteration_count > MAX_ITERATIONS:
return {"status": "terminated", "reason": "max_iterations_exceeded"}
AI-Generated Runbooks (Growth+)¶
Operational documentation generated from detection patterns:
- Step-by-step remediation procedures
- Monitoring queries to verify the fix
- Prevention guidelines for the team
Self-Healing Playbooks (Growth+)¶
Pre-configured automated fix sequences:
| Playbook | Trigger | Action |
|---|---|---|
| Loop breaker | Loop detected with confidence > 0.85 | Inject termination condition |
| Context compressor | Overflow at WARNING level | Summarize older context |
| Persona reset | Persona drift > threshold | Re-inject system prompt |
| Cost circuit breaker | Budget exceeded | Pause workflow, notify |
AI-Generated Fixes (Enterprise)¶
Full automated fix generation using Claude:
- Analyzes the trace, detection, and codebase context
- Generates specific code patches
- Includes test cases for the fix
- Provides rollback instructions
Approval Policies¶
Fixes are categorized by risk level, and each level has a different approval requirement:
| Risk Level | Examples | Policy |
|---|---|---|
| Low | Config changes, threshold adjustments | Auto-apply |
| Medium | Prompt modifications, retry logic | Require team lead approval |
| High | Code changes, workflow modifications | Require admin approval |
| Critical | Data pipeline changes, auth modifications | Require manual review + staging test |
API Endpoints¶
List healing operations¶
curl "http://localhost:8000/api/v1/tenants/$TENANT_ID/healing/operations" \
-H "Authorization: Bearer $TOKEN"
Get operation details¶
curl "http://localhost:8000/api/v1/tenants/$TENANT_ID/healing/operations/$OPERATION_ID" \
-H "Authorization: Bearer $TOKEN"
Approve an operation¶
curl -X POST "http://localhost:8000/api/v1/tenants/$TENANT_ID/healing/operations/$OPERATION_ID/approve" \
-H "Authorization: Bearer $TOKEN"
Execute a healing operation¶
curl -X POST "http://localhost:8000/api/v1/tenants/$TENANT_ID/healing/operations/$OPERATION_ID/execute" \
-H "Authorization: Bearer $TOKEN"
Rollback a fix¶
curl -X POST "http://localhost:8000/api/v1/tenants/$TENANT_ID/healing/operations/$OPERATION_ID/rollback" \
-H "Authorization: Bearer $TOKEN"
View healing history¶
curl "http://localhost:8000/api/v1/tenants/$TENANT_ID/healing/history" \
-H "Authorization: Bearer $TOKEN"
Safety Guarantees¶
- Checkpoint before apply: Every fix creates a state checkpoint before modifying anything
- Rollback capability: Any applied fix can be rolled back to its checkpoint
- Validation after apply: Fixes are validated immediately after application -- if validation fails, automatic rollback is triggered
- Audit trail: Every healing operation is logged with timestamps, approvers, and outcomes
- Canary deployment: Enterprise tier supports canary-style fix application -- apply to a subset, verify, then roll out
Availability by Plan¶
| Capability | Free | Startup | Growth | Enterprise |
|---|---|---|---|---|
| Basic fix suggestions | Yes | Yes | Yes | Yes |
| Code-level fixes | -- | Yes | Yes | Yes |
| Fix confidence scores | -- | Yes | Yes | Yes |
| AI-generated runbooks | -- | -- | Yes | Yes |
| Playbook fixes | -- | -- | Yes | Yes |
| AI-generated fixes | -- | -- | -- | Yes |
| Auto-apply (canary) | -- | -- | -- | Yes |