Self-Healing¶

Pisama's self-healing pipeline automatically generates, validates, and applies fixes when failure modes are detected in agent workflows. The system is designed with safety-first principles: every fix requires a checkpoint, every applied fix can be rolled back.

Pipeline Overview¶

Detection Result
       │
       ▼
┌──────────────────┐
│  1. Analyze      │  Identify failure root cause
│                  │  Determine fix category
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  2. Generate Fix │  AI-powered fix suggestion
│                  │  Code-level or config-level
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  3. Approval     │  Manual or automatic
│     Policy       │  Based on risk level
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  4. Apply &      │  Execute with checkpoint
│     Validate     │  Run validation checks
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  5. Rollback     │  If validation fails,
│     (if needed)  │  restore from checkpoint
└──────────────────┘

Fix Categories¶

Basic Fix Suggestions (All Plans)¶

Text-based suggestions that describe what to change:

Loop breaking: "Consider adding a maximum iteration count or convergence check"
Injection defense: "Add input sanitization before passing to LLM"
Overflow prevention: "Implement conversation summarization when context exceeds 70%"

Code-Level Fixes (Startup+)¶

Specific code changes with copy-paste solutions:

# Example: Loop detection fix suggestion
# Add to your agent's step function:
if iteration_count > MAX_ITERATIONS:
    return {"status": "terminated", "reason": "max_iterations_exceeded"}

AI-Generated Runbooks (Growth+)¶

Operational documentation generated from detection patterns:

Step-by-step remediation procedures
Monitoring queries to verify the fix
Prevention guidelines for the team

Self-Healing Playbooks (Growth+)¶

Pre-configured automated fix sequences:

Playbook	Trigger	Action
Loop breaker	Loop detected with confidence > 0.85	Inject termination condition
Context compressor	Overflow at WARNING level	Summarize older context
Persona reset	Persona drift > threshold	Re-inject system prompt
Cost circuit breaker	Budget exceeded	Pause workflow, notify

AI-Generated Fixes (Enterprise)¶

Full automated fix generation using Claude:

Analyzes the trace, detection, and codebase context
Generates specific code patches
Includes test cases for the fix
Provides rollback instructions

Approval Policies¶

Fixes are categorized by risk level, and each level has a different approval requirement:

Risk Level	Examples	Policy
Low	Config changes, threshold adjustments	Auto-apply
Medium	Prompt modifications, retry logic	Require team lead approval
High	Code changes, workflow modifications	Require admin approval
Critical	Data pipeline changes, auth modifications	Require manual review + staging test

API Endpoints¶

List healing operations¶

curl "http://localhost:8000/api/v1/tenants/$TENANT_ID/healing/operations" \
  -H "Authorization: Bearer $TOKEN"

Get operation details¶

curl "http://localhost:8000/api/v1/tenants/$TENANT_ID/healing/operations/$OPERATION_ID" \
  -H "Authorization: Bearer $TOKEN"

Approve an operation¶

curl -X POST "http://localhost:8000/api/v1/tenants/$TENANT_ID/healing/operations/$OPERATION_ID/approve" \
  -H "Authorization: Bearer $TOKEN"

Execute a healing operation¶

curl -X POST "http://localhost:8000/api/v1/tenants/$TENANT_ID/healing/operations/$OPERATION_ID/execute" \
  -H "Authorization: Bearer $TOKEN"

Rollback a fix¶

curl -X POST "http://localhost:8000/api/v1/tenants/$TENANT_ID/healing/operations/$OPERATION_ID/rollback" \
  -H "Authorization: Bearer $TOKEN"

View healing history¶

curl "http://localhost:8000/api/v1/tenants/$TENANT_ID/healing/history" \
  -H "Authorization: Bearer $TOKEN"

Safety Guarantees¶

Checkpoint before apply: Every fix creates a state checkpoint before modifying anything
Rollback capability: Any applied fix can be rolled back to its checkpoint
Validation after apply: Fixes are validated immediately after application -- if validation fails, automatic rollback is triggered
Audit trail: Every healing operation is logged with timestamps, approvers, and outcomes
Canary deployment: Enterprise tier supports canary-style fix application -- apply to a subset, verify, then roll out

Availability by Plan¶

Capability	Free	Startup	Growth	Enterprise
Basic fix suggestions	Yes	Yes	Yes	Yes
Code-level fixes	--	Yes	Yes	Yes
Fix confidence scores	--	Yes	Yes	Yes
AI-generated runbooks	--	--	Yes	Yes
Playbook fixes	--	--	Yes	Yes
AI-generated fixes	--	--	--	Yes
Auto-apply (canary)	--	--	--	Yes