Detection Overview¶
Pisama detects 22 failure modes in multi-agent LLM systems, organized by the MAST taxonomy with extensions for enterprise use cases.
Failure Mode Summary¶
| MAST ID | Name | Key | Category | Tier | F1 | Status |
|---|---|---|---|---|---|---|
| F1 | Specification Mismatch | specification | Planning | ICP | 0.857 | Production |
| F2 | Poor Task Decomposition | decomposition | Planning | ICP | 1.000 | Production |
| F3 | Resource Misallocation | resource_misallocation | Planning | Enterprise | -- | Dev |
| F4 | Inadequate Tool Provision | tool_provision | Planning | Enterprise | -- | Dev |
| F5 | Flawed Workflow Design | workflow | Planning | ICP | 0.667 | Emerging |
| F6 | Task Derailment | derailment | Execution | ICP | 0.667 | Emerging |
| F7 | Context Neglect | context | Execution | ICP | 0.865 | Production |
| F8 | Information Withholding | withholding | Execution | ICP | 0.800 | Production |
| F9 | Role Usurpation | role_usurpation | Execution | Enterprise | -- | Dev |
| F10 | Communication Breakdown | communication | Execution | ICP | 0.667 | Emerging |
| F11 | Coordination Failure | coordination | Execution | ICP | 0.914 | Production |
| F12 | Output Validation Failure | output_validation | Verification | Enterprise | -- | Dev |
| F13 | Quality Gate Bypass | quality_gate | Verification | Enterprise | -- | Dev |
| F14 | Completion Misjudgment | completion | Verification | ICP | 0.703 | Beta |
| -- | Loop Detection | loop | Extended | ICP | 0.652 | Emerging |
| -- | Context Overflow | overflow | Extended | ICP | 0.706 | Beta |
| -- | Prompt Injection | injection | Extended | ICP | 0.667 | Emerging |
| -- | Hallucination | hallucination | Extended | ICP | 0.857 | Production |
| -- | Grounding Failure | grounding | Extended | ICP | 0.850 | Production |
| -- | Retrieval Quality | retrieval_quality | Extended | ICP | 0.698 | Emerging |
| -- | Persona Drift | persona_drift | Extended | ICP | 0.828 | Production |
| -- | State Corruption | corruption | Extended | ICP | 0.909 | Production |
| -- | Convergence | convergence | Extended | ICP | 0.652 | Emerging |
| -- | Cost Tracking | cost | Extended | ICP | N/A | Production |
Status Definitions¶
| Status | F1 Threshold | Meaning |
|---|---|---|
| Production | >= 0.80 | Reliable for production use |
| Beta | 0.70 - 0.79 | Usable but may have false positives/negatives |
| Emerging | < 0.70 | Under active improvement |
| Dev | Not yet calibrated | Enterprise-only, benchmarking in progress |
Detection by Category¶
Planning Failures (FC1)¶
Problems in how tasks are specified, decomposed, and organized:
- F1 Specification Mismatch: Output doesn't match user's original requirements
- F2 Poor Decomposition: Subtasks are circular, vague, or wrongly granular
- F3 Resource Misallocation: Agents compete for shared resources (Enterprise)
- F4 Tool Provision: Required tools are missing or misconfigured (Enterprise)
- F5 Workflow Design: Unreachable nodes, dead ends, missing error handling
Execution Failures (FC2)¶
Problems during agent runtime:
- F6 Task Derailment: Agent goes off-topic (20% prevalence in MAST-Data)
- F7 Context Neglect: Agent ignores upstream context
- F8 Information Withholding: Agent omits critical information
- F9 Role Usurpation: Agent exceeds role boundaries (Enterprise)
- F10 Communication Breakdown: Inter-agent messages misunderstood
- F11 Coordination Failure: Handoff failures, circular delegation
Verification Failures (FC3)¶
Problems in output validation and completion:
- F12 Output Validation: Validation steps skipped or bypassed (Enterprise)
- F13 Quality Gate Bypass: Quality thresholds ignored (Enterprise)
- F14 Completion Misjudgment: Premature completion claims (40% prevalence for F1.5 in MAST-Data)
Extended Detectors¶
Cross-cutting concerns not in the core MAST taxonomy:
- Loop Detection: Agents stuck repeating actions
- Context Overflow: Context window exhaustion
- Prompt Injection: Attack detection
- Hallucination: Fabricated information
- Grounding Failure: Claims unsupported by source documents
- Retrieval Quality: Wrong or irrelevant documents retrieved
- Persona Drift: Role/personality deviation
- State Corruption: Memory/state anomalies
- Convergence: Metric plateau, regression, thrashing, divergence detection
- Cost Tracking: Token/cost budget monitoring
Platform-Specific Detectors¶
In addition to the general-purpose detectors above, Pisama includes 24 platform-specific detectors (6 per platform) that catch issues unique to each framework's architecture:
- n8n (6): Schema mismatch, workflow cycles, complexity, error handling, resource exhaustion, timeouts
- LangGraph (6): Recursion limits, state corruption, edge misrouting, tool failures, parallel sync, checkpoint corruption
- Dify (6): RAG poisoning, iteration escape, silent model fallback, variable leakage, classifier drift, tool schema mismatch
- OpenClaw (6): Session loops, tool abuse, elevated privilege risk, spawn chain depth, channel mismatch, sandbox escape
These run automatically when traces from the corresponding platform are ingested.
Detection Pipeline¶
Each trace is analyzed by the DetectionOrchestrator, which runs applicable detectors using a cheapest-first strategy:
- Tier 1: Rule-based (hash, pattern, structural) -- $0.00
- Tier 2: State delta analysis -- $0.00
- Tier 3: Embedding similarity -- ~$0.001
- Tier 4: LLM Judge (Claude) -- ~$0.005-0.05
- Tier 5: Human review -- variable
Target: $0.05/trace average. Most traces resolve at Tier 1-2.
See Detection Tiers for the full escalation architecture.