Skip to content

Detection Overview

Pisama detects failure modes across 57 calibrated detectors organized by the MAST taxonomy with extensions for enterprise use cases. Mean F1 is 0.876 across all calibrated detectors; 51 detectors reach production-grade quality (F1 >= 0.80).

Failure Mode Summary

MAST ID Name Key Category Tier F1 Status
F1 Specification Mismatch specification Planning ICP 0.855 Production
F2 Poor Task Decomposition decomposition Planning ICP 0.886 Production
F3 Resource Misallocation resource_misallocation Planning Enterprise -- Dev
F4 Inadequate Tool Provision tool_provision Planning Enterprise -- Dev
F5 Flawed Workflow Design workflow Planning ICP 0.906 Production
F6 Task Derailment derailment Execution ICP 0.654 Beta
F7 Context Neglect context Execution ICP 0.791 Beta
F8 Information Withholding withholding Execution ICP 0.924 Production
F9 Role Usurpation role_usurpation Execution Enterprise -- Dev
F10 Communication Breakdown communication Execution ICP 0.957 Production
F11 Coordination Failure coordination Execution ICP 0.747 Beta
F12 Output Validation Failure output_validation Verification Enterprise -- Dev
F13 Quality Gate Bypass quality_gate Verification Enterprise -- Dev
F14 Completion Misjudgment completion Verification ICP 0.793 Beta
-- Loop Detection loop Extended ICP 0.801 Production
-- Context Overflow overflow Extended ICP 0.855 Production
-- Prompt Injection injection Extended ICP 0.985 Production
-- Hallucination hallucination Extended ICP 0.884 Production
-- Grounding Failure grounding Extended ICP 0.818 Production
-- Retrieval Quality retrieval_quality Extended ICP 0.828 Production
-- Persona Drift persona_drift Extended ICP 0.908 Production
-- State Corruption corruption Extended ICP 0.790 Beta
-- Convergence convergence Extended ICP 0.862 Production
-- Delegation delegation Extended ICP 0.830 Production
-- Citation citation Extended ICP 0.967 Production
-- Cost Tracking cost Extended ICP N/A Production

Status Definitions

Status F1 Threshold Meaning
Production >= 0.80 Reliable for production use
Beta 0.40 - 0.79 Usable but may have false positives/negatives
Experimental < 0.40 Under active improvement
Dev Not yet calibrated Enterprise-only, benchmarking in progress

Detection by Category

Planning Failures (FC1)

Problems in how tasks are specified, decomposed, and organized:

  • F1 Specification Mismatch: Output doesn't match user's original requirements
  • F2 Poor Decomposition: Subtasks are circular, vague, or wrongly granular
  • F3 Resource Misallocation: Agents compete for shared resources (Enterprise)
  • F4 Tool Provision: Required tools are missing or misconfigured (Enterprise)
  • F5 Workflow Design: Unreachable nodes, dead ends, missing error handling

Execution Failures (FC2)

Problems during agent runtime:

  • F6 Task Derailment: Agent goes off-topic (20% prevalence in MAST-Data)
  • F7 Context Neglect: Agent ignores upstream context
  • F8 Information Withholding: Agent omits critical information
  • F9 Role Usurpation: Agent exceeds role boundaries (Enterprise)
  • F10 Communication Breakdown: Inter-agent messages misunderstood
  • F11 Coordination Failure: Handoff failures, circular delegation

Verification Failures (FC3)

Problems in output validation and completion:

  • F12 Output Validation: Validation steps skipped or bypassed (Enterprise)
  • F13 Quality Gate Bypass: Quality thresholds ignored (Enterprise)
  • F14 Completion Misjudgment: Premature completion claims (40% prevalence for F1.5 in MAST-Data)

Extended Detectors

Cross-cutting concerns not in the core MAST taxonomy:

  • Loop Detection: Agents stuck repeating actions
  • Context Overflow: Context window exhaustion
  • Prompt Injection: Attack detection
  • Hallucination: Fabricated information
  • Grounding Failure: Claims unsupported by source documents
  • Retrieval Quality: Wrong or irrelevant documents retrieved
  • Persona Drift: Role/personality deviation
  • State Corruption: Memory/state anomalies
  • Convergence: Metric plateau, regression, thrashing, divergence detection
  • Cost Tracking: Token/cost budget monitoring

Platform-Specific Detectors

In addition to the general-purpose detectors above, Pisama includes 30 platform-specific detectors (6 per platform) that catch issues unique to each framework's architecture:

  • n8n (6): Schema mismatch, workflow cycles, complexity, error handling, resource exhaustion, timeouts
  • LangGraph (6): Recursion limits, state corruption, edge misrouting, tool failures, parallel sync, checkpoint corruption
  • Dify (6): RAG poisoning, iteration escape, silent model fallback, variable leakage, classifier drift, tool schema mismatch
  • OpenClaw (6): Session loops, tool abuse, elevated privilege risk, spawn chain depth, channel mismatch, sandbox escape
  • Claude Managed Agents (6): Session stall, tool permission escalation, MCP failure, environment escape, cost overrun, session corruption

These run automatically when traces from the corresponding platform are ingested.

Detection Pipeline

Each trace is analyzed by the DetectionOrchestrator, which runs applicable detectors using a cheapest-first strategy:

  1. Tier 1: Rule-based (hash, pattern, structural) -- $0.00
  2. Tier 2: State delta analysis -- $0.00
  3. Tier 3: Embedding similarity -- ~$0.001
  4. Tier 4: LLM Judge (Claude) -- ~$0.005-0.05
  5. Tier 5: Human review -- variable

Target: $0.05/trace average. Most traces resolve at Tier 1-2.

See Detection Tiers for the full escalation architecture.