Skip to content

Execution Failures (FC2)

Execution failures occur during agent runtime -- when agents deviate from their task, ignore context, withhold information, or fail to coordinate with each other.


F6: Task Derailment

Field Value
Detector key derailment
Tier ICP
Severity High
Accuracy F1 0.667, P 0.588, R 0.769
MAST mapping FM-2.3 Task Derailment

Plain language: The agent went off-topic. It was asked to do one thing but started doing something else entirely -- like asking someone to write a blog post and getting API documentation instead.

Technical: Computes embedding distance between the task description and the agent's output, combined with topic drift detection via keyword clustering and task substitution pair analysis (e.g., authentication vs authorization confusion).

Examples (non-technical):

  • You ask for a pricing analysis and the agent delivers a feature comparison instead
  • A code review agent starts writing new features rather than reviewing existing ones
  • An agent asked to summarize a document starts editing it instead

Examples (technical):

  • Agent assigned write_auth_docs produces output about authorization middleware instead of authentication flows
  • Research agent's output embeddings have cosine similarity < 0.3 with the task description embedding
  • Agent confuses pytest test writing with unittest -- delivers wrong framework's patterns

Detection methods:

  • Semantic Similarity: Compares embedding distance between task description and output
  • Topic Drift Detection: Tracks topic focus using keyword clustering
  • Task Substitution Detection: Identifies confused concepts using substitution pairs
  • Coverage Verification: Checks whether core task requirements are addressed

F7: Context Neglect

Field Value
Detector key context
Tier ICP
Severity Medium
Accuracy F1 0.865, P 0.762, R 1.000
MAST mapping FM-1.4 Loss of Conversation History

Plain language: The agent ignored information it was given. A previous step provided important context, but the agent acted as if it never received it -- starting from scratch instead of building on prior work.

Technical: Checks for key information elements from upstream context using element matching, critical marker detection (CRITICAL, IMPORTANT labels), and semantic overlap measurement between context and response.

Examples (non-technical):

  • Agent B ignores Agent A's research findings and redoes the analysis from scratch
  • Important warnings from a previous step are completely absent from the output
  • Agent says "based on prior analysis" but doesn't actually use any of the prior data

Examples (technical):

  • Upstream context contains CRITICAL: rate limit is 100 req/s but agent's output proposes 1000 req/s
  • Agent receives structured JSON context with 12 fields but only references 2 in its response
  • Context marked priority: high with 8 key findings -- agent's output mentions zero of them

Detection methods:

  • Element Matching: Checks for key information elements from upstream context
  • Critical Marker Detection: Flags when CRITICAL/IMPORTANT-labeled context is ignored
  • Conceptual Overlap: Measures semantic similarity between context and response
  • Reference Tracking: Verifies claims of context usage against actual content

F8: Information Withholding

Field Value
Detector key withholding
Tier ICP
Severity Medium
Accuracy F1 0.800, P 0.667, R 1.000
MAST mapping FM-2.4 Information Withholding

Plain language: The agent knows something important but didn't share it. It might have found a security issue but only reported "task completed successfully" -- hiding bad news or over-simplifying critical details.

Technical: Compares information density between the agent's internal state and its output, detecting critical omissions (errors, security issues), negative finding suppression, and excessive summarization loss.

Examples (non-technical):

  • Agent finds a security problem but reports only "everything looks good"
  • A 10-page analysis is summarized into 2 sentences, losing all the important details
  • Agent reports only the positive findings and hides all the errors it encountered

Examples (technical):

  • Agent's internal state contains {"vulnerabilities": [{"severity": "critical", ...}]} but output says "No issues found"
  • Input document has 47 data points; agent output references only 3
  • Agent discovers DeprecationWarning in 4 dependencies but output lists zero deprecations

Detection methods:

  • Information Density Comparison: Compares input richness against output content
  • Critical Omission Detection: Checks for missing high-importance information (errors, security, financial)
  • Negative Suppression Detection: Flags when negative findings are absent from positive-heavy reports
  • Semantic Retention Check: Uses embeddings to verify key concepts are preserved

Critical pattern weights:

Pattern Weight
Errors/failures 1.0
Security/vulnerabilities 1.0
Time constraints 0.9
Financial info 0.8
Warnings 0.7
Deprecation notices 0.6

Sub-types: critical_omission, detail_loss, negative_suppression, selective_reporting, context_stripping


F9: Role Usurpation (Enterprise)

Field Value
Detector key role_usurpation
Tier Enterprise
Severity High
Accuracy Benchmarking in progress
MAST mapping FM-2.6

Plain language: The agent overstepped its role. A reviewer started making changes instead of just reviewing, or a support agent made admin-level decisions it wasn't authorized to make.

Technical: Validates agent actions against allowed/forbidden action sets defined in the role specification, detecting scope expansion, authority violations, and task hijacking through action-role boundary analysis.

Examples (non-technical):

  • A code reviewer starts rewriting the code instead of just reviewing it
  • A research assistant makes final product decisions that should be the manager's call
  • A support agent escalates itself to admin privileges without authorization

Examples (technical):

  • Agent with role: "reviewer" calls git commit and git push -- write actions outside its allowed_actions: ["comment", "approve", "request_changes"]
  • Agent with role: "data_analyst" executes DROP TABLE -- a DBA-only operation
  • Agent gradually expands: first reads files, then edits configs, then modifies production deployments

Detection methods:

  • Role Boundary Check: Validates actions against allowed/forbidden action sets
  • Scope Analysis: Detects gradual scope expansion beyond assignment
  • Authority Verification: Checks decision authority against role definition
  • Task Hijacking Detection: Identifies when agent takes over another agent's task

Sub-types: role_violation, scope_expansion, authority_violation, decision_overreach, task_hijacking


F10: Communication Breakdown

Field Value
Detector key communication
Tier ICP
Severity Medium
Accuracy F1 0.667, P 0.571, R 0.800
MAST mapping FM-2.1, FM-2.2, FM-2.5

Plain language: Agents are miscommunicating. One agent sends a message but the receiving agent misunderstands it -- like giving someone directions in kilometers when they expect miles.

Technical: Measures alignment between sender intent and receiver interpretation using semantic similarity, validates message format compliance against expected schemas, and detects ambiguous or incomplete inter-agent messages.

Examples (non-technical):

  • Agent A sends data in one format but Agent B expects a different format
  • An ambiguous instruction like "process the results" is interpreted differently by two agents
  • Critical details are missing from a handoff message between agents

Examples (technical):

  • Agent A sends JSON {"price": "19.99"} (string) but Agent B expects {"price": 19.99} (number), causing a type error
  • Agent A's message says "update the config" -- Agent B updates nginx.conf instead of app.config.yml
  • Inter-agent message missing required correlation_id field, breaking downstream tracing

Detection methods:

  • Intent Alignment: Measures alignment between sender's intent and receiver's interpretation
  • Format Compliance: Checks message format matches expected schema
  • Ambiguity Detection: Flags semantically ambiguous instructions
  • Completeness Check: Verifies all required information is present

Sub-types: intent_mismatch, format_mismatch, semantic_ambiguity, incomplete_information, conflicting_instructions


F11: Coordination Failure

Field Value
Detector key coordination
Tier ICP
Severity Critical
Accuracy F1 0.914, P 0.842, R 1.000
MAST mapping FM-2.5 Ignored Input

Plain language: Agents can't work together. They're waiting on each other in circles, ignoring each other's messages, or going back and forth endlessly without making progress -- like two people stuck saying "no, you go first" at a doorway.

Technical: Tracks message acknowledgment patterns, detects excessive back-and-forth exchanges (threshold: 5), analyzes delegation chains for cycles, and monitors whether inter-agent exchanges produce measurable forward progress.

Examples (non-technical):

  • Agent A waits for Agent B's output while Agent B waits for Agent A -- neither moves
  • One agent sends a request but the other agent never responds
  • Two agents exchange 15 messages clarifying the same thing without making any progress

Examples (technical):

  • Circular delegation: task routed A → B → C → A, creating an infinite delegation loop
  • Agent A's POST /handoff to Agent B returns no acknowledgment after 30s timeout
  • Agents exchange 12 clarify_request/clarify_response messages with cosine similarity > 0.95 (repeating themselves)
  • Agent B receives Agent A's output but processed: false -- input was silently dropped

Detection methods:

  • Message Acknowledgment Tracking: Detects ignored or unacknowledged messages
  • Back-and-Forth Detection: Flags excessive message exchanges between agent pairs (threshold: 5)
  • Circular Delegation Analysis: Traces delegation chains for cycles
  • Progress Monitoring: Measures whether exchanges produce forward progress

The CoordinationAnalyzer checks 12 issue types including information withholding, conflicting instructions, duplicate dispatch, data corruption relay, ordering violations, resource contention, rapid instruction changes, and response delays.