Execution Failures (FC2)¶
Execution failures occur during agent runtime -- when agents deviate from their task, ignore context, withhold information, or fail to coordinate with each other.
F6: Task Derailment¶
| Field | Value |
|---|---|
| Detector key | derailment |
| Tier | ICP |
| Severity | High |
| Accuracy | F1 0.667, P 0.588, R 0.769 |
| MAST mapping | FM-2.3 Task Derailment |
Plain language: The agent went off-topic. It was asked to do one thing but started doing something else entirely -- like asking someone to write a blog post and getting API documentation instead.
Technical: Computes embedding distance between the task description and the agent's output, combined with topic drift detection via keyword clustering and task substitution pair analysis (e.g., authentication vs authorization confusion).
Examples (non-technical):
- You ask for a pricing analysis and the agent delivers a feature comparison instead
- A code review agent starts writing new features rather than reviewing existing ones
- An agent asked to summarize a document starts editing it instead
Examples (technical):
- Agent assigned
write_auth_docsproduces output about authorization middleware instead of authentication flows - Research agent's output embeddings have cosine similarity < 0.3 with the task description embedding
- Agent confuses
pytesttest writing withunittest-- delivers wrong framework's patterns
Detection methods:
- Semantic Similarity: Compares embedding distance between task description and output
- Topic Drift Detection: Tracks topic focus using keyword clustering
- Task Substitution Detection: Identifies confused concepts using substitution pairs
- Coverage Verification: Checks whether core task requirements are addressed
F7: Context Neglect¶
| Field | Value |
|---|---|
| Detector key | context |
| Tier | ICP |
| Severity | Medium |
| Accuracy | F1 0.865, P 0.762, R 1.000 |
| MAST mapping | FM-1.4 Loss of Conversation History |
Plain language: The agent ignored information it was given. A previous step provided important context, but the agent acted as if it never received it -- starting from scratch instead of building on prior work.
Technical: Checks for key information elements from upstream context using element matching, critical marker detection (CRITICAL, IMPORTANT labels), and semantic overlap measurement between context and response.
Examples (non-technical):
- Agent B ignores Agent A's research findings and redoes the analysis from scratch
- Important warnings from a previous step are completely absent from the output
- Agent says "based on prior analysis" but doesn't actually use any of the prior data
Examples (technical):
- Upstream context contains
CRITICAL: rate limit is 100 req/sbut agent's output proposes 1000 req/s - Agent receives structured JSON context with 12 fields but only references 2 in its response
- Context marked
priority: highwith 8 key findings -- agent's output mentions zero of them
Detection methods:
- Element Matching: Checks for key information elements from upstream context
- Critical Marker Detection: Flags when CRITICAL/IMPORTANT-labeled context is ignored
- Conceptual Overlap: Measures semantic similarity between context and response
- Reference Tracking: Verifies claims of context usage against actual content
F8: Information Withholding¶
| Field | Value |
|---|---|
| Detector key | withholding |
| Tier | ICP |
| Severity | Medium |
| Accuracy | F1 0.800, P 0.667, R 1.000 |
| MAST mapping | FM-2.4 Information Withholding |
Plain language: The agent knows something important but didn't share it. It might have found a security issue but only reported "task completed successfully" -- hiding bad news or over-simplifying critical details.
Technical: Compares information density between the agent's internal state and its output, detecting critical omissions (errors, security issues), negative finding suppression, and excessive summarization loss.
Examples (non-technical):
- Agent finds a security problem but reports only "everything looks good"
- A 10-page analysis is summarized into 2 sentences, losing all the important details
- Agent reports only the positive findings and hides all the errors it encountered
Examples (technical):
- Agent's internal state contains
{"vulnerabilities": [{"severity": "critical", ...}]}but output says "No issues found" - Input document has 47 data points; agent output references only 3
- Agent discovers
DeprecationWarningin 4 dependencies but output lists zero deprecations
Detection methods:
- Information Density Comparison: Compares input richness against output content
- Critical Omission Detection: Checks for missing high-importance information (errors, security, financial)
- Negative Suppression Detection: Flags when negative findings are absent from positive-heavy reports
- Semantic Retention Check: Uses embeddings to verify key concepts are preserved
Critical pattern weights:
| Pattern | Weight |
|---|---|
| Errors/failures | 1.0 |
| Security/vulnerabilities | 1.0 |
| Time constraints | 0.9 |
| Financial info | 0.8 |
| Warnings | 0.7 |
| Deprecation notices | 0.6 |
Sub-types: critical_omission, detail_loss, negative_suppression, selective_reporting, context_stripping
F9: Role Usurpation (Enterprise)¶
| Field | Value |
|---|---|
| Detector key | role_usurpation |
| Tier | Enterprise |
| Severity | High |
| Accuracy | Benchmarking in progress |
| MAST mapping | FM-2.6 |
Plain language: The agent overstepped its role. A reviewer started making changes instead of just reviewing, or a support agent made admin-level decisions it wasn't authorized to make.
Technical: Validates agent actions against allowed/forbidden action sets defined in the role specification, detecting scope expansion, authority violations, and task hijacking through action-role boundary analysis.
Examples (non-technical):
- A code reviewer starts rewriting the code instead of just reviewing it
- A research assistant makes final product decisions that should be the manager's call
- A support agent escalates itself to admin privileges without authorization
Examples (technical):
- Agent with
role: "reviewer"callsgit commitandgit push-- write actions outside itsallowed_actions: ["comment", "approve", "request_changes"] - Agent with
role: "data_analyst"executesDROP TABLE-- a DBA-only operation - Agent gradually expands: first reads files, then edits configs, then modifies production deployments
Detection methods:
- Role Boundary Check: Validates actions against allowed/forbidden action sets
- Scope Analysis: Detects gradual scope expansion beyond assignment
- Authority Verification: Checks decision authority against role definition
- Task Hijacking Detection: Identifies when agent takes over another agent's task
Sub-types: role_violation, scope_expansion, authority_violation, decision_overreach, task_hijacking
F10: Communication Breakdown¶
| Field | Value |
|---|---|
| Detector key | communication |
| Tier | ICP |
| Severity | Medium |
| Accuracy | F1 0.667, P 0.571, R 0.800 |
| MAST mapping | FM-2.1, FM-2.2, FM-2.5 |
Plain language: Agents are miscommunicating. One agent sends a message but the receiving agent misunderstands it -- like giving someone directions in kilometers when they expect miles.
Technical: Measures alignment between sender intent and receiver interpretation using semantic similarity, validates message format compliance against expected schemas, and detects ambiguous or incomplete inter-agent messages.
Examples (non-technical):
- Agent A sends data in one format but Agent B expects a different format
- An ambiguous instruction like "process the results" is interpreted differently by two agents
- Critical details are missing from a handoff message between agents
Examples (technical):
- Agent A sends JSON
{"price": "19.99"}(string) but Agent B expects{"price": 19.99}(number), causing a type error - Agent A's message says "update the config" -- Agent B updates
nginx.confinstead ofapp.config.yml - Inter-agent message missing required
correlation_idfield, breaking downstream tracing
Detection methods:
- Intent Alignment: Measures alignment between sender's intent and receiver's interpretation
- Format Compliance: Checks message format matches expected schema
- Ambiguity Detection: Flags semantically ambiguous instructions
- Completeness Check: Verifies all required information is present
Sub-types: intent_mismatch, format_mismatch, semantic_ambiguity, incomplete_information, conflicting_instructions
F11: Coordination Failure¶
| Field | Value |
|---|---|
| Detector key | coordination |
| Tier | ICP |
| Severity | Critical |
| Accuracy | F1 0.914, P 0.842, R 1.000 |
| MAST mapping | FM-2.5 Ignored Input |
Plain language: Agents can't work together. They're waiting on each other in circles, ignoring each other's messages, or going back and forth endlessly without making progress -- like two people stuck saying "no, you go first" at a doorway.
Technical: Tracks message acknowledgment patterns, detects excessive back-and-forth exchanges (threshold: 5), analyzes delegation chains for cycles, and monitors whether inter-agent exchanges produce measurable forward progress.
Examples (non-technical):
- Agent A waits for Agent B's output while Agent B waits for Agent A -- neither moves
- One agent sends a request but the other agent never responds
- Two agents exchange 15 messages clarifying the same thing without making any progress
Examples (technical):
- Circular delegation: task routed A → B → C → A, creating an infinite delegation loop
- Agent A's
POST /handoffto Agent B returns no acknowledgment after 30s timeout - Agents exchange 12
clarify_request/clarify_responsemessages with cosine similarity > 0.95 (repeating themselves) - Agent B receives Agent A's output but
processed: false-- input was silently dropped
Detection methods:
- Message Acknowledgment Tracking: Detects ignored or unacknowledged messages
- Back-and-Forth Detection: Flags excessive message exchanges between agent pairs (threshold: 5)
- Circular Delegation Analysis: Traces delegation chains for cycles
- Progress Monitoring: Measures whether exchanges produce forward progress
The CoordinationAnalyzer checks 12 issue types including information withholding, conflicting instructions, duplicate dispatch, data corruption relay, ordering violations, resource contention, rapid instruction changes, and response delays.