LangGraph Failure Modes¶

Platform-specific detectors for LangGraph state-graph agent execution. These catch issues unique to LangGraph's superstep execution model, conditional routing, parallel branches, and checkpoint persistence.

Recursion Limit¶

Field	Value
Detector key	`langgraph_recursion`
Severity	Critical

Plain language: Your graph agent hit the recursion limit and was forcibly stopped. It ran too many steps -- either because it's stuck in a cycle or because the task genuinely requires more steps than the limit allows.

Technical: Checks execution status for recursion_limit (definitive hit), monitors the ratio of completed supersteps to configured recursion limit (flags at >90%), and detects unbounded node repetition (same node appearing in 3+ distinct supersteps).

Examples (non-technical):

Your agent was set to a maximum of 25 steps but the task needed 30 -- it stopped mid-work
The agent kept going back to the same planning step over and over until it hit the limit
The agent used 23 of its 25 allowed steps -- it's about to hit the wall

Examples (technical):

Graph status: "recursion_limit" -- hard stop after exhausting configured limit
Superstep ratio: 23/25 = 0.92 (threshold: 0.90 -- approaching limit warning)
Node "planner" appears in supersteps [1, 4, 7, 10, 13] -- repeating every 3 steps (unbounded cycle)
Recursion limit set to 10 but task requires tool-call → result → tool-call chains of 15+ steps

Detection methods:

Status Check: Definitive detection when status == "recursion_limit"
Ratio Monitoring: Warns when supersteps / recursion_limit > 0.90
Node Repetition Analysis: Flags nodes appearing in 3+ distinct supersteps (configurable threshold)

Sub-types: recursion_limit_hit, approaching_limit, node_repetition

State Corruption¶

Field	Value
Detector key	`langgraph_state_corruption`
Severity	High

Plain language: Your agent's state got corrupted between steps. Values changed type unexpectedly, important fields disappeared, identity fields were modified, or counters went backwards -- things that should never happen.

Technical: Performs 10 integrity checks across consecutive state snapshots: type changes, null injection, field deletion, value explosion (>10x container growth), list shrinkage (append-only violation), identity field mutation (user_id, session_id, etc.), counter decreases, value jumps (>100x), suspicious field injection, and node error signals containing corruption keywords.

Examples (non-technical):

The user ID changed mid-conversation -- the agent confused who it was talking to
A counter that should only go up suddenly went backwards
A list of messages that should only grow got shorter -- messages were lost

Examples (technical):

Type change: state["price"] transitions from float to str between supersteps 3 and 4
Identity mutation: state["user_id"] changes from "usr_abc" to "usr_xyz" (immutable field violated)
Counter decrease: state["step_count"] goes from 8 to 5 (monotonic violation)
Value explosion: state["messages"] grows from 3 items to 45 items in one superstep (>10x)
Null injection: state["context"] was {"key": "value"} but becomes None
Node error: agent node output contains "state_error: schema violation"

Detection methods:

Type Drift Detection: Flags same key changing Python type between snapshots
Identity Field Protection: Monitors immutable fields (user_id, session_id, thread_id, etc.)
Monotonic Counter Validation: Ensures counters like step_count never decrease
Container Growth Tracking: Detects value explosion (>10x) and list shrinkage
Null Injection Detection: Catches non-None values becoming None
Error Signal Analysis: Scans node errors for corruption-related keywords

Sub-types: type_change, null_injection, field_deletion, value_explosion, list_shrinkage, identity_mutation, counter_decrease, value_jump, field_injection, node_error

Edge Misrouting¶

Field	Value
Detector key	`langgraph_edge_misroute`
Severity	High

Plain language: Your graph routed to the wrong node. A conditional edge sent the agent down the wrong path -- like a GPS giving you a turn that leads to a dead end. The routing decision contradicted the agent's state or output.

Technical: Validates conditional edge routing by checking target node existence, detecting dead-end and unreachable nodes, and performing semantic analysis between condition text and target node types. Cross-references routing decisions against actual state values and node outputs to detect contradictions.

Examples (non-technical):

The agent's condition said "task complete" but it routed to a processing node instead of the end node
A branch in the graph leads to a node that doesn't exist anymore
A node has no connections going out -- the workflow gets stuck there

Examples (technical):

Edge condition "finish" routes to node process_data instead of __end__ (condition-target mismatch)
Conditional edge targets node_id: "validator" but no node with that ID exists in the graph
State has {"should_continue": false} but conditional edge evaluates to continue path (state-condition contradiction)
Node transform output is {"decision": "reject"} but edge routes to approve node (output-condition contradiction)
Dead-end: node analyze has incoming edges but no outgoing edges and is not __end__

Detection methods:

Target Existence Check: Validates edge targets exist in graph definition
Dead-End Detection: Finds non-terminal nodes with no outgoing edges
Condition-Target Semantic Analysis: Compares condition text against target node type
State-Condition Cross-Reference: Checks if routing decisions match actual state values
Output-Condition Validation: Verifies node outputs align with taken edge condition

Sub-types: missing_target, dead_end, unreachable, condition_mismatch, condition_title_mismatch, state_condition_contradiction, output_condition_contradiction, condition_value_target_mismatch, skipped_conditional

Tool Failures¶

Field	Value
Detector key	`langgraph_tool_failure`
Severity	High

Plain language: A tool node in your graph failed, and the graph either couldn't recover or had to fall back to an alternative. Unrecovered tool failures can crash the entire graph execution.

Technical: Filters for node_type == "tool" nodes with status == "failed", then classifies recovery pattern by checking the next superstep: retry (same node reappears), fallback (different node handles it), or uncaught (no recovery and graph fails).

Examples (non-technical):

The agent tried to search the web but the search tool crashed -- the entire graph stopped
A database query tool failed, and the agent retried it but it failed again
A tool failed but the agent switched to an alternative tool and continued successfully

Examples (technical):

Uncaught: tool web_search fails at superstep 3, no nodes in superstep 4, graph status: "failed"
Retried failure: tool query_db fails at superstep 5, reappears at superstep 6 with status: "failed" again
Fallback: tool api_call fails at superstep 4, node fallback_handler appears at superstep 5 with status: "succeeded"
Tool error: {"error": "ConnectionTimeout: API endpoint unreachable after 30s"}

Detection methods:

Failure Detection: Identifies tool nodes with status == "failed"
Retry Pattern Analysis: Checks if same node_id appears in next superstep
Fallback Detection: Identifies different nodes handling recovery in next superstep
Uncaught Failure Classification: Flags failures with no recovery when graph status is failed/error

Sub-types: uncaught_failure, retried_failure, fallback_handled, tool_failure

Parallel Sync Issues¶

Field	Value
Detector key	`langgraph_parallel_sync`
Severity	High

Plain language: Nodes running in parallel stepped on each other. Two nodes tried to write to the same piece of state simultaneously, or parallel branches didn't properly merge back together -- causing lost data or inconsistent results.

Technical: Detects parallel execution (multiple nodes in same superstep), then checks for write conflicts (multiple nodes writing same state key without a join), race conditions (overlapping read/write sets), missing join nodes, mixed success/failure in parallel branches, and state error keywords after parallel supersteps.

Examples (non-technical):

Two parallel agents both tried to update the same field -- one overwrote the other's work
Three branches ran in parallel but only two finished successfully -- the third failed silently
Parallel branches completed but there was no step to merge their results together

Examples (technical):

Write conflict: nodes researcher and analyst both write to state["summary"] in superstep 4 (no reducer defined)
Race condition: node A reads state["data"], node B writes state["data"] in same superstep
Missing join: superstep 4 has 3 parallel nodes, superstep 5 has 2 nodes (expected 1 join node)
Failed parallel: superstep 3 has nodes [fetch_a: succeeded, fetch_b: failed] -- mixed results
State error: state["sync_status"] contains "partial_failure" after parallel superstep

Detection methods:

Write Conflict Detection: Checks if multiple parallel nodes write the same state key
Race Condition Analysis: Finds overlapping read/write sets between parallel nodes
Join Validation: Verifies a single join node follows parallel execution
Mixed Result Detection: Flags supersteps with both succeeded and failed nodes
Post-Parallel State Check: Scans state for error keywords after parallel execution

Sub-types: write_conflict, race_condition, missing_join, failed_parallel, downstream_failure, state_error_after_parallel

Checkpoint Corruption¶

Field	Value
Detector key	`langgraph_checkpoint_corruption`
Severity	High

Plain language: Your graph's saved checkpoints are corrupted. Checkpoints are snapshots that let you resume or replay a graph run -- if they're out of order, have gaps, or don't match the actual state, replaying from them will produce wrong results.

Technical: Validates checkpoint integrity by checking superstep monotonicity (ordering), sequence completeness (no gaps), state consistency (checkpoint state matches corresponding state snapshot), and schema completeness (all required keys from state_schema present in checkpoint state).

Examples (non-technical):

A saved checkpoint says the agent was at step 5, but the next checkpoint says step 3 -- the order is wrong
Checkpoints jump from step 2 to step 5 -- steps 3 and 4 are missing
The checkpoint's saved state doesn't match what the agent actually had at that step

Examples (technical):

Non-monotonic: checkpoint sequence has supersteps [1, 2, 5, 3, 4] -- step 3 appears after step 5
Superstep gap: checkpoints at supersteps [0, 1, 2, 5, 6] -- gap at steps 3-4
State inconsistency: checkpoint at superstep 3 has {"messages": 5} but state snapshot shows {"messages": 8} (value mismatch)
Missing schema keys: state_schema requires ["messages", "context", "plan"] but checkpoint only has ["messages"]
Extra keys in checkpoint not present in state snapshot indicate data integrity issue

Detection methods:

Monotonicity Validation: Ensures checkpoint supersteps are non-decreasing
Sequence Completeness: Detects gaps in superstep sequence (allows duplicates)
State Cross-Reference: Compares checkpoint state against state snapshot at same superstep
Schema Completeness: Validates all required keys from state_schema exist in checkpoint

Sub-types: non_monotonic, superstep_gap, state_inconsistency, missing_schema_keys