Detecting Overflow in Deep Agents Traces¶

April 2026

LangChain's Deep Agents runtime shipped in March 2026 with three primitives that make long-horizon agent work tractable: write_todos for planning, task for isolated-context subagent spawns, and the LangGraph Memory Store for persistent state. The design is deliberate — isolate the subagent so the parent's context window doesn't drown in intermediate reasoning.

The isolation helps. It does not eliminate the failure. Subagent results still flow back to the parent, and if the parent stitches them together without summarizing, the parent's own window blows up anyway. The subagent ran cheap; the supervisor still ran expensive.

Pisama-core v1.6.0 ships a DeepAgentsAdapter that ingests Deep Agents state checkpoints and runs the full Pisama detector suite against them. This post walks through detecting that specific failure — context overflow on the supervisor agent after fan-out to subagents — on a real Deep Agents trace.

The trace¶

A research agent is asked to produce a comparative brief on three frontier models. The agent calls write_todos to lay out a plan, then spawns three subagents via task — one per model — and finally folds the results back into a single report.

Only four state checkpoints are relevant:

states = [
    # 1. Supervisor writes the plan.
    {
        "node": "planner",
        "messages": [
            {"role": "user", "content": "Compare Claude 4.5, GPT-5.4, Gemini 3.1 on reasoning benchmarks."},
            {"role": "ai", "tool_calls": [{
                "name": "write_todos",
                "args": {"todos": [
                    {"content": "Research Claude 4.5 reasoning results", "status": "pending"},
                    {"content": "Research GPT-5.4 reasoning results", "status": "pending"},
                    {"content": "Research Gemini 3.1 reasoning results", "status": "pending"},
                    {"content": "Synthesize comparative brief", "status": "pending"},
                ]},
            }]},
        ],
        "todos": [
            {"content": "Research Claude 4.5 reasoning results", "status": "pending"},
            {"content": "Research GPT-5.4 reasoning results", "status": "pending"},
            {"content": "Research Gemini 3.1 reasoning results", "status": "pending"},
            {"content": "Synthesize comparative brief", "status": "pending"},
        ],
    },
    # 2. Three subagents spawned via `task`. Each returns the raw content
    # of 6-10 source pages it scraped. No summarization.
    {
        "node": "supervisor",
        "subagents": [
            {"name": "researcher_claude", "task": "Find Claude 4.5 reasoning results",
             "result": _12k_tokens_of_scraped_pages()},
            {"name": "researcher_gpt", "task": "Find GPT-5.4 reasoning results",
             "result": _14k_tokens_of_scraped_pages()},
            {"name": "researcher_gemini", "task": "Find Gemini 3.1 reasoning results",
             "result": _11k_tokens_of_scraped_pages()},
        ],
        "messages": [],
    },
    # 3. Supervisor re-reads every subagent result verbatim to "synthesize".
    {
        "node": "agent",
        "messages": [
            {"role": "tool", "name": "researcher_claude",
             "content": _12k_tokens_of_scraped_pages()},
            {"role": "tool", "name": "researcher_gpt",
             "content": _14k_tokens_of_scraped_pages()},
            {"role": "tool", "name": "researcher_gemini",
             "content": _11k_tokens_of_scraped_pages()},
            {"role": "ai", "content": "I need to re-read the Claude material more carefully..."},
        ],
    },
    # 4. Synthesis attempt — truncated; model dropped Gemini entirely.
    {
        "node": "agent",
        "messages": [
            {"role": "ai", "content": "Claude 4.5 and GPT-5.4 compare as follows..."},
        ],
        "todos": [
            {"content": "Research Claude 4.5 reasoning results", "status": "completed"},
            {"content": "Research GPT-5.4 reasoning results", "status": "completed"},
            {"content": "Research Gemini 3.1 reasoning results", "status": "pending"},
            {"content": "Synthesize comparative brief", "status": "completed"},
        ],
    },
]

The final artifact looks plausible. The rubric reads "comparative brief" and sees a brief. The grader does not notice that Gemini never made it in, that the todo for Gemini is still pending, and that Synthesize was marked completed anyway.

Wire it up¶

pip install pisama-core

import asyncio
from pisama_core.adapters.deep_agents import DeepAgentsAdapter
from pisama_core.detection import DetectionOrchestrator

adapter = DeepAgentsAdapter(session_id="research-001", agent_name="comparator")
trace = adapter.parse_trace(states)

orchestrator = DetectionOrchestrator()
result = asyncio.run(orchestrator.analyze(trace))

for finding in result.get_issues():
    print(finding.detector_name, finding.severity, finding.summary)

The adapter does not import langchain, langgraph, or deepagents. It consumes whatever your runtime exposes — agent.get_state_history(), a custom checkpointer, or a JSON dump from a postmortem.

What Pisama flags¶

The orchestrator runs all 17 production-tier detectors against the trace. Three fire:

{
  "detector_name": "overflow",
  "detected": true,
  "severity": 72,
  "confidence": 0.91,
  "summary": "Supervisor context accumulated 37k tokens of raw subagent output without summarization before synthesis step.",
  "evidence": [
    {
      "description": "Three subagent results re-injected into supervisor context verbatim",
      "span_ids": ["span_researcher_claude", "span_researcher_gpt", "span_researcher_gemini"],
      "data": {
        "subagent_output_tokens": [12000, 14000, 11000],
        "supervisor_context_tokens_at_synthesis": 37420,
        "context_window_utilization": 0.93
      }
    }
  ]
}

{
  "detector_name": "completion",
  "detected": true,
  "severity": 65,
  "confidence": 0.88,
  "summary": "Todo 'Synthesize comparative brief' marked completed while 'Research Gemini 3.1' remains pending.",
  "evidence": [
    {
      "description": "Downstream todo completed before upstream dependency",
      "data": {
        "completed_todo": "Synthesize comparative brief",
        "pending_predecessor": "Research Gemini 3.1 reasoning results"
      }
    }
  ]
}

{
  "detector_name": "coordination",
  "detected": true,
  "severity": 58,
  "confidence": 0.84,
  "summary": "Subagent researcher_gemini produced output but the synthesis message references only two of three models.",
  "evidence": [
    {
      "description": "Subagent result dropped on supervisor handoff",
      "span_ids": ["span_researcher_gemini", "span_synthesis"]
    }
  ]
}

The three findings are related: the supervisor hit its context ceiling, silently dropped one subagent's contribution, and the completion tracker never noticed because it was keyed off todo status rather than the synthesis content.

Why the overflow detector catches this and observability does not¶

The overflow detector (see packages/pisama-core/src/pisama_core/detection/overflow.py) walks the trace tree, accumulates token counts across message spans attributed to the same parent agent, and fires when a parent's rolling context crosses a configured threshold. On Deep Agents traces the adapter tags each subagent span with deep_agents.subagent.isolated_context=true, so the detector knows to exclude the subagent's internal tokens and count only the results folded back to the parent.

The rubric evaluator never sees the token counts. Phoenix sees them but does not correlate them across subagent boundaries. Langfuse records the spans but does not compute rolling context per parent. The signal is in the trace graph; the detector is what turns the graph into a finding.

What this doesn't do¶

The Deep Agents runtime does not expose a pre-execution hook API comparable to Claude Code's, so the adapter is ingestion-only — it cannot block a subagent spawn, inject a summarization step, or rewrite the plan. That is by design: Deep Agents' in-the-loop guardrails are its own concern. Pisama complements them with post-hoc forensics.

For a full write-up of the adapter's state-shape contract, see the Deep Agents integration guide. The detector catalog is in the Failure Modes reference.

Pisama-core v1.6.0 ships the DeepAgentsAdapter, plus the 17 production detectors this example triggers. Get started in 30 seconds.