Skip to content

Convergence and Orchestration: The Failures Rubric Graders Cannot See

April 2026

Note

Dated launch post. Detector counts, F1 figures, and code samples reflect Pisama at the time of writing; see the current docs for up-to-date numbers and runnable examples.


Anthropic's Managed Agents research preview introduced outcomes — a markdown rubric, a grader loop, and a verdict per criterion. The pattern is the cleanest artifact evaluator shipped by a foundation model vendor. It is also structurally blind to two classes of failure that sink long-running agent systems: convergence failures and orchestration quality failures.

Both are process-level. Both require numerical reasoning over a trajectory, not a verdict on a final document. Both are what Pisama was built for.

The shape of the blindness

A rubric grader reads the final artifact and an isolated context window. It does not see:

  • Whether the optimization metric is still improving or has plateaued
  • Whether the agent is oscillating between near-duplicates
  • Whether three workers are idle while one carries the whole load
  • Whether delegation dropped half the context across a handoff
  • Whether a supervisor reassigned the same subtask four times

These do not appear in the artifact. They appear in the trace.

Convergence: the metric the grader never sees

The convergence detector at backend/app/detection/convergence.py ingests the metric series an iterative agent produces — validation loss, rubric score, benchmark accuracy, ARC-AGI eval, anything a caller tracks across iterations — and classifies the trajectory into one of four failure types:

Type Definition Typical cause
plateau Metric improvement stalls below the improvement threshold for N windows Agent exhausted its strategy space
regression Metric worsens past the previous best Agent pursuing a worse branch
thrashing Metric oscillates without clear trend Alternating between near-duplicate strategies
divergence Metric consistently trends the wrong direction Optimization loop destabilized

All four are rubric-invisible. The rubric sees iteration N's artifact and either accepts or rejects it. It does not see that iteration N is worse than iteration N-3 and that the loop has been producing worse artifacts for five rounds.

from pisama_core.detection import ConvergenceDetector

metrics = [0.42, 0.48, 0.51, 0.52, 0.52, 0.52, 0.52]
result = ConvergenceDetector().detect(
    metrics=metrics,
    direction="higher_is_better",
    window_size=3,
)
# ConvergenceFailureType.PLATEAU — severity MODERATE

Cost: zero. No LLM call. The detector operates on a list of floats. It runs in microseconds. It is not replicable by any grader reading the final document because the signal is not in the final document.

Orchestration quality: the seven dimensions a rubric flattens

Managed Agents' multi-agent research preview ships hub-and-spoke, one level deep. That topology is a deliberate simplification — it makes the coordination tractable, and it makes the rubric tractable. A hub-and-spoke system has one coordinator and N workers; the rubric can focus on the coordinator's final artifact.

Richer topologies — DAGs, mesh, pipelines, hierarchical supervisors — fail in ways no artifact rubric can describe.

The orchestration quality scorer at backend/app/detection/orchestration_quality.py scores each trace across seven dimensions on a 0.0–1.0 scale:

Dimension Question it answers Tier
efficiency Makespan ratio — optimal parallel vs actual elapsed Metric
utilization Agent load distribution (Gini coefficient) Metric
parallelization Did the workflow miss exploitable parallelism? Structural
delegation_quality Was context preserved across handoffs? Semantic
communication_efficiency Message-to-work ratio Metric
robustness Did the system recover from errors? Structural
topology_alignment Is the topology suited to the task? Semantic

The scorer auto-detects the topology (pipeline, fan-out, fan-in, parallel, hierarchical, mixed) and weights the dimensions accordingly. A fan-out with a severe Gini coefficient (one worker did 90% of the work) scores low on utilization regardless of whether the final artifact passes.

Three of the seven dimensions — efficiency, utilization, communication efficiency — are pure metrics. Four require structural or semantic reasoning over the trace graph. None is expressible as a rubric criterion because none is about the artifact.

What this means for an evaluation stack

If your stack is a Managed Agents session + outcomes, your evaluation is artifact-complete and process-blind. If your stack is a LangGraph DAG with a custom grader, you have the same gap. The fix is not to replace the rubric — the rubric is doing its job — but to add a process-level layer alongside it.

Pisama's POST /api/v1/evaluate/rubric endpoint routes rubric criteria to detectors where they match semantically (hallucination, completion, persona drift, derailment) and falls back to an LLM judge for criteria with no detector equivalent. The process-level detectors — convergence, orchestration quality, coordination, communication, corruption, withholding, persona drift, loop — run independently on the trace and produce findings the rubric has no slot for.

The two layers coexist:

  • Rubric layer answers: did this artifact meet the criteria?
  • Process layer answers: did the execution that produced this artifact go off the rails in ways the artifact doesn't show?

A Managed Agents session that produces a beautiful final report after four iterations of worsening metrics passed the rubric. The metrics say the process was broken. Only the process layer can surface that.

Try it

The convergence detector and orchestration quality scorer are in the open-source pisama-core package:

pip install pisama
from pisama_core.detection import ConvergenceDetector
from app.detection.orchestration_quality import OrchestrationQualityScorer

The rubric engine is at POST /api/v1/evaluate/rubric — see the SDK quickstart.

Artifact evaluation is a solved problem. Process evaluation is not.