Convergence and Orchestration: The Failures Rubric Graders Cannot See¶
April 2026
Note
Dated launch post. Detector counts, F1 figures, and code samples reflect Pisama at the time of writing; see the current docs for up-to-date numbers and runnable examples.
Anthropic's Managed Agents research preview introduced outcomes — a markdown rubric, a grader loop, and a verdict per criterion. The pattern is the cleanest artifact evaluator shipped by a foundation model vendor. It is also structurally blind to two classes of failure that sink long-running agent systems: convergence failures and orchestration quality failures.
Both are process-level. Both require numerical reasoning over a trajectory, not a verdict on a final document. Both are what Pisama was built for.
The shape of the blindness¶
A rubric grader reads the final artifact and an isolated context window. It does not see:
- Whether the optimization metric is still improving or has plateaued
- Whether the agent is oscillating between near-duplicates
- Whether three workers are idle while one carries the whole load
- Whether delegation dropped half the context across a handoff
- Whether a supervisor reassigned the same subtask four times
These do not appear in the artifact. They appear in the trace.
Convergence: the metric the grader never sees¶
The convergence detector at backend/app/detection/convergence.py ingests the metric series an iterative agent produces — validation loss, rubric score, benchmark accuracy, ARC-AGI eval, anything a caller tracks across iterations — and classifies the trajectory into one of four failure types:
| Type | Definition | Typical cause |
|---|---|---|
plateau | Metric improvement stalls below the improvement threshold for N windows | Agent exhausted its strategy space |
regression | Metric worsens past the previous best | Agent pursuing a worse branch |
thrashing | Metric oscillates without clear trend | Alternating between near-duplicate strategies |
divergence | Metric consistently trends the wrong direction | Optimization loop destabilized |
All four are rubric-invisible. The rubric sees iteration N's artifact and either accepts or rejects it. It does not see that iteration N is worse than iteration N-3 and that the loop has been producing worse artifacts for five rounds.
from pisama_core.detection import ConvergenceDetector
metrics = [0.42, 0.48, 0.51, 0.52, 0.52, 0.52, 0.52]
result = ConvergenceDetector().detect(
metrics=metrics,
direction="higher_is_better",
window_size=3,
)
# ConvergenceFailureType.PLATEAU — severity MODERATE
Cost: zero. No LLM call. The detector operates on a list of floats. It runs in microseconds. It is not replicable by any grader reading the final document because the signal is not in the final document.
Orchestration quality: the seven dimensions a rubric flattens¶
Managed Agents' multi-agent research preview ships hub-and-spoke, one level deep. That topology is a deliberate simplification — it makes the coordination tractable, and it makes the rubric tractable. A hub-and-spoke system has one coordinator and N workers; the rubric can focus on the coordinator's final artifact.
Richer topologies — DAGs, mesh, pipelines, hierarchical supervisors — fail in ways no artifact rubric can describe.
The orchestration quality scorer at backend/app/detection/orchestration_quality.py scores each trace across seven dimensions on a 0.0–1.0 scale:
| Dimension | Question it answers | Tier |
|---|---|---|
| efficiency | Makespan ratio — optimal parallel vs actual elapsed | Metric |
| utilization | Agent load distribution (Gini coefficient) | Metric |
| parallelization | Did the workflow miss exploitable parallelism? | Structural |
| delegation_quality | Was context preserved across handoffs? | Semantic |
| communication_efficiency | Message-to-work ratio | Metric |
| robustness | Did the system recover from errors? | Structural |
| topology_alignment | Is the topology suited to the task? | Semantic |
The scorer auto-detects the topology (pipeline, fan-out, fan-in, parallel, hierarchical, mixed) and weights the dimensions accordingly. A fan-out with a severe Gini coefficient (one worker did 90% of the work) scores low on utilization regardless of whether the final artifact passes.
Three of the seven dimensions — efficiency, utilization, communication efficiency — are pure metrics. Four require structural or semantic reasoning over the trace graph. None is expressible as a rubric criterion because none is about the artifact.
What this means for an evaluation stack¶
If your stack is a Managed Agents session + outcomes, your evaluation is artifact-complete and process-blind. If your stack is a LangGraph DAG with a custom grader, you have the same gap. The fix is not to replace the rubric — the rubric is doing its job — but to add a process-level layer alongside it.
Pisama's POST /api/v1/evaluate/rubric endpoint routes rubric criteria to detectors where they match semantically (hallucination, completion, persona drift, derailment) and falls back to an LLM judge for criteria with no detector equivalent. The process-level detectors — convergence, orchestration quality, coordination, communication, corruption, withholding, persona drift, loop — run independently on the trace and produce findings the rubric has no slot for.
The two layers coexist:
- Rubric layer answers: did this artifact meet the criteria?
- Process layer answers: did the execution that produced this artifact go off the rails in ways the artifact doesn't show?
A Managed Agents session that produces a beautiful final report after four iterations of worsening metrics passed the rubric. The metrics say the process was broken. Only the process layer can surface that.
Try it¶
The convergence detector and orchestration quality scorer are in the open-source pisama-core package:
from pisama_core.detection import ConvergenceDetector
from app.detection.orchestration_quality import OrchestrationQualityScorer
The rubric engine is at POST /api/v1/evaluate/rubric — see the SDK quickstart.
Artifact evaluation is a solved problem. Process evaluation is not.