Framework-Agnostic Convergence Detection¶

April 2026

Autonomous research agents — Karpathy-style autoresearch loops, iterative code generation, hyperparameter sweeps — produce one thing that ordinary agent traces do not: a numerical performance signal over many iterations. val_loss over 100 training runs. accuracy across 40 proposed prompt edits. latency_ms across every tool call the agent makes.

That signal is cheap to collect and expensive to ignore. An agent that plateaus at iteration 40 of a 100-iteration sweep has already wasted 60% of the budget. An agent that diverges on a bad learning rate burns tokens producing nothing. A rubric evaluator reading the final artifact will not notice; the loss curve is not in the rubric.

Pisama's ConvergenceDetector consumes a metric time series and flags four failure modes: plateau, regression, thrashing, and divergence. Earlier builds of Pisama had an AutoresearchAdapter that converted one specific framework's spans into metric series. That adapter was removed. The metric extractor replaced it — it pulls series from the generic fields every ingested trace already carries, so the detector fires on every framework by default.

What it catches¶

The detector returns a ConvergenceResult with a failure_type string, a severity (minor / moderate / severe / critical), a confidence score, and an evidence dict. One example per failure type:

plateau — val_loss stops improving for the last ten steps of a sweep. Evidence carries avg_improvement, stall_ratio, and the window size. Detector output: "Metric plateaued: avg improvement 0.000142 per step over last 10 steps (threshold: 0.02). 9/9 steps showed no meaningful progress."
regression — the agent moves past its best checkpoint and keeps going. Evidence carries regression_frac and steps_since_best. Detector output: "Metric regressed by 12.4% from best value (0.4120) to current (0.4630), 8 steps after best."
thrashing — the metric oscillates without a trend; the agent is undoing its own work. Evidence carries reversals and reversal_ratio. Detector output: "Metric thrashing: 7 direction reversals in 10 steps (reversal ratio: 88%). No consistent trend detected."
divergence — the metric consistently moves the wrong way. Evidence carries wrong_ratio, total_change, and normalized_change. Detector output: "Metric diverging: 8/9 steps moving in wrong direction over last 10 steps. Total change: 1.8400 (42.7% of scale)."

A concrete trace: plateaued hyperparameter sweep¶

An agent is tasked with finding a learning rate + batch size combination that minimizes val_loss over 100 training runs. After step 40 every subsequent run returns within 1% of the running best. The rubric evaluator will happily grade the final report ("tried 100 configurations"); the convergence detector catches that the last 60 runs produced nothing.

from pisama_core.detection.convergence import ConvergenceDetector

detector = ConvergenceDetector()

metrics = [
    {"step": 1,  "value": 2.41},
    {"step": 5,  "value": 1.82},
    {"step": 10, "value": 1.31},
    {"step": 20, "value": 0.88},
    {"step": 30, "value": 0.54},
    {"step": 40, "value": 0.42},
    {"step": 45, "value": 0.421},
    {"step": 50, "value": 0.419},
    {"step": 55, "value": 0.420},
    {"step": 60, "value": 0.421},
    {"step": 65, "value": 0.418},
    {"step": 70, "value": 0.420},
    {"step": 75, "value": 0.419},
]

result = detector.detect_convergence_issues(metrics, direction="minimize")

{
  "detected": true,
  "confidence": 1.0,
  "failure_type": "plateau",
  "severity": "severe",
  "best_value": 0.418,
  "current_value": 0.419,
  "improvement_rate": 0.0000341,
  "steps_since_best": 2,
  "evidence": {
    "primary_failure": "plateau",
    "issue_count": 1,
    "direction": "minimize",
    "num_steps": 13,
    "window_size": 10
  }
}

Interpretation: the agent found its best configuration around step 40 and has been re-sampling near that point ever since. Early-stop at step 40 would save the remaining 60% of the compute budget.

Why it's framework-agnostic¶

Every trace that flows through Pisama's ingestion produces a list of State rows, each carrying token_count, latency_ms, tool_calls, and sequence_num. That is enough raw signal to build generic metric series without a framework-specific adapter.

app/detection/metric_extractor.py currently registers three extractors:

tokens_per_state — verbosity creep / context explosion (minimize)
latency_per_state — performance regression / stuck agent (minimize)
tool_call_count_per_state — thrashing between tools (minimize)

from app.detection.metric_extractor import extract_metric_series

series = extract_metric_series(trace.states)
# {
#   "latency_per_state": {"metrics": [...], "direction": "minimize"},
#   "tokens_per_state":  {"metrics": [...], "direction": "minimize"},
#   ...
# }

Adding a custom metric means registering another function in _EXTRACTORS. No subclassing, no adapter, no framework import. The older AutoresearchAdapter was deleted because it solved a narrower problem than the fields already on every State row.

How to wire it¶

Standalone on your own metric series:

from pisama_core.detection.convergence import ConvergenceDetector

detector = ConvergenceDetector(
    plateau_threshold=0.02,   # min per-step improvement (fraction of scale)
    plateau_window=10,         # steps to evaluate for plateau
    regression_tolerance=0.02, # acceptable regression from best
    thrashing_min_reversals=3,
    min_steps=3,
)

result = detector.detect_convergence_issues(
    metrics=[{"step": i, "value": v} for i, v in enumerate(val_loss_series)],
    direction="minimize",
)

if result.detected:
    print(result.failure_type, result.severity, result.current_value)
    for issue in result.issues:
        print("-", issue.description)

Inside the orchestrator, via the generic extractor:

from app.detection.metric_extractor import extract_metric_series
from app.detection.convergence import convergence_detector

for name, payload in extract_metric_series(trace.states).items():
    result = convergence_detector.detect_convergence_issues(
        metrics=payload["metrics"],
        direction=payload["direction"],
    )
    if result.detected:
        record_finding(series=name, result=result)

The detector has no I/O, no model call, and no framework dependency. It is a function from a list of floats to a ConvergenceResult.

Limitations¶

The detector is narrow on purpose.

Discrete failures are invisible. An agent that crashes halfway through a sweep produces no metric points after the crash — the detector sees a clean converging curve and returns detected=False. Pair it with the completion and workflow detectors for terminal-state checks.
Non-monotonic signals break assumptions. The detector expects minimize or maximize — a metric that is supposed to oscillate (e.g., exploration rate) will trip the thrashing check.
Sub-sampled traces lose signal. Series shorter than min_steps (default 3) return detected=False with evidence={"reason": "insufficient_data"}. If your trace captures every tenth step, plateau detection will false-negative.
Overall-trend guard suppresses minor issues. If the series improves by more than 30% end-to-end and is smooth, minor regression and thrashing findings are dropped; if it improves by more than 50%, minor and moderate plateau findings are dropped. This is intentional — noisy SGD is not a failure — but it means the detector will not flag mid-run wobble in an otherwise-converging series.
No root cause. The detector says the metric plateaued; it does not say the learning rate is too low. Pair with a fix generator or human review for remediation.

The ConvergenceDetector ships in every Pisama build. No feature flag, no adapter install. Point it at a list of numbers and read the result.