Framework-Agnostic Convergence Detection¶
April 2026
Autonomous research agents — Karpathy-style autoresearch loops, iterative code generation, hyperparameter sweeps — produce one thing that ordinary agent traces do not: a numerical performance signal over many iterations. val_loss over 100 training runs. accuracy across 40 proposed prompt edits. latency_ms across every tool call the agent makes.
That signal is cheap to collect and expensive to ignore. An agent that plateaus at iteration 40 of a 100-iteration sweep has already wasted 60% of the budget. An agent that diverges on a bad learning rate burns tokens producing nothing. A rubric evaluator reading the final artifact will not notice; the loss curve is not in the rubric.
Pisama's ConvergenceDetector consumes a metric time series and flags four failure modes: plateau, regression, thrashing, and divergence. Earlier builds of Pisama had an AutoresearchAdapter that converted one specific framework's spans into metric series. That adapter was removed. The metric extractor replaced it — it pulls series from the generic fields every ingested trace already carries, so the detector fires on every framework by default.
What it catches¶
The detector returns a ConvergenceResult with a failure_type string, a severity (minor / moderate / severe / critical), a confidence score, and an evidence dict. One example per failure type:
- plateau —
val_lossstops improving for the last ten steps of a sweep. Evidence carriesavg_improvement,stall_ratio, and the window size. Detector output:"Metric plateaued: avg improvement 0.000142 per step over last 10 steps (threshold: 0.02). 9/9 steps showed no meaningful progress." - regression — the agent moves past its best checkpoint and keeps going. Evidence carries
regression_fracandsteps_since_best. Detector output:"Metric regressed by 12.4% from best value (0.4120) to current (0.4630), 8 steps after best." - thrashing — the metric oscillates without a trend; the agent is undoing its own work. Evidence carries
reversalsandreversal_ratio. Detector output:"Metric thrashing: 7 direction reversals in 10 steps (reversal ratio: 88%). No consistent trend detected." - divergence — the metric consistently moves the wrong way. Evidence carries
wrong_ratio,total_change, andnormalized_change. Detector output:"Metric diverging: 8/9 steps moving in wrong direction over last 10 steps. Total change: 1.8400 (42.7% of scale)."
A concrete trace: plateaued hyperparameter sweep¶
An agent is tasked with finding a learning rate + batch size combination that minimizes val_loss over 100 training runs. After step 40 every subsequent run returns within 1% of the running best. The rubric evaluator will happily grade the final report ("tried 100 configurations"); the convergence detector catches that the last 60 runs produced nothing.
from pisama_core.detection.convergence import ConvergenceDetector
detector = ConvergenceDetector()
metrics = [
{"step": 1, "value": 2.41},
{"step": 5, "value": 1.82},
{"step": 10, "value": 1.31},
{"step": 20, "value": 0.88},
{"step": 30, "value": 0.54},
{"step": 40, "value": 0.42},
{"step": 45, "value": 0.421},
{"step": 50, "value": 0.419},
{"step": 55, "value": 0.420},
{"step": 60, "value": 0.421},
{"step": 65, "value": 0.418},
{"step": 70, "value": 0.420},
{"step": 75, "value": 0.419},
]
result = detector.detect_convergence_issues(metrics, direction="minimize")
{
"detected": true,
"confidence": 1.0,
"failure_type": "plateau",
"severity": "severe",
"best_value": 0.418,
"current_value": 0.419,
"improvement_rate": 0.0000341,
"steps_since_best": 2,
"evidence": {
"primary_failure": "plateau",
"issue_count": 1,
"direction": "minimize",
"num_steps": 13,
"window_size": 10
}
}
Interpretation: the agent found its best configuration around step 40 and has been re-sampling near that point ever since. Early-stop at step 40 would save the remaining 60% of the compute budget.
Why it's framework-agnostic¶
Every trace that flows through Pisama's ingestion produces a list of State rows, each carrying token_count, latency_ms, tool_calls, and sequence_num. That is enough raw signal to build generic metric series without a framework-specific adapter.
app/detection/metric_extractor.py currently registers three extractors:
tokens_per_state— verbosity creep / context explosion (minimize)latency_per_state— performance regression / stuck agent (minimize)tool_call_count_per_state— thrashing between tools (minimize)
from app.detection.metric_extractor import extract_metric_series
series = extract_metric_series(trace.states)
# {
# "latency_per_state": {"metrics": [...], "direction": "minimize"},
# "tokens_per_state": {"metrics": [...], "direction": "minimize"},
# ...
# }
Adding a custom metric means registering another function in _EXTRACTORS. No subclassing, no adapter, no framework import. The older AutoresearchAdapter was deleted because it solved a narrower problem than the fields already on every State row.
How to wire it¶
Standalone on your own metric series:
from pisama_core.detection.convergence import ConvergenceDetector
detector = ConvergenceDetector(
plateau_threshold=0.02, # min per-step improvement (fraction of scale)
plateau_window=10, # steps to evaluate for plateau
regression_tolerance=0.02, # acceptable regression from best
thrashing_min_reversals=3,
min_steps=3,
)
result = detector.detect_convergence_issues(
metrics=[{"step": i, "value": v} for i, v in enumerate(val_loss_series)],
direction="minimize",
)
if result.detected:
print(result.failure_type, result.severity, result.current_value)
for issue in result.issues:
print("-", issue.description)
Inside the orchestrator, via the generic extractor:
from app.detection.metric_extractor import extract_metric_series
from app.detection.convergence import convergence_detector
for name, payload in extract_metric_series(trace.states).items():
result = convergence_detector.detect_convergence_issues(
metrics=payload["metrics"],
direction=payload["direction"],
)
if result.detected:
record_finding(series=name, result=result)
The detector has no I/O, no model call, and no framework dependency. It is a function from a list of floats to a ConvergenceResult.
Limitations¶
The detector is narrow on purpose.
- Discrete failures are invisible. An agent that crashes halfway through a sweep produces no metric points after the crash — the detector sees a clean converging curve and returns
detected=False. Pair it with thecompletionandworkflowdetectors for terminal-state checks. - Non-monotonic signals break assumptions. The detector expects
minimizeormaximize— a metric that is supposed to oscillate (e.g., exploration rate) will trip the thrashing check. - Sub-sampled traces lose signal. Series shorter than
min_steps(default 3) returndetected=Falsewithevidence={"reason": "insufficient_data"}. If your trace captures every tenth step, plateau detection will false-negative. - Overall-trend guard suppresses minor issues. If the series improves by more than 30% end-to-end and is smooth, minor regression and thrashing findings are dropped; if it improves by more than 50%, minor and moderate plateau findings are dropped. This is intentional — noisy SGD is not a failure — but it means the detector will not flag mid-run wobble in an otherwise-converging series.
- No root cause. The detector says the metric plateaued; it does not say the learning rate is too low. Pair with a fix generator or human review for remediation.
The ConvergenceDetector ships in every Pisama build. No feature flag, no adapter install. Point it at a list of numbers and read the result.