trajectory_score¶
DiagnoseResult.trajectory_score is a scalar in [0.0, 1.0] that summarizes how cleanly an agent run executed. It measures process quality, not task outcome. This guide explains what that means, when the score is informative, and how to combine it with task-success signals if you need both.
What it measures¶
The trajectory_score is computed in backend/app/detection_enterprise/orchestrator.py:_compute_trajectory_score as a multiplicative composite:
orchestration_quality_overallis the 7-dimensional multi-agent orchestration quality score: efficiency, utilization, parallelization, delegation, communication, robustness, topology. Defaults to1.0for single-agent / chat traces where the dimensions don't apply.max_severity_penaltyis the largest severity-weighted contribution from any fired detector:max(SEVERITY_WEIGHT[severity] × confidence).
A clean run scores 1.0. A run with one CRITICAL detector firing at confidence 1.0 scores ≤ 0.2.
The score_breakdown field of DiagnoseResult itemises: - which orchestration_quality dimensions contributed - the top-5 detector contributors with their per-detector penalty - the active severity weights (which can be customised per source_format)
What it does NOT measure¶
trajectory_score is not a "did the agent solve the user's task?" signal. Pisama Bench v1 (May 2026, n=270 across 5 corpora) made the distinction empirical:
| Corpus | Trace type | Pearson r vs task-outcome label |
|---|---|---|
| m500 | multi-agent math reasoning (4 agents) | +0.45 |
| Sotopia | role-play dialogue (2 agents) | −0.44 |
| AgentRewardBench | single-agent web tasks | −0.03 |
The same scalar correlates positively with task success when coordination quality drives outcome (m500), inversely when the task IS verbose conversation (Sotopia rewards the back-and-forth our loop / workflow detectors flag), and zero when the work is orthogonal to coordination (single-agent web tasks).
If you need a task-success signal, do not use trajectory_score in isolation. Today, combine it with an external evaluator (test pass, reward, completion check). The Tier 2 roadmap adds a separate task_completion_score driven by outcome-oriented detectors (task_failure, silent_failure, objective_unmet).
See: docs/research/bench-v1-multi-corpus-production-scoring-2026-05.md and the public summary at pisama.ai/research/bench-v1-multi-corpus.
When trajectory_score is informative¶
Useful when: - Multi-agent coordination matters for outcome — collaborative reasoning, multi-step planning, hierarchical agent teams. Positive correlation expected. - You're observing process quality drift — comparing two runs of the SAME workload, the score's relative movement is informative regardless of corpus type. - You need an aggregated signal for dashboards — instead of "8 detectors fired" you can show "process quality 0.42" with the breakdown drill-down on click.
Less useful when: - Single-agent tool-using agents — the multi-agent OQ dimensions fall back to 1.0; the score reduces to (1 − max_severity_penalty). On many real workloads this stays near 1.0 even when the agent fails at the task. - Adversarial / roleplay corpora — process noise can correlate inversely with the outcome the corpus rewards. - Cross-corpus comparisons — a 0.6 on m500 and a 0.6 on Sotopia mean very different things. Use source_format-aware profiles or the corpus-relative baseline_adjusted_score field once Tier 3 ships.
Suggested usage patterns¶
As a quality gate in CI:
result = await analyze_atif(trajectory)
if result.trajectory_score < 0.7:
raise ProcessQualityRegression(
f"trace {result.trace_id} scored {result.trajectory_score} "
f"— top contributor: {result.score_breakdown['top_contributors'][0]}"
)
As a comparator across runs of the same workload:
deltas = [a.trajectory_score - b.trajectory_score
for a, b in zip(runs_after_change, runs_before_change)]
print(f"mean process-quality delta: {sum(deltas)/len(deltas):+.3f}")
Paired with an outcome signal you compute yourself:
result = await analyze_atif(trajectory)
task_succeeded = my_evaluator.check(trajectory)
log_metrics({
"process_quality": result.trajectory_score,
"task_completion": 1.0 if task_succeeded else 0.0,
})
Backwards compatibility¶
trajectory_score and score_breakdown are additive fields. Older backend versions (pre-Phase 2) won't include them; SDKs default both to their "clean" values (1.0 and {} respectively) on parse so downstream code doesn't break.
Reference¶
- Formula source:
backend/app/detection_enterprise/orchestrator.py:_compute_trajectory_score - Severity weights:
_SEVERITY_WEIGHTconstant in the same file - OQ scorer:
backend/app/detection/orchestration_quality.py:OrchestrationQualityScorer - Bench v1 research note:
docs/research/bench-v1-multi-corpus-production-scoring-2026-05.md - Public summary:
pisama.ai/research/bench-v1-multi-corpus