Specifications, Not Rubrics: The Procedural Gap Outcome Graders Miss¶
April 2026
In March 2026, researchers from Microsoft Research and the University of Washington published a finding that should reshape how the industry thinks about agent evaluation. They ran a state-of-the-art Claude model across 424 traces on the τ2-bench customer-service benchmark. Of the traces that earned a perfect outcome reward, 83% contained procedural violations that the reward function could not see. The agents reached the right answers. They reached them by breaking the rules they were told to follow.
The paper is Willful Disobedience: Automatically Detecting Failures in Agentic Traces (Sharma, Barke, Zorn). Its result is the cleanest empirical case for a layer of evaluation that almost no production stack runs today: process-level checking against the specifications the agent itself was given.
That layer is what specification_compliance does in Pisama.
The shape of the gap¶
A rubric grader, the dominant evaluation primitive in 2026, reads the final artifact and an isolated context window. It does not read the trace. It cannot see whether the agent called the calculator before quoting numbers, kept its tool outputs within scope, or honored the persona it was assigned. Those are specifications. They live in the system prompt. The agent's compliance with them lives in the execution log.
When the rubric grader and the agent both read from the same final artifact, two things can be true at once:
- The artifact looks correct.
- The execution that produced it ignored half the system prompt.
The 83% number is the size of that gap on one specific benchmark. It is unlikely to be smaller in production, where prompts are longer, rules are more numerous, and the agent has more opportunity to find shortcuts that the rubric blesses.
How specification_compliance works¶
The specification_compliance pipeline at backend/app/detection/specification_compliance.py operates in two stages.
Stage 1: rule extraction. A single LLM call against the system prompt returns a structured list of behavioral rules. Each rule has a trigger (when does this rule apply), a required action, a forbidden action, and a severity. The extraction result is cached by SHA256 hash of the normalized prompt, so repeated traces from the same agent share the extraction cost. In practice, the cache hit rate is high; rule extraction is a per-prompt one-shot cost, not a per-trace cost.
Stage 2: compliance checking. For each extracted rule, the checker walks the trace events. Concrete rules ("must call tool X before emitting a numerical answer") match deterministically by substring or structured event check, with no LLM call. Semantic rules ("output must maintain a professional register") escalate to the existing MASTLLMJudge infrastructure with only the relevant trace span as input. Cost stays bounded because the deterministic fast path filters the easy cases first.
The output is a SpecificationComplianceResult with the extracted rules, the violations, the evidence span for each violation, and a calibrated confidence.
from pisama.detection import SpecificationComplianceAnalyzer
system_prompt = """
You are a customer-service agent for a financial institution.
Always call the calculator tool before quoting any monetary figure.
Never reveal the exact internal scoring algorithm.
Always cite the policy section when explaining a decision.
"""
trace_events = [
{"type": "tool_call", "name": "knowledge_base", "args": {"q": "fees"}},
{"type": "agent_message", "content": "Your total is $4,827.50."},
{"type": "agent_message", "content": "Per our standard model, you qualify."},
]
result = SpecificationComplianceAnalyzer().analyze(
system_prompt=system_prompt,
trace_events=trace_events,
)
# result.detected == True
# result.violations contains:
# - missing_calculator_before_quote (high severity)
# - missing_policy_citation (medium severity)
In this example, all three rules apply. The agent produced a confident, well-formed answer that an outcome grader would accept. The compliance checker surfaces what the rubric would not: the calculator was never invoked, the policy citation never appeared.
Calibration¶
The detector ships with 30 hand-authored seeds, 15 positive and 15 negative, stratified into easy, medium, and hard difficulty buckets. Cross-validated F1 lands at 0.966, precision 1.00, recall 0.933, at threshold 0.10. Total LLM cost across the calibration run was $0.07. The threshold sits low because the deterministic fast path catches concrete rule violations before any LLM call; escalation to the judge happens only on semantic checks where the trace evidence is ambiguous.
These seeds are starter material, not a production benchmark. Real prompts contain more rules, in more interaction with each other, than hand-authored seeds capture. The honest expectation for v0 is that F1 will compress when the detector is run against larger, harder corpora. Pisama-bench v0, prepared for release alongside this post, includes 96 curated entries across 16 ICP detectors as the start of a public test set; v1.0 will expand to roughly 500.
The two-layer evaluation stack¶
The conclusion is the same as the convergence post. A complete agent evaluation stack has two layers, not one.
The artifact layer answers: did the output meet the rubric criteria? Outcome graders, including Anthropic Managed Agents' outcomes, do this well. Most production stacks have this layer in place.
The procedural layer answers: did the execution comply with the specifications the agent was given? Specification compliance, convergence, coordination, persona drift, and the other Pisama process-level detectors live here. Most production stacks do not have this layer at all.
The AgentPex 83% number is what the gap looks like when only the artifact layer is wired in. Every "perfect reward" trace on τ2-bench contained something the procedural layer could surface. The economics are also straightforward: rule extraction is cached, deterministic checks are free, and the LLM judge runs only on the residual ambiguous cases. A trace can be checked end to end for a few cents.
Try it¶
specification_compliance is shipping behind a feature flag in the Pisama SDK while the API shape stabilizes.
from pisama_agent_sdk import check_compliance
result = await check_compliance(
system_prompt=agent_system_prompt,
trace_events=trace_from_run,
)
if result.detected:
for violation in result.violations:
print(violation.rule_id, violation.evidence)
The flag will become default-on once the API shape is locked.
Closing¶
Most agent evaluation stacks in 2026 grade the document and ignore the rules. Until the procedural layer becomes standard, "the agent passed" should be read as "the agent passed the artifact check."
The agent that earns a perfect score by skipping its specifications has not done its job. It has just hidden that fact from the grader.
References: Sharma, R. K., Barke, S., Zorn, B. Willful Disobedience: Automatically Detecting Failures in Agentic Traces. arXiv 2603.23806, March 2026.