Skip to content

Calibration FAQ

Three questions we get constantly about detector thresholds and accuracy.

Why are OSS defaults uncalibrated?

The detector code in pisama-detectors ships with conservative default thresholds. That is deliberate: defaults are tuned to minimize false positives on a broad, generic trace mix, not to maximize F1 on a specific application.

Tuned thresholds require a labelled dataset — traces marked with ground-truth "this was a real loop", "this was a real injection". Without that data, any threshold shipped by default is a guess. Shipping guesses as "calibrated" would be worse than shipping conservative defaults: users would trust the number and be wrong.

So pisama-detectors gives you the detector code and reference thresholds that work out-of-the-box with low FP rates. What it does not give you is the curve of threshold → F1 for your application's traces. That curve is what Pisama Cloud produces.

How do Cloud thresholds differ?

Pisama Cloud runs each detector against a labelled golden dataset (15,000+ traces across easy/medium/hard difficulty) and picks thresholds per-detector to maximise F1. Current results:

  • 49 of 84 detectors measured; 6 externally validated at production grade (real-trace F1 >= 0.80 on at least 30 real traces, mean 0.86)
  • Examples on real-trace data: withholding 0.952, output_validation 0.933, decomposition 0.927, routing 0.909, grounding 0.839, hallucination 0.831, completion 0.818
  • Framework-specific detectors (OpenClaw): mean F1 0.97 on thin real coverage (about 4 real traces each)

The same detector code, with Cloud's tuned thresholds, recovers ~15–25 F1 points over OSS defaults on real production traces. That delta is the product.

Cloud thresholds also refresh: each quality-gate run re-picks thresholds against the latest golden data, so drift (new model versions, new failure patterns) gets corrected without a code change.

Can I tune locally?

Yes — pisama-detectors exposes the threshold arguments on every detector constructor. If you have your own labelled data, you can sweep thresholds and pick winners:

from pisama_detectors import detect_loop

# Default thresholds
r = detect_loop(states=[...])

# Custom thresholds
r = detect_loop(states=[...], window_size=7, similarity_threshold=0.9)

The detectors are deterministic on a given input, so a standard precision/recall sweep works. What Cloud gives you on top of that:

  • The golden dataset to sweep against (15,000+ labelled traces spanning the MAST failure modes and framework-specific cases).
  • The eval harness (calibrate.py) that runs the sweep, handles per-difficulty breakdown, and flags saturation.
  • Tiered escalation (Tier 1 hash → Tier 5 LLM judge) so expensive detectors only run when cheap ones are uncertain. Targets $0.05/trace.
  • Continuous re-calibration as you ship new agent versions.

If you have the labelled data and the time, local tuning works. If you don't, Pisama Cloud is the shortcut.