Calibration FAQ¶
Three questions we get constantly about detector thresholds and accuracy.
Why are OSS defaults uncalibrated?¶
The detector code in pisama-detectors ships with conservative default thresholds. That is deliberate: defaults are tuned to minimize false positives on a broad, generic trace mix, not to maximize F1 on a specific application.
Tuned thresholds require a labelled dataset — traces marked with ground-truth "this was a real loop", "this was a real injection". Without that data, any threshold shipped by default is a guess. Shipping guesses as "calibrated" would be worse than shipping conservative defaults: users would trust the number and be wrong.
So pisama-detectors gives you the detector code and reference thresholds that work out-of-the-box with low FP rates. What it does not give you is the curve of threshold → F1 for your application's traces. That curve is what Pisama Cloud produces.
How do Cloud thresholds differ?¶
Pisama Cloud runs each detector against a labelled golden dataset (15,000+ traces across easy/medium/hard difficulty) and picks thresholds per-detector to maximise F1. Current results:
- 49 of 84 detectors measured; 6 externally validated at production grade (real-trace F1 >= 0.80 on at least 30 real traces, mean 0.86)
- Examples on real-trace data:
withholding0.952,output_validation0.933,decomposition0.927,routing0.909,grounding0.839,hallucination0.831,completion0.818 - Framework-specific detectors (OpenClaw): mean F1 0.97 on thin real coverage (about 4 real traces each)
The same detector code, with Cloud's tuned thresholds, recovers ~15–25 F1 points over OSS defaults on real production traces. That delta is the product.
Cloud thresholds also refresh: each quality-gate run re-picks thresholds against the latest golden data, so drift (new model versions, new failure patterns) gets corrected without a code change.
Can I tune locally?¶
Yes — pisama-detectors exposes the threshold arguments on every detector constructor. If you have your own labelled data, you can sweep thresholds and pick winners:
from pisama_detectors import detect_loop
# Default thresholds
r = detect_loop(states=[...])
# Custom thresholds
r = detect_loop(states=[...], window_size=7, similarity_threshold=0.9)
The detectors are deterministic on a given input, so a standard precision/recall sweep works. What Cloud gives you on top of that:
- The golden dataset to sweep against (15,000+ labelled traces spanning the MAST failure modes and framework-specific cases).
- The eval harness (
calibrate.py) that runs the sweep, handles per-difficulty breakdown, and flags saturation. - Tiered escalation (Tier 1 hash → Tier 5 LLM judge) so expensive detectors only run when cheap ones are uncertain. Targets $0.05/trace.
- Continuous re-calibration as you ship new agent versions.
If you have the labelled data and the time, local tuning works. If you don't, Pisama Cloud is the shortcut.
Related¶
- OSS vs Cloud — full feature comparison
- Detection Tiers — how escalation works
- Detection Overview — per-detector F1 numbers