Calibration FAQ¶
Three questions we get constantly about detector thresholds and accuracy.
Why are OSS defaults uncalibrated?¶
The detector code in pisama-detectors ships with conservative default thresholds. That is deliberate: defaults are tuned to minimize false positives on a broad, generic trace mix, not to maximize F1 on a specific application.
Tuned thresholds require a labelled dataset — traces marked with ground-truth "this was a real loop", "this was a real injection". Without that data, any threshold shipped by default is a guess. Shipping guesses as "calibrated" would be worse than shipping conservative defaults: users would trust the number and be wrong.
So pisama-detectors gives you the detector code and reference thresholds that work out-of-the-box with low FP rates. What it does not give you is the curve of threshold → F1 for your application's traces. That curve is what Pisama Cloud produces.
How do Cloud thresholds differ?¶
Pisama Cloud runs each detector against a labelled golden dataset (the Sprint 11 set is 1,200+ traces across easy/medium/hard difficulty) and picks thresholds per-detector to maximise F1. Current results:
- Mean F1 across 51 production detectors: 0.876
- Top three on real-trace data:
injection0.969,grounding0.876,hallucination0.854 - Framework-specific detectors (OpenClaw): mean F1 0.94
The same detector code, with Cloud's tuned thresholds, recovers ~15–25 F1 points over OSS defaults on real production traces. That delta is the product.
Cloud thresholds also refresh: each quality-gate run re-picks thresholds against the latest golden data, so drift (new model versions, new failure patterns) gets corrected without a code change.
Can I tune locally?¶
Yes — pisama-detectors exposes the threshold arguments on every detector constructor. If you have your own labelled data, you can sweep thresholds and pick winners:
from pisama_detectors import LoopDetector
# Default
d = LoopDetector()
# Custom thresholds
d = LoopDetector(
semantic_similarity_threshold=0.85,
min_loop_length=3,
max_unique_states=10,
)
The detectors are deterministic on a given input, so a standard precision/recall sweep works. What Cloud gives you on top of that:
- The golden dataset to sweep against (1,200+ labelled traces across 14 MAST failure modes).
- The eval harness (
calibrate.py) that runs the sweep, handles per-difficulty breakdown, and flags saturation. - Tiered escalation (Tier 1 hash → Tier 5 LLM judge) so expensive detectors only run when cheap ones are uncertain. Targets $0.05/trace.
- Continuous re-calibration as you ship new agent versions.
If you have the labelled data and the time, local tuning works. If you don't, Pisama Cloud is the shortcut.
Related¶
- OSS vs Cloud — full feature comparison
- Detection Tiers — how escalation works
- Detection Overview — per-detector F1 numbers