Skip to content

Calibration FAQ

Three questions we get constantly about detector thresholds and accuracy.

Why are OSS defaults uncalibrated?

The detector code in pisama-detectors ships with conservative default thresholds. That is deliberate: defaults are tuned to minimize false positives on a broad, generic trace mix, not to maximize F1 on a specific application.

Tuned thresholds require a labelled dataset — traces marked with ground-truth "this was a real loop", "this was a real injection". Without that data, any threshold shipped by default is a guess. Shipping guesses as "calibrated" would be worse than shipping conservative defaults: users would trust the number and be wrong.

So pisama-detectors gives you the detector code and reference thresholds that work out-of-the-box with low FP rates. What it does not give you is the curve of threshold → F1 for your application's traces. That curve is what Pisama Cloud produces.

How do Cloud thresholds differ?

Pisama Cloud runs each detector against a labelled golden dataset (the Sprint 11 set is 1,200+ traces across easy/medium/hard difficulty) and picks thresholds per-detector to maximise F1. Current results:

  • Mean F1 across 51 production detectors: 0.876
  • Top three on real-trace data: injection 0.969, grounding 0.876, hallucination 0.854
  • Framework-specific detectors (OpenClaw): mean F1 0.94

The same detector code, with Cloud's tuned thresholds, recovers ~15–25 F1 points over OSS defaults on real production traces. That delta is the product.

Cloud thresholds also refresh: each quality-gate run re-picks thresholds against the latest golden data, so drift (new model versions, new failure patterns) gets corrected without a code change.

Can I tune locally?

Yes — pisama-detectors exposes the threshold arguments on every detector constructor. If you have your own labelled data, you can sweep thresholds and pick winners:

from pisama_detectors import LoopDetector

# Default
d = LoopDetector()

# Custom thresholds
d = LoopDetector(
    semantic_similarity_threshold=0.85,
    min_loop_length=3,
    max_unique_states=10,
)

The detectors are deterministic on a given input, so a standard precision/recall sweep works. What Cloud gives you on top of that:

  • The golden dataset to sweep against (1,200+ labelled traces across 14 MAST failure modes).
  • The eval harness (calibrate.py) that runs the sweep, handles per-difficulty breakdown, and flags saturation.
  • Tiered escalation (Tier 1 hash → Tier 5 LLM judge) so expensive detectors only run when cheap ones are uncertain. Targets $0.05/trace.
  • Continuous re-calibration as you ship new agent versions.

If you have the labelled data and the time, local tuning works. If you don't, Pisama Cloud is the shortcut.