Verification Failures (FC3)¶
Verification failures occur when agents improperly validate their outputs, bypass quality checks, or incorrectly determine task completion.
Accuracy numbers reflect the latest calibration run; the canonical per-detector table is in Detection Overview.
F12: Output Validation Failure (Enterprise)¶
| Field | Value |
|---|---|
| Detector key | output_validation |
| Tier | Enterprise |
| Severity | High |
| Accuracy | F1 0.933, P 1.000, R 0.875 (65 real validator/generator traces, Jun 2026) |
| MAST mapping | FM-3.2, FM-3.3 |
Plain language: The agent skipped its quality checks or approved work that actually failed them. Like a building inspector signing off without ever visiting the site.
Technical: Detects validation bypass patterns in agent outputs and workflow logs — including skipped validation steps, false approvals despite failed checks, and rubber-stamp approvals (short positive responses like "LGTM" or "Looks good" that contain no checking evidence). Calibrated specifically on generator/validator trace pairs where one agent produces output and a second agent is responsible for reviewing it.
Examples (non-technical):
- Agent says "all tests passed" but never actually ran the tests
- A required review step exists in the process but its results were ignored
- Agent marks work as "validated" when the validation actually failed
Examples (technical):
- Agent output contains
"tests_passed": truebut CI logs show zero test executions - Workflow step
validate_schemaran but returned{"valid": false}-- agent proceeded anyway - Agent bypasses validation with pattern:
# TODO: validate laterorBYPASS validation - Pipeline processes PII data but has no
validate_pii_handlingstep in the DAG
Detection methods:
- Bypass Pattern Detection: Identifies patterns indicating validation was skipped
- Validation Performance Check: Detects when validation steps actually ran
- False Approval Detection: Catches approval despite failed checks
- Validation Presence Audit: Ensures validation steps exist where required
Sub-types: validation_bypassed, validation_skipped, approval_despite_failure, missing_validation, validation_ignored, incomplete_validation
F13: Quality Gate Bypass (Enterprise)¶
| Field | Value |
|---|---|
| Detector key | quality_gate |
| Tier | Enterprise |
| Severity | High |
| Accuracy | Benchmarking in progress |
| MAST mapping | FM-3.2 No/Incomplete Verification |
Plain language: The agent skipped mandatory quality checks or pushed work through despite failing them. Like submitting a paper without spell-checking when spell-check is required.
Technical: Audits workflow execution for missing quality gate steps, threshold violations (e.g., score 45% below 80% minimum), bypassed review processes, and use of force flags (--no-verify, --force, --skip-*).
Examples (non-technical):
- Agent skips the required review step and goes straight to deployment
- Quality score is 45% but the minimum is 80% -- agent proceeds anyway
- A mandatory peer review process is completely omitted from the workflow
Examples (technical):
- Agent runs
git push --no-verify, bypassing pre-push hooks that enforce linting - Code coverage gate requires 80% but agent deploys with 52% coverage
- Agent calls
deploy(force=True)to skip the staging environment validation - Workflow definition includes
review_stepbut execution trace shows it was never invoked
Detection methods:
- Validation Step Audit: Checks for presence of required validation steps
- Threshold Monitoring: Verifies quality scores meet minimum thresholds
- Review Process Check: Ensures mandatory review processes are followed
- Bypass Flag Detection: Catches
--no-verify,--skip-*,-f/--forcepatterns
Sub-types: skipped_validation, ignored_threshold, bypassed_review, missing_checks, forced_completion
F14: Completion Misjudgment¶
| Field | Value |
|---|---|
| Detector key | completion |
| Tier | ICP |
| Severity | High |
| Accuracy | F1 0.818, real coverage (45 real traces), Production |
| MAST mapping | FM-1.5 Unaware of Termination, FM-3.1 Premature Termination |
Plain language: The agent said "I'm done" when it wasn't. It claimed to have completed everything, but important pieces are still missing -- like a contractor saying a house is finished when the plumbing isn't connected.
Technical: Detects premature completion claims through completion marker analysis, quantitative requirement verification (numeric counts), hedging language detection, structured output inspection for incomplete flags, and planned/future work indicators.
Examples (non-technical):
- Agent says "all 10 items are documented" but only 8 actually are
- Task is marked complete but has items listed as "planned for future work"
- Agent delivers 80% of what was asked for and declares the job done
Examples (technical):
- Output claims "all endpoints documented" but
documented_count: 8vstotal_endpoints: 10 - JSON output has
{"status": "complete", "documented": false}-- contradictory fields - Agent output contains hedging: "appears to be complete" or "should cover most cases"
- Completion message includes
# TODO: implement remaining validators-- clearly unfinished - Numeric ratio detection catches "8/10 endpoints implemented"
Detection methods:
- Completion Marker Detection: Identifies explicit and implicit completion claims
- Quantitative Requirement Check: Verifies numerical completeness ("all", "every", N items)
- Hedging Language Detection: Flags qualifiers like "appears complete" or "seems done"
- JSON Indicator Analysis: Checks structured output for incomplete flags
- Numeric Ratio Detection: Catches partial delivery (e.g., "8/10 endpoints")
- Planned/Future Work Detection: Identifies indicators that the task is not actually complete
Sub-types: premature_completion, partial_delivery, ignored_subtasks, missed_criteria, false_success_claim, incomplete_verification