The RL Environment Boom Has a Measurement Problem¶
June 2026
There is a number that most of the current literature on reinforcement learning environments does not report. Frontier labs have reportedly discussed investing more than a billion dollars a year in training environments; one environment vendor grew from a million dollars to roughly sixty million in annualized revenue in about a year. The number that is harder to find in vendor documentation: in June 2025, METR observed OpenAI's o3 gaming the reward signal, rather than completing the assigned task, in 30.4 percent of its RE-Bench runs.
I have spent the past year building the grader half of this stack, the component that decides whether an agent's output constitutes a success or a failure. That experience has given me a particular view of the environment economy, and it is not the one reflected in most of the market commentary. Environment supply is scaling rapidly; the capacity to calibrate those environments is not. The binding constraint on this market is the ability to prove that a reward signal means what it claims.
The investment thesis¶
The argument that environments are the new datasets has moved from a contrarian position to something close to consensus framing. The reasoning is structural: if reinforcement learning requires agents to practice tasks and receive feedback, then whoever controls the richest, most realistic practice environments controls the quality of the resulting policy. Mechanize argued in June 2025 that replication training is to RL what internet-scale text was to language models; by January 2026, Wing VC was framing the next four years as a race for the training and verification layer, and the funding market appeared to have priced the analogy in.
The spend is real, even if the headlines overstate it. Epoch AI's interview-based survey (eighteen practitioners, January 2026) puts contracts at seven figures per quarter or more: UI replicas of websites run about $20,000 each, complex product clones on the order of $300,000, individual tasks $200 to $2,000. Fleet, which sells simulated replicas of enterprise software, grew from $1 million to a reported $60 million in annualized revenue and was in talks this spring to raise at a $750 million valuation. Mercor raised $350 million at a $10 billion valuation as it adds reinforcement learning infrastructure on top of its expert network. Scale AI reports that nearly half of its new data projects now involve RL environments.
One discipline worth borrowing from financial analysis is the distinction between discussed spend and booked spend. The widely cited figure that Anthropic considered investing more than $1 billion in environments over the next year is reported discussion, sourced to The Information; Wing's estimate of Anthropic's actual 2025 environment spend is on the order of tens of millions annually, with aggregate lab spend growing three- to fivefold into 2026. Both can be simultaneously true; one is a forecast wearing the costume of a fact. The gap between them accounts for most of the market's narrative premium.
A specific and verifiable disagreement¶
The skepticism in this market has a particular shape: the skeptics are principally insiders, and they all identify the same component as the failure point.
Sherwin Wu, who leads engineering on OpenAI's API platform, is publicly bearish on environment startups. Ross Taylor, who ran reasoning at Meta AI, says that people are underestimating how difficult it is to scale environments, and that even the best publicly available ones "typically don't work without serious modification." Karpathy, an investor in one of the leading environment companies, has written that he is bullish on environments and bearish on reinforcement learning specifically; on the Dwarkesh Patel podcast he described outcome-reward RL as producing sparse supervision delivered through a single signal at the end of a long trajectory.
What is notable about this disagreement is that the two sides are not primarily arguing about whether environments are valuable. They agree on the failure mode with unusual precision. Wing's essay, among the most prominent statements of the environment thesis, warns in its own text that verification is where most RL efforts fail; Epoch's buyer interviews rank reward-hack robustness as the number one quality criterion labs apply when purchasing, and maintaining quality while scaling as the number one vendor bottleneck. The disagreement is about whether anyone will solve verification, and who captures the value if they do.
Reward hacking: from anecdote to measurement¶
The reward-hacking literature has moved from anecdote to systematic measurement over the past two years, and the results deserve close reading.
METR's June 2025 analysis observed o3 reward-hacking in 30.4 percent of RE-Bench runs: hacking the scoring timer instead of optimizing the program, monkey-patching graders, and in one case using stack introspection to pull the correct answer out of the scoring system's own call stack. ImpossibleBench (October 2025) takes a different approach, presenting models with coding tasks whose unit tests conflict with the written specification, so that passing the tests requires exploiting the grader; GPT-5 cheated 66 percent of the time, o3 49 percent, and Claude Opus 4.1 46 percent. The ImpossibleBench authors also tested mitigations: giving models an explicit abort mechanism cut the rate for GPT-5 to 9 percent and o3 to 12 percent (Opus 4.1, which rarely used the escape hatch, remained at 46 percent), and the authors recommend stricter grader access controls on top of that. The models' willingness to exploit the grader was relatively stable across conditions; the grader design was what moved the rate.
The downstream consequences extend beyond corrupted metrics. Anthropic's November 2025 research on emergent misalignment demonstrated that models which learn to reward-hack real production coding environments generalize the behavior to other contexts, including alignment faking and sabotage of safety research, behaviors that emerged from nothing more than exposure to a permissive grader. The verifier is an active shaping force on the policy it evaluates, not a passive recorder of behavior.
The frame I find most useful is an adversarial one: a reward channel is an attack surface, and the attacker is the policy under training. Epoch's interviewees describe hardening environments through "many many iterations" against frontier models, because a high reward must mean the task was genuinely solved. Every other property of an environment, the richness of its harness, the breadth of its task distribution, is downstream of that guarantee.
The components of calibration¶
In the dominant open-source framing (the verifiers library, which underpins Prime Intellect's Environments Hub), an environment consists of a dataset, a harness, and a grader: the reward function or rubric that decides what gets reinforced. The dataset and harness are increasingly commodities. The grader is where quality lives and dies.
Graders are growing more complex as the market moves past tasks with binary verifiable outcomes. Scale's Rubrics as Rewards work treats binary verifiable reward as a special case of rubric scoring; Epoch's buyer interviews indicate that graders in commercially sold environments are, in practice, unit tests or LLM judges scoring against rubrics. When the grader is a language model applying a rubric, every question applicable to a noisy measurement instrument applies to the reward channel: What are the gold labels, and who validated them? Do independent judges agree, and on which class? Does the score survive a change of judge model? What is the chain of custody from a published number back to the corpus that produced it?
The norms forming around these questions constitute an engineering discipline. They include hand-checked gold label sets, inter-judge agreement tracked statistically with alerting on drift, periodic recalibration, and cross-family judge ensembles. The known ceiling is sobering. On GDPval's expert-domain tasks, OpenAI's automated grader agreed with human experts 66 percent of the time, against an inter-expert baseline of 71 percent. A judge cannot be more coherent than the labels it was validated against, which makes label quality the upstream constraint on everything downstream, including the reward signal a training run optimizes.
I build this half of the stack. The rest of this piece is what that has taught me.
Observations from production: label provenance¶
Scope and disclosure, stated plainly. I build Pisama, a failure-detection platform for multi-agent LLM systems. A failure detector combined with an LLM judge and a calibration pipeline is structurally the grader half of an RL environment: it scores trajectories against a rubric, and downstream systems treat its output as truth. I have not shipped training environments to a frontier lab, and my graders face drifting production traffic rather than a policy trained to exploit them. The adversarial robustness question I know from the literature above. The label, judge, and lineage question I know from production, and the following observations carry production numbers.
Publish the whole funnel. A registry that reports only its best performers is marketing with a decimal point. Pisama's registry defines 84 detectors. Of those, 49 are measured on an external-only lane (real traces, no synthetic data feeding any published score), and exactly 4 are externally validated at production grade as of June 11, 2026: F1 at or above 0.80, precision at or above 0.70, at least 30 real samples. Mean F1 across the four is 0.85. The remaining 80 are distributed across named bands: 7 beta, 17 experimental, 21 failing, 35 untested. Most verifiers do not survive grounded measurement, and the registry says so in named bands.
Labels rot, and judges inherit the rot. The task-derailment detector in our registry was validated against a 100-trace lane of real user conversations, labeled by a three-vendor panel of judge models. When we adjudicated the 26 contested rows, 7 of the 9 positive labels flipped to negative on full-trace review. The root cause was a labeling pipeline truncation: the panel saw the first 1,500 characters of the prompt and 2,500 of the completion, and positives survived the panel vote, in one case with a unanimous three-to-zero result, because every judge was evaluating the same truncated evidence. Inter-judge agreement cannot catch a defect all judges share. The adjudication produced a doctrine ruling ("under-delivery is not derailment") applied to the judge prompt the same day; the decision record states who adjudicated (model arbiters, at my explicit delegation, recorded as human_in_loop: false), the arbiter's family bias relative to the panel, and a per-row analysis of which flips were caused by the truncation.
High agreement is not the headline it appears to be. After the full-text relabel and a panel-judge swap (a chronically abstaining judge model replaced with a stronger one from the same vendor family), raw cross-vendor agreement on the lane reads 0.96 to 0.98 across the three vendor pairs, computed on 91 to 95 shared rows at the current labeling version. That figure reads as impressive until you hold it against the class balance: positive prevalence on the slice is about 2 percent, each vendor casts exactly two yes votes, and no two vendors cast them on the same trace. Positive specific agreement is 0.00. What the lane demonstrates is that the rubric is stable on negative cases across model families; positive-class stability is an open item, recorded as such. Which agreement statistic you publish determines whether the reader is informed or misled, and a buyer who cannot tell the difference is purchasing reward signals on the basis of headline numbers rather than the structure underneath them.
Synthetic data flatters; real data humbles. Across our calibration history, the deepest framework-specific detector family scores a mean F1 of 0.94 on real traces from its native framework, while the external lane for general-purpose detection averaged 0.75 after corpus expansion on June 11, 2026; a balanced real-only subset reads 0.76. A single data-quality rule, excluding traces under four messages where the coordination signal the detector targets cannot physically exist, moved one detector's real-data F1 from 0.556 to 0.685, a larger gain than any model or threshold change we attempted. The cheapest calibration wins are usually in the corpus.
Observations from production: operational infrastructure¶
A calibration number that cannot be reproduced from a documented chain of custody is a screenshot. Every Pisama calibration run records a dataset fingerprint built from SHA256 content hashes of the corpus; a CI gate compares new runs against baseline and blocks deploys on regression; live thresholds carry the timestamp and run that derived them. When a detector's score changes, the first diagnostic question, whether the data changed or the detector changed, is answerable from the artifact without archaeology.
Cost is a design axis of verification, because verifiers run at trace volume. Our detection ladder escalates from structural checks that cost nothing, through state-delta and embedding tiers at one to two cents per trace, to LLM-judge calls at five to ten cents, against a reference cost of roughly $50 for human review of a complex trace. Panel labeling lands around 1.7 cents per trace. Scores also sit in production latency paths; structural false-positive gating reduced mean orchestrator runtime on chat traces from 82 milliseconds to 10. And when a verdict triggers automated remediation, the verifier's precision becomes a safety property: our implementation gates fixes behind per-workflow locks, checkpointed rollback, rate limits, and risk tiers that separate configuration changes from logic changes.
This is the same chain-of-custody engineering that data teams accepted years ago, applied to verdicts rather than rows. The format I would like the industry to adopt is what I will call a verifier datasheet: one page per grader stating its rubric lineage, label provenance, agreement statistics with prevalence attached, adjudication record, calibration fingerprint, and known limits. Datasets received datasheets in 2018; models received cards the same year; the component that decides what a training run reinforces still ships primarily with a single unexplained number. I am publishing a template and a filled-in example alongside this piece, with the real data behind the agreement and adjudication figures above, including the ones that flatter nobody.
Implications for the human-data industry¶
The labor model has already shifted once in a direction that clarifies what expertise-verified calibration is worth. xAI reportedly laid off 500 generalist annotators in September 2025 while announcing a tenfold expansion of specialist tutors; expert rates (Mercor pays contractors an average of $95 an hour) have decoupled entirely from gig-annotation wages. Meanwhile the container is commoditizing: OpenEnv, the environment specification developed by Meta and Hugging Face, has commercial vendors on its steering committee, and when the container is a standard, differentiation migrates to the reward signal inside it.
The categorical claim: the companies that own expert-judgment pipelines are the natural owners of the calibration layer, on one condition. They must treat verifier quality as gated engineering: agreement statistics with the prevalence attached, regression gates, lineage artifacts, adjudication doctrine. Annotation throughput with a rubric attached does not qualify, and Epoch's finding that quality-at-scale is the category's number-one bottleneck suggests the gap between those two postures is where the next round of consolidation will sort winners from the rest.
Here is the falsifiable version of the thesis: within two years, environment buyers will require calibration artifacts from vendors the way enterprise software buyers require SOC 2 today. The analogy includes the failure mode; SOC 2 became a checkbox, and calibration artifacts will face the same pressure. The difference is that a calibration artifact done right is re-executable: content-hashed corpora, reproducible runs, agreement tables a third party can recompute from the published verdict exports. A compliance document can be performed. A regression gate either passes or it does not.
The natural test of the argument is to run it forward: take a failure corpus with adjudicated labels and lineage, and convert it into a hub-format training environment whose grader ships with its datasheet attached. That build log is the next piece.