Benchmarks and evals¶
In the ancestral environment, ML engineers thought a lot about benchmarks, which were question answer pairs that served to elicit capabilities. Back then, we would use these to train (increase capabilities) provided that we cleanly separated train (for taking losses and doing gradient updates), test (for checking how the gradient updates did), and validation sets (after all is said and done, the “hold-out” set).
Lately, machine learning has pivoted a bit toward evals, which are like benchmarks but where the ground truth is a process. In evals, we allow this process to be more expensive than checking integer equality. For example, an eval might use an LLM-as-a-judge to grade the quality of another model’s outputs, or run unit tests to verify that generated code behaves correctly, send it off to mturk and wait for it to return.
Implicitly, there’s a rough analogy/correspondence where benchmark is to one-shot inference call as eval is to agent. If you don’t know what I mean by this, just keep reading. It will become clear!
Thesis: you need not ship a ground truth when the grader is very high quality¶
[ ] TODO(human): audit two paragraphs cuz they were llm generated.¶
The challenge with eval graders is reliability: LLM-as-a-judge can be inconsistent or biased, unit tests may have incomplete coverage, and human evaluation is expensive and subjective. Traditional benchmarks sidestep this by shipping explicit ground truth answers—but this limits the complexity of tasks we can evaluate, since constructing ground truth becomes increasingly difficult for open-ended or creative problems.
However, what if the grader were so reliable that we could trust its verdicts without pre-computed answers? Formal verification tools—like Dafny and Lean—provide exactly this property. When a proof checker verifies that a program satisfies its specification, or when a theorem prover confirms a mathematical claim, these verdicts are not heuristic judgments but mathematical certainties. A formal verification toolchain acts as a perfect grader: it’s deterministic, exhaustive, and provides explanatory feedback when solutions fail (through the error message). This transforms the eval paradigm—instead of shipping question-answer pairs, we can ship question-specification pairs and let the verifier grade arbitrary solutions.
Reinforcement learning¶
Reinforcement learning (RL) trains agents by rewarding desired behaviors and penalizing undesired ones through interaction with an environment. Since RLHF, we’ve been using RL in posttraining, which is distinguished from the gradient descent steps (known now as pretraining). In RLHF, humans grade model outputs, and these grades become rewards that shape the model’s behavior through policy gradient methods like PPO.
With a grader as reliable as a proof checker in formal methods, the path from eval to RL environment becomes remarkably straightforward. In most domains, constructing an RL environment requires careful reward shaping, handling edge cases, and managing noisy feedback signals. In formal methods, however, we get a sparse yet deterministic and high quality reward signal. In the obvious case, you get what DeepSeek once called “reinforcement learning from proof assistant feedback” (RLPAF): reward of 1 when the proof checker is happy and reward of 0 when the proof checker is sad.
In RL environment land, we say that RLPAF is a “high quality and computationally cheap grader”. It’s fantastic that we can score a proposed solution in O(laptop) compute. This gives formal verification a pretty unique and attractive position in the posttraining ecosystem.
Since this reward signal is so sparse, in practice you need to prefix your process with some supervised finetuning (SFT) on problems easy enough that you can solve at time .