Skip to article frontmatterSkip to article content

Defining terms

Benchmarks and evals

In the ancestral environment, ML engineers thought a lot about benchmarks, which were question answer pairs that served to elicit capabilities. Back then, we would use these to train (increase capabilities) provided that we cleanly separated train (for taking losses and doing gradient updates), test (for checking how the gradient updates did), and validation sets (after all is said and done, the “hold-out” set).

Lately, machine learning has pivoted a bit toward evals, which are like benchmarks but where the ground truth is a process. In evals, we allow this process to be more expensive than checking integer equality. For example, an eval might use an LLM-as-a-judge to grade the quality of another model’s outputs, or run unit tests to verify that generated code behaves correctly, send it off to mturk and wait for it to return.

Implicitly, there’s a rough analogy/correspondence where benchmark is to one-shot inference call as eval is to agent. If you don’t know what I mean by this, just keep reading. It will become clear!

Thesis: you need not ship a ground truth when the grader is very high quality

The challenge with eval graders is reliability, and to a lesser extent cost: LLM-as-a-judge can be inconsistent or biased, unit tests may have incomplete coverage, and human evaluation is expensive and subjective. Traditional benchmarks sidestep this by shipping explicit ground truth answers. But this limits the complexity of tasks we can evaluate, since constructing ground truth becomes increasingly difficult for open-ended or creative problems.

Formal verification, from applied type theories and interactive theorem provers to static analysis and SMT solvers, are the highest quality compile-time signals we have access to in year of our lord. While many other QA techniques rely on runtime knowledge, formal methods lets us know right away how good our code is. So we get a grader (in O(laptop) compute!) that comes up with these deductive judgments on the fly, which is both cheaper and more objective than LLM-as-judge, and, as we’ll see, easier to generate than needing a ground truth at task creation time.

Reinforcement learning

Reinforcement learning (RL) trains agents by rewarding desired behaviors and penalizing undesired ones through interaction with an environment. Since RLHF, we’ve been using RL in posttraining, which is distinguished from the gradient descent steps (known now as pretraining). In RLHF, humans grade model outputs, and these grades become rewards that shape the model’s behavior through policy gradient methods like PPO.

With a grader as reliable as a proof checker in formal methods, the path from eval to RL environment becomes remarkably straightforward. In most domains, constructing an RL environment requires careful reward shaping, handling edge cases, and managing noisy feedback signals. In formal methods, however, we get a sparse yet deterministic and high quality reward signal. In the obvious case, you get what DeepSeek once called “reinforcement learning from proof assistant feedback” (RLPAF): reward of 1 when the proof checker is happy and reward of 0 when the proof checker is sad.

In RL environment land, we say that RLPAF is a “high quality and computationally cheap grader”. It’s fantastic that we can score a proposed solution in O(laptop) compute. This gives formal verification a pretty unique and attractive position in the posttraining ecosystem.

Since this reward signal is so sparse, in practice you need to prefix your process with some supervised finetuning (SFT) on problems easy enough that you can solve at time tt.