[ ] TODO(human) audit this / cut it down cuz an llm wrote it.¶
DafnyBench is the largest benchmark for training and evaluating machine learning systems on formal software verification tasks. Published in June 2024 by researchers at Stanford and CMU, it provides a standardized way to measure how well LLMs can assist with the tedious but critical task of writing verification annotations.
Dataset Composition¶
The benchmark consists of 782 ground truth Dafny programs totaling approximately 53,000 lines of code. These programs come from three sources:
GitHub scrape: 556 programs (de-duplicated from ~15,000 files down to ~5,000 unique programs)
Clover benchmark: 62 textbook-style programs
Dafny-synthesis: 164 programs translated from the MBPP (Mostly Basic Python Problems) dataset
The programs are realistic in complexity, with an average of 2.12 methods, 1.03 functions, and 1.40 lemmas per program. Program size varies significantly—averaging 1,916 characters but ranging up to 28,736 characters for the largest programs.
The Fill-Hints Task¶
DafnyBench’s core evaluation task is called fill_hints. Here’s how it works:
Start with a verified Dafny program that compiles without errors
Remove all
assertandinvariantstatements (the “hints” that help the verifier)Give the LLM the stripped program and ask it to fill the hints back in
Check if the resulting program verifies successfully with Dafny
Critically, DafnyBench does not mark where the hints were removed. There are no /* TODO */ comments or placeholder lines. This makes the task more realistic—in real-world verification, you don’t know in advance exactly where you’ll need hints.
A program counts as successfully solved only if:
It verifies with Dafny (no compiler errors)
All original preconditions and postconditions are preserved
The solution doesn’t cheat using
{:verify false}orassume falseconstructs
Prompting Strategies¶
The DafnyBench paper tested several state-of-the-art models with carefully tuned prompts. The prompting strategy was intentionally minimal to test the models’ inherent capabilities:
For GPT models, the prompt was direct:
“You are an expert in Dafny. Return a complete program with strongest possible annotations filled back in. Do not explain.”
For Claude 3 Opus, they needed to adjust the prompt because Claude tended to add explanatory text that interfered with compilation:
[Emphasized code-only output with stricter formatting directives]
All models used temperature 0.3 and a maximum token limit of 4,096 to ensure deterministic, focused outputs.
Scaffolding with Iterative Feedback¶
The key scaffolding strategy in DafnyBench is iterative refinement with error feedback. Models get up to 10 attempts per program:
Model submits its first attempt at filling in hints
If it fails to verify, the model receives the actual Dafny compiler error message
Model tries again, taking the error message into account
This continues for up to 10 attempts, with early stopping upon success
This scaffolding mimics how a human developer would work with Dafny—you write annotations, run the compiler, read the error messages, and refine your approach.
Key Results¶
The best-performing model was Claude 3 Opus with 67.8% success rate, followed closely by GPT-4 Turbo (59.8%) and GPT-4o (59.3%). For comparison, GPT-3.5 Turbo achieved 44.0%, and CodeLlama-7b managed only 28.0%—barely above the 26.9% “no LLM” baseline (which just returns the program unchanged).
Several important patterns emerged:
Diminishing returns on attempts: Success rate improved rapidly in the first few attempts but plateaued around 65% after ~5 attempts. This suggests error messages have limited utility—models that don’t solve the problem quickly tend not to solve it at all.
Complexity matters: Success rates declined with both program length and “hint quantity” (the number of characters in the removed annotations). This makes intuitive sense—longer programs with more required hints are genuinely harder problems.
Still significant headroom: Even the best model fails on roughly one-third of programs, indicating substantial room for improvement in both models and verification agents.
Implications for Verification Agents¶
DafnyBench demonstrates that modern LLMs can handle a significant fraction of verification annotation tasks, but they’re far from replacing human experts. The benchmark’s design—realistic programs, no hint location markers, strict verification requirements—makes it an honest test of whether LLMs can reduce the “verification burden” that makes formal methods impractical for many projects.
For our purposes in this chapter, DafnyBench provides the perfect foundation for building verification agents: a large dataset of real programs, a clear task definition, and a deterministic grader (the Dafny compiler itself). We don’t need LLM-as-judge or human evaluation—the verifier tells us definitively whether we succeeded.