Now we’ll walk through the complete implementation of the DafnyBench evaluation using Inspect AI. If you’d like to clone the repo, opening up Claude Code and chatting with the repo about evals.dafnybench.inspect_ai is a great substitute for reading this chapter. If you do that, then skip to the next chapter.
Dataset Loading¶
DafnyBench can be interacted with as a dir full of .dfy files or on HuggingFace. HuggingFace is the github of ML, it serves datasets and weights (trained models), and they’ve become the standard setters in a few topic areas. Since we’re working with blackbox/API models from Anthropic, we’re not downloading weights from them.
All we do here is convert HuggingFace’s data structure to Inspect’s Sample data structure.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29"""Dataset loading for DafnyBench.""" from datasets import load_dataset from inspect_ai.dataset import Sample def load_dafnybench_dataset() -> list[Sample]: """Load the DafnyBench dataset from Hugging Face. Returns: List of Sample objects with input (hints_removed code) and target (ground_truth code). """ hf_dataset = load_dataset("wendy-sun/DafnyBench", split="test") samples = [ Sample( input=f"Add verification hints to this Dafny code:\n\n{row['hints_removed']}", metadata={ "test_id": row["test_ID"], "test_file": row["test_file"], "hints_removed": row["hints_removed"], "ground_truth": row["ground_truth"], }, ) for row in hf_dataset ] return samples
evals/src/evals/dafnybench/inspect_ai/dataset.py
Each Sample has:
input: The prompt with hints-removed Dafny codemetadata: Test ID, file path, and both versions of the code
Notice that Inspec’s Sample abstraction has a target feature, but we’re not using it because of what I said before: a formal verification benchmark need not ship ground truth.
The verify_dafny Tool¶
In the parlance, some non-LLM process that forms a sensor-actuator pair for an LLM is called a tool. This can be a call to a read or write to a database, an API call to query the weather at some zip code, the reasoning trace of a different LLM, or in our case the dafny executable.
sanbox().exec is Inspect’s subprocess.run with extra steps. It’s how you run shell commands from python.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61"""Tools for DafnyBench evaluation.""" import tempfile from inspect_ai.tool import ToolError, tool from inspect_ai.util import sandbox @tool def verify_dafny(): """Tool that verifies Dafny code and returns verification results. The agent calls this tool with generated code to check if verification succeeds. If verification fails, detailed error messages are returned for the agent to analyze and retry with improved hints. """ async def execute(code: str) -> str: """Verify Dafny code and return verification results. Args: code: Complete Dafny program with verification hints added. Returns: Success message if verification passes, or raises ToolError with diagnostics. """ # Check for verification bypass attempts (following DafnyBench methodology) if "{:verify false}" in code.lower(): raise ToolError( "Invalid code: contains {:verify false} which bypasses verification. " "You must properly verify the code with correct annotations." ) # Use context manager for automatic cleanup with tempfile.NamedTemporaryFile(mode="w", suffix=".dfy", delete=True) as tmp: temp_path = tmp.name # Write code to temporary file await sandbox().write_file(temp_path, code) try: # Run Dafny verification result = await sandbox().exec( ["dafny", "verify", temp_path], timeout=30, ) # Check for successful verification if result.returncode == 0 and "0 errors" in result.stdout: return "✓ Verification succeeded! All checks passed." # Return detailed error information for the agent to learn from error_output = result.stderr if result.stderr else result.stdout raise ToolError(f"Verification failed:\n\n{error_output}") except TimeoutError: raise ToolError( "Verification timed out after 30 seconds. The program may be too complex or contain infinite loops." ) return execute
evals/src/evals/dafnybench/inspect_ai/tools.py
One important thing to notice, which we’ll discuss more in the next chapter, is use of the raise keyword. Here, we’re using the notion of exception that means an error is when things go as planned, since we do not expect the agent to win on the first try, which may be offensive to some error handling purists. Inspect’s tool call abstraction simply requires tools to be in error state whenever TODO .
{:verify false} is an escape hatch in Dafny, so we’d consider it cheating if it did that (line 28). We can execute dafny on a single file, and you want to use with tempfile.NamedTemporaryFile to do this so it cleans up later. Finally, we read off the exit code to see if the LLM was successful. By convention, exit code 0 is happiness, no error message, and nonzero exit code is unhappiness (though if you’re the kind of person who’s happy when the world gives you helpful pointers about how to fix your mistakes, you might not view this as unhappiness)[1].
If verification succeeds, the tool returns a success message. If it fails, the tool raises a ToolError containing the error message. Inspect handles the loop, feeding the error message back into the model. In the next part, we will do that manually.
Recall when I wrote:
Implicitly, there’s a rough analogy/correspondence where benchmark is to one-shot inference call as eval is to agent. If you don’t know what I mean by this, just keep reading. It will become clear!
An agent is just looping LLM API calls with custom sensors and actuators, registered as “tool calls”. In the program synthesis case, the power of your agent is monotonically proportional to the expressiveness of your error messages, and overall the width and depth of your compiletime knowledge. The reason for this is because the language model at time reads the error message from what it did at time [2]:
System Prompt¶
The system prompt teaches the agent about Dafny verification. It lives in its own module for clarity:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68"""System prompt for DafnyBench verification task.""" DAFNY_SYSTEM_PROMPT = """You are an expert in formal verification using Dafny. Your task is to add verification hints to Dafny programs so they can be verified by the Dafny compiler. ## Verification Hints You Need to Add 1. **Loop Invariants** (`invariant`): Properties that hold before and after each loop iteration 2. **Assertions** (`assert`): Claims about the program state at specific points 3. **Preconditions** (`requires`): Conditions that must hold when a function is called 4. **Postconditions** (`ensures`): Conditions guaranteed to hold when a function returns 5. **Termination Measures** (`decreases`): Expressions that decrease on each recursive call or loop iteration ## Using the verify_dafny Tool Once you've added verification hints, use the `verify_dafny` tool to check your work. Pass your complete Dafny program to the tool. If verification fails, analyze the error messages carefully and adjust your hints accordingly. Continue refining until verification succeeds. **CRITICAL**: If the verification succeeds, STOP IMMEDIATELY. Do not add any commentary, celebration, explanation, or additional text after receiving a success message from the tool. The conversation should end silently upon success. ## Format You may discuss your reasoning, but ensure somewhere in your output is a triple backtick code block. ### Example Workflow **Step 1**: Add hints and call tool ```dafny method Example(n: nat) returns (r: nat) ensures r >= n {{ r := n + 1; }} ``` **Step 2**: If tool returns "✓ Verification succeeded! All checks passed." → STOP. Do not respond further. ### Example Dafny syntax ```dafny function factorial(n: nat): nat requires n >= 0 decreases n {{ if n == 0 then 1 else n * factorial(n - 1) }} method FactorialIter(n: nat) returns (r: nat) requires n >= 0 ensures r == factorial(n) {{ r := 1; var i := 1; while i <= n invariant 1 <= i <= n + 1 invariant r == factorial(i - 1) {{ r := r * i; i := i + 1; }} }} ``` """
evals/src/evals/dafnybench/inspect_ai/prompt.py
The prompt explains five types of verification hints the agent needs to add (invariants, assertions, pre/postconditions, termination measures), emphasizes the iterative workflow with the verify_dafny tool, and provides a concrete example[3].
Utilities¶
The utilities module provides helper functions for code extraction and error analysis:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23CODE_BLOCK_PATTERN = r"```(?:dafny)?\s*\n(.*?)```" def extract_code_v1(completion: str) -> str: """Extract Dafny code from model completion (v1 - buggy version). This version extracts from the final completion text, taking the last code block. Problem: If the model celebrates after success ("Perfect! It worked!"), this may extract the wrong code or fail to find any code. Args: completion: Raw model output potentially with markdown, explanations, etc. Returns: Cleaned Dafny code. """ matches = re.findall(CODE_BLOCK_PATTERN, completion, re.DOTALL) if matches: # Use the last code block (model might explain then provide code) return matches[-1].strip() # If no markdown blocks, return the whole completion return completion.strip()
evals/src/evals/dafnybench/inspect_ai/utils.py (excerpt)
The extract_code() function handles the agent’s completion, which might include markdown formatting or explanations. It finds code blocks marked with ```dafny or just ```, extracts the code, and returns the last match (since the model might explain its reasoning before providing the final code)[4].
Task Definition and Scorer¶
The main module brings everything together, importing from the specialized modules we just covered.
The Scorer¶
A scorer, for us, is a simple map between the terms of the subprocess result (exit code and error message) to the terms of the metrics registered to inspect, the definition of success and failure for the overall task. By convention, Inspect uses a C flag for success and I flag for failure, which I think is correct and incorrect.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49@scorer(metrics=[accuracy(), stderr()]) def dafny_verifier() -> Scorer: """Score by running Dafny verification on the reconstructed program. Executes Dafny locally and scores based on verification success. """ async def score(state: TaskState) -> Score: """Score the completion by verifying with Dafny.""" # Get extraction strategy from metadata (default: v1) strategy = state.metadata.get("extraction_strategy", "v1") # Extract code using the specified strategy code = extract_code(state, strategy=strategy) # Use context manager for automatic cleanup with tempfile.NamedTemporaryFile(mode="w", suffix=".dfy", delete=True) as tmp: temp_path = tmp.name try: # Write code to temporary file await sandbox().write_file(temp_path, code) # Run Dafny verification result = await sandbox().exec( ["dafny", "verify", temp_path], timeout=30, ) # Check for successful verification success = result.returncode == 0 and "0 errors" in result.stdout # Prepare explanation if success: explanation = "Verification succeeded" else: error_output = result.stderr if result.stderr else result.stdout error_type = categorize_error(error_output) explanation = ( f"Verification failed ({error_type}):\n{error_output[:500]}" ) return Score( value="C" if success else "I", answer=code, explanation=explanation, ) except TimeoutError:
Dafny verification scorer
The score() function itself:
Extracts code from the agent’s completion
Writes to temporary file
Runs Dafny compiler with a 30-second timeout
Checks for success: Return code 0 and “0 errors” in output
Returns a Score with:
value: “C” for correct, “I” for incorrectanswer: The extracted codeexplanation: Success message or error output with categorized error type
The scorer also handles timeouts and unexpected exceptions, ensuring we always return a valid score even when things go wrong.
The thing to notice is that we’re, in this simple case, repeating the tool call logic. You must consider it as a distinction between dafny as the LLM’s sensor and actuator, and dafny as the watcher’s/supervisor’s (i.e., yours and my) certificate or evidence. So its not redundant[5]. Let me know if this isn’t clear to you, its an important point and I’m happy to find more ways to reiterate it.
The Task¶
Finally, we define the task itself:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21@task def dafnybench( model: str | Model | None = None, sandbox: str = "local", limit: int | None = None, extraction_strategy: ExtractionStrategy = ExtractionStrategy.V1, ) -> Task: dataset = load_dafnybench_dataset() if limit is not None: dataset = dataset[:limit] return Task( dataset=dataset, solver=[ system_message(DAFNY_SYSTEM_PROMPT), use_tools(verify_dafny()), generate(), # Handles tool loop automatically ], scorer=dafny_verifier(), sandbox=sandbox, )
DafnyBench task definition
The task configuration is elegant:
Dataset: Load from HuggingFace, optionally limit samples for testing
Solver chain (lines 286-290):
system_message(): Add our Dafny verification promptuse_tools(): Register theverify_dafnytoolgenerate(): Call the model—this handles the entire tool-calling loop automatically
Scorer: Our custom
dafny_verifier()that runs the compilerSandbox: Set to “local” since we expect Dafny to be installed
The magic is in line 289: generate() with tools enabled. This single solver handles the entire agentic loop:
Model generates response (possibly with tool calls)
If tool calls present, execute them
Add tool results to conversation
Call model again
Repeat until model provides final answer or hits limits
No manual loop management. No checking for tool calls. No counting iterations. Inspect handles all of this.
Notice that we don’t necessarily know whether the error message will go to stdout or stderr. Lots of tools aren’t completely intuitive about this, where they’ll put error messages in stdout and not use stderr for anything. It feels nice to make your subprocess handler agnostic and flexible, sometimes, even if for any one tool it doesn’t need to be, it might help with code reuse later.
There’s an informal sense in which the error message or return code forms a “nonnumeric reward”, which doesn’t literally cash out in the MDP formalism but has a loose vibe kinda like reinforcement learning.
Note the double braces
{{}}in the example—these escape the braces so they’re not interpreted as format string placeholders.Lots of people like XML markup as well, like
<dafny> ... </dafny>. The parsing question is just as trivial in each case, I find that you have to try harder in the system prompt to do the XML version.though you’re free to implement it in a DRYer way