We’re going to be working with Dafny and Lean. You don’t need to know much about Dafny or Lean in particular to read this. There will be short sections introducing the languages in the appropriate places, which are easy to skip[1].
DafnyBench with Inspect AI¶
We’ll solve DafnyBench, a benchmark for repairing Dafny programs, using the inspect_ai framework. This will introduce Dafny, a verification-aware programming language, and the inspect_ai evaluation framework.
DafnyBench without a framework¶
To show that you don’t need a fancy framework, we’ll also solve a problem from DafnyBench using plain Python and the Anthropic SDK. This chapter will also cover practical considerations like data persistence.
FVAPPS with Pydantic AI¶
We’ll explore the Formally Verified Application Programming Process Standard (FVAPPS) benchmark. This chapter introduces pydantic-ai, a library for building AI agents, and shows how it can be used to solve formal verification tasks.
CompcertGitHistoryEval¶
This section will cover the CompcertGitHistoryEval benchmark, which involves analyzing the git history of the CompCert C compiler.
RL posttraining¶
We’ll discuss how the line between evaluation and reinforcement learning is blurring. This chapter will explore how to use the feedback from formal verifiers to train agents.
Outlook¶
This chapter will discuss the future of formal verification and AI, with a focus on measuring the verification burden.
I honestly had the language model write them, but I’ll styletransfer them to some fun literary voices, just to spice things up.