Skip to article frontmatterSkip to article content

Structure of the book

We’re going to be working with Dafny and Lean. You don’t need to know much about Dafny or Lean in particular to read this. There will be short sections introducing the languages in the appropriate places, which are easy to skip[1].

DafnyBench with Inspect AI

We’ll solve DafnyBench, a benchmark for repairing Dafny programs, using the inspect_ai framework. This will introduce Dafny, a verification-aware programming language, and the inspect_ai evaluation framework.

DafnyBench without a framework

To show that you don’t need a fancy framework, we’ll also solve a problem from DafnyBench using plain Python and the Anthropic SDK. This chapter will also cover practical considerations like data persistence.

FVAPPS with Pydantic AI

We’ll explore the Formally Verified Application Programming Process Standard (FVAPPS) benchmark. This chapter introduces pydantic-ai, a library for building AI agents, and shows how it can be used to solve formal verification tasks.

CompcertGitHistoryEval

This section will cover the CompcertGitHistoryEval benchmark, which involves analyzing the git history of the CompCert C compiler.

RL posttraining

We’ll discuss how the line between evaluation and reinforcement learning is blurring. This chapter will explore how to use the feedback from formal verifiers to train agents.

Outlook

This chapter will discuss the future of formal verification and AI, with a focus on measuring the verification burden.

Footnotes
  1. I honestly had the language model write them, but I’ll styletransfer them to some fun literary voices, just to spice things up.