Skip to article frontmatterSkip to article content

FVAPPS

The Formally Verified APPS Benchmark

FVAPPS (Formally Verified Automated Programming Progress Standards) is a benchmark designed to test the ability of AI systems to generate formally verified code. It was introduced in the paper “Proving the Coding Interview: A Benchmark for Formally Verified Code Generation” by Quinn Dougherty and Ronak Mehta (2025).

The benchmark is an extension of the popular APPS dataset, which consists of programming problems. FVAPPS takes these problems and generalizes their unit tests into theorems in the Lean 4 proof assistant. This means that to solve a problem in FVAPPS, an AI doesn’t just need to write code that passes a few tests; it needs to write code and a formal proof of that code’s correctness.

The original paper found that even state-of-the-art models like GPT-4 struggle with this task, achieving a success rate of only 20%. The majority of failures were not in writing the code itself, but in generating the correct formal specifications. This highlights the difficulty of formal verification for current AI systems and provides a clear challenge for the field.

The FVAPPS benchmark is available on Hugging Face.