Inspect AI - prealpha; Formal Verification Agent Recipes

[ ] TODO(human): audit and cut it down cuz claude wrote it¶

Inspect AI is an open-source framework created by the UK AI Security Institute in May 2024 for building and running large language model evaluations. While you could build an evaluation from scratch using the Anthropic SDK directly (we’ll do exactly this in the next chapter), Inspect provides batteries-included abstractions that handle the tedious parts so you can focus on the evaluation logic itself.

Why Inspect?¶

Before Inspect, every organization built evals slightly differently. Custom logging, bespoke retry logic, hand-rolled tool-calling loops, scattered metric tracking—the same infrastructure code rewritten everywhere. Inspect standardizes this: you get a complete evaluation platform with logging, visualization, scoring, and agent scaffolding built in.

The value proposition is simple:

Less boilerplate: Define your task, provide tools, write a scorer. Inspect handles the rest.
Free observability: The inspect view web interface and VS Code extension let you replay evaluation runs and debug failures without writing any UI code.
Reusable components: Share solvers, scorers, and tools across evaluations and organizations.
Production-ready: Support for sandboxing, batching, retries, structured outputs, and multi-modal evaluations.

For verification agents specifically, Inspect is perfect: we need tool-calling (to run the verifier), we want automatic logging (to understand agent behavior), and we don’t want to reimplement agentic loops ourselves.

Core Abstractions¶

Inspect evaluations are built from three primary components:

Tasks¶

A task combines a dataset, one or more solvers, and a scorer. Think of it as the complete specification of your evaluation:

@task
def dafnybench():
    return Task(
        dataset=load_dataset(),
        solver=generate(),
        scorer=dafny_scorer(),
    )

The dataset provides samples with input (the prompt) and target (expected output or grading guidance). For DafnyBench, each sample contains a Dafny program with hints removed.

Solvers¶

Solvers are the evaluation logic—they transform the task state by calling models, engineering prompts, or orchestrating multi-turn interactions. The most fundamental solver is generate(), which calls the LLM with the current prompt and returns its response.

But solvers compose. You can chain them together:

solver = chain_of_thought() + generate() + self_critique()

Inspect includes built-in solvers for common patterns:

generate(): Call the model
chain_of_thought(): Add CoT prompting
self_critique(): Have the model critique its own output
system_message(): Add system prompts
use_tools(): Enable tool calling (this is what we’ll use for DafnyBench)

For agentic evaluations, the generate() solver handles the entire tool-calling loop automatically. You don’t manually check for tool calls, execute tools, and loop—Inspect does this for you until the model provides a final answer or hits a limit.

Scorers¶

Scorers evaluate solver outputs and assign grades. Inspect supports several scoring approaches:

Exact match: Does the output match the target exactly?
Includes: Does the output contain the expected substring?
Model grading: Use an LLM to judge the quality (useful when ground truth is fuzzy)
Custom logic: Write arbitrary Python code to compute a score

For DafnyBench, our scorer is simple: run the Dafny compiler. If it verifies without errors, score = 1.0. If it fails, score = 0.0. No LLM-as-judge needed—the verifier is our perfect grader.

Tools¶

Tools extend the model’s capabilities beyond text generation. When you register a tool, Inspect:

Converts your Python function to a tool schema (name, description, parameters)
Sends available tools to the model in each generate() call
Executes tool calls made by the model
Feeds tool results back to the model
Continues until the model stops calling tools

For DafnyBench, we need exactly one tool:

@tool
def verify_dafny():
    """Run the Dafny verifier on a program."""
    # Execute Dafny compiler, return stdout/stderr
    ...

The model calls this tool with modified Dafny code, gets back compiler output, and iterates. Inspect handles the entire loop—we just define the tool.

Inspect also supports:

Model Context Protocol (MCP) integrations for standard tool APIs
Built-in tools for bash, Python, web search, browser automation, and computer use
Sandboxing via Docker, Kubernetes, or other containers to safely execute untrusted code

Metrics and Logging¶

Every Inspect evaluation automatically logs:

Model calls and responses
Tool calls and results
Intermediate states
Final scores and metrics
Timing information

This logging feeds into inspect view, a web-based log viewer that lets you:

Browse all evaluation runs
Replay individual samples step-by-step
Compare model outputs across runs
Aggregate metrics and export data

You don’t write any of this infrastructure. You define your task, run it, and Inspect handles observability.

History and Ecosystem¶

Inspect launched in May 2024 from the UK AI Security Institute as an open-source project. It quickly became a standard for LLM evaluations, particularly in AI safety research where reproducibility and rigor matter.

The companion inspect_evals repository (created in collaboration with the UK AISI, Arcadia Impact, and the Vector Institute) now includes dozens of community-contributed benchmarks:

Agent benchmarks: GAIA (general assistant questions requiring reasoning and tool use), SWE-Bench (software engineering tasks), AssistantBench (realistic web agent tasks), Mind2Web (web navigation), OSWorld (multimodal computer interaction)

Coding benchmarks: HumanEval, MBPP, APPS, BigCodeBench (1,140 Python problems), DS-1000 (data science across 7 libraries), SciCode (research coding), USACO (olympiad problems)

Cybersecurity: Cybench (capture-the-flag), CVEBench (exploit real vulnerabilities), CyberSecEval 3, 3CB (offensive capabilities)

Knowledge & reasoning: GPQA (graduate-level science questions), MMMU (multimodal understanding), ARC (grade school science), DROP (reading comprehension), and many more

DafnyBench isn’t in this repository yet—formal verification evaluations are an emerging use case. But Inspect’s architecture is a natural fit: verification is just another tool-calling task with a deterministic scorer.

When to Use Inspect¶

Inspect shines when you want:

Agent evaluations with tool use and multi-turn interactions
Reproducible benchmarks with standardized logging and metrics
Rapid iteration with built-in observability and debugging
Production deployments that need sandboxing, batching, and reliability

It’s less useful when you need:

Full control over every aspect of the agent loop (use the raw SDK instead)
Non-LLM evaluations (Inspect is designed specifically for language models)
Ultra-minimal dependencies (Inspect brings ~50 packages; sometimes you want just anthropic)

For DafnyBench, Inspect is perfect. We get tool-calling infrastructure, automatic logging, and a clean API—all the things we’d have to build ourselves with the raw Anthropic SDK. We’ll implement both versions in this chapter so you can see the difference.

Dafny part 1: with a framework

DafnyBench

Dafny part 1: with a framework

Running inspect