Skip to article frontmatterSkip to article content

Inspect AI

[ ] TODO(human): audit and cut it down cuz claude wrote it

Inspect AI is an open-source framework created by the UK AI Security Institute in May 2024 for building and running large language model evaluations. While you could build an evaluation from scratch using the Anthropic SDK directly (we’ll do exactly this in the next chapter), Inspect provides batteries-included abstractions that handle the tedious parts so you can focus on the evaluation logic itself.

Why Inspect?

Before Inspect, every organization built evals slightly differently. Custom logging, bespoke retry logic, hand-rolled tool-calling loops, scattered metric tracking—the same infrastructure code rewritten everywhere. Inspect standardizes this: you get a complete evaluation platform with logging, visualization, scoring, and agent scaffolding built in.

The value proposition is simple:

For verification agents specifically, Inspect is perfect: we need tool-calling (to run the verifier), we want automatic logging (to understand agent behavior), and we don’t want to reimplement agentic loops ourselves.

Core Abstractions

Inspect evaluations are built from three primary components:

Tasks

A task combines a dataset, one or more solvers, and a scorer. Think of it as the complete specification of your evaluation:

@task
def dafnybench():
    return Task(
        dataset=load_dataset(),
        solver=generate(),
        scorer=dafny_scorer(),
    )

The dataset provides samples with input (the prompt) and target (expected output or grading guidance). For DafnyBench, each sample contains a Dafny program with hints removed.

Solvers

Solvers are the evaluation logic—they transform the task state by calling models, engineering prompts, or orchestrating multi-turn interactions. The most fundamental solver is generate(), which calls the LLM with the current prompt and returns its response.

But solvers compose. You can chain them together:

solver = chain_of_thought() + generate() + self_critique()

Inspect includes built-in solvers for common patterns:

For agentic evaluations, the generate() solver handles the entire tool-calling loop automatically. You don’t manually check for tool calls, execute tools, and loop—Inspect does this for you until the model provides a final answer or hits a limit.

Scorers

Scorers evaluate solver outputs and assign grades. Inspect supports several scoring approaches:

For DafnyBench, our scorer is simple: run the Dafny compiler. If it verifies without errors, score = 1.0. If it fails, score = 0.0. No LLM-as-judge needed—the verifier is our perfect grader.

Tools

Tools extend the model’s capabilities beyond text generation. When you register a tool, Inspect:

  1. Converts your Python function to a tool schema (name, description, parameters)

  2. Sends available tools to the model in each generate() call

  3. Executes tool calls made by the model

  4. Feeds tool results back to the model

  5. Continues until the model stops calling tools

For DafnyBench, we need exactly one tool:

@tool
def verify_dafny():
    """Run the Dafny verifier on a program."""
    # Execute Dafny compiler, return stdout/stderr
    ...

The model calls this tool with modified Dafny code, gets back compiler output, and iterates. Inspect handles the entire loop—we just define the tool.

Inspect also supports:

Metrics and Logging

Every Inspect evaluation automatically logs:

This logging feeds into inspect view, a web-based log viewer that lets you:

You don’t write any of this infrastructure. You define your task, run it, and Inspect handles observability.

History and Ecosystem

Inspect launched in May 2024 from the UK AI Security Institute as an open-source project. It quickly became a standard for LLM evaluations, particularly in AI safety research where reproducibility and rigor matter.

The companion inspect_evals repository (created in collaboration with the UK AISI, Arcadia Impact, and the Vector Institute) now includes dozens of community-contributed benchmarks:

Agent benchmarks: GAIA (general assistant questions requiring reasoning and tool use), SWE-Bench (software engineering tasks), AssistantBench (realistic web agent tasks), Mind2Web (web navigation), OSWorld (multimodal computer interaction)

Coding benchmarks: HumanEval, MBPP, APPS, BigCodeBench (1,140 Python problems), DS-1000 (data science across 7 libraries), SciCode (research coding), USACO (olympiad problems)

Cybersecurity: Cybench (capture-the-flag), CVEBench (exploit real vulnerabilities), CyberSecEval 3, 3CB (offensive capabilities)

Knowledge & reasoning: GPQA (graduate-level science questions), MMMU (multimodal understanding), ARC (grade school science), DROP (reading comprehension), and many more

DafnyBench isn’t in this repository yet—formal verification evaluations are an emerging use case. But Inspect’s architecture is a natural fit: verification is just another tool-calling task with a deterministic scorer.

When to Use Inspect

Inspect shines when you want:

It’s less useful when you need:

For DafnyBench, Inspect is perfect. We get tool-calling infrastructure, automatic logging, and a clean API—all the things we’d have to build ourselves with the raw Anthropic SDK. We’ll implement both versions in this chapter so you can see the difference.