In the fifth month of the year two thousand and four-and-twenty, there did issue forth from the chambers of the United Kingdom’s Institute for the Security of Artificial Intelligence a most wondrous framework, which men call Inspect AI. This engine of evaluation was wrought as a gift to all who would measure the capabilities of those great language models which in our age do proliferate like stars in the firmament.
Now, whilst a knight-errant of code might fashion an evaluation from first principles, wielding the Anthropic SDK as a bare sword (and indeed, we shall undertake such a quest in the next chapter), yet Inspect provideth batteries included, as the saying goes—handling all manner of tedious particulars that the developer might focus upon the essential logic of evaluation itself.
The Three Pillars of Inspect’s Architecture¶
The framework Inspect doth rest upon three great pillars, after the manner of those temples of antiquity:
The Task, which combineth a dataset of samples, a solver to reason upon them, and a scorer to render judgment. For our DafnyBench trials, a task thus appeareth:
@task
def dafnybench():
return Task(
dataset=load_dataset(),
solver=generate(),
scorer=dafny_scorer(),
)The Solver, which is the very brain of the operation—transforming the state of each trial by calling upon the model, engineering prompts, or orchestrating interactions most subtle. The humblest solver is generate(), which merely invoketh the language model and returneth its response. Yet solvers may be chained together as links in a coat of mail: chain_of_thought() + generate() + self_critique(). For agentic quests such as ours, the generate() solver doth handle the entire tool-calling circuit automatically, sparing thee the tedium of manual iteration.
The Scorer, which evaluateth the solver’s output and assigneth grades. For our purposes, the scorer is simplicity itself: invoke the Dafny compiler. If the program verifieth without error, the score is unity; if it faileth, the score is naught. Behold—no LLM-as-judge is needed, for the verifier itself is our perfect grader, as infallible as the north star.
Of Tools, and Their Employment in the Great Work¶
Tools extend the model’s powers beyond mere text generation. When thou registerest a tool with Inspect, the framework doth:
Convert thy Python function to a schema of tool-description
Send available tools to the model in each invocation
Execute tool calls made by the model
Feed tool results back unto the model
Continue this dance until the model ceaseth calling tools
For DafnyBench, we require but a single tool:
@tool
def verify_dafny():
"""Run the Dafny verifier on a program."""
# Execute compiler, return its pronouncements
...The model calleth this tool with modified Dafny code, receiveth compiler output, and iterateth—whilst Inspect handleth the entire loop, leaving thee merely to define the tool itself.
Why This Framework Deserveth Thy Attention¶
Before Inspect, every guild and scriptorium built evaluations after their own fashion—custom logging here, bespoke retry logic there, hand-rolled tool-calling loops scattered like leaves in autumn. Inspect standardizeth this chaos: thou receivest a complete evaluation platform with logging, visualization, scoring, and agent scaffolding built in.
The value proposition, as merchants say, is thus:
Less boilerplate: Define thy task, provide tools, write a scorer. Inspect handleth all else.
Free observability: The
inspect viewweb interface and VS Code extension permit thee to replay evaluation runs and divine the causes of failures without writing any interface code thyself.Reusable components: Share solvers, scorers, and tools across evaluations and organizations, as monasteries once shared illuminated manuscripts.
For verification agents in particular, Inspect is well-suited: we require tool-calling (to invoke the verifier), we desire automatic logging (to understand agent behavior), and we have no wish to reimplement agentic loops from scratch.
Of the Framework’s Lineage and Dominion¶
Inspect launched in May 2024 as an open-source offering, and swiftly became a standard for LLM evaluations, particularly in the realm of AI safety research where reproducibility and rigor are held sacred. The companion inspect_evals repository (wrought in collaboration with the UK AISI, Arcadia Impact, and the Vector Institute) now containeth dozens of community-contributed benchmarks—agent trials such as GAIA and SWE-Bench, coding challenges like HumanEval and BigCodeBench, cybersecurity contests, and knowledge assessments most rigorous.
DafnyBench is not yet numbered among these, for formal verification evaluations are an emerging art. Yet Inspect’s architecture fitteth our purpose as a gauntlet fits a knight’s hand: verification is but another tool-calling task with a deterministic scorer.
When to Employ This Framework, and When to Quest Alone¶
Inspect shineth brightest when thou desirest:
Agent evaluations with tool use and multi-turn interactions
Reproducible benchmarks with standardized logging and metrics
Rapid iteration with built-in observability and debugging
Yet it serveth less well when thou requirest:
Full control over every aspect of the agent loop (use the raw SDK instead)
Ultra-minimal dependencies (Inspect bringeth ~50 packages; sometimes thou wantest naught but
anthropic)
For DafnyBench, Inspect is perfect—we obtain tool-calling infrastructure, automatic logging, and a clean API, all the things we would needs build ourselves with the raw Anthropic SDK. We shall implement both versions in this chapter, that thou mayest see the difference with thine own eyes.