Skip to article frontmatterSkip to article content

Of the Framework Inspect, and the Scaffolding of Evaluations Most Marvellous

In the fifth month of the year two thousand and four-and-twenty, there did issue forth from the chambers of the United Kingdom’s Institute for the Security of Artificial Intelligence a most wondrous framework, which men call Inspect AI. This engine of evaluation was wrought as a gift to all who would measure the capabilities of those great language models which in our age do proliferate like stars in the firmament.

Now, whilst a knight-errant of code might fashion an evaluation from first principles, wielding the Anthropic SDK as a bare sword (and indeed, we shall undertake such a quest in the next chapter), yet Inspect provideth batteries included, as the saying goes—handling all manner of tedious particulars that the developer might focus upon the essential logic of evaluation itself.

The Three Pillars of Inspect’s Architecture

The framework Inspect doth rest upon three great pillars, after the manner of those temples of antiquity:

The Task, which combineth a dataset of samples, a solver to reason upon them, and a scorer to render judgment. For our DafnyBench trials, a task thus appeareth:

@task
def dafnybench():
    return Task(
        dataset=load_dataset(),
        solver=generate(),
        scorer=dafny_scorer(),
    )

The Solver, which is the very brain of the operation—transforming the state of each trial by calling upon the model, engineering prompts, or orchestrating interactions most subtle. The humblest solver is generate(), which merely invoketh the language model and returneth its response. Yet solvers may be chained together as links in a coat of mail: chain_of_thought() + generate() + self_critique(). For agentic quests such as ours, the generate() solver doth handle the entire tool-calling circuit automatically, sparing thee the tedium of manual iteration.

The Scorer, which evaluateth the solver’s output and assigneth grades. For our purposes, the scorer is simplicity itself: invoke the Dafny compiler. If the program verifieth without error, the score is unity; if it faileth, the score is naught. Behold—no LLM-as-judge is needed, for the verifier itself is our perfect grader, as infallible as the north star.

Of Tools, and Their Employment in the Great Work

Tools extend the model’s powers beyond mere text generation. When thou registerest a tool with Inspect, the framework doth:

  1. Convert thy Python function to a schema of tool-description

  2. Send available tools to the model in each invocation

  3. Execute tool calls made by the model

  4. Feed tool results back unto the model

  5. Continue this dance until the model ceaseth calling tools

For DafnyBench, we require but a single tool:

@tool
def verify_dafny():
    """Run the Dafny verifier on a program."""
    # Execute compiler, return its pronouncements
    ...

The model calleth this tool with modified Dafny code, receiveth compiler output, and iterateth—whilst Inspect handleth the entire loop, leaving thee merely to define the tool itself.

Why This Framework Deserveth Thy Attention

Before Inspect, every guild and scriptorium built evaluations after their own fashion—custom logging here, bespoke retry logic there, hand-rolled tool-calling loops scattered like leaves in autumn. Inspect standardizeth this chaos: thou receivest a complete evaluation platform with logging, visualization, scoring, and agent scaffolding built in.

The value proposition, as merchants say, is thus:

For verification agents in particular, Inspect is well-suited: we require tool-calling (to invoke the verifier), we desire automatic logging (to understand agent behavior), and we have no wish to reimplement agentic loops from scratch.

Of the Framework’s Lineage and Dominion

Inspect launched in May 2024 as an open-source offering, and swiftly became a standard for LLM evaluations, particularly in the realm of AI safety research where reproducibility and rigor are held sacred. The companion inspect_evals repository (wrought in collaboration with the UK AISI, Arcadia Impact, and the Vector Institute) now containeth dozens of community-contributed benchmarks—agent trials such as GAIA and SWE-Bench, coding challenges like HumanEval and BigCodeBench, cybersecurity contests, and knowledge assessments most rigorous.

DafnyBench is not yet numbered among these, for formal verification evaluations are an emerging art. Yet Inspect’s architecture fitteth our purpose as a gauntlet fits a knight’s hand: verification is but another tool-calling task with a deterministic scorer.

When to Employ This Framework, and When to Quest Alone

Inspect shineth brightest when thou desirest:

Yet it serveth less well when thou requirest:

For DafnyBench, Inspect is perfect—we obtain tool-calling infrastructure, automatic logging, and a clean API, all the things we would needs build ourselves with the raw Anthropic SDK. We shall implement both versions in this chapter, that thou mayest see the difference with thine own eyes.