We’ve seen the individual components of our plain Python agent. Now let’s take a step back and look at the complete architecture. ## TODO: audit the prose to make sure its not too LLMy
The codebase is organized into seven modules, each with a clear responsibility:
__init__.py(81 lines): The entry point for the evaluation.agent.py(446 lines): The core of the agent, including the manual tool-calling loop.tools.py(437 lines): The specialized insertion tools and theverify_dafnytool.config.py(128 lines): The TOML-based configuration system.types.py(81 lines): The Pydantic models for the data and results.metrics.py(40 lines): The logic for aggregating the evaluation results.prompt.py(21 lines): Backwards compatibility for the prompt.
The data flows through the system as follows:
The
__init__.pymodule loads the dataset and iterates through the samples.For each sample, it calls the
run_agent_on_samplefunction inagent.py.The agent loop in
agent.pycalls the model and the tools intools.py.The
verify_dafnytool intools.pyruns the Dafny compiler and returns the results.The
agent.pymodule records the results and returns them to__init__.py.The
__init__.pymodule aggregates the results using themetrics.pymodule.
One interesting detail is that our plain Python implementation reuses the categorize_error function from the inspect-ai implementation. This is a good example of how you can share code between different agent implementations, even if they use different frameworks (or no framework at all).
The plain Python implementation also includes robust error handling for things like timeouts, verification bypass attempts, and parsing failures. And the artifact naming convention, with the _final.dfy file for successful verifications, is a simple but powerful innovation that makes it much easier to work with the agent’s output.