Code Walkthrough: Tying It All Together - prealpha; Formal Verification Agent Recipes

We’ve seen the individual components of our plain Python agent. Now let’s take a step back and look at the complete architecture. ## TODO: audit the prose to make sure its not too LLMy

The codebase is organized into seven modules, each with a clear responsibility:

__init__.py (81 lines): The entry point for the evaluation.
agent.py (446 lines): The core of the agent, including the manual tool-calling loop.
tools.py (437 lines): The specialized insertion tools and the verify_dafny tool.
config.py (128 lines): The TOML-based configuration system.
types.py (81 lines): The Pydantic models for the data and results.
metrics.py (40 lines): The logic for aggregating the evaluation results.
prompt.py (21 lines): Backwards compatibility for the prompt.

The data flows through the system as follows:

The __init__.py module loads the dataset and iterates through the samples.
For each sample, it calls the run_agent_on_sample function in agent.py.
The agent loop in agent.py calls the model and the tools in tools.py.
The verify_dafny tool in tools.py runs the Dafny compiler and returns the results.
The agent.py module records the results and returns them to __init__.py.
The __init__.py module aggregates the results using the metrics.py module.

One interesting detail is that our plain Python implementation reuses the categorize_error function from the inspect-ai implementation. This is a good example of how you can share code between different agent implementations, even if they use different frameworks (or no framework at all).

The plain Python implementation also includes robust error handling for things like timeouts, verification bypass attempts, and parsing failures. And the artifact naming convention, with the _final.dfy file for successful verifications, is a simple but powerful innovation that makes it much easier to work with the agent’s output.

Dafny part 2: plain Python

Running the Plain Implementation

Dafny part 2: plain Python

When to “Rawdog” vs. Use Frameworks