Skip to article frontmatterSkip to article content

Code Walkthrough: Tying It All Together

We’ve seen the individual components of our plain Python agent. Now let’s take a step back and look at the complete architecture. ## TODO: audit the prose to make sure its not too LLMy

The codebase is organized into seven modules, each with a clear responsibility:

The data flows through the system as follows:

  1. The __init__.py module loads the dataset and iterates through the samples.

  2. For each sample, it calls the run_agent_on_sample function in agent.py.

  3. The agent loop in agent.py calls the model and the tools in tools.py.

  4. The verify_dafny tool in tools.py runs the Dafny compiler and returns the results.

  5. The agent.py module records the results and returns them to __init__.py.

  6. The __init__.py module aggregates the results using the metrics.py module.

One interesting detail is that our plain Python implementation reuses the categorize_error function from the inspect-ai implementation. This is a good example of how you can share code between different agent implementations, even if they use different frameworks (or no framework at all).

The plain Python implementation also includes robust error handling for things like timeouts, verification bypass attempts, and parsing failures. And the artifact naming convention, with the _final.dfy file for successful verifications, is a simple but powerful innovation that makes it much easier to work with the agent’s output.