Now that we’ve seen how the plain Python agent works, let’s run it. The agent is exposed as a command-line interface, which you can run with uv run agent dafnybench plain. ## TODO: audit the prose to make sure its not too LLMy
The CLI accepts two arguments: --model and --limit. The --model argument lets you specify which Anthropic model to use, and the --limit argument lets you limit the number of samples to evaluate.
# Test with 1 sample
uv run agent dafnybench plain --limit 1
# Full evaluation
uv run agent dafnybench plain
# Different model
uv run agent dafnybench plain -m anthropic/claude-3-opus-20240229The agent will create a log file in the logs directory, with a name like plain_YYYYMMDD_HHMMSS.log. This file contains a detailed record of the agent’s execution, including all the messages exchanged with the model.
The agent also creates artifacts in the artifacts directory. For each attempt, it saves the Dafny code to a file named sample_<id>_attempt_<n>.dfy. When the agent successfully verifies a sample, it saves the final code to sample_<id>_final.dfy. These artifacts are invaluable for debugging the agent and understanding its behavior.
By inspecting the logs and artifacts, you can get a very clear picture of how the agent is thinking and how it’s using the tools to solve the DafnyBench problems.