Skip to article frontmatterSkip to article content

Conclusion, exercises, next steps.

We didn’t actually do the task right. DafnyBench proper leaves the source of each sample mostly hardcoded, and only inserts hints. Instead, we had it regenerate all the code in the completion because that is the simplest text-to-text setup. This is of course acceptable when language models are flawless, trustworthy, faithful, and not scruffy looking. But in real life, language models are only one of those things. We should be paranoid that the language model agent might take the opportunity of piping the non-hint code through to make the problem easier for itself.

Exercises

As a reminder, the exercises are all stretch goals, none of them are essential for following along with the book, and they’re also all good claude-code prompts if you cloned the repo so far.

  1. Make a “red team” prompt that tries to cheat and get away with it (this is especially easy given that we haven’t done the proper hint-filling approach yet)

  2. Implement the hint-filling technique. This is especially interesting before you see how I do it later in the next part.

  3. Change the prompt and the regex to <dafny> ... </dafny> style.

  4. inspect_ai’s .eval file is none other than a zipped archive of .json files. Unzip it, poke around, and ship a streamlit dashboard that replaces inspect_ai’s generic uv run inspect view dashboard. This way, you can make it better suited for your needs of analyzing synthetic data.

Next steps

Instead of retrofitting the inspect-ai codebase for the hint-filling critique, we will move on. We will throw out our code up till this point and do it again with no framework. We’ll use plain python with the Anthropic SDK, to dispell any sense of “magic” behind the agentdev frameworks[1] you’ll use in the future.

Once more into the, etc.

Footnotes
  1. If you know anything about webdev, think of agentdev as in the “pre-react” era. No dominant paradigm yet, a lot of duplicated work, but most of all lots of disagreement about when someone else’s abstractions makes sense vs when you should rawdog it.