Skip to article frontmatterSkip to article content

Outline for Chapter 03: DafnyBench with Plain Python

Context from Exploration

Existing Chapter 2 Structure (Dafny and Inspect):

Plain Python Implementation Key Features:

Existing Chapter 3 Files:


Proposed Outline (8 Sections)

00. Introduction: Dispelling Framework Superstition

File: 00-ch0.md (already exists, needs expansion)

Content:

Teaching goal: Set expectations and motivate the chapter


01. The Manual Tool-Calling Loop ⭐ HIGH EMPHASIS

File: 01-ch1.md (new)

Content:

Key code sections:

for iteration in range(max_iterations):
    response = client.messages.create(...)
    messages.append({"role": "assistant", "content": response.content})

    if response.stop_reason == "tool_use":
        # Process tools...
        messages.append({"role": "user", "content": tool_results})

Teaching goal: Demystify the tool-calling loop that frameworks hide, with deep mechanical understanding


02. Specialized Insertion Tools ⭐ HIGH EMPHASIS

File: 02-ch2.md (new)

Content:

Tool inventory:

Teaching goal: Show the design trade-offs in tool granularity with deep philosophical treatment


03. Dual Location Specification

File: 03-ch3.md (new)

Content:

Code examples:

# Precise
insert_invariant(invariant="0 <= i <= n", line_number=5)

# Flexible
insert_invariant(
    invariant="0 <= i <= n",
    context_before="while (i < n)"
)

# Disambiguated
insert_invariant(
    invariant="0 <= i <= n",
    context_before="while (i < n)",
    context_after="{"
)

Teaching goal: Demonstrate thoughtful tool API design for LLMs


04. State Management via Message History ⭐ HIGH EMPHASIS

File: 04-ch4.md (new)

Content:

Key insight: Multiple insertions in one turn aren’t cumulative because state updates come after tool_results. Agent must verify or wait for next iteration.

Teaching goal: Show state management patterns without framework magic, with deep understanding of message structure and timing


05. Configuration as Code: TOML and Templates

File: 05-ch5.md (new)

Content:

185-line system prompt walkthrough:

Teaching goal: Show how configuration enables reusability and testing


06. Running the Plain Implementation

File: 06-ch6.md (new)

Content:

Practical walkthrough:

# Test with 1 sample
uv run agent dafnybench plain --limit 1

# Full evaluation
uv run agent dafnybench plain --limit -1

# Different model
uv run agent dafnybench plain -m anthropic/claude-opus-4

Teaching goal: Make the implementation immediately usable


07. Code Walkthrough: Tying It All Together

File: 07-ch7.md (new)

Content:

Teaching goal: Provide navigable reference for the implementation


08. When to “Rawdog” vs. Use Frameworks

File: 08-ch8.md (new)

Content:

AspectPlain PythonFrameworks (Inspect)
Setup timeHigher (implement loop)Lower (batteries included)
FlexibilityComplete controlConstrained by abstractions
MaintenanceDirect, no surprisesFramework updates may break
Learning curveSteep (must understand APIs)Gentle (guided patterns)
DebuggingExplicit, transparentAbstract, harder to trace
CustomizationTrivial (just code)May hit framework limits
InfrastructureDIY (logging, dashboards)Provided (inspect view)

Teaching goal: Help readers make informed architectural decisions


09. Conclusion & Exercises

File: 09-ch9.md (already exists, may need minor edits)

Content:

Exercises:

  1. Add a new insertion tool: insert_lemma() for helper lemmas

  2. Implement retry logic with exponential backoff for API calls

  3. Add streaming support to show incremental verification output

  4. Create a visualization tool that animates the agent’s insertion decisions

  5. Port the inspect implementation to use specialized insertion tools (from ch0 notes)

  6. Add conversation persistence: save/resume agent sessions

  7. Implement a “dry run” mode that shows insertions without verification

Teaching goal: Encourage hands-on exploration and extension


Narrative Arc

Chapter 2 (Inspect): “Here’s a powerful framework that makes verification agents easy”

Chapter 3 (Plain): “Here’s what that framework was doing for you, and why you might want to do it yourself”

Flow:

  1. Motivate with philosophy (dispel framework superstition)

  2. Show the core loop (demystify generate())

  3. Explore tool design (specialized vs. monolithic)

  4. Dig into implementation details (location specs, state, config)

  5. Make it practical (running, reading artifacts)

  6. Provide reference (architecture overview)

  7. Guide architectural decisions (when to use what)

  8. Encourage exploration (exercises)


Style Notes

Emphasis Areas (per user feedback)

High priority - These should get deeper treatment:

  1. Manual loop mechanics - iteration control, message pairing discipline, stop reasons

  2. Tool design philosophy - specialized vs monolithic, composable APIs, agent-friendliness

  3. State management - message history as truth, update timing, cumulative vs atomic

Standard coverage - Important but don’t over-elaborate:


Files to Create

  1. 01-ch1.md - Manual tool-calling loop

  2. 02-ch2.md - Specialized insertion tools

  3. 03-ch3.md - Dual location specification

  4. 04-ch4.md - State management

  5. 05-ch5.md - Configuration system

  6. 06-ch6.md - Running and using

  7. 07-ch7.md - Code walkthrough

  8. 08-ch8.md - When to use plain vs frameworks

Total: 8 new files + 2 existing (00, 09) = 10 sections


Key Teaching Principles

  1. Transparency over abstraction: Show exactly what frameworks hide

  2. Design decisions: Explain why choices were made (trade-offs)

  3. Practical patterns: Message pairing, state tracking, config loading

  4. Tool API design: Agent-friendly interfaces (dual location specs)

  5. Architectural guidance: When to framework, when to plain

  6. Reusable patterns: These ideas apply beyond DafnyBench