Context from Exploration¶
Existing Chapter 2 Structure (Dafny and Inspect):
8 sections following progressive elaboration: context → background → dataset → framework → practice → implementation → refinement → conclusion
Literary/engaging style with Cervantes-like prose
Detailed code walkthroughs with line numbers
Focus on Inspect AI’s abstractions (Task, Solver, Scorer)
Addresses the “touchdown dance” problem with extraction strategies
Plain Python Implementation Key Features:
Manual tool-calling loop (shows what
generate()abstracts)6 specialized tools (5 insertion tools + verify_dafny)
Dual location specification (line numbers + context matching)
TOML configuration system for prompts and settings
State management via message history markers
Artifact persistence with
_final.dfynaming~1,234 lines across 7 modules
Existing Chapter 3 Files:
00-ch0.md- Placeholder intro with notes09-ch9.md- Conclusion about when to use plain Python (3+ month codebases)
Proposed Outline (8 Sections)¶
00. Introduction: Dispelling Framework Superstition¶
File: 00-ch0.md (already exists, needs expansion)
Content:
Opening hook: The “plain Python” philosophy
Why this chapter exists: demystify frameworks by showing raw implementation
What we’re building: same DafnyBench solver, no frameworks
Preview of key innovations: specialized tools, TOML config, manual loops
Transition: “Let’s see what Inspect was doing for us...”
Teaching goal: Set expectations and motivate the chapter
01. The Manual Tool-Calling Loop ⭐ HIGH EMPHASIS¶
File: 01-ch1.md (new)
Content:
The core loop structure (
agent.py:225-434)What
generate()abstracts awayDeep dive: Message history management (assistant → tool_use → user → tool_result)
Deep dive: Stop reasons (
end_turn,tool_use, unexpected) and iteration controlDeep dive: Message pairing discipline - why tool_results must follow assistant
Code walkthrough with detailed annotations
The “one more call” pattern when verification succeeds (letting model respond to success)
Edge cases: hitting max_iterations, API errors, timeouts
Key code sections:
for iteration in range(max_iterations):
response = client.messages.create(...)
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "tool_use":
# Process tools...
messages.append({"role": "user", "content": tool_results})Teaching goal: Demystify the tool-calling loop that frameworks hide, with deep mechanical understanding
02. Specialized Insertion Tools ⭐ HIGH EMPHASIS¶
File: 02-ch2.md (new)
Content:
Deep dive: The problem with monolithic tools (regenerating entire files)
Five specialized insertion tools vs. one verify tool
Deep dive: Tool design philosophy
Focused single responsibility
Composable operations
Agent-friendly APIs (what vs how)
Reducing cognitive load
Deep dive: Trade-offs in tool granularity
Monolithic: simple API, complex calls
Specialized: complex API, simple calls
When to choose each approach
API surface comparison:
verify_dafny(full_code)vs.insert_invariant(hint, location)The three-layer architecture (generic → type-specific → utilities)
Indentation handling and proper Dafny syntax formatting
Tool inventory:
insert_invariant()- Loop invariantsinsert_assertion()- Intermediate assertionsinsert_precondition()- Function requires clausesinsert_postcondition()- Function ensures clausesinsert_measure()- Termination decreases clausesverify_dafny()- Run verification on accumulated state
Teaching goal: Show the design trade-offs in tool granularity with deep philosophical treatment
03. Dual Location Specification¶
File: 03-ch3.md (new)
Content:
The challenge: How should agents specify WHERE to insert hints?
Strategy 1: Line numbers (precise but brittle)
Strategy 2: Context matching (flexible but ambiguous)
Hybrid approach: both options available
Implementation walkthrough (
find_insertion_point())Ambiguity detection and error messages
Context_after for disambiguation
Agent-friendly API design principles
Code examples:
# Precise
insert_invariant(invariant="0 <= i <= n", line_number=5)
# Flexible
insert_invariant(
invariant="0 <= i <= n",
context_before="while (i < n)"
)
# Disambiguated
insert_invariant(
invariant="0 <= i <= n",
context_before="while (i < n)",
context_after="{"
)Teaching goal: Demonstrate thoughtful tool API design for LLMs
04. State Management via Message History ⭐ HIGH EMPHASIS¶
File: 04-ch4.md (new)
Content:
The state tracking problem: code evolves through insertions
Deep dive: Message history as single source of truth (vs. separate state)
State markers:
=== CURRENT_CODE_STATE ===get_code_state()- Backtracking through messages (implementation walkthrough)update_code_state()- Appending new state (implementation walkthrough)Deep dive: Maintaining Anthropic API message pairing semantics
Why tool_use must be paired with tool_result
Where state updates fit in the sequence
Consequences of breaking pairing
Deep dive: Why updates happen AFTER tool_results (pairing discipline)
Deep dive: Trade-off: cumulative vs. atomic tool calls in single turn
What happens with multiple insertions in one turn
Why agent must verify between modifications to see cumulative effects
Design decision implications
Key insight: Multiple insertions in one turn aren’t cumulative because state updates come after tool_results. Agent must verify or wait for next iteration.
Teaching goal: Show state management patterns without framework magic, with deep understanding of message structure and timing
05. Configuration as Code: TOML and Templates¶
File: 05-ch5.md (new)
Content:
The problem with hardcoded prompts and magic numbers
Configuration-driven design philosophy
config.tomlstructure (5 sections)[evaluation]- Loop limits, timeouts, model[logging]- Artifacts, log levels[prompt]- System prompt + templates with interpolation[dataset]- HuggingFace config
Template interpolation:
{code}variablesConfiguration loading mechanism (
PlainConfigdataclass)Global singleton pattern with
get_config()Python 3.11+
tomllibwith fallback totomli
185-line system prompt walkthrough:
Tool descriptions
Workflow (analyze → identify → insert → verify → iterate)
Both location strategies with examples
Important rules and Dafny syntax reminders
Teaching goal: Show how configuration enables reusability and testing
06. Running the Plain Implementation¶
File: 06-ch6.md (new)
Content:
CLI command:
uv run agent dafnybench plainArguments:
--model,--limitComparison with inspect:
plainvs.inspectsubcommandsLog structure:
logs/plain_YYYYMMDD_HHMMSS.logArtifact structure:
artifacts/sample_<id>_attempt_<n>.dfyand_final.dfyReading artifacts to understand agent behavior
Interpreting success/failure output
Debugging with message history inspection
Practical walkthrough:
# Test with 1 sample
uv run agent dafnybench plain --limit 1
# Full evaluation
uv run agent dafnybench plain --limit -1
# Different model
uv run agent dafnybench plain -m anthropic/claude-opus-4Teaching goal: Make the implementation immediately usable
07. Code Walkthrough: Tying It All Together¶
File: 07-ch7.md (new)
Content:
Complete architecture overview
File-by-file walkthrough with line counts
__init__.py(81 lines) - Orchestrationagent.py(446 lines) - Manual looptools.py(437 lines) - Insertion tools + verifyconfig.py(128 lines) - TOML loadingtypes.py(81 lines) - Dataclasses + artifactsmetrics.py(40 lines) - Result aggregationprompt.py(21 lines) - Backwards compat
Data flow: dataset → agent loop → tools → verification → metrics
Shared code: reusing
categorize_error()from inspect_aiError handling: timeouts, bypass attempts, parsing failures
The artifact naming innovation:
_final.dfyfor successful verifications
Teaching goal: Provide navigable reference for the implementation
08. When to “Rawdog” vs. Use Frameworks¶
File: 08-ch8.md (new)
Content:
The 3-month rule: frameworks for exploration, plain for serious work
Trade-offs table:
| Aspect | Plain Python | Frameworks (Inspect) |
|---|---|---|
| Setup time | Higher (implement loop) | Lower (batteries included) |
| Flexibility | Complete control | Constrained by abstractions |
| Maintenance | Direct, no surprises | Framework updates may break |
| Learning curve | Steep (must understand APIs) | Gentle (guided patterns) |
| Debugging | Explicit, transparent | Abstract, harder to trace |
| Customization | Trivial (just code) | May hit framework limits |
| Infrastructure | DIY (logging, dashboards) | Provided (inspect view) |
When frameworks shine: research, rapid prototyping, standard use cases
When plain Python shines: production systems, custom requirements, long-term codebases
The “framework-to-plain” migration path
Modern advantages: Anthropic structured outputs, better APIs
Teaching goal: Help readers make informed architectural decisions
09. Conclusion & Exercises¶
File: 09-ch9.md (already exists, may need minor edits)
Content:
Summary: We’ve seen what frameworks abstract away
Key takeaways: manual loops, specialized tools, configuration-driven design
The broader pattern: this applies to all agent frameworks (not just Inspect)
Transition to next chapter: Lean and pydantic-ai
Exercises:
Add a new insertion tool:
insert_lemma()for helper lemmasImplement retry logic with exponential backoff for API calls
Add streaming support to show incremental verification output
Create a visualization tool that animates the agent’s insertion decisions
Port the inspect implementation to use specialized insertion tools (from ch0 notes)
Add conversation persistence: save/resume agent sessions
Implement a “dry run” mode that shows insertions without verification
Teaching goal: Encourage hands-on exploration and extension
Narrative Arc¶
Chapter 2 (Inspect): “Here’s a powerful framework that makes verification agents easy”
Chapter 3 (Plain): “Here’s what that framework was doing for you, and why you might want to do it yourself”
Flow:
Motivate with philosophy (dispel framework superstition)
Show the core loop (demystify
generate())Explore tool design (specialized vs. monolithic)
Dig into implementation details (location specs, state, config)
Make it practical (running, reading artifacts)
Provide reference (architecture overview)
Guide architectural decisions (when to use what)
Encourage exploration (exercises)
Style Notes¶
Maintain literary/engaging tone from Chapter 2
Use code annotations and line references
Consolidate comparisons: Save Inspect AI comparisons for section 08, not throughout
Embed literal code includes from the implementation
Keep focus on why decisions were made, not just what they are
Reference figures: agent-helix.png can be reused
Create new diagrams: message history structure, tool routing flow
Emphasis Areas (per user feedback)¶
High priority - These should get deeper treatment:
Manual loop mechanics - iteration control, message pairing discipline, stop reasons
Tool design philosophy - specialized vs monolithic, composable APIs, agent-friendliness
State management - message history as truth, update timing, cumulative vs atomic
Standard coverage - Important but don’t over-elaborate:
TOML configuration system
Dual location specification
Artifact persistence
Files to Create¶
01-ch1.md- Manual tool-calling loop02-ch2.md- Specialized insertion tools03-ch3.md- Dual location specification04-ch4.md- State management05-ch5.md- Configuration system06-ch6.md- Running and using07-ch7.md- Code walkthrough08-ch8.md- When to use plain vs frameworks
Total: 8 new files + 2 existing (00, 09) = 10 sections
Key Teaching Principles¶
Transparency over abstraction: Show exactly what frameworks hide
Design decisions: Explain why choices were made (trade-offs)
Practical patterns: Message pairing, state tracking, config loading
Tool API design: Agent-friendly interfaces (dual location specs)
Architectural guidance: When to framework, when to plain
Reusable patterns: These ideas apply beyond DafnyBench