Skip to article frontmatterSkip to article content

The touchdown dance

When you’re plumbing, you sometimes run into silly things

celebrate good times

Figure 1:A language model is so excited that it solved the task that it passes its celebratory message into the proof checker. There’s an (american) football metaphor about dancing in the touchdown zone somewhere.

Extraction Strategy v2

After the agent successfully verifies code, it sometimes generates a celebratory message like “Perfect! The code now verifies successfully!” without a code block. Our initial extraction logic (extract_code_v1) simply grabbed the last code block from the final completion, which worked great until there was no code block in that final message.

The naive approach looked like this:

utils.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def extract_code_v1(completion: str) -> str:
    """Extract Dafny code from model completion (v1 - buggy version).

    This version extracts from the final completion text, taking the last code block.
    Problem: If the model celebrates after success ("Perfect! It worked!"), this may
    extract the wrong code or fail to find any code.

    Args:
        completion: Raw model output potentially with markdown, explanations, etc.

    Returns:
        Cleaned Dafny code.
    """
    matches = re.findall(CODE_BLOCK_PATTERN, completion, re.DOTALL)

    if matches:
        # Use the last code block (model might explain then provide code)
        return matches[-1].strip()

    # If no markdown blocks, return the whole completion
    return completion.strip()

Buggy extraction (v1) - only looks at final completion

When the agent outputs “Great! It worked!” without code, extract_code_v1 fails to find any code blocks. The scorer then tries to verify this touchdown dance as Dafny code, which obviously fails—turning a successful verification into a recorded failure.

The fix is backtracking through message history. Instead of only looking at the final completion, we walk backwards through all assistant messages until we find one containing code:

utils.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def extract_code_v2(state: TaskState) -> str:
    """Extract Dafny code from task state (v2 - fixed version with backtracking).

    This version walks backwards through the message history. If the current message
    has no Dafny code (e.g., celebratory message), it backtracks to find the most
    recent message that does contain code.

    Args:
        state: Task state containing message history.

    Returns:
        Cleaned Dafny code.
    """
    # Walk backwards through assistant messages
    for message in reversed(state.messages):
        # Only look at assistant messages
        if message.role != "assistant":
            continue

        # Get the text content
        content = message.text if hasattr(message, "text") else str(message.content)

        # Try to extract code from this message
        matches = re.findall(CODE_BLOCK_PATTERN, content, re.DOTALL)

        if matches:
            # Found code! Return the last code block from this message
            return matches[-1].strip()

    # Fallback: use v1 extraction on the final completion
    return extract_code_v1(state.output.completion)

Fixed extraction (v2) - backtracks through message history

This handles the celebration problem: if the current message has no code, we skip it and check the previous message. We eventually find the most recent code the agent actually generated.

You can switch between strategies using the extraction_strategy parameter in the task definition:

uv run agent dafnybench inspect --extraction-strategy v2

This pattern—searching backwards through conversation history—is useful whenever you have tool-calling loops where the final message might not contain the information you need. The verifier acts as a sensor providing feedback, and the agent’s response to good news (“it worked!”) can omit the actual artifact you’re evaluating.