claude-3-5-haiku-20241022

20250118_claude_3_5_haiku_20241022_0_0_n_20_fmt_tool_call_verified_mini

Setup

Model claude-3-5-haiku-20241022
Temperature N/A
Max Iterations 20
Format tool_call
Max Cost $1.00

% Resolved

28.0%
Resolved 14
Total 50

Cost

$6.04
$/Instance $0.12
Resolved/$ 2.32

Token Usage

12.4M
Prompt
12.1M (9.1M)
Completion 356.3K

Status

Final status of instances evaluated with the Moatless EvalTools SWE-Bench Harness. Indicates if the instance was resolved successfully, failed to complete, encountered an error, or didn't generate any patches.

Flags

Flags indicate potential issues in how the LLM follows the agentic workflow. They help identify common failure modes like hallucinations, "stuck in a loop", or missing test verifications.

Instances

50 instances
failed29%19$0.13
failed_actions
387.6K
no patch43%75.35¢
no_test_patch
92.4K
failed27%17$0.14
failed_actions
339.3K
resolved79%62.83¢
58.1K
failed0%8$0.11
duplicated_actionsno_test_patchfailed_actions
133.2K
resolved53%51.46¢
35.2K
resolved19%95.78¢
105.0K
failed56%65.93¢
no_test_patch
81.5K
resolved31%103.96¢
failed_actions
95.1K
resolved65%94.78¢
failed_actions
100.2K
resolved83%94.81¢
122.6K
resolved88%73.93¢
79.7K
resolved56%72.59¢
59.9K
failed53%16$0.17
failed_testsduplicated_actionsfailed_actions
332.0K
failed25%87.95¢
126.9K
no patch4%15$0.11
failed_testsfailed_actions
249.9K
resolved71%93.08¢
failed_actions
84.9K
failed32%96.44¢
117.3K
error10%94.68¢
85.1K
failed8%20$0.28
duplicated_actionsfailed_actions
583.1K
completed0%11$0.13
no_test_patchfailed_actions
202.4K
failed40%19$0.17
failed_actions
387.3K
failed59%107.37¢
duplicated_actionsno_test_patchfailed_actions
144.0K
failed13%87.08¢
failed_actions
112.1K
resolved85%52.72¢
44.6K

Instances per page

Page 1 of 2