claude-3-5-haiku-20241022
20250118_claude_3_5_haiku_20241022_0_0_n_20_fmt_tool_call_verified_mini
Setup
Model claude-3-5-haiku-20241022
Temperature N/A
Max Iterations 20
Format tool_call
Max Cost $1.00
% Resolved
28.0%
Resolved 14
Total 50
Cost
$6.04
$/Instance $0.12
Resolved/$ 2.32
Token Usage
12.4M
Prompt
12.1M (9.1M)
Completion 356.3K
Status
Final status of instances evaluated with the Moatless EvalTools SWE-Bench Harness. Indicates if the instance was resolved successfully, failed to complete, encountered an error, or didn't generate any patches.
Flags
Flags indicate potential issues in how the LLM follows the agentic workflow. They help identify common failure modes like hallucinations, "stuck in a loop", or missing test verifications.
Instances
50 instances
failed | 29% | 19 | $0.13 | failed_actions | 387.6K | |
no patch | 43% | 7 | 5.35¢ | no_test_patch | 92.4K | |
failed | 27% | 17 | $0.14 | failed_actions | 339.3K | |
resolved | 79% | 6 | 2.83¢ | 58.1K | ||
failed | 0% | 8 | $0.11 | duplicated_actionsno_test_patchfailed_actions | 133.2K | |
resolved | 53% | 5 | 1.46¢ | 35.2K | ||
resolved | 19% | 9 | 5.78¢ | 105.0K | ||
failed | 56% | 6 | 5.93¢ | no_test_patch | 81.5K | |
resolved | 31% | 10 | 3.96¢ | failed_actions | 95.1K | |
resolved | 65% | 9 | 4.78¢ | failed_actions | 100.2K | |
resolved | 83% | 9 | 4.81¢ | 122.6K | ||
resolved | 88% | 7 | 3.93¢ | 79.7K | ||
resolved | 56% | 7 | 2.59¢ | 59.9K | ||
failed | 53% | 16 | $0.17 | failed_testsduplicated_actionsfailed_actions | 332.0K | |
failed | 25% | 8 | 7.95¢ | 126.9K | ||
no patch | 4% | 15 | $0.11 | failed_testsfailed_actions | 249.9K | |
resolved | 71% | 9 | 3.08¢ | failed_actions | 84.9K | |
failed | 32% | 9 | 6.44¢ | 117.3K | ||
error | 10% | 9 | 4.68¢ | 85.1K | ||
failed | 8% | 20 | $0.28 | duplicated_actionsfailed_actions | 583.1K | |
completed | 0% | 11 | $0.13 | no_test_patchfailed_actions | 202.4K | |
failed | 40% | 19 | $0.17 | failed_actions | 387.3K | |
failed | 59% | 10 | 7.37¢ | duplicated_actionsno_test_patchfailed_actions | 144.0K | |
failed | 13% | 8 | 7.08¢ | failed_actions | 112.1K | |
resolved | 85% | 5 | 2.72¢ | 44.6K |
Instances per page
Page 1 of 2
Instances
50 instances
failed | 29% | 19 | $0.13 | failed_actions | 387.6K | |
no patch | 43% | 7 | 5.35¢ | no_test_patch | 92.4K | |
failed | 27% | 17 | $0.14 | failed_actions | 339.3K | |
resolved | 79% | 6 | 2.83¢ | 58.1K | ||
failed | 0% | 8 | $0.11 | duplicated_actionsno_test_patchfailed_actions | 133.2K | |
resolved | 53% | 5 | 1.46¢ | 35.2K | ||
resolved | 19% | 9 | 5.78¢ | 105.0K | ||
failed | 56% | 6 | 5.93¢ | no_test_patch | 81.5K | |
resolved | 31% | 10 | 3.96¢ | failed_actions | 95.1K | |
resolved | 65% | 9 | 4.78¢ | failed_actions | 100.2K | |
resolved | 83% | 9 | 4.81¢ | 122.6K | ||
resolved | 88% | 7 | 3.93¢ | 79.7K | ||
resolved | 56% | 7 | 2.59¢ | 59.9K | ||
failed | 53% | 16 | $0.17 | failed_testsduplicated_actionsfailed_actions | 332.0K | |
failed | 25% | 8 | 7.95¢ | 126.9K | ||
no patch | 4% | 15 | $0.11 | failed_testsfailed_actions | 249.9K | |
resolved | 71% | 9 | 3.08¢ | failed_actions | 84.9K | |
failed | 32% | 9 | 6.44¢ | 117.3K | ||
error | 10% | 9 | 4.68¢ | 85.1K | ||
failed | 8% | 20 | $0.28 | duplicated_actionsfailed_actions | 583.1K | |
completed | 0% | 11 | $0.13 | no_test_patchfailed_actions | 202.4K | |
failed | 40% | 19 | $0.17 | failed_actions | 387.3K | |
failed | 59% | 10 | 7.37¢ | duplicated_actionsno_test_patchfailed_actions | 144.0K | |
failed | 13% | 8 | 7.08¢ | failed_actions | 112.1K | |
resolved | 85% | 5 | 2.72¢ | 44.6K |
Instances per page
Page 1 of 2