claude-3-5-sonnet-20241022

20250119_claude_3_5_sonnet_20241022_0_0_n_20_fmt_tool_call_verified_mini

Setup

Model claude-3-5-sonnet-20241022
Temperature N/A
Max Iterations 20
Format tool_call
Max Cost $1.00

% Resolved

46.0%
Resolved 23
Total 50

Cost

$18.27
$/Instance $0.37
Resolved/$ 1.26

Token Usage

12.8M
Prompt
12.5M (9.1M)
Completion 304.8K

Status

Final status of instances evaluated with the Moatless EvalTools SWE-Bench Harness. Indicates if the instance was resolved successfully, failed to complete, encountered an error, or didn't generate any patches.

Flags

Flags indicate potential issues in how the LLM follows the agentic workflow. They help identify common failure modes like hallucinations, "stuck in a loop", or missing test verifications.

Instances

50 instances
failed29%19$0.32
257.2K
failed43%8$0.14
76.5K
failed27%6$0.11
50.5K
resolved79%8$0.18
102.8K
failed0%20$0.56
no_changesno_test_patchfailed_actions
503.5K
resolved53%8$0.18
91.9K
failed19%9$0.21
112.3K
resolved56%10$0.25
143.6K
failed31%8$0.24
98.5K
resolved65%7$0.13
58.6K
resolved83%7$0.19
89.6K
resolved88%8$0.13
retries
67.3K
resolved56%8$0.14
71.4K
resolved53%8$0.21
109.5K
resolved25%17$0.53
no_changesfailed_testsfailed_actions
319.5K
no patch4%20$0.54
duplicated_actionsfailed_actionsretries
403.5K
resolved71%8$0.12
65.8K
failed32%20$0.45
386.4K
failed10%8$0.16
80.2K
failed8%9$0.40
failed_tests
197.6K
failed0%9$0.36
182.2K
resolved40%13$0.40
failed_actions
292.1K
resolved59%12$0.32
string_not_foundduplicated_actionsfailed_actions
189.2K
failed13%13$0.27
string_not_foundfailed_actions
193.9K
resolved85%7$0.16
87.0K

Instances per page

Page 1 of 2