gpt-4o-mini-2024-07-18

20250118_gpt_4o_mini_2024_07_18_0_0_n_20_fmt_tool_call_thoughts-in-action_6_verified_mini

Setup

Model gpt-4o-mini-2024-07-18
Temperature N/A
Max Iterations 20
Format tool_call
Max Cost $1.00

% Resolved

16.0%
Resolved 8
Total 50

Cost

$0.70
$/Instance 1.41¢
Resolved/$ 11.38

Token Usage

7.4M
Prompt
7.3M (6.1M)
Completion 99.4K

Status

Final status of instances evaluated with the Moatless EvalTools SWE-Bench Harness. Indicates if the instance was resolved successfully, failed to complete, encountered an error, or didn't generate any patches.

Flags

Flags indicate potential issues in how the LLM follows the agentic workflow. They help identify common failure modes like hallucinations, "stuck in a loop", or missing test verifications.

Instances

50 instances
failed29%60.31¢
no_test_patch
32.6K
failed43%201.98¢
duplicated_actionsno_test_patchfailed_actions
215.7K
failed27%203.20¢
duplicated_actionsfailed_actionsretries
308.7K
resolved79%110.83¢
wrong_action_for_importsduplicated_actionsfailed_actions
86.1K
failed0%203.47¢
wrong_action_for_importsduplicated_actionsfailed_actions
335.6K
failed53%201.76¢
duplicated_actionsno_test_patchfailed_actionsretries
198.8K
failed19%161.51¢
no_test_patchfailed_actions
153.2K
failed56%202.08¢
duplicated_actionsno_test_patch
234.1K
failed31%60.31¢
no_test_patch
31.6K
resolved65%60.48¢
no_test_patch
42.0K
resolved83%60.33¢
duplicated_actionsno_test_patchfailed_actions
33.6K
resolved88%121.59¢
duplicated_actionsno_test_patchfailed_actionsretries
165.3K
failed56%50.32¢
no_test_patch
32.0K
failed53%201.95¢
duplicated_actionsno_test_patch
234.6K
failed25%70.42¢
no_test_patch
42.3K
no patch4%100.99¢
failed_actionsretries
102.1K
failed71%50.27¢
no_test_patch
25.7K
failed32%181.53¢
duplicated_actionsfailed_actions
165.5K
failed10%60.35¢
no_test_patch
34.4K
no patch8%121.26¢
duplicated_actionsno_test_patchfailed_actions
119.0K
failed0%60.39¢
no_test_patch
37.9K
failed40%50.31¢
no_test_patch
30.0K
failed59%131.11¢
failed_testsno_test_patchfailed_actions
122.6K
failed13%172.53¢
failed_testsfailed_actions
261.2K
resolved85%50.26¢
no_test_patch
26.8K

Instances per page

Page 1 of 2