gpt-4o-mini-2024-07-18

20250118_gpt_4o_mini_2024_07_18_0_0_n_20_fmt_tool_call_thoughts-in-action_6_verified_mini

Setup

Model gpt-4o-mini-2024-07-18

Temperature N/A

Max Iterations 20

Format tool_call

Max Cost $1.00

% Resolved

16.0%

Resolved 8

Total 50

Cost

$0.70

$/Instance 1.41¢

Resolved/$ 11.38

Token Usage

7.4M

Prompt

7.3M (6.1M)

Completion 99.4K

Status

Final status of instances evaluated with the Moatless EvalTools SWE-Bench Harness. Indicates if the instance was resolved successfully, failed to complete, encountered an error, or didn't generate any patches.

Flags

Flags indicate potential issues in how the LLM follows the agentic workflow. They help identify common failure modes like hallucinations, "stuck in a loop", or missing test verifications.

Instances

50 instances


failed	29%	6	0.31¢	no_test_patch	32.6K
failed	43%	20	1.98¢	duplicated_actionsno_test_patchfailed_actions	215.7K
failed	27%	20	3.20¢	duplicated_actionsfailed_actionsretries	308.7K
resolved	79%	11	0.83¢	wrong_action_for_importsduplicated_actionsfailed_actions	86.1K
failed	0%	20	3.47¢	wrong_action_for_importsduplicated_actionsfailed_actions	335.6K
failed	53%	20	1.76¢	duplicated_actionsno_test_patchfailed_actionsretries	198.8K
failed	19%	16	1.51¢	no_test_patchfailed_actions	153.2K
failed	56%	20	2.08¢	duplicated_actionsno_test_patch	234.1K
failed	31%	6	0.31¢	no_test_patch	31.6K
resolved	65%	6	0.48¢	no_test_patch	42.0K
resolved	83%	6	0.33¢	duplicated_actionsno_test_patchfailed_actions	33.6K
resolved	88%	12	1.59¢	duplicated_actionsno_test_patchfailed_actionsretries	165.3K
failed	56%	5	0.32¢	no_test_patch	32.0K
failed	53%	20	1.95¢	duplicated_actionsno_test_patch	234.6K
failed	25%	7	0.42¢	no_test_patch	42.3K
no patch	4%	10	0.99¢	failed_actionsretries	102.1K
failed	71%	5	0.27¢	no_test_patch	25.7K
failed	32%	18	1.53¢	duplicated_actionsfailed_actions	165.5K
failed	10%	6	0.35¢	no_test_patch	34.4K
no patch	8%	12	1.26¢	duplicated_actionsno_test_patchfailed_actions	119.0K
failed	0%	6	0.39¢	no_test_patch	37.9K
failed	40%	5	0.31¢	no_test_patch	30.0K
failed	59%	13	1.11¢	failed_testsno_test_patchfailed_actions	122.6K
failed	13%	17	2.53¢	failed_testsfailed_actions	261.2K
resolved	85%	5	0.26¢	no_test_patch	26.8K

Instances per page

Page 1 of 2

Instances

50 instances


failed	29%	6	0.31¢	no_test_patch	32.6K
failed	43%	20	1.98¢	duplicated_actionsno_test_patchfailed_actions	215.7K
failed	27%	20	3.20¢	duplicated_actionsfailed_actionsretries	308.7K
resolved	79%	11	0.83¢	wrong_action_for_importsduplicated_actionsfailed_actions	86.1K
failed	0%	20	3.47¢	wrong_action_for_importsduplicated_actionsfailed_actions	335.6K
failed	53%	20	1.76¢	duplicated_actionsno_test_patchfailed_actionsretries	198.8K
failed	19%	16	1.51¢	no_test_patchfailed_actions	153.2K
failed	56%	20	2.08¢	duplicated_actionsno_test_patch	234.1K
failed	31%	6	0.31¢	no_test_patch	31.6K
resolved	65%	6	0.48¢	no_test_patch	42.0K
resolved	83%	6	0.33¢	duplicated_actionsno_test_patchfailed_actions	33.6K
resolved	88%	12	1.59¢	duplicated_actionsno_test_patchfailed_actionsretries	165.3K
failed	56%	5	0.32¢	no_test_patch	32.0K
failed	53%	20	1.95¢	duplicated_actionsno_test_patch	234.6K
failed	25%	7	0.42¢	no_test_patch	42.3K
no patch	4%	10	0.99¢	failed_actionsretries	102.1K
failed	71%	5	0.27¢	no_test_patch	25.7K
failed	32%	18	1.53¢	duplicated_actionsfailed_actions	165.5K
failed	10%	6	0.35¢	no_test_patch	34.4K
no patch	8%	12	1.26¢	duplicated_actionsno_test_patchfailed_actions	119.0K
failed	0%	6	0.39¢	no_test_patch	37.9K
failed	40%	5	0.31¢	no_test_patch	30.0K
failed	59%	13	1.11¢	failed_testsno_test_patchfailed_actions	122.6K
failed	13%	17	2.53¢	failed_testsfailed_actions	261.2K
resolved	85%	5	0.26¢	no_test_patch	26.8K

Instances per page

Page 1 of 2