gpt-4o-2024-11-20

20250119_azure_gpt_4o_0_0_n_20_fmt_tool_call_thoughts-in-action_1_verified_mini

Setup

Model gpt-4o-2024-11-20

Temperature N/A

Max Iterations 20

Format tool_call

Max Cost $1.00

% Resolved

32.0%

Resolved 16

Total 50

Cost

$34.49

$/Instance $0.69

Resolved/$ 0.46

Token Usage

8.4M

Prompt

8.2M (1.8M)

Completion 169.0K

Status

Final status of instances evaluated with the Moatless EvalTools SWE-Bench Harness. Indicates if the instance was resolved successfully, failed to complete, encountered an error, or didn't generate any patches.

Flags

Flags indicate potential issues in how the LLM follows the agentic workflow. They help identify common failure modes like hallucinations, "stuck in a loop", or missing test verifications.

Instances

50 instances


failed	29%	8	$0.22		61.9K
failed	43%	20	$0.86	duplicated_actionsno_test_patchfailed_actions	210.2K
failed	27%	10	$0.27		72.3K
resolved	79%	10	$0.30		77.6K
failed	0%	13	$0.91		189.6K
resolved	53%	9	$0.27	duplicated_actions	69.2K
resolved	19%	10	$0.40		82.2K
failed	56%	17	$1.06		258.4K
resolved	31%	9	$0.29		55.5K
resolved	65%	12	$0.59		150.7K
resolved	83%	18	$1.06	duplicated_actionsfailed_actions	245.1K
resolved	88%	10	$0.31		65.5K
failed	56%	8	$0.29		73.9K
resolved	53%	13	$0.82	wrong_action_for_importsfailed_actions	182.9K
failed	25%	20	$0.85	duplicated_actions	215.7K
no patch	4%	20	$0.89	duplicated_actionsfailed_actions	199.1K
resolved	71%	20	$0.82	duplicated_actionsfailed_actions	210.5K
failed	32%	20	$0.80	duplicated_actionsno_test_patchfailed_actions	214.7K
failed	10%	11	$0.32	failed_actions	94.5K
failed	8%	17	$1.03		246.6K
no patch	0%	20	$0.90	no_test_patch	235.8K
resolved	40%	14	$1.00	failed_actions	228.7K
resolved	59%	9	$0.33		78.9K
failed	13%	8	$0.20		60.8K
resolved	85%	6	$0.23		55.0K

Instances per page

Page 1 of 2

Instances

50 instances


failed	29%	8	$0.22		61.9K
failed	43%	20	$0.86	duplicated_actionsno_test_patchfailed_actions	210.2K
failed	27%	10	$0.27		72.3K
resolved	79%	10	$0.30		77.6K
failed	0%	13	$0.91		189.6K
resolved	53%	9	$0.27	duplicated_actions	69.2K
resolved	19%	10	$0.40		82.2K
failed	56%	17	$1.06		258.4K
resolved	31%	9	$0.29		55.5K
resolved	65%	12	$0.59		150.7K
resolved	83%	18	$1.06	duplicated_actionsfailed_actions	245.1K
resolved	88%	10	$0.31		65.5K
failed	56%	8	$0.29		73.9K
resolved	53%	13	$0.82	wrong_action_for_importsfailed_actions	182.9K
failed	25%	20	$0.85	duplicated_actions	215.7K
no patch	4%	20	$0.89	duplicated_actionsfailed_actions	199.1K
resolved	71%	20	$0.82	duplicated_actionsfailed_actions	210.5K
failed	32%	20	$0.80	duplicated_actionsno_test_patchfailed_actions	214.7K
failed	10%	11	$0.32	failed_actions	94.5K
failed	8%	17	$1.03		246.6K
no patch	0%	20	$0.90	no_test_patch	235.8K
resolved	40%	14	$1.00	failed_actions	228.7K
resolved	59%	9	$0.33		78.9K
failed	13%	8	$0.20		60.8K
resolved	85%	6	$0.23		55.0K

Instances per page

Page 1 of 2