gpt-4o-2024-11-20

20250119_azure_gpt_4o_0_0_n_20_fmt_tool_call_thoughts-in-action_1_verified_mini

Setup

Model gpt-4o-2024-11-20
Temperature N/A
Max Iterations 20
Format tool_call
Max Cost $1.00

% Resolved

32.0%
Resolved 16
Total 50

Cost

$34.49
$/Instance $0.69
Resolved/$ 0.46

Token Usage

8.4M
Prompt
8.2M (1.8M)
Completion 169.0K

Status

Final status of instances evaluated with the Moatless EvalTools SWE-Bench Harness. Indicates if the instance was resolved successfully, failed to complete, encountered an error, or didn't generate any patches.

Flags

Flags indicate potential issues in how the LLM follows the agentic workflow. They help identify common failure modes like hallucinations, "stuck in a loop", or missing test verifications.

Instances

50 instances
failed29%8$0.22
61.9K
failed43%20$0.86
duplicated_actionsno_test_patchfailed_actions
210.2K
failed27%10$0.27
72.3K
resolved79%10$0.30
77.6K
failed0%13$0.91
189.6K
resolved53%9$0.27
duplicated_actions
69.2K
resolved19%10$0.40
82.2K
failed56%17$1.06
258.4K
resolved31%9$0.29
55.5K
resolved65%12$0.59
150.7K
resolved83%18$1.06
duplicated_actionsfailed_actions
245.1K
resolved88%10$0.31
65.5K
failed56%8$0.29
73.9K
resolved53%13$0.82
wrong_action_for_importsfailed_actions
182.9K
failed25%20$0.85
duplicated_actions
215.7K
no patch4%20$0.89
duplicated_actionsfailed_actions
199.1K
resolved71%20$0.82
duplicated_actionsfailed_actions
210.5K
failed32%20$0.80
duplicated_actionsno_test_patchfailed_actions
214.7K
failed10%11$0.32
failed_actions
94.5K
failed8%17$1.03
246.6K
no patch0%20$0.90
no_test_patch
235.8K
resolved40%14$1.00
failed_actions
228.7K
resolved59%9$0.33
78.9K
failed13%8$0.20
60.8K
resolved85%6$0.23
55.0K

Instances per page

Page 1 of 2