o1-mini-2024-09-12

20250114_o1_mini_2024_09_12_0_0_n_20_fmt_react_hist_react_verified_mini

Setup

Model o1-mini-2024-09-12
Temperature N/A
Max Iterations 20
Format react
Max Cost $1.00

% Resolved

28.0%
Resolved 14
Total 50

Cost

$24.55
$/Instance $0.49
Resolved/$ 0.57

Token Usage

7.1M
Prompt
6.1M (3.2M)
Completion 1.0M

Status

Final status of instances evaluated with the Moatless EvalTools SWE-Bench Harness. Indicates if the instance was resolved successfully, failed to complete, encountered an error, or didn't generate any patches.

Flags

Flags indicate potential issues in how the LLM follows the agentic workflow. They help identify common failure modes like hallucinations, "stuck in a loop", or missing test verifications.

Instances

50 instances
resolved29%49.81¢
no_test_patch
22.8K
failed43%20$0.97
string_not_foundduplicated_actionsfailed_actionsretries
291.9K
resolved27%47.54¢
no_test_patch
21.8K
resolved79%49.94¢
no_test_patch
27.1K
failed0%18$1.04
duplicated_actionsretries
290.5K
failed53%20$0.74
no_changesstring_already_existsfailed_testsduplicated_actionsno_test_patchfailed_actionsretries
209.5K
failed19%20$0.84
string_not_foundduplicated_actionsno_test_patchfailed_actionsretries
255.5K
resolved56%8$0.48
duplicated_actionsno_test_patch
134.7K
failed31%10$0.26
duplicated_actionsno_test_patch
73.9K
resolved65%47.99¢
no_test_patch
23.6K
resolved83%59.86¢
no_test_patchfailed_actions
29.1K
resolved88%59.21¢
no_test_patch
28.6K
failed56%5$0.12
no_test_patch
34.2K
failed53%7$0.31
no_test_patch
82.4K
failed25%5$0.11
no_test_patch
31.9K
failed4%6$0.25
no_test_patch
61.8K
failed71%20$0.72
string_already_existsfailed_testsduplicated_actionsfailed_actionsretries
198.7K
failed32%20$0.97
string_not_foundduplicated_actionsfailed_actions
253.8K
failed10%46.77¢
no_test_patch
20.7K
no patch8%7$0.23
no_test_patch
66.7K
no patch0%20$0.96
string_not_foundduplicated_actionsno_test_patchfailed_actionsretries
236.8K
failed40%59.79¢
no_test_patch
31.4K
resolved59%5$0.13
no_test_patch
35.8K
failed13%48.21¢
no_test_patch
23.2K
resolved85%20$0.70
duplicated_actionsretries
201.5K

Instances per page

Page 1 of 2