deepseek-chat-v3
20250111_deepseek_chat_v3_temp_0_0_iter_20_fmt_react_hist_react
Setup
Model deepseek-chat-v3
Temperature N/A
Max Iterations 20
Format react
Max Cost $1.00
% Resolved
30.7%
Resolved 92
Total 300
Cost
$3.82
$/Instance 1.27¢
Resolved/$ 24.07
Token Usage
57.2M
Prompt
55.7M (34.9M)
Completion 1.5M
Status
Final status of instances evaluated with the Moatless EvalTools SWE-Bench Harness. Indicates if the instance was resolved successfully, failed to complete, encountered an error, or didn't generate any patches.
Flags
Flags indicate potential issues in how the LLM follows the agentic workflow. They help identify common failure modes like hallucinations, "stuck in a loop", or missing test verifications.
Instances
300 instances
failed | 37% | 20 | 1.52¢ | duplicated_actionsfailed_actions | 249.9K | |
failed | 1% | 20 | 2.05¢ | duplicated_actions | 299.3K | |
failed | 1% | 10 | 0.96¢ | 109.2K | ||
resolved | 73% | 8 | 0.95¢ | 118.8K | ||
failed | 30% | 7 | 0.37¢ | failed_actions | 58.7K | |
failed | 0% | 8 | 1.18¢ | 121.4K | ||
resolved | 75% | 20 | 0.67¢ | duplicated_actionsno_test_patchfailed_actions | 208.1K | |
failed | 21% | 12 | 0.66¢ | failed_actions | 111.3K | |
failed | 32% | 15 | 0.93¢ | failed_actions | 164.7K | |
failed | 0% | 20 | 1.49¢ | duplicated_actionsfailed_actions | 251.7K | |
resolved | 78% | 9 | 0.45¢ | 73.1K | ||
resolved | 51% | 4 | 0.12¢ | no_test_patch | 22.6K | |
resolved | 89% | 9 | 0.23¢ | 60.5K | ||
resolved | 83% | 9 | 0.26¢ | duplicated_actionsfailed_actions | 69.7K | |
resolved | 70% | 7 | 0.44¢ | failed_actions | 60.2K | |
failed | 3% | 20 | 1.51¢ | duplicated_actionsfailed_actions | 248.5K | |
failed | 19% | 20 | 1.56¢ | duplicated_actionsfailed_actions | 256.0K | |
failed | 0% | 10 | 0.65¢ | failed_actions | 90.8K | |
failed | 62% | 5 | 0.21¢ | no_test_patch | 33.0K | |
no patch | 25% | 20 | 0.62¢ | duplicated_actionsno_test_patch | 170.6K | |
no patch | 0% | 20 | 1.36¢ | duplicated_actionsno_test_patch | 248.9K | |
failed | 1% | 20 | 1.77¢ | duplicated_actionsfailed_actionsretries | 291.9K | |
failed | 1% | 20 | 1.60¢ | failed_actions | 256.2K | |
failed | 43% | 10 | 0.72¢ | failed_actions | 101.3K | |
resolved | 27% | 20 | 0.88¢ | duplicated_actions | 205.5K |
Instances per page
Page 1 of 12
Instances
300 instances
failed | 37% | 20 | 1.52¢ | duplicated_actionsfailed_actions | 249.9K | |
failed | 1% | 20 | 2.05¢ | duplicated_actions | 299.3K | |
failed | 1% | 10 | 0.96¢ | 109.2K | ||
resolved | 73% | 8 | 0.95¢ | 118.8K | ||
failed | 30% | 7 | 0.37¢ | failed_actions | 58.7K | |
failed | 0% | 8 | 1.18¢ | 121.4K | ||
resolved | 75% | 20 | 0.67¢ | duplicated_actionsno_test_patchfailed_actions | 208.1K | |
failed | 21% | 12 | 0.66¢ | failed_actions | 111.3K | |
failed | 32% | 15 | 0.93¢ | failed_actions | 164.7K | |
failed | 0% | 20 | 1.49¢ | duplicated_actionsfailed_actions | 251.7K | |
resolved | 78% | 9 | 0.45¢ | 73.1K | ||
resolved | 51% | 4 | 0.12¢ | no_test_patch | 22.6K | |
resolved | 89% | 9 | 0.23¢ | 60.5K | ||
resolved | 83% | 9 | 0.26¢ | duplicated_actionsfailed_actions | 69.7K | |
resolved | 70% | 7 | 0.44¢ | failed_actions | 60.2K | |
failed | 3% | 20 | 1.51¢ | duplicated_actionsfailed_actions | 248.5K | |
failed | 19% | 20 | 1.56¢ | duplicated_actionsfailed_actions | 256.0K | |
failed | 0% | 10 | 0.65¢ | failed_actions | 90.8K | |
failed | 62% | 5 | 0.21¢ | no_test_patch | 33.0K | |
no patch | 25% | 20 | 0.62¢ | duplicated_actionsno_test_patch | 170.6K | |
no patch | 0% | 20 | 1.36¢ | duplicated_actionsno_test_patch | 248.9K | |
failed | 1% | 20 | 1.77¢ | duplicated_actionsfailed_actionsretries | 291.9K | |
failed | 1% | 20 | 1.60¢ | failed_actions | 256.2K | |
failed | 43% | 10 | 0.72¢ | failed_actions | 101.3K | |
resolved | 27% | 20 | 0.88¢ | duplicated_actions | 205.5K |
Instances per page
Page 1 of 12