deepseek-reasoner
20250120_deepseek_deepseek_reasoner_None_n_20_fmt_react_verified_mini
Setup
Model deepseek-reasoner
Temperature N/A
Max Iterations 20
Format react
Max Cost $1.00
% Resolved
50.0%
Resolved 25
Total 50
Cost
$3.82
$/Instance 7.64¢
Resolved/$ 6.54
Token Usage
7.0M
Prompt
5.9M (4.5M)
Completion 1.1M
Status
Final status of instances evaluated with the Moatless EvalTools SWE-Bench Harness. Indicates if the instance was resolved successfully, failed to complete, encountered an error, or didn't generate any patches.
Flags
Flags indicate potential issues in how the LLM follows the agentic workflow. They help identify common failure modes like hallucinations, "stuck in a loop", or missing test verifications.
Instances
50 instances
resolved | 29% | 10 | 2.85¢ | failed_actions | 73.2K | |
failed | 43% | 14 | 8.12¢ | failed_actions | 132.9K | |
resolved | 27% | 7 | 7.37¢ | string_not_foundfailed_actions | 73.5K | |
resolved | 79% | 7 | 3.21¢ | failed_actions | 56.6K | |
no patch | 0% | 3 | 3.78¢ | no_test_patch | 33.2K | |
failed | 53% | 13 | 8.97¢ | failed_testsfailed_actions | 166.7K | |
failed | 19% | 11 | 7.12¢ | string_not_foundduplicated_actionsno_test_patchfailed_actions | 112.0K | |
resolved | 56% | 10 | 4.59¢ | 99.6K | ||
failed | 31% | 20 | $0.13 | string_not_foundduplicated_actionsno_test_patchfailed_actions | 179.5K | |
resolved | 65% | 7 | 1.78¢ | failed_actions | 49.2K | |
resolved | 83% | 13 | 5.39¢ | failed_actions | 139.7K | |
resolved | 88% | 7 | 2.93¢ | 50.5K | ||
failed | 56% | 16 | 7.14¢ | string_not_foundduplicated_actionsfailed_actions | 189.5K | |
resolved | 53% | 17 | $0.12 | string_not_foundfailed_actions | 193.5K | |
resolved | 25% | 15 | 6.55¢ | string_not_foundfailed_actions | 169.6K | |
resolved | 4% | 20 | $0.11 | string_not_foundfailed_actionsretries | 239.2K | |
resolved | 71% | 20 | $0.14 | failed_testsduplicated_actionsfailed_actions | 212.2K | |
failed | 32% | 20 | $0.11 | string_not_foundduplicated_actionsno_test_patchfailed_actions | 179.4K | |
failed | 10% | 13 | 5.38¢ | failed_testsduplicated_actionsfailed_actions | 115.9K | |
failed | 8% | 5 | 4.64¢ | no_test_patch | 51.3K | |
failed | 0% | 15 | $0.12 | string_not_foundfailed_actions | 171.9K | |
failed | 40% | 20 | $0.16 | duplicated_actionsno_test_patchretries | 231.2K | |
resolved | 59% | 11 | 4.43¢ | failed_actions | 105.5K | |
failed | 13% | 9 | 3.61¢ | failed_actions | 77.5K | |
resolved | 85% | 8 | 4.06¢ | 75.6K |
Instances per page
Page 1 of 2
Instances
50 instances
resolved | 29% | 10 | 2.85¢ | failed_actions | 73.2K | |
failed | 43% | 14 | 8.12¢ | failed_actions | 132.9K | |
resolved | 27% | 7 | 7.37¢ | string_not_foundfailed_actions | 73.5K | |
resolved | 79% | 7 | 3.21¢ | failed_actions | 56.6K | |
no patch | 0% | 3 | 3.78¢ | no_test_patch | 33.2K | |
failed | 53% | 13 | 8.97¢ | failed_testsfailed_actions | 166.7K | |
failed | 19% | 11 | 7.12¢ | string_not_foundduplicated_actionsno_test_patchfailed_actions | 112.0K | |
resolved | 56% | 10 | 4.59¢ | 99.6K | ||
failed | 31% | 20 | $0.13 | string_not_foundduplicated_actionsno_test_patchfailed_actions | 179.5K | |
resolved | 65% | 7 | 1.78¢ | failed_actions | 49.2K | |
resolved | 83% | 13 | 5.39¢ | failed_actions | 139.7K | |
resolved | 88% | 7 | 2.93¢ | 50.5K | ||
failed | 56% | 16 | 7.14¢ | string_not_foundduplicated_actionsfailed_actions | 189.5K | |
resolved | 53% | 17 | $0.12 | string_not_foundfailed_actions | 193.5K | |
resolved | 25% | 15 | 6.55¢ | string_not_foundfailed_actions | 169.6K | |
resolved | 4% | 20 | $0.11 | string_not_foundfailed_actionsretries | 239.2K | |
resolved | 71% | 20 | $0.14 | failed_testsduplicated_actionsfailed_actions | 212.2K | |
failed | 32% | 20 | $0.11 | string_not_foundduplicated_actionsno_test_patchfailed_actions | 179.4K | |
failed | 10% | 13 | 5.38¢ | failed_testsduplicated_actionsfailed_actions | 115.9K | |
failed | 8% | 5 | 4.64¢ | no_test_patch | 51.3K | |
failed | 0% | 15 | $0.12 | string_not_foundfailed_actions | 171.9K | |
failed | 40% | 20 | $0.16 | duplicated_actionsno_test_patchretries | 231.2K | |
resolved | 59% | 11 | 4.43¢ | failed_actions | 105.5K | |
failed | 13% | 9 | 3.61¢ | failed_actions | 77.5K | |
resolved | 85% | 8 | 4.06¢ | 75.6K |
Instances per page
Page 1 of 2