deepseek-chat-v3

20250111_deepseek_chat_v3_temp_0_0_iter_20_fmt_react_hist_react

Setup

Model deepseek-chat-v3
Temperature N/A
Max Iterations 20
Format react
Max Cost $1.00

% Resolved

30.7%
Resolved 92
Total 300

Cost

$3.82
$/Instance 1.27¢
Resolved/$ 24.07

Token Usage

57.2M
Prompt
55.7M (34.9M)
Completion 1.5M

Status

Final status of instances evaluated with the Moatless EvalTools SWE-Bench Harness. Indicates if the instance was resolved successfully, failed to complete, encountered an error, or didn't generate any patches.

Flags

Flags indicate potential issues in how the LLM follows the agentic workflow. They help identify common failure modes like hallucinations, "stuck in a loop", or missing test verifications.

Instances

300 instances
failed37%201.52¢
duplicated_actionsfailed_actions
249.9K
failed1%202.05¢
duplicated_actions
299.3K
failed1%100.96¢
109.2K
resolved73%80.95¢
118.8K
failed30%70.37¢
failed_actions
58.7K
failed0%81.18¢
121.4K
resolved75%200.67¢
duplicated_actionsno_test_patchfailed_actions
208.1K
failed21%120.66¢
failed_actions
111.3K
failed32%150.93¢
failed_actions
164.7K
failed0%201.49¢
duplicated_actionsfailed_actions
251.7K
resolved78%90.45¢
73.1K
resolved51%40.12¢
no_test_patch
22.6K
resolved89%90.23¢
60.5K
resolved83%90.26¢
duplicated_actionsfailed_actions
69.7K
resolved70%70.44¢
failed_actions
60.2K
failed3%201.51¢
duplicated_actionsfailed_actions
248.5K
failed19%201.56¢
duplicated_actionsfailed_actions
256.0K
failed0%100.65¢
failed_actions
90.8K
failed62%50.21¢
no_test_patch
33.0K
no patch25%200.62¢
duplicated_actionsno_test_patch
170.6K
no patch0%201.36¢
duplicated_actionsno_test_patch
248.9K
failed1%201.77¢
duplicated_actionsfailed_actionsretries
291.9K
failed1%201.60¢
failed_actions
256.2K
failed43%100.72¢
failed_actions
101.3K
resolved27%200.88¢
duplicated_actions
205.5K

Instances per page

Page 1 of 12