deepseek-chat-v3

20250118_deepseek_deepseek_chat_0_0_n_20_fmt_react_verified_mini

Setup

Model deepseek-chat-v3
Temperature N/A
Max Iterations 20
Format react
Max Cost $1.00

% Resolved

36.0%
Resolved 18
Total 50

Cost

$0.44
$/Instance 0.88¢
Resolved/$ 40.80

Token Usage

8.6M
Prompt
8.4M (6.3M)
Completion 234.7K

Status

Final status of instances evaluated with the Moatless EvalTools SWE-Bench Harness. Indicates if the instance was resolved successfully, failed to complete, encountered an error, or didn't generate any patches.

Flags

Flags indicate potential issues in how the LLM follows the agentic workflow. They help identify common failure modes like hallucinations, "stuck in a loop", or missing test verifications.

Instances

50 instances
resolved29%60.14¢
35.3K
resolved43%201.23¢
duplicated_actions
280.6K
resolved27%80.30¢
58.9K
resolved79%50.19¢
retries
44.2K
failed0%201.38¢
duplicated_actionsfailed_actions
270.2K
failed53%50.15¢
no_test_patch
29.0K
failed19%60.21¢
43.4K
no patch56%201.27¢
duplicated_actionsfailed_actions
259.8K
failed31%201.20¢
failed_actions
239.9K
resolved65%60.21¢
41.4K
resolved83%130.46¢
no_test_patchfailed_actions
116.4K
resolved88%80.20¢
49.3K
failed56%80.30¢
60.6K
resolved53%201.62¢
duplicated_actionsfailed_actions
261.3K
no patch25%200.60¢
duplicated_actionsno_test_patchfailed_actions
210.9K
failed4%201.97¢
266.8K
resolved71%90.20¢
failed_actions
59.1K
failed32%200.97¢
duplicated_actionsno_test_patch
202.6K
failed10%80.51¢
failed_actionsretries
78.1K
failed8%50.32¢
no_test_patch
41.4K
no patch0%140.84¢
failed_actions
184.7K
resolved40%151.19¢
failed_actions
208.1K
resolved59%90.49¢
failed_actions
92.1K
failed13%200.52¢
duplicated_actionsno_test_patchfailed_actions
198.1K
resolved85%60.28¢
47.4K

Instances per page

Page 1 of 2