deepseek-chat-v3

20250118_deepseek_deepseek_chat_0_0_n_20_fmt_react_verified_mini

Setup

Model deepseek-chat-v3

Temperature N/A

Max Iterations 20

Format react

Max Cost $1.00

% Resolved

36.0%

Resolved 18

Total 50

Cost

$0.44

$/Instance 0.88¢

Resolved/$ 40.80

Token Usage

8.6M

Prompt

8.4M (6.3M)

Completion 234.7K

Status

Final status of instances evaluated with the Moatless EvalTools SWE-Bench Harness. Indicates if the instance was resolved successfully, failed to complete, encountered an error, or didn't generate any patches.

Flags

Flags indicate potential issues in how the LLM follows the agentic workflow. They help identify common failure modes like hallucinations, "stuck in a loop", or missing test verifications.

Instances

50 instances


resolved	29%	6	0.14¢		35.3K
resolved	43%	20	1.23¢	duplicated_actions	280.6K
resolved	27%	8	0.30¢		58.9K
resolved	79%	5	0.19¢	retries	44.2K
failed	0%	20	1.38¢	duplicated_actionsfailed_actions	270.2K
failed	53%	5	0.15¢	no_test_patch	29.0K
failed	19%	6	0.21¢		43.4K
no patch	56%	20	1.27¢	duplicated_actionsfailed_actions	259.8K
failed	31%	20	1.20¢	failed_actions	239.9K
resolved	65%	6	0.21¢		41.4K
resolved	83%	13	0.46¢	no_test_patchfailed_actions	116.4K
resolved	88%	8	0.20¢		49.3K
failed	56%	8	0.30¢		60.6K
resolved	53%	20	1.62¢	duplicated_actionsfailed_actions	261.3K
no patch	25%	20	0.60¢	duplicated_actionsno_test_patchfailed_actions	210.9K
failed	4%	20	1.97¢		266.8K
resolved	71%	9	0.20¢	failed_actions	59.1K
failed	32%	20	0.97¢	duplicated_actionsno_test_patch	202.6K
failed	10%	8	0.51¢	failed_actionsretries	78.1K
failed	8%	5	0.32¢	no_test_patch	41.4K
no patch	0%	14	0.84¢	failed_actions	184.7K
resolved	40%	15	1.19¢	failed_actions	208.1K
resolved	59%	9	0.49¢	failed_actions	92.1K
failed	13%	20	0.52¢	duplicated_actionsno_test_patchfailed_actions	198.1K
resolved	85%	6	0.28¢		47.4K

Instances per page

Page 1 of 2

Instances

50 instances


resolved	29%	6	0.14¢		35.3K
resolved	43%	20	1.23¢	duplicated_actions	280.6K
resolved	27%	8	0.30¢		58.9K
resolved	79%	5	0.19¢	retries	44.2K
failed	0%	20	1.38¢	duplicated_actionsfailed_actions	270.2K
failed	53%	5	0.15¢	no_test_patch	29.0K
failed	19%	6	0.21¢		43.4K
no patch	56%	20	1.27¢	duplicated_actionsfailed_actions	259.8K
failed	31%	20	1.20¢	failed_actions	239.9K
resolved	65%	6	0.21¢		41.4K
resolved	83%	13	0.46¢	no_test_patchfailed_actions	116.4K
resolved	88%	8	0.20¢		49.3K
failed	56%	8	0.30¢		60.6K
resolved	53%	20	1.62¢	duplicated_actionsfailed_actions	261.3K
no patch	25%	20	0.60¢	duplicated_actionsno_test_patchfailed_actions	210.9K
failed	4%	20	1.97¢		266.8K
resolved	71%	9	0.20¢	failed_actions	59.1K
failed	32%	20	0.97¢	duplicated_actionsno_test_patch	202.6K
failed	10%	8	0.51¢	failed_actionsretries	78.1K
failed	8%	5	0.32¢	no_test_patch	41.4K
no patch	0%	14	0.84¢	failed_actions	184.7K
resolved	40%	15	1.19¢	failed_actions	208.1K
resolved	59%	9	0.49¢	failed_actions	92.1K
failed	13%	20	0.52¢	duplicated_actionsno_test_patchfailed_actions	198.1K
resolved	85%	6	0.28¢		47.4K

Instances per page

Page 1 of 2