claude-3-5-haiku-20241022

20250118_claude_3_5_haiku_20241022_0_0_n_20_fmt_tool_call_verified_mini

Setup

Model claude-3-5-haiku-20241022

Temperature N/A

Max Iterations 20

Format tool_call

Max Cost $1.00

% Resolved

28.0%

Resolved 14

Total 50

Cost

$6.04

$/Instance $0.12

Resolved/$ 2.32

Token Usage

12.4M

Prompt

12.1M (9.1M)

Completion 356.3K

Status

Final status of instances evaluated with the Moatless EvalTools SWE-Bench Harness. Indicates if the instance was resolved successfully, failed to complete, encountered an error, or didn't generate any patches.

Flags

Flags indicate potential issues in how the LLM follows the agentic workflow. They help identify common failure modes like hallucinations, "stuck in a loop", or missing test verifications.

Instances

50 instances


failed	29%	19	$0.13	failed_actions	387.6K
no patch	43%	7	5.35¢	no_test_patch	92.4K
failed	27%	17	$0.14	failed_actions	339.3K
resolved	79%	6	2.83¢		58.1K
failed	0%	8	$0.11	duplicated_actionsno_test_patchfailed_actions	133.2K
resolved	53%	5	1.46¢		35.2K
resolved	19%	9	5.78¢		105.0K
failed	56%	6	5.93¢	no_test_patch	81.5K
resolved	31%	10	3.96¢	failed_actions	95.1K
resolved	65%	9	4.78¢	failed_actions	100.2K
resolved	83%	9	4.81¢		122.6K
resolved	88%	7	3.93¢		79.7K
resolved	56%	7	2.59¢		59.9K
failed	53%	16	$0.17	failed_testsduplicated_actionsfailed_actions	332.0K
failed	25%	8	7.95¢		126.9K
no patch	4%	15	$0.11	failed_testsfailed_actions	249.9K
resolved	71%	9	3.08¢	failed_actions	84.9K
failed	32%	9	6.44¢		117.3K
error	10%	9	4.68¢		85.1K
failed	8%	20	$0.28	duplicated_actionsfailed_actions	583.1K
completed	0%	11	$0.13	no_test_patchfailed_actions	202.4K
failed	40%	19	$0.17	failed_actions	387.3K
failed	59%	10	7.37¢	duplicated_actionsno_test_patchfailed_actions	144.0K
failed	13%	8	7.08¢	failed_actions	112.1K
resolved	85%	5	2.72¢		44.6K

Instances per page

Page 1 of 2

Instances

50 instances


failed	29%	19	$0.13	failed_actions	387.6K
no patch	43%	7	5.35¢	no_test_patch	92.4K
failed	27%	17	$0.14	failed_actions	339.3K
resolved	79%	6	2.83¢		58.1K
failed	0%	8	$0.11	duplicated_actionsno_test_patchfailed_actions	133.2K
resolved	53%	5	1.46¢		35.2K
resolved	19%	9	5.78¢		105.0K
failed	56%	6	5.93¢	no_test_patch	81.5K
resolved	31%	10	3.96¢	failed_actions	95.1K
resolved	65%	9	4.78¢	failed_actions	100.2K
resolved	83%	9	4.81¢		122.6K
resolved	88%	7	3.93¢		79.7K
resolved	56%	7	2.59¢		59.9K
failed	53%	16	$0.17	failed_testsduplicated_actionsfailed_actions	332.0K
failed	25%	8	7.95¢		126.9K
no patch	4%	15	$0.11	failed_testsfailed_actions	249.9K
resolved	71%	9	3.08¢	failed_actions	84.9K
failed	32%	9	6.44¢		117.3K
error	10%	9	4.68¢		85.1K
failed	8%	20	$0.28	duplicated_actionsfailed_actions	583.1K
completed	0%	11	$0.13	no_test_patchfailed_actions	202.4K
failed	40%	19	$0.17	failed_actions	387.3K
failed	59%	10	7.37¢	duplicated_actionsno_test_patchfailed_actions	144.0K
failed	13%	8	7.08¢	failed_actions	112.1K
resolved	85%	5	2.72¢		44.6K

Instances per page

Page 1 of 2