claude-3-5-sonnet-20241022

20250119_claude_3_5_sonnet_20241022_0_0_n_20_fmt_tool_call_verified_mini

Setup

Model claude-3-5-sonnet-20241022

Temperature N/A

Max Iterations 20

Format tool_call

Max Cost $1.00

% Resolved

46.0%

Resolved 23

Total 50

Cost

$18.27

$/Instance $0.37

Resolved/$ 1.26

Token Usage

12.8M

Prompt

12.5M (9.1M)

Completion 304.8K

Status

Final status of instances evaluated with the Moatless EvalTools SWE-Bench Harness. Indicates if the instance was resolved successfully, failed to complete, encountered an error, or didn't generate any patches.

Flags

Flags indicate potential issues in how the LLM follows the agentic workflow. They help identify common failure modes like hallucinations, "stuck in a loop", or missing test verifications.

Instances

50 instances


failed	29%	19	$0.32		257.2K
failed	43%	8	$0.14		76.5K
failed	27%	6	$0.11		50.5K
resolved	79%	8	$0.18		102.8K
failed	0%	20	$0.56	no_changesno_test_patchfailed_actions	503.5K
resolved	53%	8	$0.18		91.9K
failed	19%	9	$0.21		112.3K
resolved	56%	10	$0.25		143.6K
failed	31%	8	$0.24		98.5K
resolved	65%	7	$0.13		58.6K
resolved	83%	7	$0.19		89.6K
resolved	88%	8	$0.13	retries	67.3K
resolved	56%	8	$0.14		71.4K
resolved	53%	8	$0.21		109.5K
resolved	25%	17	$0.53	no_changesfailed_testsfailed_actions	319.5K
no patch	4%	20	$0.54	duplicated_actionsfailed_actionsretries	403.5K
resolved	71%	8	$0.12		65.8K
failed	32%	20	$0.45		386.4K
failed	10%	8	$0.16		80.2K
failed	8%	9	$0.40	failed_tests	197.6K
failed	0%	9	$0.36		182.2K
resolved	40%	13	$0.40	failed_actions	292.1K
resolved	59%	12	$0.32	string_not_foundduplicated_actionsfailed_actions	189.2K
failed	13%	13	$0.27	string_not_foundfailed_actions	193.9K
resolved	85%	7	$0.16		87.0K

Instances per page

Page 1 of 2

Instances

50 instances


failed	29%	19	$0.32		257.2K
failed	43%	8	$0.14		76.5K
failed	27%	6	$0.11		50.5K
resolved	79%	8	$0.18		102.8K
failed	0%	20	$0.56	no_changesno_test_patchfailed_actions	503.5K
resolved	53%	8	$0.18		91.9K
failed	19%	9	$0.21		112.3K
resolved	56%	10	$0.25		143.6K
failed	31%	8	$0.24		98.5K
resolved	65%	7	$0.13		58.6K
resolved	83%	7	$0.19		89.6K
resolved	88%	8	$0.13	retries	67.3K
resolved	56%	8	$0.14		71.4K
resolved	53%	8	$0.21		109.5K
resolved	25%	17	$0.53	no_changesfailed_testsfailed_actions	319.5K
no patch	4%	20	$0.54	duplicated_actionsfailed_actionsretries	403.5K
resolved	71%	8	$0.12		65.8K
failed	32%	20	$0.45		386.4K
failed	10%	8	$0.16		80.2K
failed	8%	9	$0.40	failed_tests	197.6K
failed	0%	9	$0.36		182.2K
resolved	40%	13	$0.40	failed_actions	292.1K
resolved	59%	12	$0.32	string_not_foundduplicated_actionsfailed_actions	189.2K
failed	13%	13	$0.27	string_not_foundfailed_actions	193.9K
resolved	85%	7	$0.16		87.0K

Instances per page

Page 1 of 2