Moatless Experiments - AI Coding Agent Evaluations

SWE-Bench Lite Evaluations

Evaluations run on the SWE-Bench Lite dataset split.

						Setup
1/13/2025	claude-3-5-sonnet-20241022
1/11/2025	deepseek-chat-v3
1/14/2025	deepseek-chat-v3

SWE-Bench Lite Evaluations

Evaluations run on the SWE-Bench Lite dataset split.

						Setup
1/13/2025	claude-3-5-sonnet-20241022
1/11/2025	deepseek-chat-v3
1/14/2025	deepseek-chat-v3

SWE-Bench Verified Mini Evaluations

Evaluations run on the SWE-Bench Verified Mini dataset - a curated subset of 50 datapoints from the full SWE-Bench Verified dataset, optimized to maintain similar distributions of performance, test pass rates and difficulty.

						Setup
1/20/2025	deepseek-reasoner
1/19/2025	claude-3-5-sonnet-20241022
1/19/2025	gemini-2.0-flash-exp
1/19/2025	deepseek-chat-v3
1/19/2025	qwen2.5-coder-32b-instruct
1/19/2025	gpt-4o-2024-11-20
1/19/2025	claude-3-5-haiku-20241022
1/19/2025	o1-mini-2024-09-12
1/19/2025	meta-llama-3.1-405b-instruct-fp8
1/19/2025	claude-3-5-haiku-20241022
1/19/2025	deepseek-chat-v3
1/19/2025	gpt-4o-mini-2024-07-18
1/19/2025	gpt-4o-mini-2024-07-18

SWE-Bench Verified Mini Evaluations

Evaluations run on the SWE-Bench Verified Mini dataset - a curated subset of 50 datapoints from the full SWE-Bench Verified dataset, optimized to maintain similar distributions of performance, test pass rates and difficulty.

						Setup
1/20/2025	deepseek-reasoner
1/19/2025	claude-3-5-sonnet-20241022
1/19/2025	gemini-2.0-flash-exp
1/19/2025	deepseek-chat-v3
1/19/2025	qwen2.5-coder-32b-instruct
1/19/2025	gpt-4o-2024-11-20
1/19/2025	claude-3-5-haiku-20241022
1/19/2025	o1-mini-2024-09-12
1/19/2025	meta-llama-3.1-405b-instruct-fp8
1/19/2025	claude-3-5-haiku-20241022
1/19/2025	deepseek-chat-v3
1/19/2025	gpt-4o-mini-2024-07-18
1/19/2025	gpt-4o-mini-2024-07-18

About the Project

I'm building Moatless Tools to experiment with ideas around agentic workflows, primarily focused on how LLMs can be used to edit code in large existing codebases. This site shares the evaluation results from these experiments, providing insights into what approaches work and what don't.

Through Moatless Tools, I explore different strategies for prompt engineering and LLM response handling. By making these results public, I hope to contribute to our collective understanding of building effective AI agents.

Whether you're interested in agentic workflows, want to discuss the results, or are curious about collaborating on research, feel free to join the discussion on our Discord server or reach out to me directly.