Moatless Experiments

Evaluation results from Moatless Tools, sharing insights into what works and what doesn't when building AI coding agents in this rapidly evolving field.

SWE-Bench Lite Evaluations

Evaluations run on the SWE-Bench Lite dataset split.

Setup
1/13/2025
1/11/2025
1/14/2025
deepseek-chat-v3

SWE-Bench Verified Mini Evaluations

Evaluations run on the SWE-Bench Verified Mini dataset - a curated subset of 50 datapoints from the full SWE-Bench Verified dataset, optimized to maintain similar distributions of performance, test pass rates and difficulty.

Setup
1/20/2025
1/19/2025
1/19/2025
1/19/2025
1/19/2025
1/19/2025
gpt-4o-2024-11-20
1/19/2025
1/19/2025
1/19/2025
1/19/2025
1/19/2025
deepseek-chat-v3
1/19/2025
1/19/2025

About the Project

I'm building Moatless Tools to experiment with ideas around agentic workflows, primarily focused on how LLMs can be used to edit code in large existing codebases. This site shares the evaluation results from these experiments, providing insights into what approaches work and what don't.

Through Moatless Tools, I explore different strategies for prompt engineering and LLM response handling. By making these results public, I hope to contribute to our collective understanding of building effective AI agents.

Whether you're interested in agentic workflows, want to discuss the results, or are curious about collaborating on research, feel free to join the discussion on our Discord server or reach out to me directly.