Moatless Experiments
Evaluation results from Moatless Tools, sharing insights into what works and what doesn't when building AI coding agents in this rapidly evolving field.
SWE-Bench Lite Evaluations
Evaluations run on the SWE-Bench Lite dataset split.
Setup | ||||
---|---|---|---|---|
1/13/2025 | ||||
1/11/2025 | ||||
1/14/2025 |
SWE-Bench Lite Evaluations
Evaluations run on the SWE-Bench Lite dataset split.
Setup | ||||
---|---|---|---|---|
1/13/2025 | ||||
1/11/2025 | ||||
1/14/2025 |
SWE-Bench Verified Mini Evaluations
Evaluations run on the SWE-Bench Verified Mini dataset - a curated subset of 50 datapoints from the full SWE-Bench Verified dataset, optimized to maintain similar distributions of performance, test pass rates and difficulty.
Setup | ||||
---|---|---|---|---|
1/20/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 |
SWE-Bench Verified Mini Evaluations
Evaluations run on the SWE-Bench Verified Mini dataset - a curated subset of 50 datapoints from the full SWE-Bench Verified dataset, optimized to maintain similar distributions of performance, test pass rates and difficulty.
Setup | ||||
---|---|---|---|---|
1/20/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 | ||||
1/19/2025 |
About the Project
I'm building Moatless Tools to experiment with ideas around agentic workflows, primarily focused on how LLMs can be used to edit code in large existing codebases. This site shares the evaluation results from these experiments, providing insights into what approaches work and what don't.
Through Moatless Tools, I explore different strategies for prompt engineering and LLM response handling. By making these results public, I hope to contribute to our collective understanding of building effective AI agents.
Whether you're interested in agentic workflows, want to discuss the results, or are curious about collaborating on research, feel free to join the discussion on our Discord server or reach out to me directly.