CodeSweep Blog

Mixture of Open-Weight Models with Iterative Patch Generation Improves Performance on SWE-bench

Rishi Vaish — Tue, 09 Dec 2025 19:32:07 GMT

Goal

Our goal in this study was to explore whether a mixture of open-weight models, combined through an iterative process, can outperform any single model on the SWE-bench Verified benchmark. Specifically, we wanted to evaluate if patches generated by multiple models can provide a useful signal that improves subsequent rounds of patch generation.

Models

We selected three open-weight models for this experiment:

Qwen3 Coder 480B A35B Instruct
Kimi K2 Thinking
Kimi K2 Instruct 0905

Each model had access to the same tool suite and was run under identical constraints to ensure fair comparison. We used Fireworks AI as the provider for these models.

Tools

All runs were conducted using our tool-calling scaffolding. This scaffolding is inspired by the SWE-agent tool-calling scaffolding.

Our scaffolding supports custom prompts for each iteration and circuit breakers in the agentic loop to account for factors like:

Trajectory length
Cost
Consecutive timeouts
Un-parsable model responses

The models were guided entirely by:

Their “thinking”
The tool-calls they generated
The tool-call responses

The available tools were:

str_replace_editor - a tool for viewing and editing files
bash - a tool to provide the model access to a bash shell
submit - a simple tool the model can call when it’s done with the task

The tool-calls were executed in docker containers based on the SWE-bench docker images.

Models did not receive information or hints outside of what these tools provided.

Methodology

Our process was carried out in three iterations.

Iteration 1 - Independent Model Runs

We first ran our tool-calling scaffolding with the three models independently. This gave us the first pool of candidate patches.

Iteration 2 - Single Model With Patches From Multiple Models

For the second iteration, we selected Kimi K2 Instruct 0905 as our single model. Its prompt was augmented with the patches produced in iteration 1.

The intuition is simple: if each model acts like a junior engineer exploring different directions, then collecting their patches provides additional structure that may give another model a better starting point.

In the third iteration, we again used Kimi K2 Instruct 0905. Its prompt was augmented with the patch generated iteration 2.

The intuition is how an engineer might refine their own previous attempt: with additional context and a clearer search space, the model may be able to resolve more issues.

Handling Limits

When a circuit breaker was hit we needed to handle two scenarios:

No files were edited — in this case a run did not produce a patch, we continued without it
Files were edited, but the model didn't call the submit tool - in this case we proceeded by creating a patch with the available edits

Handling Patches From One Iteration To The Next

All iterations were performed sequentially with no intermediate ranking or categorization of patches. All patches generated in one iteration were passed into the subsequent iteration as an input.

Results

Using this mixture-of-models iterative approach, we observed consistent improvement over any individual model’s performance. With our scaffolding and methodology, we reached 70.4% in a pass@1 (per the SWE-bench definition) through the above agentic loop, which, at the time of writing (Dec 09, 2025), placed us second among all open-weight models on the SWE-bench Verified leaderboard.

A more detailed breakdown is included below.

Detailed Analysis

We wanted to see whether there was relative improvement (or regression) across the iterations. In order to do this, we used the SWE-bench harness to evaluate whether the intermediate patches were correct:

Did they resolve the issue, and
As we progressed through the iterations, did the resolution rate improve.

Note that this analysis was done after our final submission to the SWE-bench benchmark. Intermediate resolution evaluation was not used during the iteration runs themselves.

Iteration 1 - Mixture of Open-Weight Model Runs

Here we were curious about the relative performance of each open-weight model.

The model labels in the Venn diagram are as follows:

Qwen3 Coder 480B A35B Instruct: q480b
Kimi K2 Thinking: kimithink
Kimi K2 Instruct 0905: kimi0905

You can see that more than half of the issues (56.8%) were resolved by all the models, a combination of any two models resolved another 10.8% of the issues and a single model resolved 8.6% of the issues, leaving an overall 23.8% unresolved in iteration one.

Consecutive Iterations

Next, we wanted to compare relative improvement or regressions across Iteration 1, 2 and 3.

As a baseline for Iteration 1 we used the 284 issues that all three models resolved. (Note: this is the strict intersection of issues resolved by all three models, not the total number of issues resolved across any model). For Iterations 2 and 3 we simply chose the number of issues resolved in that iteration, since these iterations only produced one patch per issue.

You can see from the Total Issues Resolved by Iteration bar chart that there is definitely improvement across the iterations. Across the three iterations we went from 284 issues resolved, to 345 issues resolved (plus 61) to the final submission of 354 issues (plus 9).

And you can see from the Regression vs New Fixes Between Iterations chart that while there are a small number of regressions (issues that were resolved in the prior iteration, but were not resolved in the current iteration), the net gain is still much greater. Iteration 2 had 5 regressions, but 66 new fixes and Iteration 3 had two regressions but 11 new fixes, still leading to an overall net gain.

Potential for Further Boosting Performance

One final question we wanted to answer was if we had an oracle that allowed us to select correct patches across all the iterations, how many issues would be resolved.

The answer is 390. This is 78% of the 500 SWE-bench Verified issues. At the time of writing (Dec 09, 2025) that would put this oracle second on the overall leaderboard (across open-weight and closed-weight models). This suggests that additional gains may be achievable through better patch selection, consensus methods, or reranking.

Conclusions and Future Work

We conclude that combining a mixture of open-weight models and running iterations definitely improves the overall system performance over any single model run on a benchmark like SWE-bench.

Each successive iteration improves performance over the previous iteration with a very small number of regressions.

The fact that the oracle potential after the final iteration is 36/500 issues (7.2%) suggests that there is an opportunity to further boost performance by exploring ways to pick correct candidate patches out of all the generated patches.

Analysis of Reasoning Trajectories - Comparing Closed Weight Models vs Open Weight Models - Claude Sonnet 4 vs Kimi K2 Instruct

Rishi Vaish — Tue, 05 Aug 2025 04:53:51 GMT

Abstract

This study presents a comprehensive analysis of SWE-agent trajectories comparing Kimi K2 Instruct and Claude Sonnet 4 performance on software engineering tasks from the SWE-bench dataset. Through detailed examination of action category distributions, Sankey diagrams, Markov transition patterns, and step count distributions, our goal was to identify distinct behavioral patterns that reveal differences in problem-solving approaches between these two large language models when applied to automated software engineering. For the purpose of comparison, we ran SWE-agent with the Kimi K2 Instruct model and collected trajectories that we have submitted to the SWE-bench team for publishing. For comparison with Claude Sonnet 4 we used the previously published run from the SWE-agent team.

Introduction

The SWE-bench benchmark is a key framework for assessing the capabilities of large language models in real-world software engineering tasks. Understanding how different models approach these complex, multi-step problems provides valuable insights for both model development and practical deployment considerations.

Overall Performance Comparison

Before examining behavioral differences, it's important to establish the baseline performance metrics. With the SWE-agent scaffolding, Claude Sonnet 4 significantly outperformed Kimi K2 Instruct across the SWE-bench evaluation, achieving a 69.0% overall success rate compared to Kimi K2 Instruct's 53.4% - representing a 29% greater issue resolution capability.

This performance advantage is consistent across repository types, with Claude demonstrating superior or equivalent performance on virtually all evaluated repositories. Notable performance gaps included:

Sphinx: Claude 82% vs Kimi 34% (2.4x improvement)
Astropy: Claude 51% vs Kimi 27% (1.9x improvement)
Django: Claude 74% vs Kimi 57% (30% improvement)
SymPy: Claude 63% vs Kimi 48% (31% improvement)

Only on Pylint did Kimi show superior performance (41% vs Claude's 9%) - this is worth a future deep dive.

Our subsequent analysis focuses on the subset of issues that both models successfully resolved. This controlled comparison allowed us to examine the fundamental differences in problem-solving approaches when both models achieved the same outcome.

Methodology

The SWE-agent trajectories contain a thought-action-observation pattern, where "thought" represents the model's reasoning, "action" is the action that the model wants to take and "observation" is the output returned after taking the action. Each trajectory has several of these thought-action-observation steps that the model takes before an issue is resolved. In order to analyze the difference in model behavior we did the following:

Created Action Categories: First, we seeded an LLM prompt with action categories that we observed by manually inspecting a few trajectories and then provided an LLM with 100 trajectories, 50 each from Claud 4 Sonnet and Kimi K2 Instruct and asked the LLM to add / update / delete the seeded categories based on its analysis. For this purpose we used the Claude Sonnet 4 model API provided by Anthropic. Once we had a refined set of Action Categories we once again inspected the list and manually refined it to ensure it represented what we intended.
Classified Trajectory Steps: Next, based on the final Action Categories, we asked an LLM to categorize each step in 207 trajectory pairs, each pair containing the Claude Sonnet 4 and Kimi K2 Instruct trajectory for the same successfully resolved issue.
Analyzed Differences: After that, we analyzed the differences between the two models at an aggregate level, and then looked at pair-wise differences for the two models across the same issue. Here are the techniques we used:
- Action Category Distribution: percentage allocation of steps across different software engineering activities.
- Trajectory Length Summarizarion: illustrating the difference between the two models in terms of number of steps to resolve issues.
- Sankey Diagrams: showing a simple visualization of the "from" to "to" action aggregate level for the two models.
- Markov Transition Analysis: quantifying the "from" to "to" action aggregate level for the two models.
- Pairwise Trajectory Visualization: making it easy to compare the difference between the two models' approaches for an individual issue.

Results

Action Categories

Here are the different action categories we used for the final analysis and how often the two models used them in their trajectories.

Claude Sonnet 4 does around 30% more Test Script Creation and 80% more Test Script Execution than Kimi K2 Instruct, which spends around 66% more of its time modifying code.

Trajectory Length

Below are the difference in trajectory length (number of steps) between the two models:

The step distribution analysis reveals significant differences in trajectory length requirements between the models.

Claude Sonnet 4 requires approximately 2.7x more steps to reach resolution, with a notably higher and a more consistent step count distribution. The box plot analysis above shows that Claude Sonnet 4's approach has lower variance in step requirements, suggesting a more consistent but lengthier methodology.

Kimi K2 Instruct showed a strong concentration of solutions in the 15-25 step range with some outliers going up to around 100 steps.

Sankey Workflow Diagram Comparison

The Sankey diagrams above provide visual confirmation of distinct workflow patterns employed by each model, revealing the complexity and routing patterns of their problem-solving approaches.

Claude Sonnet 4 exhibits significantly higher workflow complexity with 70 distinct transition types (≥5 occurrences) compared to Kimi K2 Instruct's 35 transitions. This 2:1 ratio indicates that Claude employs a more intricate approach to problem resolution, tries different paths, with greater interconnection between different activity categories.

Markov Transition Analysis

Building on the Sankey workflow visualizations, the transition probability matrices (filtered for categories with ≥1.5% involvement) show behavioral patterns characteristic of each model's problem-solving approach.

Claude Sonnet 4 transitions from Test Script Execution to Test Script Execution with 37% probability as opposed to Kimi K2 Instruct's 3%, indicating multiple iterations of running tests. Kimi K2 Instruct transitions from Test Script Execution directly to Code Modification 46% of the time against Claude Sonnet 4's 10%, indicating the reason for the shorter trajectories.

Visualizing Pairwise Trajectories for Individual Issues

To illustrate visually what the difference in steps means for the same issue on a head-to-head basis, the above figures color code the trajectory changes and plot them one below the other in pairs. The color coding for the different steps are available in the appendix. This further illustrates the difference in length of the trajectories as well as the emphasis that Claude Sonnet 4 has on Test Script Creation and Test Script Execution.

Limitations and Future Work

This analysis focused exclusively on successfully resolved issues, which may introduce bias. The behavioral patterns observed might not generalize to failed attempts or different problem domains within software engineering.

Potential topics for future research include:

Analysis of failed trajectory patterns to understand model limitations
Investigation of problem complexity correlation with step requirements
Examination of solution quality metrics beyond binary resolution success

Conclusion

Our analysis revealed different approaches to automated software engineering between Kimi K2 Instruct and Claude Sonnet 4. While both models achieve successful resolution, with Claude Sonnet 4 leading substantially, they employ distinct strategies. Claude prioritizes iterative verification through testing, while Kimi emphasizes efficient exploration and direct problem-solving paths.

This analysis is based on SWE-agent trajectory data comparing Kimi K2 Instruct runs with previously collected Claude Sonnet 4 trajectories on mutually resolved SWE-bench issues.

Appendix

Action Category Labels and Color Codes

CodeSweep Blog

Mixture of Open-Weight Models with Iterative Patch Generation Improves Performance on SWE-bench

Goal

Models

Tools

Methodology

Iteration 1 - Independent Model Runs

Iteration 2 - Single Model With Patches From Multiple Models

Iteration 3 - Single Model Refinement

Handling Limits

Handling Patches From One Iteration To The Next

Results

Detailed Analysis

Iteration 1 - Mixture of Open-Weight Model Runs

Consecutive Iterations

Potential for Further Boosting Performance

Conclusions and Future Work

Analysis of Reasoning Trajectories - Comparing Closed Weight Models vs Open Weight Models - Claude Sonnet 4 vs Kimi K2 Instruct

Abstract

Introduction

Overall Performance Comparison

Methodology

Results

Action Categories

Trajectory Length

Sankey Workflow Diagram Comparison

Markov Transition Analysis

Visualizing Pairwise Trajectories for Individual Issues

Limitations and Future Work

Conclusion

Appendix