Mixture of Open-Weight Models with Iterative Patch Generation Improves Performance on SWE-bench

Goal

Our goal in this study was to explore whether a mixture of open-weight models, combined through an iterative process, can outperform any single model on the SWE-bench Verified benchmark. Specifically, we wanted to evaluate if patches generated by multiple models can provide a useful signal that improves subsequent rounds of patch generation.

Models

We selected three open-weight models for this experiment:

Qwen3 Coder 480B A35B Instruct
Kimi K2 Thinking
Kimi K2 Instruct 0905

Each model had access to the same tool suite and was run under identical constraints to ensure fair comparison. We used Fireworks AI as the provider for these models.

Tools

All runs were conducted using our tool-calling scaffolding. This scaffolding is inspired by the SWE-agent tool-calling scaffolding.

Our scaffolding supports custom prompts for each iteration and circuit breakers in the agentic loop to account for factors like:

Trajectory length
Cost
Consecutive timeouts
Un-parsable model responses

The models were guided entirely by:

Their “thinking”
The tool-calls they generated
The tool-call responses

The available tools were:

str_replace_editor - a tool for viewing and editing files
bash - a tool to provide the model access to a bash shell
submit - a simple tool the model can call when it’s done with the task

The tool-calls were executed in docker containers based on the SWE-bench docker images.

Models did not receive information or hints outside of what these tools provided.

Methodology

Our process was carried out in three iterations.

Iteration 1 - Independent Model Runs

We first ran our tool-calling scaffolding with the three models independently. This gave us the first pool of candidate patches.

Iteration 2 - Single Model With Patches From Multiple Models

For the second iteration, we selected Kimi K2 Instruct 0905 as our single model. Its prompt was augmented with the patches produced in iteration 1.

The intuition is simple: if each model acts like a junior engineer exploring different directions, then collecting their patches provides additional structure that may give another model a better starting point.

In the third iteration, we again used Kimi K2 Instruct 0905. Its prompt was augmented with the patch generated iteration 2.

The intuition is how an engineer might refine their own previous attempt: with additional context and a clearer search space, the model may be able to resolve more issues.

Handling Limits

When a circuit breaker was hit we needed to handle two scenarios:

No files were edited — in this case a run did not produce a patch, we continued without it
Files were edited, but the model didn't call the submit tool - in this case we proceeded by creating a patch with the available edits

Handling Patches From One Iteration To The Next

All iterations were performed sequentially with no intermediate ranking or categorization of patches. All patches generated in one iteration were passed into the subsequent iteration as an input.

Results

Using this mixture-of-models iterative approach, we observed consistent improvement over any individual model’s performance. With our scaffolding and methodology, we reached 70.4% in a pass@1 (per the SWE-bench definition) through the above agentic loop, which, at the time of writing (Dec 09, 2025), placed us second among all open-weight models on the SWE-bench Verified leaderboard.

A more detailed breakdown is included below.

Detailed Analysis

We wanted to see whether there was relative improvement (or regression) across the iterations. In order to do this, we used the SWE-bench harness to evaluate whether the intermediate patches were correct:

Did they resolve the issue, and
As we progressed through the iterations, did the resolution rate improve.

Note that this analysis was done after our final submission to the SWE-bench benchmark. Intermediate resolution evaluation was not used during the iteration runs themselves.

Iteration 1 - Mixture of Open-Weight Model Runs

Here we were curious about the relative performance of each open-weight model.

The model labels in the Venn diagram are as follows:

Qwen3 Coder 480B A35B Instruct: q480b
Kimi K2 Thinking: kimithink
Kimi K2 Instruct 0905: kimi0905

You can see that more than half of the issues (56.8%) were resolved by all the models, a combination of any two models resolved another 10.8% of the issues and a single model resolved 8.6% of the issues, leaving an overall 23.8% unresolved in iteration one.

Consecutive Iterations

Next, we wanted to compare relative improvement or regressions across Iteration 1, 2 and 3.

As a baseline for Iteration 1 we used the 284 issues that all three models resolved. (Note: this is the strict intersection of issues resolved by all three models, not the total number of issues resolved across any model). For Iterations 2 and 3 we simply chose the number of issues resolved in that iteration, since these iterations only produced one patch per issue.

You can see from the Total Issues Resolved by Iteration bar chart that there is definitely improvement across the iterations. Across the three iterations we went from 284 issues resolved, to 345 issues resolved (plus 61) to the final submission of 354 issues (plus 9).

And you can see from the Regression vs New Fixes Between Iterations chart that while there are a small number of regressions (issues that were resolved in the prior iteration, but were not resolved in the current iteration), the net gain is still much greater. Iteration 2 had 5 regressions, but 66 new fixes and Iteration 3 had two regressions but 11 new fixes, still leading to an overall net gain.

Potential for Further Boosting Performance

One final question we wanted to answer was if we had an oracle that allowed us to select correct patches across all the iterations, how many issues would be resolved.

The answer is 390. This is 78% of the 500 SWE-bench Verified issues. At the time of writing (Dec 09, 2025) that would put this oracle second on the overall leaderboard (across open-weight and closed-weight models). This suggests that additional gains may be achievable through better patch selection, consensus methods, or reranking.

Conclusions and Future Work

We conclude that combining a mixture of open-weight models and running iterations definitely improves the overall system performance over any single model run on a benchmark like SWE-bench.

Each successive iteration improves performance over the previous iteration with a very small number of regressions.

The fact that the oracle potential after the final iteration is 36/500 issues (7.2%) suggests that there is an opportunity to further boost performance by exploring ways to pick correct candidate patches out of all the generated patches.

Mixture of Open-Weight Models with Iterative Patch Generation Improves Performance on SWE-bench

Goal

Models

Tools

Methodology

Iteration 1 - Independent Model Runs

Iteration 2 - Single Model With Patches From Multiple Models

Iteration 3 - Single Model Refinement

Handling Limits

Handling Patches From One Iteration To The Next

Results

Detailed Analysis

Iteration 1 - Mixture of Open-Weight Model Runs

Consecutive Iterations

Potential for Further Boosting Performance

Conclusions and Future Work

Comments

More from this blog

Why Most Enterprise Software Isn't Ready for Agentic AI

Analysis of Reasoning Trajectories - Comparing Closed Weight Models vs Open Weight Models - Claude Sonnet 4 vs Kimi K2 Instruct

Command Palette

Goal

Models

Tools

Methodology

Iteration 1 - Independent Model Runs

Iteration 2 - Single Model With Patches From Multiple Models

Iteration 3 - Single Model Refinement

Handling Limits

Handling Patches From One Iteration To The Next

Results

Detailed Analysis

Iteration 1 - Mixture of Open-Weight Model Runs

Consecutive Iterations

Potential for Further Boosting Performance

Conclusions and Future Work

Comments

More from this blog