<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[CodeSweep Blog]]></title><description><![CDATA[CodeSweep Blog]]></description><link>https://blog.codesweep.ai</link><generator>RSS for Node</generator><lastBuildDate>Wed, 15 Apr 2026 18:16:57 GMT</lastBuildDate><atom:link href="https://blog.codesweep.ai/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Mixture of Open-Weight Models with Iterative Patch Generation Improves Performance on SWE-bench]]></title><description><![CDATA[Goal
Our goal in this study was to explore whether a mixture of open-weight models, combined through an iterative process, can outperform any single model on the SWE-bench Verified benchmark. Specifically, we wanted to evaluate if patches generated b...]]></description><link>https://blog.codesweep.ai/mixture-of-open-weight-models-with-iterative-patch-generation-improves-performance-on-swe-bench</link><guid isPermaLink="true">https://blog.codesweep.ai/mixture-of-open-weight-models-with-iterative-patch-generation-improves-performance-on-swe-bench</guid><category><![CDATA[AI]]></category><category><![CDATA[SWE-bench]]></category><category><![CDATA[agentic AI]]></category><category><![CDATA[devtools]]></category><category><![CDATA[llm]]></category><dc:creator><![CDATA[Rishi Vaish]]></dc:creator><pubDate>Tue, 09 Dec 2025 19:32:07 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1765250481490/846df239-5044-4eb1-b624-b8abe20609ea.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-goal">Goal</h3>
<p>Our goal in this study was to explore whether a mixture of open-weight models, combined through an iterative process, can outperform any single model on the SWE-bench Verified benchmark. Specifically, we wanted to evaluate if patches generated by multiple models can provide a useful signal that improves subsequent rounds of patch generation.</p>
<h3 id="heading-models">Models</h3>
<p>We selected three open-weight models for this experiment:</p>
<ul>
<li>Qwen3 Coder 480B A35B Instruct</li>
<li>Kimi K2 Thinking</li>
<li>Kimi K2 Instruct 0905</li>
</ul>
<p>Each model had access to the same tool suite and was run under identical constraints to ensure fair comparison. We used <a target="_blank" href="https://fireworks.ai/">Fireworks AI</a> as the provider for these models.</p>
<h3 id="heading-tools">Tools</h3>
<p>All runs were conducted using our tool-calling scaffolding. This scaffolding is inspired by the <a target="_blank" href="https://github.com/SWE-agent/SWE-agent">SWE-agent</a> tool-calling scaffolding.</p>
<p>Our scaffolding supports custom prompts for each iteration and circuit breakers in the agentic loop to account for factors like:</p>
<ul>
<li>Trajectory length</li>
<li>Cost</li>
<li>Consecutive timeouts</li>
<li>Un-parsable model responses</li>
</ul>
<p>The models were guided entirely by:</p>
<ul>
<li>Their “thinking”</li>
<li>The tool-calls they generated</li>
<li>The tool-call responses</li>
</ul>
<p>The available tools were:</p>
<ul>
<li>str_replace_editor - a tool for viewing and editing files    </li>
<li>bash - a tool to provide the model access to a bash shell</li>
<li>submit - a simple tool the model can call when it’s done with the task</li>
</ul>
<p>The tool-calls were executed in docker containers based on the SWE-bench docker images.</p>
<p>Models did not receive information or hints outside of what these tools provided.</p>
<h3 id="heading-methodology">Methodology</h3>
<p>Our process was carried out in three iterations.</p>
<h4 id="heading-iteration-1-independent-model-runs">Iteration 1 - Independent Model Runs</h4>
<p>We first ran our tool-calling scaffolding with the three models <em>independently</em>. This gave us the first pool of candidate patches.</p>
<h4 id="heading-iteration-2-single-model-with-patches-from-multiple-models">Iteration 2 - Single Model With Patches From Multiple Models</h4>
<p>For the second iteration, we selected <strong>Kimi K2 Instruct 0905</strong> as our single model. Its prompt was augmented with the patches produced in iteration 1.</p>
<p>The intuition is simple: if each model acts like a junior engineer exploring different directions, then collecting their patches provides additional structure that may give another model a better starting point.</p>
<h4 id="heading-iteration-3-single-model-refinement">Iteration 3 - Single Model Refinement</h4>
<p>In the third iteration, we again used <strong>Kimi K2 Instruct 0905</strong>. Its prompt was augmented with the patch generated iteration 2.</p>
<p>The intuition is how an engineer might refine their own previous attempt: with additional context and a clearer search space, the model may be able to resolve more issues.</p>
<h4 id="heading-handling-limits">Handling Limits</h4>
<p>When a circuit breaker was hit we needed to handle two scenarios:</p>
<ul>
<li>No files were edited — in this case a run did not produce a patch, we continued without it </li>
<li>Files were edited, but the model didn't call the submit tool - in this case we proceeded by creating a patch with the available edits</li>
</ul>
<h4 id="heading-handling-patches-from-one-iteration-to-the-next">Handling Patches From One Iteration To The Next</h4>
<p>All iterations were performed sequentially with no intermediate ranking or categorization of patches. All patches generated in one iteration were passed into the subsequent iteration as an input.</p>
<h3 id="heading-results">Results</h3>
<p>Using this mixture-of-models iterative approach, we observed consistent improvement over any individual model’s performance. With our scaffolding and methodology, we reached <strong>70.4%</strong> in a pass@1 (per the SWE-bench definition) through the above agentic loop, which, at the time of writing (Dec 09, 2025), placed us <strong>second among all open-weight models</strong> on the SWE-bench Verified leaderboard.</p>
<p>A more detailed breakdown is included below.</p>
<h3 id="heading-detailed-analysis">Detailed Analysis</h3>
<p>We wanted to see whether there was relative improvement (or regression) across the iterations. In order to do this, we used the SWE-bench harness to evaluate whether the intermediate patches were correct: </p>
<ul>
<li>Did they resolve the issue, and </li>
<li>As we progressed through the iterations, did the resolution rate improve.</li>
</ul>
<p>Note that this analysis was done after our final submission to the SWE-bench benchmark. Intermediate resolution evaluation was not used during the iteration runs themselves.</p>
<h4 id="heading-iteration-1-mixture-of-open-weight-model-runs">Iteration 1 - Mixture of Open-Weight Model Runs</h4>
<p>Here we were curious about the relative performance of each open-weight model. </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1765250014982/65d351c3-f79d-4728-8bfe-ebf05828b54a.png" alt class="image--center mx-auto" /></p>
<p>The model labels in the Venn diagram are as follows:</p>
<ul>
<li>Qwen3 Coder 480B A35B Instruct: q480b</li>
<li>Kimi K2 Thinking: kimithink</li>
<li>Kimi K2 Instruct 0905: kimi0905</li>
</ul>
<p>You can see that more than half of the issues (56.8%) were resolved by all the models, a combination of any two models resolved another 10.8% of the issues and a single model resolved 8.6% of the issues, leaving an overall 23.8% unresolved in iteration one. </p>
<h4 id="heading-consecutive-iterations">Consecutive Iterations</h4>
<p>Next, we wanted to compare relative improvement or regressions across Iteration 1, 2 and 3. </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1765250066258/17abf74e-4a1f-4780-ba32-a4e0b112dba0.png" alt class="image--center mx-auto" /></p>
<p>As a baseline for Iteration 1 we used the 284 issues that all three models resolved. (Note: this is the strict intersection of issues resolved by all three models, not the total number of issues resolved across any model). For Iterations 2 and 3 we simply chose the number of issues resolved in that iteration, since these iterations only produced one patch per issue.</p>
<p>You can see from the <strong>Total Issues Resolved by Iteration</strong> bar chart that there is definitely improvement across the iterations. Across the three iterations we went from 284 issues resolved, to 345 issues resolved (plus 61) to the final submission of 354 issues (plus 9).</p>
<p>And you can see from the <strong>Regression vs New Fixes Between Iterations</strong> chart that while there are a small number of regressions (issues that were resolved in the prior iteration, but were not resolved in the current iteration), the net gain is still much greater. Iteration 2 had 5 regressions, but 66 new fixes and Iteration 3 had two regressions but 11 new fixes, still leading to an overall net gain.</p>
<h4 id="heading-potential-for-further-boosting-performance">Potential for Further Boosting Performance</h4>
<p>One final question we wanted to answer was if we had an oracle that allowed us to select correct patches across all the iterations, how many issues would be resolved. </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1765250077131/0de0a79c-54f6-4f09-8915-4b54ecd9a3a4.png" alt class="image--center mx-auto" /></p>
<p>The answer is 390. This is 78% of the 500 SWE-bench Verified issues. At the time of writing (Dec 09, 2025) that would put this oracle <strong>second on the overall leaderboard (across open-weight and closed-weight models)</strong>. This suggests that additional gains may be achievable through better patch selection, consensus methods, or reranking.</p>
<h3 id="heading-conclusions-and-future-work">Conclusions and Future Work</h3>
<p>We conclude that combining a mixture of open-weight models and running iterations definitely improves the overall system performance over any single model run on a benchmark like SWE-bench.</p>
<p>Each successive iteration improves performance over the previous iteration with a very small number of regressions.</p>
<p>The fact that the oracle potential after the final iteration is 36/500 issues (7.2%) suggests that there is an opportunity to further boost performance by exploring ways to pick correct candidate patches out of all the generated patches.</p>
]]></content:encoded></item><item><title><![CDATA[Analysis of Reasoning Trajectories - Comparing Closed Weight Models vs Open Weight Models - Claude Sonnet 4 vs Kimi K2 Instruct]]></title><description><![CDATA[Abstract
This study presents a comprehensive analysis of SWE-agent trajectories comparing Kimi K2 Instruct and Claude Sonnet 4 performance on software engineering tasks from the SWE-bench dataset. Through detailed examination of action category distr...]]></description><link>https://blog.codesweep.ai/analysis-of-reasoning-trajectories-comparing-closed-weight-models-vs-open-weight-models-claude-sonnet-4-vs-kimi-k2-instruct</link><guid isPermaLink="true">https://blog.codesweep.ai/analysis-of-reasoning-trajectories-comparing-closed-weight-models-vs-open-weight-models-claude-sonnet-4-vs-kimi-k2-instruct</guid><category><![CDATA[Artificial Intelligence]]></category><category><![CDATA[LLM's ]]></category><dc:creator><![CDATA[Rishi Vaish]]></dc:creator><pubDate>Tue, 05 Aug 2025 04:53:51 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1754420382475/9ddd6aae-70b3-48ec-bb8a-1cd26fbee400.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-abstract">Abstract</h2>
<p>This study presents a comprehensive analysis of <a target="_blank" href="https://github.com/SWE-agent/SWE-agent">SWE-agent</a> trajectories comparing <a target="_blank" href="https://huggingface.co/moonshotai/Kimi-K2-Instruct">Kimi K2 Instruct</a> and <a target="_blank" href="https://www.anthropic.com/news/claude-4">Claude Sonnet 4</a> performance on software engineering tasks from the SWE-bench dataset. Through detailed examination of action category distributions, Sankey diagrams, Markov transition patterns, and step count distributions, our goal was to identify distinct behavioral patterns that reveal differences in problem-solving approaches between these two large language models when applied to automated software engineering. For the purpose of comparison, we ran SWE-agent with the Kimi K2 Instruct model and collected trajectories that we <a target="_blank" href="https://github.com/SWE-bench/experiments/pull/304">have submitted</a> to the <a target="_blank" href="https://www.swebench.com/">SWE-bench</a> team for publishing. For comparison with Claude Sonnet 4 we used the <a target="_blank" href="https://github.com/SWE-bench/experiments/tree/main/evaluation/verified/20250522_sweagent_claude-4-sonnet-20250514">previously published</a> run from the SWE-agent team.</p>
<h2 id="heading-introduction">Introduction</h2>
<p>The SWE-bench benchmark is a key framework for assessing the capabilities of large language models in real-world software engineering tasks. Understanding how different models approach these complex, multi-step problems provides valuable insights for both model development and practical deployment considerations.</p>
<h2 id="heading-overall-performance-comparison">Overall Performance Comparison</h2>
<p>Before examining behavioral differences, it's important to establish the baseline performance metrics. With the SWE-agent scaffolding, <strong>Claude Sonnet 4 significantly outperformed Kimi K2 Instruct</strong> across the SWE-bench evaluation, achieving a <strong>69.0% overall success rate compared to Kimi K2 Instruct's 53.4%</strong> - representing a <strong>29% greater</strong> issue resolution capability.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754345313316/240bed33-b65b-42b8-ab05-25dd32963c47.png" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754345349938/0e784199-f6b1-4b45-8845-4b05c73e892f.png" /></p>
<p>This performance advantage is consistent across repository types, with Claude demonstrating superior or equivalent performance on virtually all evaluated repositories. Notable performance gaps included:</p>
<ul>
<li><p><strong>Sphinx</strong>: Claude 82% vs Kimi 34% (2.4x improvement)</p>
</li>
<li><p><strong>Astropy</strong>: Claude 51% vs Kimi 27% (1.9x improvement)</p>
</li>
<li><p><strong>Django</strong>: Claude 74% vs Kimi 57% (30% improvement)</p>
</li>
<li><p><strong>SymPy</strong>: Claude 63% vs Kimi 48% (31% improvement)</p>
</li>
</ul>
<p>Only on <strong>Pylint</strong> did Kimi show superior performance (41% vs Claude's 9%) - this is worth a future deep dive.</p>
<p>Our subsequent analysis focuses on the subset of issues that <strong>both models successfully resolved</strong>. This controlled comparison allowed us to examine the fundamental differences in problem-solving approaches when both models achieved the same outcome.</p>
<h2 id="heading-methodology">Methodology</h2>
<p>The SWE-agent trajectories contain a thought-action-observation pattern, where "thought" represents the model's reasoning, "action" is the action that the model wants to take and "observation" is the output returned after taking the action. Each trajectory has several of these thought-action-observation steps that the model takes before an issue is resolved. In order to analyze the difference in model behavior we did the following:</p>
<ol>
<li><p><strong>Created Action Categories</strong>: First, we seeded an LLM prompt with action categories that we observed by manually inspecting a few trajectories and then provided an LLM with 100 trajectories, 50 each from Claud 4 Sonnet and Kimi K2 Instruct and asked the LLM to add / update / delete the seeded categories based on its analysis. For this purpose we used the Claude Sonnet 4 model API provided by Anthropic. Once we had a refined set of Action Categories we once again inspected the list and manually refined it to ensure it represented what we intended.</p>
</li>
<li><p><strong>Classified Trajectory Steps</strong>: Next, based on the final Action Categories, we asked an LLM to categorize each step in 207 trajectory pairs, each pair containing the Claude Sonnet 4 and Kimi K2 Instruct trajectory for the same successfully resolved issue.</p>
</li>
<li><p><strong>Analyzed Differences</strong>: After that, we analyzed the differences between the two models at an aggregate level, and then looked at pair-wise differences for the two models across the same issue. Here are the techniques we used:</p>
<ul>
<li><strong>Action Category Distribution</strong>: percentage allocation of steps across different software engineering activities.</li>
<li><strong>Trajectory Length Summarizarion</strong>: illustrating the difference between the two models in terms of number of steps to resolve issues.</li>
<li><strong>Sankey Diagrams</strong>: showing a simple visualization of the "from" to "to" action aggregate level for the two models.</li>
<li><strong>Markov Transition Analysis</strong>: quantifying the "from" to "to" action aggregate level for the two models.</li>
<li><strong>Pairwise Trajectory Visualization</strong>: making it easy to compare the difference between the two models' approaches for an individual issue.</li>
</ul>
</li>
</ol>
<h2 id="heading-results">Results</h2>
<h3 id="heading-action-categories">Action Categories</h3>
<p>Here are the different action categories we used for the final analysis and how often the two models used them in their trajectories.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754349672196/bc91d7ca-e21a-4021-8fa4-d9a2684a936e.png" alt class="image--center mx-auto" /></p>
<p><strong>Claude Sonnet 4</strong> does around 30% more Test Script Creation and 80% more Test Script Execution than <strong>Kimi K2 Instruct</strong>, which spends around 66% more of its time modifying code.</p>
<h3 id="heading-trajectory-length">Trajectory Length</h3>
<p>Below are the difference in trajectory length (number of steps) between the two models:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754349869794/4fa84536-87c0-4ad1-b989-ffe3090ff531.png" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754349864883/820b7e29-365a-473c-a806-6e6a552330a2.png" /></p>
<p>The step distribution analysis reveals significant differences in trajectory length requirements between the models.</p>
<p><strong>Claude Sonnet 4</strong> requires approximately 2.7x more steps to reach resolution, with a notably higher and a more consistent step count distribution. The box plot analysis above shows that Claude Sonnet 4's approach has lower variance in step requirements, suggesting a more consistent but lengthier methodology.</p>
<p><strong>Kimi K2 Instruct</strong> showed a strong concentration of solutions in the 15-25 step range with some outliers going up to around 100 steps.</p>
<h3 id="heading-sankey-workflow-diagram-comparison">Sankey Workflow Diagram Comparison</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754349794599/29082bc6-fd64-4b01-b18d-9ea6225fae8a.png" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754349801977/03506673-e00f-4c34-b89f-cd56bae5a6f0.png" /></p>
<p>The Sankey diagrams above provide visual confirmation of distinct workflow patterns employed by each model, revealing the complexity and routing patterns of their problem-solving approaches.</p>
<p><strong>Claude Sonnet 4</strong> exhibits significantly higher workflow complexity with <strong>70 distinct transition types</strong> (≥5 occurrences) compared to <strong>Kimi K2 Instruct's 35 transitions</strong>. This 2:1 ratio indicates that Claude employs a more intricate approach to problem resolution, tries different paths, with greater interconnection between different activity categories.</p>
<h3 id="heading-markov-transition-analysis">Markov Transition Analysis</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754349732552/3c5a5423-d474-4ecf-a38b-472ec9285571.png" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754349741370/51325e3f-6553-47e1-a3ec-630e4276ce1f.png" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754349765995/0da52aac-07c5-479e-8fd2-eeff9ead2813.png" /></p>
<p>Building on the Sankey workflow visualizations, the transition probability matrices (filtered for categories with ≥1.5% involvement) show behavioral patterns characteristic of each model's problem-solving approach.</p>
<p><strong>Claude Sonnet 4</strong> transitions from Test Script Execution to Test Script Execution with 37% probability as opposed to <strong>Kimi K2 Instruct's</strong> 3%, indicating multiple iterations of running tests. Kimi K2 Instruct transitions from Test Script Execution directly to Code Modification 46% of the time against Claude Sonnet 4's 10%, indicating the reason for the shorter trajectories.</p>
<h3 id="heading-visualizing-pairwise-trajectories-for-individual-issues">Visualizing Pairwise Trajectories for Individual Issues</h3>
<table>
    <tr>
        <td>
            <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754361778710/ea166f4b-f410-4972-b4f2-00bd1b33bed4.png" /></td>
        <td>
            <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754361797217/091d2a17-c43f-446e-a069-592b3f96d80d.png" /></td>
      </tr>
    <tr>
        <td>
            <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754361802331/7f51db6a-6cbd-417f-bbdf-71a69ad85575.png" /></td>
        <td>
            <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754361808264/b97aa58c-13e7-40d3-9911-fee5d5d47934.png" /></td>
      </tr>
    <tr>
        <td>
            <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754361815219/a3b13d29-a7a0-445e-bfff-fdf92fce0dc9.png" /></td>
        <td>
            <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754361792562/07228f90-ea64-4d9e-82f0-61789ed62ef4.png" /></td>
      </tr>
    <tr>
        <td>
            <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754361769827/7729c3f9-850a-4459-985c-07df48e676d2.png" /></td>
        <td>
            <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754361764188/a45a326d-3d56-4a60-b075-c38c2e51bf18.png" /></td>
      </tr>
</table>

<p>To illustrate visually what the difference in steps means for the same issue on a head-to-head basis, the above figures color code the trajectory changes and plot them one below the other in pairs. The color coding for the different steps are available in the appendix. This further illustrates the difference in length of the trajectories as well as the emphasis that Claude Sonnet 4 has on Test Script Creation and Test Script Execution.</p>
<h2 id="heading-limitations-and-future-work">Limitations and Future Work</h2>
<p>This analysis focused exclusively on successfully resolved issues, which may introduce bias. The behavioral patterns observed might not generalize to failed attempts or different problem domains within software engineering.</p>
<p>Potential topics for future research include:</p>
<ul>
<li>Analysis of failed trajectory patterns to understand model limitations</li>
<li>Investigation of problem complexity correlation with step requirements</li>
<li>Examination of solution quality metrics beyond binary resolution success</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Our analysis revealed different approaches to automated software engineering between Kimi K2 Instruct and Claude Sonnet 4. While both models achieve successful resolution, with Claude Sonnet 4 leading substantially, they employ distinct strategies. Claude prioritizes iterative verification through testing, while Kimi emphasizes efficient exploration and direct problem-solving paths.</p>
<hr />
<p><em>This analysis is based on SWE-agent trajectory data comparing Kimi K2 Instruct runs with previously collected Claude Sonnet 4 trajectories on mutually resolved SWE-bench issues.</em></p>
<h2 id="heading-appendix">Appendix</h2>
<p><strong>Action Category Labels and Color Codes</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754351974848/f55d31d6-1ac4-470f-a595-c92dc07ecb5b.png" /></p>
]]></content:encoded></item></channel></rss>