Qwen 3.6 35B A3B did the agentic spec workflow 13 models couldn't — 174 reqs on the OpenAI SDK, no scaffolding
TL;DR. Pointed an MCP-driven agentic spec-writing pipeline at the [excalidraw]( GitHub - excalidraw/excalidraw: Virtual whiteboard for sketching hand-drawn like diagrams · GitHub ) codebase, ran the same brief through 13 models. The headline result for this audience: @Qwen/Qwen3.6-35B-A3B-Instruct — an MoE running locally on LM Studio at 50k-token context — produced 174 requirements across 23 top-level features , within 12% of Claude Opus 4.7’s 197. Every other OpenAI-SDK model on the table (cloud or local) sat at 13–60. Qwen 3.6 is the only open-weights model that closed the gap. Side-by-side — pick any feature node and read what each model actually wrote: AI Specification Generation - Model Comparison | SPECLAN
Why this is news (specifically for the local-LLM-on-agentic-flows folks here). Two weeks ago I wrote up 7 local models running an MCP tool-calling spec workflow, and every MoE variant I tested failed — @google/gemma-4-26b-a4b looped on the same tool 16x, @openai/gpt-oss-20b hallucinated completion (claimed to write a file, didn’t), @google/gemma-4-e4b never produced a final response. The conclusion at the time was “dense beats MoE for agentic tool-calling, regardless of parameter count.” Qwen 3.6 35B A3B (3B active, 35B total — sparse MoE) just walked through the same kind of workflow, called dozens of tools across multiple turns, and self-terminated cleanly. Full disclosure: I’m building SPECLAN, the VS Code extension running this pipeline; that’s where the trees live.
The non-obvious finding (the SDK matters more than the model). The 196–203 requirement band for the three Claude models — Opus, Sonnet, Haiku, regardless of size — is not “Anthropic models are better.” It’s a scaffolding artefact. The Anthropic SDK ships with a built-in persistent Todo-List , a planner primitive, and scratchpad memory as first-class agent tools. Spec authoring is a list-management problem (enumerate features, write one, cross off, next); the SDK solves half of it. When Haiku writes 16 top-level features, those become 16 TODO items in the SDK’s list, and subsequent turns just enumerate. The OpenAI SDK exposes our MCP tools only — no built-in planner, no TODO list, no scratchpad. Every other model on the table (GPT 5.4, Gemini previews, all local) starts with no list. The clustering at 196–203 is the SDK floor, not the model ceiling. The clustering at 13–60 is the absence-of-scaffolding floor.
What makes Qwen 3.6 the data point that matters: it ran on the OpenAI-SDK path with zero scaffolding and produced 174 requirements. Same harness as @google/gemma-4-31b dense (60 reqs) and GPT 5.4 (43 reqs). My working hypothesis: Qwen 3.6’s training mix has heavy agentic-tool-call trajectories that internalized the bookkeeping the Anthropic SDK externalizes as tools. I can’t prove that from the output alone, but it’s the only explanation consistent with the data — same harness, order-of-magnitude higher output.
For an HF builder picking a local daily-driver. If you have ~64 GB+ of unified memory and you want a local model for multi-step agentic workflows where the model has to plan, enumerate, and self-terminate (spec writing, multi-step refactors, agentic test generation), @Qwen/Qwen3.6-35B-A3B-Instruct via LM Studio with 50k context is the working pick. Nothing else on the open-weights side is close on this workload as of 2026-04-29.
Open question. Has anyone reproduced this pattern with other agentic-tuned MoEs? If you’ve found another MoE that breaks the “MoE-fails-multi-step-tool-calls” rule, drop the model + harness combo — I’d rather learn from your trace than rerun mine.
Full 13-model comparison + the SDK disclaimer in detail + the Gemini-preview anomaly (frontier cloud model authored an Account and Billing Management feature for a no-billing product): AI Specification Generation - Model Comparison | SPECLAN
Discussion in the ATmosphere