Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreig4i5tqjxcsoiwdw4ptw6t2elqli6ehwc4dlasdnzxpoosq2qkhue",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkfw7f6xwkb2"
  },
  "path": "/t/qwen-3-6-35b-a3b-did-the-agentic-spec-workflow-13-models-couldnt-174-reqs-on-the-openai-sdk-no-scaffolding/175573#post_1",
  "publishedAt": "2026-04-26T15:22:56.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "GitHub - excalidraw/excalidraw: Virtual whiteboard for sketching hand-drawn like diagrams · GitHub",
    "AI Specification Generation - Model Comparison | SPECLAN",
    "https://speclan.net",
    "@Qwen",
    "@google",
    "@openai"
  ],
  "textContent": "****TL;DR.**** Pointed an MCP-driven agentic spec-writing pipeline at the [excalidraw]( GitHub - excalidraw/excalidraw: Virtual whiteboard for sketching hand-drawn like diagrams · GitHub ) codebase, ran the same brief through 13 models. The headline result for this audience: ****` @Qwen/Qwen3.6-35B-A3B-Instruct`**** — an MoE running locally on LM Studio at 50k-token context — produced ****174 requirements across 23 top-level features**** , within 12% of Claude Opus 4.7’s 197. Every other OpenAI-SDK model on the table (cloud or local) sat at 13–60. Qwen 3.6 is the only open-weights model that closed the gap. Side-by-side — pick any feature node and read what each model actually wrote: AI Specification Generation - Model Comparison | SPECLAN\n\n****Why this is news (specifically for the local-LLM-on-agentic-flows folks here).**** Two weeks ago I wrote up 7 local models running an MCP tool-calling spec workflow, and ****every MoE variant I tested failed**** — `@google/gemma-4-26b-a4b` looped on the same tool 16x, `@openai/gpt-oss-20b` hallucinated completion (claimed to write a file, didn’t), `@google/gemma-4-e4b` never produced a final response. The conclusion at the time was _*“dense beats MoE for agentic tool-calling, regardless of parameter count.”*_ Qwen 3.6 35B A3B (3B active, 35B total — sparse MoE) just walked through the same kind of workflow, called dozens of tools across multiple turns, and self-terminated cleanly. Full disclosure: I’m building [SPECLAN](https://speclan.net), the VS Code extension running this pipeline; that’s where the trees live.\n\n****The non-obvious finding (the SDK matters more than the model).**** The 196–203 requirement band for the three Claude models — Opus, Sonnet, Haiku, regardless of size — is not “Anthropic models are better.” It’s a scaffolding artefact. The Anthropic SDK ships with a built-in ****persistent Todo-List**** , a planner primitive, and scratchpad memory as first-class agent tools. Spec authoring is a list-management problem (enumerate features, write one, cross off, next); the SDK solves half of it. When Haiku writes 16 top-level features, those become 16 TODO items in the SDK’s list, and subsequent turns just enumerate. The OpenAI SDK exposes our MCP tools only — no built-in planner, no TODO list, no scratchpad. Every other model on the table (GPT 5.4, Gemini previews, all local) starts with no list. ****The clustering at 196–203 is the SDK floor, not the model ceiling. The clustering at 13–60 is the absence-of-scaffolding floor.****\n\nWhat makes Qwen 3.6 the data point that matters: it ran on the OpenAI-SDK path with ****zero scaffolding**** and produced 174 requirements. Same harness as `@google/gemma-4-31b` dense (60 reqs) and GPT 5.4 (43 reqs). My working hypothesis: Qwen 3.6’s training mix has heavy agentic-tool-call trajectories that ****internalized**** the bookkeeping the Anthropic SDK externalizes as tools. I can’t prove that from the output alone, but it’s the only explanation consistent with the data — same harness, order-of-magnitude higher output.\n\n****For an HF builder picking a local daily-driver.**** If you have ~64 GB+ of unified memory and you want a local model for multi-step agentic workflows where the model has to plan, enumerate, and self-terminate (spec writing, multi-step refactors, agentic test generation), `@Qwen/Qwen3.6-35B-A3B-Instruct` via LM Studio with 50k context is the working pick. Nothing else on the open-weights side is close on this workload as of 2026-04-29.\n\n****Open question.**** Has anyone reproduced this pattern with other agentic-tuned MoEs? If you’ve found another MoE that breaks the “MoE-fails-multi-step-tool-calls” rule, drop the model + harness combo — I’d rather learn from your trace than rerun mine.\n\nFull 13-model comparison + the SDK disclaimer in detail + the Gemini-preview anomaly (frontier cloud model authored an `Account and Billing Management` feature for a no-billing product): AI Specification Generation - Model Comparison | SPECLAN",
  "title": "Qwen 3.6 35B A3B did the agentic spec workflow 13 models couldn't — 174 reqs on the OpenAI SDK, no scaffolding"
}