Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicfhmhtt6fpfecfhji7cqmogz5fnkwwrxc7xjyr4hgjm25j4c4yei",
    "uri": "at://did:plc:ws6dhxzqnqxu5aqxt4kd27oc/app.bsky.feed.post/3mjyaiafpk262"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreienfysehfymzoun24zbg3qyr2u7e5zv5j6vxenxod5exwthqaep4a"
    },
    "mimeType": "image/webp",
    "size": 26350
  },
  "description": "Moonshot AI's trillion-parameter MoE model targets long-horizon coding, agent swarms, and native INT4 inference.",
  "path": "/kimi-k2-6-what-moonshot-ais-new-open-model-actually-does/",
  "publishedAt": "2026-04-21T05:23:06.000Z",
  "site": "https://allthings.how",
  "tags": [
    "Hugging Face",
    "platform.moonshot.ai"
  ],
  "textContent": "Kimi K2.6 is Moonshot AI's latest open-source model, a Mixture-of-Experts system with 1 trillion total parameters and 32 billion active per token. It ships with open weights on Hugging Face under a Modified MIT license, native INT4 quantization, and a 256K context window, and it's aimed squarely at long-horizon coding, agentic workflows, and coding-driven design.\n\n⚡\n\nQuick answer: K2.6 is a 1T-parameter MoE model (32B active) with native INT4 weights, a 256K context window, and multimodal input. It runs on vLLM, SGLang, and KTransformers, and is accessible through Moonshot's API plus the Kimi Code CLI.\n\n* * *\n\n### What Kimi K2.6 is\n\nK2.6 is the successor to K2.5 and shares the same underlying architecture, which means existing K2.5 deployments can swap in the new weights without reconfiguring their inference stack. Moonshot describes it as a native multimodal agentic model with a focus on four practical capabilities: long-horizon coding across languages like Rust, Go, and Python; coding-driven design that turns prompts and images into working interfaces; an elevated agent swarm that can coordinate up to 300 sub-agents over 4,000 steps; and proactive orchestration for persistent background agents.\n\nThe model is available on Hugging Face with full weights, and Moonshot also runs a hosted API at platform.moonshot.ai that's compatible with both OpenAI and Anthropic client SDKs.\n\n* * *\n\n### Architecture and specs\n\nSpec| Value\n---|---\nArchitecture| Mixture-of-Experts (MoE)\nTotal parameters| 1T\nActive parameters per token| 32B\nLayers (incl. dense)| 61\nAttention heads| 64\nExperts| 384 (8 selected + 1 shared per token)\nAttention mechanism| MLA (Multi-head Latent Attention)\nActivation| SwiGLU\nVocabulary| 160K\nContext length| 256K tokens\nVision encoder| MoonViT (400M params)\n\nThe sparse expert routing is the key efficiency lever. Only 32 billion of the 1 trillion parameters fire for any given token, which keeps per-token compute cost closer to a mid-size dense model while giving the system a much larger knowledge base to draw from.\n\n* * *\n\n### Benchmark performance\n\nMoonshot reports K2.6 with thinking mode enabled and compares it to GPT-5.4 at xhigh reasoning, Claude Opus 4.6 at max effort, and Gemini 3.1 Pro at high thinking. The headline numbers place it competitively at the frontier on agentic and coding tasks, while trailing slightly on some pure-reasoning benchmarks.\n\nBenchmark| K2.6| GPT-5.4| Opus 4.6| Gemini 3.1 Pro\n---|---|---|---|---\nHLE-Full (w/ tools)| 54.0| 52.1| 53.0| 51.4\nBrowseComp| 83.2| 82.7| 83.7| 85.9\nBrowseComp (Agent Swarm)| 86.3| —| —| —\nDeepSearchQA (accuracy)| 83.0| 63.7| 80.6| 60.2\nSWE-Bench Verified| 80.2| —| 80.8| 80.6\nSWE-Bench Pro| 58.6| 57.7| 53.4| 54.2\nTerminal-Bench 2.0| 66.7| 65.4| 65.4| 68.5\nLiveCodeBench v6| 89.6| —| 88.8| 91.7\nAIME 2026| 96.4| 99.2| 96.7| 98.3\nGPQA-Diamond| 90.5| 92.8| 91.3| 94.3\nMMMU-Pro| 79.4| 81.2| 73.9| 83.0\n\nThe agent swarm configuration on BrowseComp, where K2.6 jumps to 86.3, is a specific capability Moonshot highlights. The model can fan out to hundreds of sub-agents to parallelize information gathering, which is difficult to replicate with closed models that restrict parallel tool use.\n\n* * *\n\n### Native INT4 quantization\n\nOne of the more interesting technical choices is Quantization-Aware Training for the INT4 variant. Rather than compressing weights after training (post-training quantization), K2.6's INT4 model is trained with the quantization constraints in the loop. The practical effect is roughly 2x faster inference compared to FP16, about 50% less GPU memory, and benchmark scores that stay within 1–2% of the full-precision baseline.\n\nThe INT4 weights are around 594GB on Hugging Face, versus roughly 2TB for FP16. That changes the hardware math significantly.\n\nPrecision| Model size| Min GPU memory| Typical config\n---|---|---|---\nFP16 / BF16| ~2TB| ~640GB+ VRAM| 8× H100 80GB\nFP8| ~1TB| ~320GB+ VRAM| 8× A100 80GB\nINT4 (QAT)| ~594GB| ~320GB+ VRAM| 4× H100 80GB\n\n* * *\n\n### Self-hosting options\n\nThree inference engines officially support K2.6: vLLM, SGLang, and KTransformers. All three require `transformers>=4.57.1,<5.0.0` and expose an OpenAI-compatible chat completions endpoint.\n\n**vLLM** is the most general-purpose choice, with PagedAttention and continuous batching for high-throughput serving. A typical INT4 launch looks like this:\n\n\n    python -m vllm.entrypoints.openai.api_server \\\n      --model moonshotai/Kimi-K2.6-INT4 \\\n      --tensor-parallel-size 4 \\\n      --max-model-len 131072 \\\n      --trust-remote-code \\\n      --port 8000\n\n\n**SGLang** is built for structured generation, constrained decoding, and multi-turn workloads. Its RadixAttention caches KV state across conversation turns, which tends to help agentic loops where the same system prompt and tool definitions repeat.\n\n**KTransformers** is Moonshot's first-party engine, tuned specifically for K2's MoE routing pattern and MLA attention. It also supports CPU offloading of inactive experts, which can lower the total GPU VRAM requirement for teams that don't have a full 4× or 8× H100 node available.\n\n* * *\n\n### Thinking vs Instant mode\n\nK2.6 exposes two generation modes. Thinking mode produces a visible reasoning trace before the final answer and is tuned for complex reasoning, multi-step coding, and agentic tasks. Instant mode skips the reasoning trace for faster, lower-overhead responses on straightforward queries.\n\nParameter| Thinking| Instant\n---|---|---\nTemperature| 1.0| 0.6\ntop_p| 0.95| 0.95\nthinking flag| True (default)| False\n\nOn vLLM or SGLang, you switch to Instant mode by passing `chat_template_kwargs: {\"thinking\": False}` in the request body. On Moonshot's official API, the equivalent is `thinking: {\"type\": \"disabled\"}`.\n\n* * *\n\n### How it's accessed today\n\nAt launch, K2.6 is labeled as a code preview in Moonshot's developer console and is primarily reached through the Kimi Code CLI. The standard Kimi web chat at kimi.com still routes the general agent to K2.5, which has caused some confusion for users who expect to pick K2.6 from a model dropdown. Inside the Kimi Code console, opting into the beta program exposes the flagship as `k2.6-code-preview`.\n\nThere's also a quirk around authentication: the K2.6 preview has been available to OAuth users of Kimi Code, while API-key auth paths have sometimes been limited to K2.5. This behavior may change as the preview graduates, but it's worth testing both auth flows if K2.6 doesn't appear where expected.\n\nFor users who want a hosted agent setup without running their own CLI, Moonshot's Kimi Claw feature provides a one-click deployment that wires the K2.6 coding plan into a cloud-hosted OpenClaw environment, including messaging-app connectors. K2.6's subscription plans are priced significantly lower than equivalent per-token API usage on Claude or GPT-class models, which is the main draw for developers running high-volume coding agents.\n\n* * *\n\n### Cost tradeoffs for self-hosting\n\nThe break-even point between Moonshot's API and self-hosted infrastructure depends almost entirely on monthly token volume. Self-hosting on a 4× H100 INT4 node runs roughly $8,000–$12,000 per month in cloud GPU costs, versus API pricing that scales linearly with usage.\n\nMonthly volume| API cost (est.)| 4× H100 INT4\n---|---|---\n10M tokens| ~$15–$30| ~$8,000–$12,000\n500M tokens| ~$750–$1,500| ~$8,000–$12,000\n5B tokens| ~$7,500–$15,000| ~$8,000–$12,000\n20B+ tokens| ~$30,000–$60,000| ~$8,000–$12,000\n\nBelow roughly 5 billion tokens per month, the API is cheaper. Above that, self-hosting on INT4 can save 60–80% while also giving teams data sovereignty, custom batching, and no rate limits.\n\n* * *\n\n### Where K2.6 fits\n\nK2.6 is best understood as an open-weights alternative to Claude Opus and GPT-5-class models for coding and agent workloads, with two specific advantages: the weights are freely redistributable under a Modified MIT license, and the model plugs into third-party agent frameworks like OpenClaw and Hermes that closed APIs have been restricting. The tradeoffs are a smaller context window than Claude's 1M-token ceiling, no polished desktop app at the level of Claude Code, and a coding speed that trails Opus 4.7 in side-by-side tests.\n\nFor teams building agent swarms, running high-volume coding pipelines, or needing on-prem deployment, the combination of native INT4, MoE efficiency, and open weights makes K2.6 one of the more practical frontier-class models to actually deploy right now.",
  "title": "Kimi K2.6: What Moonshot AI's new open model actually does",
  "updatedAt": "2026-04-21T05:23:08.228Z"
}