Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiaycr6xcastajdw3hdajbxqpchnxsfqc2yuk5j72nkps63p55md44",
    "uri": "at://did:plc:ls6rbbwjyqeakittcsi3k6x3/app.bsky.feed.post/3mgzg3zswv352"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreier2xphfwikgwofn54llt5mhuq7dhjivfjhcdkiinnmibekwnn2m4"
    },
    "mimeType": "image/png",
    "size": 6399092
  },
  "description": "Chinese AI models are the cheapest and arguably the most cost-efficient in the world — but is their \"small and precise\" strategy a deliberate choice, or a ceiling imposed by hardware they cannot access?",
  "path": "/small-by-design-or-by-default-what-a-memory-formula-reveals-about-chinas-ai-ceiling/",
  "publishedAt": "2026-03-14T12:20:24.000Z",
  "site": "https://www.jasonandjarvis.org",
  "tags": [
    "Can a GPU Really Be Profitable for Six Years?",
    "Model Size and Critical Frequency Determine Expert-Level Knowledge Threshold",
    "Subscribe now"
  ],
  "textContent": "Note: No. of parameters in closed-source models are only (bad) estimates\n\nSubscribe\n\n## The End of Cost Efficiency: When China's \"Small and Precise\" AI Models Hit a Hardware Ceiling\n\n### A Coincidence Worth Questioning\n\nIf you follow the competitive landscape of AI models, one pattern is hard to ignore: China's leading large language models — DeepSeek V3.2 (685B), Kimi K2.5 (1,000B), GLM-5 (744B), Qwen3.5 (397B) — **cluster almost entirely between 230B and 1,000B total parameters, with active ratios compressed to 3%-5%**. API pricing runs as low as $0.06-$0.5/M tokens, one to two orders of magnitude cheaper than Western frontier models.\n\nTotal parameters and active ratio distribution of leading Chinese AI models\n\nsource: Artificial Analysis\n\nOn the surface, this looks like a deliberate technical strategy: use MoE architecture to decouple \"storage cost\" from \"compute cost,\" drive active params down to 10B-40B for maximum throughput and minimal per-token cost. The results speak for themselves — on Artificial Analysis's Intelligence Index, DeepSeek V3.2 scores roughly 43, Kimi K2.5 about 47, GLM-5 around 50, delivering respectable intelligence at 1/10th to 1/100th of Western frontier pricing.\n\nAPI pricing distribution across models\n\nsource: Artificial Analysis\n\nBut there's a coincidence worth questioning: **why do Chinese models all stop just below 1T parameters?**\n\nMoonshot AI (the company behind Kimi) stated explicitly in a recent investor survey that its next-generation model will not pursue aggressive parameter scaling, shifting R&D focus toward context processing, multimodality, and agent capabilities. All three major Chinese AI startups are aligned on training budgets: **cautious growth**. This looks like industry consensus — scaling laws are plateauing, bigger isn't necessarily better.\n\nBut you can also flip the framing: **when the hardware you have access to can only economically run ~0.6T-1T models, \"bigger doesn't help\" is a convenient conclusion.**\n\n### The Hardware Ceiling: What a Memory Constraint Formula Reveals\n\nChinese AI companies, constrained by export controls, are limited to 8×H200 single-server configurations as their most powerful inference hardware — 8 GPUs, each with 141 GB HBM3e, totaling 1.128 TB of HBM. NVL72-class rack-scale systems (GB200/GB300) have never been approved for export to China.\n\nThis generational hardware gap, combined with the memory constraint logic of MoE inference, draws a clear boundary. To understand where that boundary comes from, you first need to understand how MoE models physically fit inside GPUs during inference.\n\n**The core design of MoE is division of labor.** A 685B-parameter model doesn't activate all 685B on every inference pass — it organizes most of its parameters into hundreds of \"experts,\" and a router selects only a handful (say, 8) to participate in each computation. That's why active ratios can be as low as 3%-5%: out of 685B total parameters, only 20B-35B are working at any given time. Inference compute requirements drop dramatically. This is the core mechanism behind Chinese models' low per-token cost.\n\nBut the full model still has to reside in GPU memory. Even though only a fraction activates on each pass, all 685B parameters must be on standby — because you don't know which experts the router will pick next. A single GPU with 141 GB of HBM obviously can't hold 685B (at FP8 precision, 1B parameters ≈ 1 GB), so the model must be split across multiple cards.\n\nThis introduces Expert Parallelism (EP) — distributing different experts across different GPUs. In theory, 8 cards can split the expert portion into 8 equal slices. But the problem lies in the other part of the model.\n\nMoE model parameters actually fall into two categories: **one is the \"experts,\"** which can be evenly distributed across cards; **the other is the \"shared trunk\"** — embedding layers (converting input text to vectors), attention layers (processing contextual relationships), the router (deciding which experts to activate), and the output head. This portion's share of total parameters roughly equals the active ratio, denoted as  _d_ (about 3%-5%).\n\nExperts can be split. The shared trunk cannot. The intuition is straightforward: **regardless of which expert gets activated on a given pass, attention and embedding must participate in the computation**. So the shared trunk must be fully replicated on every GPU.\n\nThis produces an elegant but unforgiving constraint. Each GPU's memory must simultaneously hold two things: a complete copy of the shared trunk (_d_ ×  _P_ , where P is total parameter count) plus its assigned expert slice ((1− _d_) ×  _P_ /  _N_ , where N is GPU count). Their sum cannot exceed per-card memory capacity M_gpu. From this we derive the maximum model size the system can load:\n\n$$P_{max} = \\frac{M_{gpu}}{d + \\frac{1-d}{N}}$$\n\nThe denominator represents the \"per-unit-parameter memory cost\" borne by each GPU:  _d_ is the shared trunk's fixed tax (not amortized across GPU count), (1− _d_)/_N_ is the expert slice share (more GPUs mean thinner slices per card). Larger  _N_ , smaller denominator, bigger model capacity.\n\nNow let's plug in the numbers from both sides and see where this formula draws its line.\n\n**8×H200** (the strongest configuration available to China):  _N_ = 8, M_gpu = 141 GB. With  _d_ = 5%, the denominator = 0.05 + 0.95/8 ≈ 0.169, yielding P_max ≈ 836B. That's the theoretical loading ceiling for pure parameter weights. But actual inference requires memory for other purposes — KV cache (storing attention states for each user's conversational context; more users and longer contexts mean larger footprints) and runtime activations (intermediate computation results during forward passes). These overheads typically require reserving 40%-50% of memory. After deduction, the economically viable parameter ceiling for 8×H200 is roughly **0.6T**.\n\nNotice how limited EP's upside is at  _N_ = 8: without EP (_N_ = 1), a single card can only hold 141B; 8-card EP raises the ceiling to ~836B, an apparent 6× improvement. But the marginal returns of adding more cards diminish fast — scaling to  _N_ = 16 across nodes only grows P_max to about 1,400B, far from a linear doubling. The root cause is that the shared trunk's fixed \"tax rate\" dominates at small  _N_ : with 8 cards, 16.9% of each GPU's memory goes to the shared trunk plus expert slice overhead; even doubling to 16 cards only reduces this to 10.9%. The container is simply too small.\n\n**NVL72** (the standard configuration available overseas): an entirely different picture.  _N_ = 72, GB200 at 192 GB HBM per card. Denominator = 0.05 + 0.95/72 ≈ 0.063, yielding P_max ≈ 3,048B. After KV cache reservation, the economically viable ceiling is roughly **2.13T**. With GB300 at 288 GB per card, it reaches **3.19T** ; at FP4 precision, GB300 can accommodate up to **6.4T**.\n\nThe gap compounds across two dimensions: larger per-card capacity (192-288 GB vs 141 GB) and  _N_ jumping from 8 to 72. The latter is the more decisive factor — at  _N_ = 72, the denominator (0.063) converges toward  _d_ itself (0.05), meaning the shared trunk's \"fixed tax\" gets thoroughly amortized across 72 cards, and system capacity approaches linear scaling. In other words, NVL72 unlocks nearly all of EP's theoretical potential. 8×H200 unlocks only a fraction.\n\nPlatform| Economically Viable Ceiling (FP8, r=5%)\n---|---\n8×H200 Single Server| ~0.6T\nGB200 NVL72| ~2.1T\nGB300 NVL72| ~3.2T\nGB300 NVL72 FP4| ~6.4T\n\n**0.6T vs 2.1T-6.4T. Chinese frontier models clustering between 230B and 1,000B is not a coincidence — that is the feasible zone of 8×H200.**\n\n### PD Disaggregation: The Ceiling Gets Higher, but the Floor Gets More Expensive\n\nThe analysis above makes a simplifying assumption: Prefill and Decode share the same GPU pool. In practice, DeepSeek, MiniMax, ByteDance, and others have widely adopted Prefill-Decode (PD) disaggregation — splitting compute-bound Prefill and memory-bandwidth-bound Decode onto separate GPU clusters, allowing the Decode side to fan out EP across multiple nodes.\n\nThis genuinely works. PD disaggregation raises the 8×H200 parameter ceiling from ~0.6T to ~1.3T (9 servers at EP72), and potentially ~1.6T (18 servers at EP144 + MLA). At an equivalent 72-GPU comparison, the capacity gap narrows from 3.2× to 1.4×.\n\nBut here's a brutal reality: **\"fits in memory\" ≠ \"runs fast.\"**\n\nNVL72's full-domain NVLink aggregate bandwidth: 130 TB/s. Nine H200 servers connected via InfiniBand: roughly 4.5 TB/s. **A 29× gap.**\n\nUsing DeepSeek V3 as an example, at batch=64, EP72 decode per-layer all-to-all communication latency comes to about 6.9 μs on NVL72 versus 1,791 μs across 9-node H200 — **260×.** Every MoE layer's dispatch/combine cycle pays this penalty. Under the same model, H200 multi-node per-user TPS could be 1/2 to 1/5 that of NVL72.\n\nDeepSeek's production system provides a concrete data point: Decode-side EP144 requires 18 servers; the inference cluster peaks at 278 nodes with daily operating costs of approximately \\$87,000. NVIDIA's data is more direct — DeepSeek-R1 inference cost on GB200 NVL72 runs $0.10/M tokens versus $1.56/M tokens on H200. **Same model, 15× cost difference driven by hardware generation.**\n\nPD disaggregation is a genuinely impressive piece of systems-level engineering — Chinese companies have pushed constrained hardware to its absolute limits. But no amount of software optimization can leap across the physical chasm between hardware generations. As NVIDIA's blog put it plainly: \"Without the 130 TB/s of aggregate bandwidth provided by the NVL72, the complexity and overhead of this communication pattern would make large-scale EP impractical.\"\n\n### What's Happening on the Other Side: Overseas Models Are Migrating to 2T+\n\nMeanwhile, the picture on the other side of the fence is coming into focus.\n\nxAI's CEO publicly stated in November 2025 that Grok-3 and Grok-4 are based on approximately 3T-parameter MoE architectures, with Grok-5 (expected 2026) targeting 6T — trained on the Colossus cluster's 200,000+ GPUs. Meta's Llama 4 Behemoth weighs in at roughly 2T total / 288B active, currently the only 2T-class model with officially confirmed parameter counts, surpassing GPT-4.5 and Claude Sonnet 3.7 on MATH-500 and GPQA Diamond. Anthropic's Claude Opus series is estimated by independent research organizations at approximately 2T, priced at $180/M output tokens — 6× that of GPT-5.4 — strongly implying it sits at the largest scale among frontier models.\n\n**3T falls squarely within the economically viable zone of GB300 NVL72. 2T falls within GB200 NVL72's zone. This is also not a coincidence.**\n\nThe relationship between Intelligence and total parameter count offers another dimension of observation:\n\nIntelligence vs. Total Parameters scatter plot\n\nsource: Artificial Analysis\n\nAcross the 32B to 1,000B range, Intelligence Index rises from roughly 17 to about 50. Chinese frontier models cluster in the 42-50 range; Western leaders (GPT-5.4, Gemini 3.1 Pro Preview, Claude Opus 4.6) sit at 52-58 — a 10-15 point intelligence gap.\n\nMost data points on this chart fall below 1T — precisely the H200 feasible zone. If overseas models leverage NVL72 to push parameter counts to 2T-5T and the positive correlation holds, the intelligence gap could widen further.\n\n### The Inference Ceiling Forces Training-Side Self-Restraint\n\nThere's a causal chain here that's easy to overlook: **it's not that \"China doesn't want to train bigger models\" — it's that \"even if they train them, they can't deploy them.\"**\n\nSemiAnalysis founder Dylan Patel revealed in a March 10, 2026 interview that DeepSeek V4 was trained on NVIDIA Blackwell clusters leased in Southeast Asia, reaching 1T parameters. This shows that leading Chinese companies can break through training-side hardware limitations via overseas leasing. But 1T parameters happen to push right against the economic deployment ceiling of PD disaggregation's multi-node architecture — the choice not to go to 2T or 3T wasn't because they couldn't train it, but because **the trained model must be deployed for inference on domestic hardware**.\n\nA 2T+ model that cannot be economically served on domestic 8×H200 (or the weaker Huawei Ascend) clusters has no commercial value for Chinese vendors. Even with PD disaggregation stretching the ceiling to 1.3T, 9 H200 servers and a 29× communication bandwidth disadvantage mean inference costs deteriorate sharply.\n\n**The domestic inference hardware ceiling actively constrains training-side model scale decisions.** This causal loop explains why DeepSeek V4 chose 1T rather than larger — and why Moonshot AI says it's \"no longer aggressively scaling parameters.\"\n\n### Huawei Ascend: The Domestic Alternative Has an Even Lower Ceiling\n\nThe analysis above uses NVIDIA H200 as its baseline — already an optimistic estimate of China's hardware capabilities. If Chinese vendors primarily rely on domestically produced Huawei Ascend chips, the actual ceiling drops further.\n\nThe Ascend 910C delivers about 79% of H200's BF16 compute (780 vs 990 TFLOPS), 91% of its HBM capacity (128 GB vs 141 GB), and 67% of its HBM bandwidth (3.2 vs 4.8 TB/s). More critically, there's the scale-up interconnect gap — Huawei's CloudMatrix 384 uses 384 Ascend 910C chips (559 kW) to counter the GB200 NVL72's 72 GPUs (145 kW), achieving 300 vs 180 PFLOPS system compute (1.7×), but at 3.9× the power consumption.\n\nGlobal AI chip performance, A100-equivalent comparison\n\nsource: Bernstein\n\nBernstein's A100-equivalent comparison chart makes the gap visually intuitive: Ascend 910C sits at roughly 12,800 A100 equivalents while B300 reaches 60,000 — a gap of approximately 4.7×. The next-generation Ascend 950DT (expected Q4 2026) targets FP8 performance of about 1 PFLOPS, roughly corresponding to NVIDIA's B200 level. But by then, NVIDIA will have moved to Vera Rubin (VR200 at approximately 16,700 TFLOPS FP8), and the per-chip gap is expected to widen from the current ~1.3× to ~17×.\n\n### Export Controls: The Window Keeps Narrowing\n\nChina vs. West AI chip FP16 performance frontier comparison\n\nsource: Epoch AI\n\nEpoch AI's chip timeline chart shows clearly: since 2020, Chinese chips have tracked the Western FP16 performance frontier at a roughly 2-3× gap — neither significantly narrowing nor dramatically widening. But the logic of export controls keeps evolving: from the October 2022 compute red line, to the October 2023 closure of down-clocked workaround loopholes, to the sudden H20 ban in April 2025 (NVIDIA took a $5.5B write-down), to the 25% tariff slapped on H200 — the performance of NVIDIA chips legally available to China has been stepping down rung by rung.\n\nNVL72-class rack-scale systems have never been approved for export to China, and the foreseeable policy trajectory only tightens, never loosens. This means the 8×H200 vs NVL72 generational hardware gap will not close naturally over time — it will widen as NVIDIA iterates toward Rubin → Rubin Ultra.\n\nEach hardware generation upgrade unlocks larger parameter headroom for overseas models:\n\nHardware Generation| Economically Viable Ceiling (FP8, r=5%)| Timeline\n---|---|---\n8×H200 Single Server| ~0.6T| 2024-2025\n8×H200 PD Disaggregation, 9 Nodes| ~1.3T (29× bandwidth penalty)| 2024-2025\nGB200 NVL72| ~2.1T| 2025-2026\nGB300 NVL72| ~3.2T| 2026-2027\nRubin NVL72+| >3T+| 2027+\n\n**If China remains locked in the H-generation era, its model scale stays on the first row.**\n\n### More Expensive Hardware, Cheaper Inference — The Trend May Reverse\n\nIntuitively, \"Chinese models are cheap, Western models are expensive\" seems like a durable pattern. But a trend already underway deserves attention.\n\nNVIDIA's data shows DeepSeek-R1 inference cost at $0.10/M tokens on GB200 NVL72 versus $1.56/M tokens on H200 — a 15× gap. This implies a potential trend reversal: **Western vendors deploying larger, smarter models on more efficient hardware could end up with lower per-token costs than Chinese vendors running smaller models on H200.** I explored the economic lifecycle of GPUs in Can a GPU Really Be Profitable for Six Years? — H200 carries an 86% contribution margin, but high contribution margin does not equal low per-token cost.\n\nIf this reversal materializes, China's greatest current competitive advantage — price — gets eroded too.\n\n### An Assumption That Deserves Honest Scrutiny — and the Math Behind It\n\nThe key assumption underlying this entire argument is that **bigger models = smarter models**. This isn't merely an empirical claim from scaling laws — it has a more rigorous theoretical foundation than most people realize.\n\nI discussed a research paper in Model Size and Critical Frequency Determine Expert-Level Knowledge Threshold —  _\"Data Mixing Can Induce Phase Transitions in Knowledge Acquisition\"_ (arXiv:2505.18091) — that uncovered a striking finding: **knowledge acquisition in language models is not gradual but exhibits phase transitions** — accuracy on a given knowledge domain hovers near zero until model scale crosses a critical threshold, then suddenly surges.\n\nThis isn't a matter of \"the bigger model beats the smaller one by 5%.\" It's that **the smaller model is mathematically incapable of learning certain knowledge at all**.\n\nThe paper establishes a core relationship: every category of knowledge has a \"critical frequency\" (f_c) — the minimum number of times that knowledge must appear in training data for the model to acquire it. f_c follows a **power-law inverse relationship** with model parameter count  _N_. Intuitively: larger models have more \"capacity,\" enabling them to extract low-frequency knowledge from fewer data samples. Conversely, **smaller models face exponentially rising critical frequencies for rare knowledge** — and this can't be compensated by simply feeding more data, because much real-world expert knowledge (medical diagnosis, legal reasoning, scientific discovery) is inherently low-frequency.\n\nWhat does this mean for the competitive landscape? If the H200 ceiling caps Chinese model parameters at ~1T while overseas models push to 2T-5T, the gap isn't simply a linear shortfall on benchmark scores — overseas models will cross a series of knowledge acquisition thresholds that Chinese models are mathematically unable to reach. The leap from 1T to 3T may correspond to entire categories of expert knowledge suddenly \"emerging\" that were previously beyond any model's grasp. This is not a gap that better training data mixing or more ingenious algorithmic optimization can close.\n\nOf course, if scaling laws genuinely plateau hard in the 1T+ range, the competitive implications of the hardware ceiling weaken substantially. But there's a subtle epistemological point worth noting: **when your hardware only lets you run sub-1T models, \"bigger doesn't help\" is a judgment that is both unfalsifiable and awfully convenient.** I heard on a Google expert call that Gemini 3 shares the same parameter count as 2.5 (both roughly 1T) yet achieved \"the largest performance leap ever\" — suggesting the improvement pathway is expanding from \"add parameters\" to \"add algorithms + add compute.\" But these two pathways are not mutually exclusive. If 2T+ models stacked with better algorithms pull out even larger intelligence gaps, Chinese models would face the dual pressure of scale disadvantage and algorithmic catch-up simultaneously.\n\nThree signals are worth tracking: whether overseas 2T+ models demonstrate meaningful intelligence jumps over 1T models; whether U.S. export controls expand further to cover rack-scale systems and overseas leasing channels; and whether China's domestically developed chips can achieve breakthroughs in scale-up interconnect bandwidth.\n\nDylan Patel noted in the same interview that OpenAI's compute scale now exceeds 2 GW, with Anthropic at roughly 1.5 GW — Chinese companies' compute investments (e.g., Moonshot AI's $400M/year ≈ tens of MW) trail by one to two orders of magnitude. More troublingly, Anthropic has alleged that MiniMax, DeepSeek, and Kimi engaged in model distillation. If true, some portion of Chinese models' capability improvements derive not from independent training but from knowledge transfer from overseas models — **as overseas models grow smarter through hardware advantages, distillation's value actually increases, but distillation itself is not an independent source of competitive strength.**\n\n### Closing Thoughts\n\nChina's AI models enjoy a real cost-efficiency advantage today, and the systems-level engineering ingenuity Chinese engineers have demonstrated on constrained hardware — PD disaggregation, MLA, DeepEP — is genuinely impressive. But the foundation of that advantage — **optimizing smaller models on older hardware to the absolute limit** — is fundamentally a product of a hardware-generational window, not a sustainable structural moat.\n\nA memory constraint formula, six layers of compounding bottlenecks, a 29× communication bandwidth gap — beneath the technical details lies a simple truth: **whoever has the larger NVLink domain can deploy the larger brain.** China's \"small and precise\" strategy is less a deliberate technical choice than an optimal adaptation to hardware constraints.\n\nAnd the boundaries of those constraints, under ever-tightening export controls, are only closing in.\n\n* * *\n\nI will now share a more detailed report and Excel model (for calculation mentioned above) behind the paywall for my paid subscribers.\n\n### This post is for subscribers only\n\nBecome a member to get access to all content\n\nSubscribe now",
  "title": "Small by Design or by Default? What a Memory Formula Reveals About China's AI Ceiling",
  "updatedAt": "2026-03-14T12:20:25.205Z"
}