External Publication
Visit Post

Nvidia claims 10x cost savings with open-source inference models

Network World [Unofficial] February 13, 2026
Source

Nvidia has released analysis showing a 4X to 10X reduction in cost per token for AI inferencing by switching to open source models.

The cost reductions were achieved by pairing Nvidia’s Blackwell GPU platform with open-source models from Baseten, DeepInfra, Fireworks AI, and Together AI. Their tests showed significant cost improvements across healthcare, gaming, agentic chat, and customer service.

Related : [More Nvidia news and insights]

The cost discounts required combining Blackwell hardware with two other elements: optimized software stacks – in this case Nvidia’s TensorRT-LLM library and Dynamo — and switching from proprietary to open-source models of comparable intelligence.

Nvidia noted that cost per token went from 20 cents on the older Hopper platform to 10 cents on Blackwell. Moving to Blackwell’s native low-precision NVFP4 format further reduced the cost to just 5 cents, so a basic upgrade gave a 4x improvement in cost per token while maintaining the accuracy that customers expect.

Nvidia outlined four industry deployments in a blog post showing how this combination of Blackwell infrastructure, NVFP4, optimized software stacks and open-source models delivers significant cost reductions. They break down like this:

  • Healthcare — In healthcare, tedious, time-consuming tasks like medical coding, documentation and managing insurance forms cut into the time doctors can spend with patients. Sully.ai helps tackle this problem through AI agents to handle routine tasks that take up time.

The problem is that Sully.ai’s proprietary, closed source models didn’t scale well. So Sully.ai used Baseten’s open-source Model API on Blackwell GPUs with NVFP4 data format, the TensorRT-LLM library and the Dynamo inference framework .The result was a 90% drop in inference costs dropped by 90%, representing a 10x reduction compared with the prior closed source implementation, while response times improved by 65% for critical workflows like generating medical notes.

  • Gaming — Developer Latitude is building an AI-native adventure-story game Voyage, where players can create or play worlds with the freedom to choose any action and make their own story. The problem is the large language models used to respond to players’ actions didn’t scale.

With Voyage on large open-source models from DeepInfra’s inference platform along with Blackwell GPUs and TensorRT-LLM, Latitude is able to deliver fast, reliable responses cost effectively while reliably handling traffic spikes and deploying more capable models without compromising player experience.

  • Agentic Chat — Sentient Labs is working on open-source reasoning AI systems designed to accelerate AI toward solving harder reasoning problems through research in secure autonomy, agentic architecture and continual learning.

Like the other two examples, Sentient Labs suffers from significant scaling problems that could be triggered by a single but complex query. To provide scale and complexity, Sentient uses Fireworks AI’s inference platform running on Blackwell and achieved 25-50% better cost efficiency compared with its previous Hopper-based deployment.

  • Customer Service — Many customers find using voice AI to often be an unpleasant experience. Decagon builds AI agents for enterprise customer support, with AI-powered voice being its most demanding channel. Decagon needed infrastructure that could deliver sub-second responses under unpredictable traffic loads with tokenomics that supported 24/7 voice deployments.

Decagon worked with Nvidia on optimizations to its system and saw response times under 400 milliseconds even when processing thousands of tokens per query. Cost per query, which is the total cost to complete one voice interaction, dropped by 6x compared with using closed source proprietary models.

More Nvidia news:

  • Reports of Nvidia/OpenAI deal in jeopardy are overblown, says Nvidia’s CEO
  • Eying AI factories, Nvidia buys bigger stake in CoreWeave
  • China clears Nvidia H200 sales to tech giants, reshaping AI data center plans
  • Nvidia is still working with suppliers on RAM chips for Rubin
  • RISC-V chip designer SiFive integrates Nvidia NVLink Fusion to power AI data centers
  • Nvidia H200 chips in China: US says yes, China says no
  • Lenovo-Nvidia partnership targets faster AI infrastructure rollouts
  • Top 10 Nvidia stories of 2025 – From the data center to the AI factory
  • HPE loads up AI networking portfolio, strengthens Nvidia, AMD partnerships
  • Nvidia’s $2B Synopsys stake tests independence of open AI interconnect standard
  • Nvidia bets on open infrastructure for the agentic AI era with Nemotron 3
  • Nvidia moves deeper into AI infrastructure with SchedMD acquisition
  • Nvidia chips sold out? Cut back on AI plans, or look elsewhere

Discussion in the ATmosphere

Loading comments...