obscurNotes

Poolside AI Has Entered The Chat

Joshua White May 3, 2026

Laguna M.1 vs MiniMax M2.7 American open-weight/opensource LLM companies are rare. Not rare like "there are a few," rare like "you can basically name them." Outside of Google and OpenAI who make a very small amount of models opensource, you're left with a short list that basically comes down to Arcee, and now Poolside (and yes, there are a very small number of others). And listen, I think that's worth supporting. Not out of patriotism... out of pragmatism. The market gets healthier when there are real alternatives, and the open-weight ecosystem specifically gets stronger when companies with actual resources decide it's worth their time. Poolside has been quietly building toward this, from what I've been able to discover, for a couple of years now, and their Laguna M.1 model is a serious signal that they might be onto something. The Roots Run Deep Here's what you may not know about Poolside: this wasn't a consumer startup that pivoted to government work. It was a government shop that decided to share what it had been building. Jason Warner, the CEO, was GitHub's CTO during the early Copilot days. Eiso Kant, the CTO, founded source{d}, which is widely recognized as the first company whose entire purpose was applying AI to source code, and later Athenian, an AI-powered code review platform. These aren't general-purpose AI people who decided to do coding on the side. These are coding-first people who spent years thinking about what it actually means to ship a useful model to developers. They started with government and public sector clients. This might sound like a boring origin story, yet it matters enormously for what they built. When your clients can't put data on the internet, when they're literally running models in air-gapped facilities with no external network access, you learn to build differently. You learn to build for high-security environments as a first-class requirement. You build for correctness in a way that consumer-facing API companies simply don't have to. Compare that to MiniMax, which is a well-funded Shanghai-based AI company with serious commercial momentum. Their M2.7 model is genuinely impressive! What they've managed to get out of such a (relatively) small model is incredible. The self-evolution of M2.7 alone is one of the more interesting things I've seen in model development in the past year. But it's worth knowing what you're looking at: a commercially sharp, publicly traded Chinese AI company backed by Alibaba, Tencent, and Shanghai state-linked investment funds. I'm not saying that disqualifies them... I'm saying it's worth knowing, and it's worth asking questions about what that means for your data, and your comfort level. Where the Benchmarks Break Down MiniMax M2.7 beats Poolside Laguna M.1 on most of the things you'd look at. On paper, this is not a close race. (Also worth noting, labs juice for these benchmarks all.the.time.) And yet. I spent a weekend building a custom homelab observability dashboard, the kind of messy, real-world task that doesn't have a clean benchmark representation. M.1 handled things in one shot that MiniMax consistently struggled with. Not always. MiniMax is no slouch. But M.1 had a reliability to it that the numbers don't capture. It felt... consistent in a way that made the work actually enjoyable rather than a constant debugging exercise in my prompting. This is where I think the benchmarks are doing us a disservice. SWE-bench Pro measures whether a model can resolve a GitHub issue. That's genuinely useful. But it doesn't measure whether a model can sit with you through an ambiguous, evolving, multi-hour coding session where the goal isn't clearly defined and you don't know what you don't know until you're halfway there. That's the difference between passing a test and doing a job. I'd also note that the evaluation frameworks differ substantially between the two models. Poolside uses the Laude Institute's Harbor Framework with their own agent harness, run to 500 steps in sandboxed execution. MiniMax uses internal evaluation harnesses that aren't publicly comparable. These numbers aren't lying, but they're not exactly shaking hands either. Treat them as directional. The Max Output Problem I tested Laguna M.1 across four different agent harnesses: OpenClaw, Hermes, Pi, and Pool (Poolside's own harness) using direct API inference from Poolside. Across OpenClaw, Hermes, and Pi the 8K maximum output length was a consistent ceiling resulting in nearly instant truncation errors on tool calls. It was fine in Pool's own harness, where (I assume) the work is purpose-built around the model's capabilities. In the other three, it was a genuine constraint. OpenClaw and Hermes specifically are the harnesses I use most (not a coder, most of the time!), and M.1 struggled to deliver complete outputs in those environments. This isn't a knock on the model itself. The model might be excellent. But if your workflow depends on long, multi-turn agentic interactions with large context windows, the 8K ceiling is a real compatibility issue that needs to be on the table when you're evaluating. I am mainly curious if this has been a model limitation or inference-delivery limitation. What I'd Like to See - A Letter to Poolside First, price Laguna M.1 close to MiniMax. Not as a floor, but as a signal that you're serious about the indie and small-shop developer market. MiniMax is at $0.30 per million input tokens and $1.20 per million output tokens. Match it or beat it. The developer community notices. Second, at least consider releasing M.1 as an open-weight model. I know the government work might make that complicated. Open-source would be ideal, but I'll take open-weight. The gesture alone would be meaningful. You've already done it with XS.2, you know how. Third, a coding or token subscription plan. Even at MiniMax's prices, the kind of work that agentic harnesses require; long multi-turn interactions, large context windows, repeated tool calls all burn through tokens fast. A flat-rate or tiered subscription would make this accessible to individuals and small teams who can't justify per-token billing at scale. The OpenClaw and Hermes ecosystems are real. They need a pricing model that fits real usage patterns, not edge cases. Fourth... and I say this because your roots make you uniquely suited to it, embrace the Confidential AI space properly. Not as a marketing line. As a real offering. A premium tier, on both pay-as-you-go and the coding/token plan, that is fully TEE-enabled with hardware encryption. The government clients taught you how to build this. The developer market is waking up to the same need. Even as as an independent subscriber, I have TEE-enabled inference. There's a premium tier in the Confidential AI space that no one is properly serving right now. An American Bet Worth Making Poolside's M.1 is not the benchmark winner. MiniMax M2.7 is more impressive on paper, and if you're choosing purely on raw numbers, you might pick MiniMax (but again, MiniMax = internally tested). However, M.1 is a model that has made me feel like I've been working with a tool that understands me. The latency was right. The reliability was right. We need more American companies willing to take the open-weight ecosystem seriously. Not as a marketing angle, but as a genuine commitment. Poolside's Laguna XS.2 is already Apache 2.0, weights on Hugging Face, TensorRT-LLM support on day one. That's real. And if M.1 eventually goes the same direction, which I think it will (maybe not full opensource, but t least open-weight) then this is a company worth getting to know early. And more broadly... this is yet another sign that the thesis is right. You do not need to spend enterprise-Claude money to get enterprise-grade work done. Not for coding, not for agents, not for the kind of daily-driver work that most of us are actually doing. The frontier is widening, the middle is getting stronger, and the price-performance curve is bending in exactly the direction it needs to bend. --- *Tested with direct API inference from Poolside across OpenClaw, Hermes, Pi, and Pool. MiniMax M2.7 tested via MiniMax Standard Speed coding/token plan API. I've used MiniMax as a daily driver since October 2025, and have been using Poolside's Laguna M.1 exclusively for coding use for about 5 days.

Discussion in the ATmosphere