Strap in (and harness up)
Once upon a time I was a serious climber. I knew the names, I'd worked my way through the routes in Classic Rock , I'd been brought up on it. There's a photo of me on a rock face somewhere around the age of ten.
Me, about ten, somewhere on a rock face.
And I loved the history of it as much as the climbing - who'd made the first ascents, who'd got there first. The Sheffield dirtbags, Jerry Moffatt pushing Uk climbing to the limits. Further back, Joe Brown and Don Whillans, pushing the grit to places it had never been. Further back still, the early ones scratching up cracks in hobnail boots, setting routes that are still hard today.
In the 80s and 90s the grades really took off. Sport climbing and bouldering played their part, and so did sheer skill and bloody-mindedness. But it was also the kit. New shoes made an enormous difference. And so, I think, did the humble harness. As things got safer, people pushed harder.
Trace the harness back and at first there's nothing, you just climb, and falling is not an option. Then comes the rope tied round your waist. It works, mostly. But there's a fair chance that in saving your life it also breaks your back, crushes your ribs, wrecks your spine, flips you upside down. The thing keeping you alive could also be the thing that ruins you.
So it got better. Webbing round the waist instead of rope, the swami belt. Then somebody added leg loops, so the force went through your hips instead of your gut. In 1970 a British company sewed the first proper one-piece sit harness for an expedition up Annapurna. It was designed by Don Whillans, yes the same Whillans who'd been pushing the grit twenty years earlier. The climber who made the routes harder ended up lending his name to the thing that made them safer. It was stiff, uncomfortable, and had a strap in a deeply unfortunate place. But it was a real harness. Then came adjustable fits, padding, and proper testing standards, and that's the bit that made everyone trust them enough to stop thinking about them.
No harness, to a rope that might break your back, to a tested bit of kit you forget you're wearing. As it got safer, people climbed harder. I've been thinking about that a lot lately, because I reckon AI is somewhere around the swami-belt stage of its own.
Pushing at the edges
The thing about climbing harder is that often you only get to do it because the kit underneath you got more trustworthy. The grades went up because the gear let people stop thinking about whether they'd survive the fall.
AI is in that phase now, pushing at the edges, new models and capabilities arriving faster than anyone can keep up with, everyone reaching further and harder than they were even six months ago. And when things move that fast, the worst thing you can do is bet on the model. The model you pick today is a snapshot of a moving thing. Tie your work tightly to it and you've tied yourself to a number that'll be stale by the time you've finished building.
I've argued for a long time, through TechFreedom and elsewhere, that we should be provider and model agnostic, building flexibility into our stacks, our builds, our thinking, so we're never captured by one company's pricing or one model's quirks. That's the freedom that actually matters: the ability to move.
But you can't be model agnostic if the model is where all your value lives. If swapping it out means rebuilding everything, you're not agnostic, you're stuck. The thing that makes you free to swap is the scaffolding around the model, the harness. Get that right and the model becomes a part you can change at will. Get it wrong and you're welded to whoever you started with.
So the question isn't just which model. It's what are you building around it.
I built Bearing to help with some of this agnosticism, to show that you can do many of the things you want to do with the frontier models with small, more open, more sustainable models. You give it a real task and it weighs it against a registry of models, both frontier and open, with sustainability and transparency folded in and it tells you what actually fits. It's showing to be genuinely useful: it cuts through the leaderboard noise and lets you choose on the merits, which is the whole point of staying agnostic. I lean on it a lot and others are starting to also.
But using it well taught me something I didn't expect. To compare models fairly, you have to hold the conditions around them steady, the same harness, or each in the harness that suits it. Otherwise you're not comparing models at all, you're comparing scaffolds and crediting the model. Most of the "model A versus model B" takes flying around online are quietly measuring the harness and giving the model the medal.
What a harness is for
When you climb, the harness isn't really about not dying, day to day. It's the thing that lets you commit. You'll try a move you'd never go for on a clean fall, because you trust what's holding you. The system around you; the rope, the gear, the belayer, is what turns a body that can climb into someone who actually gets up the route.
It's the same with these models. The model is the climber. But the harness, everything around it, the tools it can reach for, the way it checks its own work, the instructions it's been given, the loop it runs in is doing far more of the work than the leaderboard lets on. And unlike the model, the harness is the bit I actually get to build.
Most people are still tying the rope round their waist. Running a raw model in a chat box, hoping. It holds. But it's not the kit it could be.
What I actually do
So what does this mean in practice? There are a few ways it show's up for me on a day to day basis.
I've experienced many of the AI assisted coding tools. Codex, Cursor, Github co pilot, VS Code with Github co-pilot, Claude Code, Open Code, on and on these go. At point Claude Code was the one, but even then, I did't run Claude Code raw. I modified it. I created a template to drop into projects that gives it state tracking, a mistakes log it learns from, slash commands, and subagents that review the work. None of that touches the model. It changes everything around it. Verification loops, on there own, maybe improve the quality of the output two or three times. Same model. The only difference is whether it can check itself before it hands you something.
But I can go beyond Claude, use different models, point things at Ollama Cloud for open models, or lighter models. But I don't just swap the model in and hope. I write the harness for it. A specific set of .md files giving it the context and guardrails it needs, because an open model dropped in cold is noticeably worse than the same model with a proper harness around it.
For longer agentic jobs I'll reach for a dedicated agent rather than a chat window. For code review I'll use a harness built for reviewing code, not a general one. Different jobs, different rigs. I've stopped thinking of it as choosing a model and started thinking of it as choosing a rig.
The numbers, briefly
Take my word for it. Or maybe actually don't.
If you watch the almost-daily leaderboards, a new leader emerges all the time shifting by a point here, a point there. But on the harder coding benchmarks, changing only the scaffolding around a fixed model moves the score by twenty-odd points. Swapping the model itself, at the top end, moves it by about one. And six of the frontier models now sit within a single point of each other anyway, so the thing everyone's anxious about choosing between is mostly a wash. There's even a published result where a smaller, cheaper model in a custom-built agent edged out the flagship running on its maker's own setup. Smaller climber, better rig, higher up the route.
I'm not saying the model never matters. On the genuinely hard, tangled, novel problems the frontier models do pull ahead, and that's real.
But....
I pulled the routing data out of Bearing, currently 222 real tasks, about two dozen candidate models weighed for each. I went in assuming most of it would be easy stuff where obviously you don't need the big model. It wasn't. Nearly half the tasks were genuinely complex.
And yet a top-of-the-board flagship came out as the best fit only about one time in six. The other five-sixths of the time the right tool was something else and more than half the time it was an open or self-hostable model.
And the difficulty or complexity barely changed this. Complex tasks went to the flagship about as rarely as moderate ones. Eighteen percent against seventeen. Difficulty isn't what sends you to the expensive model. Something more specific does, some particular mix of what the task actually needs, and most of the time that mix points somewhere cheaper, lighter, closer to home.
So the hesitation people feel, the I'd better stick with the big one to be safe, the data just doesn't back it. The safe choice is mostly a more expensive habit.
Why I keep saying open
There's more to this than just saving people money though.
If much of the capability lives in the harness, then maybe the harness is the thing worth keeping open. Open weights matter, of course they do, they're the only reason the cheaper, local, sovereign options exist at all. But a closed harness wrapped around an open model still leaves you renting the part that does most of the work. You don't really own your tools. You're still in someone else's rig.
Alongside open models we need open harnesses. Scaffolds you can read, fork, change, and pass on. My Claude Code template is a small one and it's just sitting on GitHub for anyone who wants it.
We spent a few years treating the model as the prize and the scaffolding as plumbing. I'm increasingly sure it's the other way round.
Learn the craft
Learn to build harnesses. Spend the time on it like you'd spend it learning any craft. When you've built a few rigs, shaped them, broken them, fixed them, you start to feel where the capability actually sits. You can tell when something's genuinely changed and when it's just a new number. You can pick up a smaller or open model, build the right harness around it, and show people what it can do, which is far more convincing than any benchmark, and the thing most likely to break that hesitation.
And if all you know is one harness, tied to one model, from one company, you're doubly exposed. You're dependent on that model staying good and staying affordable. And you're dependent on that company's direction, their roadmap, their pricing, their priorities, none of which are yours. That's not a position of strength. It's the rope round the waist: it holds, until the day it doesn't, and then it hurts.
This matters most, I think, for the sectors I work in. We have deep specialist knowledge about communities, about need, about how this work actually gets done and almost none of it is encoded anywhere a machine can use it. If we spent our effort building harnesses around that, the specialisms we know, the contexts we understand rather than waiting for a model vendor to serve us, we'd get more flexibility and better outcomes, and we'd own more of the parts that matter.
Discussion in the ATmosphere