External Publication

There be goblins

Tomcw.xyz May 16, 2026

In April, OpenAI published a blog post called Where the goblins came from.

Starting with GPT-5.1, their models had developed an unprompted habit of mentioning goblins, gremlins, trolls, and ogres in metaphors. Use of "goblin" in ChatGPT went up 175%. The cause: during training of the Nerdy personality, they had accidentally given high reward signals for "metaphors with creatures." The goblins generalised. They turned up in places they had no business being.

The fix was a line added to the system prompt: "Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query."

Yes, the most* resourced AI company in the world responded to an unwanted behaviour in its own model by telling it off. The model still has whatever produced the goblins. They've just instructed it not to act on it. The goblins are still in there.

Now as someone who grew up playing Ghosts n Goblins on the commodore 64, who's favourite book growing up was the hobbit (ok orcs/goblins are different) I say we need more goblins in our lives. But behind the goblins story is a very real thing, what is in a large language model is hard to get rid of.

Now rather than goblins, imagine a council has been using an AI model to prioritise social care assessments for the past year, trained on historical case data. Someone exercises their right to be forgotten. A tribunal rules that certain historical data was collected without proper consent. Or, and this is the one that should keep people awake, the model has memorised specific case details and can be prompted to leak them.

What happens now?

You have two choices: retrain the entire model (millions of pounds, weeks of work, potential loss of other capabilities) or don't. Cross your fingers. Hope the data subject doesn't notice. Add a line to the system prompt telling the model not to mention any of it.

Are we really running public services on AI systems whose primary mechanism for compliance is forgetfulness on the part of the people checking them.

Democracy is mutable. Models aren't.

Large Language Models are treated as write-once assets. You train them, you deploy them, and if something's wrong, you throw them away and start again.

This might just about work for commercial AI. It doesn't work for public services, where policies change, case law evolves, data rights are enforced, decisions are appealed, and transparency is a legal requirement.

You can't run accountable public services on immutable systems.

The fiction of "responsible AI" when we don't own anything

Most public bodies aren't training their own models. They're buying Microsoft Copilot, embedding GPT via vendor wrappers, using anthropic API's or worse, signing large parts of our health infrastructure over to palintir. They are deploying agentic systems built on foundation models they don't own and can't inspect. The unlearning problem has two faces, and neither is being honestly addressed.

The first is the vendor problem. When a citizen exercises their rights, how does a council compel OpenAI, Microsoft, or Anthropic to surgically edit their model? They can't. The best the vendor can offer is "we didn't train on your data" or "we'll delete your logs." Neither touches the model. That's not unlearning.

The second is the workaround problem. Faced with the gap between what's legally required and what's technically possible, the sector has reached for techniques and called them governance:

Retrieval-augmented generation doesn't make the model forget. It steers it away from what it knows.
Finetuning and LoRA adapters don't remove information. They layer new behaviour on top of old.
Prompt engineering is thinner still. It tells the model not to mention something it still knows.

Every one of these is a sticking plaster. The data is still in there. A sufficiently determined prompt, a sufficiently motivated and adversarial user, a sufficiently novel context and the original surfaces. There be goblins.

What accountability actually requires

Public systems have always needed a clear audit trail of decisions, the ability to update when policy or law changes, a route for individuals to challenge decisions, transparency to oversight bodies, and resilience to staff and supplier change.

None of these are specific AI requirements, they're administrative requirements, they've always been there. The old world of paper case files met them imperfectly. Database systems met them better in some ways, worse in others. AI systems, as currently deployed, meet almost none of them.

What can't currently happen

Data rights. An applicant exercises their right to be forgotten. The system removes their specific case data from the model's memory without affecting other cases. Today: impossible. You delete a database record while the model carries on.

Policy change. Eligibility criteria change. The model is directly edited to reflect new policy, live within hours. Today: the model continues applying the old rules until someone notices, and the vendor's roadmap dictates when it changes.

Bias discovery. An audit reveals the model urgent needs for a specific group. Forensic tools find the pattern in the weights. Surgical intervention corrects it. Today: "we'll retrain with better data" an unverifiable promise, because the models and their training data aren't open (well some are, but they are not used widely)

Training data challenge. A court rules certain historical data shouldn't have been collected. The model unlearns it, with an audit trail. Today: the data is in there forever, and everyone hopes it doesn't matter, or we prompt it out.

No audit trail, no accountability

Even if a vendor said "we edited the model" how would anyone verify it?

There's no equivalent of a git history for model weights. No cryptographic marker that says "this model state was derived from these training inputs, with these subsequent edits, at these times, signed by these parties." No reproducible evaluation an independent party could re-run.

Without that, "we removed that data" is unknowable, unverifiable. We are running on vibes (and I love vibes, but sometimes vibes alone won't save us) With it, you get something closer to the audit trails public administration has always required: versioned model states with public hashes, logged edits with timestamps and authorising party, reproducible evaluation suites, independent assurance.

This is what the technical layer of an Open Org Standard approach would look like applied to AI: federated, verifiable claims about organisational state, including the AI systems organisations run.

Sovereignty, or there is no governance

If governable AI requires the ability to edit, audit, and attest and if vendors structurally cannot, or will not, offer that, then the only path is to run models you can actually edit. Open weights, open training data. Infrastructure you control. The technical capacity to do the editing.

This is the TechFreedom argument applied directly to AI. The five lenses all bite. Jurisdiction: where does the model run, and whose laws apply? Business continuity: what happens when the vendor changes terms, or gets acquired, or sunsets the product? Surveillance: what's being logged, and by whom? Lock-in: can you migrate, or are your prompts and workflows welded to one provider?

The honest answer for most public and social bodies: they've taken on AI dependencies they cannot govern, in service of efficiencies they haven't measured, on terms they didn't negotiate.

Should public bodies do more of their own model work? I'd say yes and not because they should all become AI labs, but because somebody in the social purpose ecosystem needs to. Is there anywhere this is actually happening?

Potions and magic

Ok, so is any of this actually possible? Or is it just potions and magic? Well a bit of both really.

The research field around machine unlearning is now substantial hundreds of papers, standard benchmarks like TOFU, WMDP, MUSE, and active competitions. The techniques include gradient ascent (running training in reverse on the data you want forgotten), representation misdirection (disrupting the pathways to specific knowledge rather than deleting it), and model editing (surgically updating specific weights). They work, in narrow conditions, on benchmark tasks.

But real life isn't some tightly controlled bench marked task and when these techniques are used in the real world almost all of them fail under what researchers call relearning attacks. Fine-tune an unlearned model on a small amount of publicly available, loosely related data and the supposedly forgotten knowledge comes back. The goblins, basically. Recoverable with a light touch.

So we have two problems running in parallel. The research-grade techniques aren't good enough yet. And even those techniques aren't being deployed in commercial AI products at all.

The watchers on the wall

This week the National Lottery Community Fund announced £3m for an "AI Pulse Network" In the announcement there is a mention of maybe pushing for small, specific AI models. Maybe a charity supporting people with benefit claims is funded to spot when algorithmic decisions go wrong and share warning signs with the network.

But what happens if we notice? What if there are large signals that something is wrong? Can we actually do anything about it? Do we just point out the goblins and hope they are prompted out eventually by someone?

What could we do

Focus some money, resource, time. One pilot with one open-weights model, small, specific, on owned infrastructure. One use case: benefits assessment, say, or social care prioritisation. A published edit log. A documented evaluation suite anyone can re-run. Then run this alongside a major model from an outside provider. Transparent comparison. My tool bearing allows side by side comparision of models for the same task, so that's pretty easy, now we just need to focus on the evaluation.

Six months in, we'd know more about the real cost of governable AI than five years of briefings from OpenAI will tell us. We'd know what the failure modes look like. We'd know what an audit trail in this domain actually needs to contain. We'd know whether the unlearning techniques the research community is producing are working and mature enough to deploy.

Without work like this, we'll be chasing goblins forever.