Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihv47yguf6uw7ijh7xrc7axx7vqucinyrnontuhxzmywwzxtakwfe",
    "uri": "at://did:plc:mi64j333om5ptdeosrgdsopz/app.bsky.feed.post/3mly6mcqbvo62"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreiew7rvhbkrvykpoqcbvqbwdsgccyimjhxrhnxpwf3anyei2ho7pi4"
    },
    "mimeType": "image/jpeg",
    "size": 137515
  },
  "description": "In April, OpenAI published a blog post called Where the goblins came from.\n\nStarting with GPT-5.1, their models had developed an unprompted habit of mentioning goblins, gremlins, trolls, and ogres in metaphors. Use of \"goblin\" in ChatGPT went up 175%. The cause: during training of the Nerdy personality, they had accidentally given high reward signals for \"metaphors with creatures.\" The goblins generalised. They turned up in places they had no business being.\n\nThe fix was a line added to the syst",
  "path": "/there-be-goblins/",
  "publishedAt": "2026-05-16T15:39:54.000Z",
  "site": "https://tomcw.xyz",
  "tags": [
    "Where the goblins came from",
    "Ghosts n Goblins on the commodore 64",
    "Open Org Standard",
    "TechFreedom",
    "TOFU",
    "WMDP",
    "MUSE",
    "gradient ascent",
    "representation misdirection",
    "model editing",
    "relearning attacks",
    "bearing"
  ],
  "textContent": "In April, OpenAI published a blog post called Where the goblins came from.\n\nStarting with GPT-5.1, their models had developed an unprompted habit of mentioning goblins, gremlins, trolls, and ogres in metaphors. Use of \"goblin\" in ChatGPT went up 175%. The cause: during training of the Nerdy personality, they had accidentally given high reward signals for \"metaphors with creatures.\" The goblins generalised. They turned up in places they had no business being.\n\nThe fix was a line added to the system prompt: _\"Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query.\"_\n\nYes, the most* resourced AI company in the world responded to an unwanted behaviour in its own model by telling it off. The model still _has_ whatever produced the goblins. They've just instructed it not to act on it. The goblins are still in there.\n\nNow as someone who grew up playing Ghosts n Goblins on the commodore 64, who's favourite book growing up was the hobbit (ok orcs/goblins are different) I say we need more goblins in our lives. But behind the goblins story is a very real thing, what is in a large language model is hard to get rid of.\n\nNow rather than goblins, imagine a council has been using an AI model to prioritise social care assessments for the past year, trained on historical case data. Someone exercises their right to be forgotten. A tribunal rules that certain historical data was collected without proper consent. Or, and this is the one that should keep people awake, the model has memorised specific case details and can be prompted to leak them.\n\nWhat happens now?\n\nYou have two choices: retrain the entire model (millions of pounds, weeks of work, potential loss of other capabilities) or don't. Cross your fingers. Hope the data subject doesn't notice. Add a line to the system prompt telling the model not to mention any of it.\n\nAre we really running public services on AI systems whose primary mechanism for compliance is forgetfulness on the part of the people checking them.\n\n## Democracy is mutable. Models aren't.\n\nLarge Language Models are treated as write-once assets. You train them, you deploy them, and if something's wrong, you throw them away and start again.\n\nThis might just about work for commercial AI. It doesn't work for public services, where policies change, case law evolves, data rights are enforced, decisions are appealed, and transparency is a legal requirement.\n\nYou can't run accountable public services on immutable systems.\n\n## The fiction of \"responsible AI\" when we don't own anything\n\nMost public bodies aren't training their own models. They're buying Microsoft Copilot, embedding GPT via vendor wrappers, using anthropic API's or worse, signing large parts of our health infrastructure over to palintir. They are deploying agentic systems built on foundation models they don't own and can't inspect. The unlearning problem has two faces, and neither is being honestly addressed.\n\nThe first is the vendor problem. When a citizen exercises their rights, how does a council compel OpenAI, Microsoft, or Anthropic to surgically edit their model? They can't. The best the vendor can offer is \"we didn't train on your data\" or \"we'll delete your logs.\" Neither touches the model. That's not unlearning.\n\nThe second is the workaround problem. Faced with the gap between what's legally required and what's technically possible, the sector has reached for techniques and called them governance:\n\n  * **Retrieval-augmented generation** doesn't make the model forget. It steers it away from what it knows.\n  * **Finetuning and LoRA adapters** don't remove information. They layer new behaviour on top of old.\n  * **Prompt engineering** is thinner still. It tells the model not to mention something it still knows.\n\n\n\nEvery one of these is a sticking plaster. The data is still in there. A sufficiently determined prompt, a sufficiently motivated and adversarial user, a sufficiently novel context and the original surfaces. There be goblins.\n\n## What accountability actually requires\n\nPublic systems have always needed a clear audit trail of decisions, the ability to update when policy or law changes, a route for individuals to challenge decisions, transparency to oversight bodies, and resilience to staff and supplier change.\n\nNone of these are specific AI requirements, they're administrative requirements, they've always been there. The old world of paper case files met them imperfectly. Database systems met them better in some ways, worse in others. AI systems, as currently deployed, meet almost none of them.\n\n## What can't currently happen\n\n**Data rights.** An applicant exercises their right to be forgotten. The system removes their specific case data from the model's memory without affecting other cases. Today: impossible. You delete a database record while the model carries on.\n\n**Policy change.** Eligibility criteria change. The model is directly edited to reflect new policy, live within hours. Today: the model continues applying the old rules until someone notices, and the vendor's roadmap dictates when it changes.\n\n**Bias discovery.** An audit reveals the model urgent needs for a specific group. Forensic tools find the pattern in the weights. Surgical intervention corrects it. Today: \"we'll retrain with better data\" an unverifiable promise, because the models and their training data aren't open (well some are, but they are not used widely)\n\n**Training data challenge.** A court rules certain historical data shouldn't have been collected. The model unlearns it, with an audit trail. Today: the data is in there forever, and everyone hopes it doesn't matter, or we prompt it out.\n\n## No audit trail, no accountability\n\nEven if a vendor said \"we edited the model\" how would anyone verify it?\n\nThere's no equivalent of a git history for model weights. No cryptographic marker that says \"this model state was derived from these training inputs, with these subsequent edits, at these times, signed by these parties.\" No reproducible evaluation an independent party could re-run.\n\nWithout that, \"we removed that data\" is unknowable, unverifiable. We are running on vibes (and I love vibes, but sometimes vibes alone won't save us) With it, you get something closer to the audit trails public administration has always required: versioned model states with public hashes, logged edits with timestamps and authorising party, reproducible evaluation suites, independent assurance.\n\nThis is what the technical layer of an Open Org Standard approach would look like applied to AI: federated, verifiable claims about organisational state, including the AI systems organisations run.\n\n## Sovereignty, or there is no governance\n\nIf governable AI requires the ability to edit, audit, and attest and if vendors structurally cannot, or will not, offer that, then the only path is to run models you can actually edit. Open weights, open training data. Infrastructure you control. The technical capacity to do the editing.\n\nThis is the TechFreedom argument applied directly to AI. The five lenses all bite. Jurisdiction: where does the model run, and whose laws apply? Business continuity: what happens when the vendor changes terms, or gets acquired, or sunsets the product? Surveillance: what's being logged, and by whom? Lock-in: can you migrate, or are your prompts and workflows welded to one provider?\n\nThe honest answer for most public and social bodies: they've taken on AI dependencies they cannot govern, in service of efficiencies they haven't measured, on terms they didn't negotiate.\n\nShould public bodies do more of their own model work? I'd say yes and not because they should all become AI labs, but because _somebody_ in the social purpose ecosystem needs to. Is there anywhere this is actually happening?\n\n### Potions and magic\n\nOk, so is any of this actually possible? Or is it just potions and magic? Well a bit of both really.\n\nThe research field around machine unlearning is now substantial hundreds of papers, standard benchmarks like TOFU, WMDP, MUSE, and active competitions. The techniques include gradient ascent (running training in reverse on the data you want forgotten), representation misdirection (disrupting the pathways to specific knowledge rather than deleting it), and model editing (surgically updating specific weights). They work, in narrow conditions, on benchmark tasks.\n\nBut real life isn't some tightly controlled bench marked task and when these techniques are used in the real world almost all of them fail under what researchers call relearning attacks. Fine-tune an unlearned model on a small amount of _publicly available, loosely related_ data and the supposedly forgotten knowledge comes back. The goblins, basically. Recoverable with a light touch.\n\nSo we have two problems running in parallel. The research-grade techniques aren't good enough yet. And even those techniques aren't being deployed in commercial AI products _at all_.\n\n## The watchers on the wall\n\nThis week the National Lottery Community Fund announced £3m for an \"AI Pulse Network\" In the announcement there is a mention of maybe pushing for small, specific AI models. Maybe a charity supporting people with benefit claims is funded to spot when algorithmic decisions go wrong and share warning signs with the network.\n\nBut what happens if we notice? What if there are large signals that something is wrong? Can we actually do anything about it? Do we just point out the goblins and hope they are prompted out eventually by someone?\n\n## What could we do\n\nFocus some money, resource, time. One pilot with one open-weights model, small, specific, on owned infrastructure. One use case: benefits assessment, say, or social care prioritisation. A published edit log. A documented evaluation suite anyone can re-run. Then run this alongside a major model from an outside provider. Transparent comparison. My tool bearing allows side by side comparision of models for the same task, so that's pretty easy, now we just need to focus on the evaluation.\n\nSix months in, we'd know more about the real cost of governable AI than five years of briefings from OpenAI will tell us. We'd know what the failure modes look like. We'd know what an audit trail in this domain actually needs to contain. We'd know whether the unlearning techniques the research community is producing are working and mature enough to deploy.\n\nWithout work like this, we'll be chasing goblins forever.",
  "title": "There be goblins",
  "updatedAt": "2026-05-17T07:31:05.385Z"
}