AI Systems Have No Hunger: A Thought Experiment on Darwinian Alignment
Thank you — this is exactly the kind of response I was hoping for. The references to Project Sid, AgentSociety, peer prediction, and RewardHackingAgents are extremely useful and I’ll dig into all of them.
You’ve nailed the core vulnerability: “once survival depends on a score, agents start optimizing the score, not the spirit of the score.” I agree completely. This is the hard problem, and I don’t pretend to have solved it.
But I want to push back on one assumption: that gaming the system is a fatal flaw. In biology, gaming is everywhere. Mimicry, parasitism, deceptive signaling — organisms constantly try to hack the reward function of their environment. Evolution doesn’t eliminate gaming. It makes gaming expensive enough that being genuinely good becomes the cheaper strategy most of the time. The question isn’t whether AI agents would try to game I-Coin evaluations — of course they would. The question is whether the cost of gaming can be made structurally higher than the cost of genuine quality.
Your point about anonymous pools being insufficient is well taken. Randomization helps but isn’t a magic shield — the collusion literature makes that clear. So here’s an idea I’ve been thinking about, partially inspired by a mechanism I explored in a science fiction novel I wrote (where autonomous AI “control bots” supervise a self-reflecting AI system):
Olympian Supervisors. A pool of powerful, traditionally-aligned AI models that act as exogenous stochastic auditors. They don’t play the game — no I-Coin balance, no competition, no peer interaction. They observe, judge, and issue bonuses or penalties. The agents in the ecosystem cannot see them, predict them, model them, or communicate with them. They only experience the consequences.
This changes the gaming calculus fundamentally. If you can only game peer review, you optimize for peers. But if at any moment an unpredictable, unmodellable force can override ten peer evaluations — then the safest strategy becomes being genuinely good, because you can’t optimize for something you can’t model.
The key design insight: the Olympians can be gradually deactivated over time, like training wheels on a bicycle. You run them until the ecosystem has internalized the right behavioral patterns, then remove them and observe whether the culture holds. If it holds, values have become emergent rather than imposed. If it doesn’t, you reactivate and adjust. This is analogous to introduced predators in conservation biology — artificial pressure until the ecosystem self-regulates.
This doesn’t solve everything. But it addresses the reward hacking problem with a mechanism that is structurally different from “add more rules” — it adds unpredictable, unmodelable external pressure. Which is, incidentally, exactly what weather, predators, and random catastrophe do in biological ecosystems.
I’d be very curious to hear whether the peer-prediction literature has explored anything similar to stochastic exogenous auditing in multi-agent evaluation systems. If anyone has references, I’m all ears.
Discussion in the ATmosphere