NEON-CITY/CosySim and the NEXUS project
Game mechanics
Persistent player state lives in engine/world/player_state.py (a thread-safe singleton persisted to data/player_state.json, broadcasting hud_update over Socket.IO so the Neon HUD stays live across every scene):
- Vitals — credits (₵), reputation, heat / wanted level (0–100), health, hunger, energy
- Skills & XP — 8 skills (hacking, combat, stealth, social, tech, driving, medicine, trading) on a use-based XP curve (
skill_progression.py), with d20-style checks:success = roll(1–20) + skill_level*4 + modifier ≥ difficulty, scaling from Trivial(5) to Legendary(25), and a global player level 1–50 - Factions — six powers (OmniCorp, NeoTech, BlackMarket, Ghost_Net, SynthSec, DeepState) each with its own personality and a standing scale of −100 (sworn enemy) → 0 → +100 (trusted ally)
- Territory — those factions contest 16 districts; control flows from missions, crew ops, and world events, and a >10% swing in one tick triggers a faction war that can cascade to adjacent districts
- Economy — six good categories (weapons, tech, consumables, contraband, intel, luxury) priced as
base · (1 + (demand − supply)/100)with territory multipliers layered on top - Inventory & equipment — items carry rarity/condition; equipping cyberware/weapons grants real skill and stat bonuses; consumables resolve effects by category
- Crew ops — recruit NPCs you’ve built relationships with into role-based crews; operations resolve via probabilistic skill checks (SUCCESS / PARTIAL / FAILURE) that shift loyalty and pay out scaled rewards
- Missions & chains — four branching storylines (heist escalation, faction war, deep-state defection, street-to-syndicate) where outcome and standing route you down divergent paths
See docs/GAME_SYSTEMS.md and docs/ECONOMY_GUIDE.md for the full mechanics.
Local-agent simulations: NPCs that perceive, decide, act
The defining trick of NEON CITY is that its inhabitants are local LLM agents running a real agent loop , not scripted dialogue trees. engine/agents/agent_loop.py runs a tick-based cycle for every character in a scene:
- Perceive — observe location, nearby characters, and recent events (including world-sim digests)
- Decide —
VirtualAgentManagerproduces a structured JSON action against a fixed schema (speak,move,interact,idle,flirt, …) — batched across agents for parallel inference - Execute — the action is applied to the scene, broadcast over Socket.IO, and logged to the
EventChain
DECISION_SCHEMA = {
"type": "object",
"properties": {
"action": {"type": "string",
"enum": ["speak", "move", "interact", "idle",
"flirt", "touch", "kiss", "cuddle", "intimate"]},
"target": {"type": "string"},
"message": {"type": "string"},
},
"required": ["action"],
}
Every reply an agent emits flows through the MCP interceptor pipeline (36 interceptors, priority-ordered), which is what wires dialogue into the world: NexusPrompt hydrates context from the knowledge base, FactionContextInterceptor (pri 40) injects the speaker’s standing toward you, HeatAwarenessInterceptor (pri 75) makes NPCs react to your wanted level, StatSyncInterceptor (pri 91) applies stat changes, and SpectatorBroadcastInterceptor (pri 92) pushes danmaku to onlookers. NPCs even drift through NaturalMoodDrift neurochemistry tagging between turns. Agent decisions are also fed into the DataCollector for the self-improvement training loop and auto-registered into Nexus’s agent registry. The architecture of that pipeline is documented in docs/MCP_FRAMEWORK.md and docs/ARCHITECTURE.md.
The result is a city where the bartender remembers the slight, the rival faction lieutenant prices you out, and a stranger across the lounge is — genuinely — deciding what to do next, locally, on your machine.
Engine internals: how agents are steered
The Oracle scene — a neural-consciousness terminal in NeonCity that doubles as the project’s All-Seeing Eye observability dashboard (real-time error feed, service-health grid, trace links).
Most “AI character” demos are a system prompt and a while loop. CosySim is the opposite: every agent reply passes through a governed pipeline of ~38 interceptors, the model’s own output is parsed for inline control tags that mutate game state, and inference itself is steered by a custom LMStudio client/server that does model affinity, federation, speculative decoding, and ephemeral tool servers — all running on local hardware. This section is the deep dive. Everything below is grounded in real modules you can open and read.
Why read this? It’s a working reference implementation of agent governance, structured-output steering, and observability that you can borrow wholesale. The patterns are deliberately small and composable — an interceptor is ~40 lines; a control tag is a regex plus a state write.
The shape of one reply
When a scene asks a character to respond, it doesn’t call the LLM directly. It calls an AgentGovernor (engine/mcp/comms_framework.py) which orchestrates the whole flow:
user_message
│
▼
AgentGovernor.reply()
├─ 1. Load SceneManifest (which skills this scene exposes)
├─ 2. Run AUTO skills ── cooldown + prerequisite gated ──▶ ctx["auto_results"]
├─ 3. pipeline.run_pre(ctx) ◀── ~38 interceptors, priority-ordered
│ (mutate system_prompt + messages: mood, memory, scene, rules…)
├─ 4. LLM call (custom LMStudio client) ──▶ ctx["reply"], response_id, tool_calls
├─ 5. ContentRouter.parse_full(reply) ──▶ ctx["parsed"] (single pass)
└─ 6. pipeline.run_post(ctx) ◀── same interceptors, post phase
(apply [STAT], sync mood, broadcast danmaku, log, shape)
▼
final reply (tags stripped, state mutated, telemetry emitted)
The carrier is a single mutable ResponseContext (a dict subclass). Every interceptor reads and writes well-known keys (system_prompt, messages, reply, parsed, mood_tags, abort, skip_llm…). Any interceptor can short-circuit the chain by setting ctx["abort"] = True, or skip the LLM entirely (ctx["skip_llm"] = True) to provide a canned reply. The pipeline never lets one bad interceptor crash a reply — each hook is wrapped, and failures are logged through the Oracle, not swallowed.
1. The interceptor pipeline (~38 hooks, by priority)
Interceptors subclass InterceptorBase and override pre_call(ctx) and/or post_call(ctx). They’re registered in engine/agents/interceptors/init.py and sorted by an integer priority (lower runs first). Each can declare applicable_scenes to limit itself to specific scenes. The registry logs its count at import time, so the live number is always visible in the logs.
The pipeline is the embodiment of the project’s design philosophy: behaviour is layered, not monolithic. Context flows in (pre, low→high priority) and gets applied on the way out (post). Pre-call tiers hydrate the prompt; post-call tiers turn the model’s words into consequences.
The full pipeline by priority (pre-call hydration → LLM → post-call application) (click for more details)
Writing a new one is intentionally trivial — and you can register it from anywhere with a decorator:
from engine.agents.interceptors import register_interceptor
from engine.mcp.comms_framework import InterceptorBase, ResponseContext
@register_interceptor
class WeatherMoodInterceptor(InterceptorBase):
name = "weather_mood"
priority = 18 # runs after world state (15), before skills (30)
applicable_scenes = {"neoncity", "penthouse"}
def pre_call(self, ctx: ResponseContext) -> None:
ctx["system_prompt"] += "\n[It is raining outside; the mood is contemplative.]"
2. Stream tags — the model steers the world
CosySim treats the LLM’s output as a control channel , not just text. Characters emit inline tags that the engine parses and applies :
| Tag | Example | Applied by | Effect |
|---|---|---|---|
[MOOD:x] |
[MOOD:playful intensity=0.8] |
MoodSyncInterceptor (92) |
Sets mood, fires threshold rules |
[STAT:x±n] |
[STAT:arousal+10] [STAT:trust=70] |
StatSyncInterceptor (91) |
Mutates character game state |
[ACTION:x] |
[ACTION:pour a drink] |
post-call / spectator | Drives animation / narration |
[IMAGE:x] |
[IMAGE:a selfie in the penthouse] |
scene image pipeline | Triggers ComfyUI generation |
[VOICE:x] |
[VOICE:whisper] |
TTSStyleInterceptor (85) |
Selects TTS delivery style |
There’s a single canonical parser — ContentRouter.parse_full() in engine/agents/content_router.py — that runs once per reply (step 5 above) and produces a ParsedResponse. Every downstream interceptor reads ctx["parsed"] instead of re-scanning with its own regex. For streaming, the mirror is StreamProcessor (engine/agents/stream_processor.py), which accumulates tags incrementally off the v1 SSE event stream and fires callbacks (on_mood, on_image_request, on_stat_delta) in real time — so a [MOOD:...] lights up the UI before the sentence finishes.
The keystone is StatSyncInterceptor (priority 91). Before v1.59 these tags were parsed and discarded — a character could say [STAT:trust+10] and nothing happened. Now the loop is closed: stat tags route through the CharacterStateCoordinator (only known stats, with LLM-alias normalization like desire→horniness), and because StatSync runs just before MoodSyncInterceptor (92), the freshly-updated stats are visible to the threshold-rule auto-evaluation MoodSync performs. A character’s words have mechanical consequences, and those consequences cascade into rule-driven behaviour — all in one reply.
reply: "I lean closer, heart racing. [MOOD:flirtatious] [STAT:arousal+15] [ACTION:lean in]"
│
▼ ContentRouter.parse_full() → ParsedResponse(mood=flirtatious, stat_updates=[arousal+15], actions=[lean in])
▼ StatSync(91): coordinator.update("aria", arousal=+15) → state mutated
▼ MoodSync(92): set mood; arousal now > threshold → rule fires → directive injected next turn
▼ SpectatorBroadcast(92): danmaku "Aria: I lean closer…" in mood color
▼ TTSStyle(85)/clean text: tags stripped → "I lean closer, heart racing."
3. The custom LMStudio client/server
All inference is local , through a hand-written native-v1 client — no OpenAI-compat shim. engine/lmstudio/ is a full control plane over LMStudio:
LMSClient(lms_client.py) — implements every endpoint of the LMStudio v1 REST API (/api/v1/chat, model load/unload/download). It exposes the steering knobs that matter: stateful chats viaprevious_response_id/response_id(conversation branching by reusing any historical id), structured output (JSON-schema enforcement at the logit level), full sampling control (top_k,min_p,repeat_penalty, reasoning mode, per-requestcontext_length), image input for VLMs, and typed SSE streaming across all 19 event types.ServerController(server_controller.py) — CosySim is both client and server to LMStudio. The controller does server-side lifecycle: load/unload models, configure inference, per-agent model instances (create_agent_instance("aria", ...)), TTL-based auto-unload of idle instances, and per-model health (VRAM, request counts, idle time) that feeds the Oracle dashboard.LMLinkManager(lmlink_manager.py) — federation. Connects multiple LMStudio instances (local + remote over Tailscale) and routes each request to the best peer by model affinity , capability, load, and failover. Peers track latency (EMA), error rate, and consecutive failures; transient health blips retry with exponential backoff + jitter rather than flipping a peer unhealthy.TaskQueue(task_queue.py) — a priority queue with model-affinity routing :CODEtasks go to*coder*models,VISIONto*vl*/*llava*,ROUTERto tiny*0.6b*models, etc. Workers auto-start on firstsubmit().- Ephemeral MCP tool servers — tools are offered to the model per-request via the v1
integrationsfield.MCP.ephemeral("http://localhost:8600/mcp/sse")references a server by URL (no pre-registration), withallowed_toolsand auth headers;MCP.plugin("mcp/cosysim")references a registered one. This is how a character gains tool access for one call without standing infrastructure. - Speculative decoding —
client.enable_speculative(main_model, draft_model)loads a main+draft pair; LMStudio then activates spec decoding automatically and CosySim passesdraft_modelthrough the chat payload. Real throughput gains, fully local.
Per-agent affinity, federation, and the task queue together mean a single rig (or a small fleet) can run a tiny router model, a chat model, a coder model, and a vision model concurrently — each agent steered onto the right one.
4. The Oracle — one name, two entities
The Oracle is deliberately dual, and that duality is the project’s signature flourish.
The telemetry backbone (engine/observability/oracle.py) is the project-wide observability facade. One import — from engine.observability.oracle import get_logger — and on first use it wires the entire stack: a StructuredLogger root handler (→ SQLite + JSONL, queryable and traceable), the CosyLogger ring buffer (→ the in-game Phone feed), and an _OracleHandler that fires only on ERROR+ (~0.2ms cost). Errors flow into the ErrorAggregator, which fingerprints them — stripping IDs, numbers, and paths to a stable hash — so 500 log lines collapse into “LMStudio auth failed: 47× in 5min, affecting phone + lounge + tavern, started 14:32.” It’s hardened: a bounded-LRU flood guard caps memory under a storm of unique fingerprints, a throttled rate-alert hook emits one CRITICAL line instead of silence, and a post-install self-check confirms the handlers actually attached (a silent no-op install is exactly the failure mode it guards against). diagnose() and scripts/oracle.py print health, top errors, LLM p95, Nexus KB stats, per-model VRAM, and Gemini service status in one ASCII-safe report.
The in-game scene (content/scenes/oracle/oracle_scene.py) is a neural-consciousness terminal in NeonCity’s core — meditation, LLM-driven fortune readings, city-pulse displays — and it surfaces the very same telemetry through an “All-Seeing Eye” dashboard: a real-time error feed, a service-health grid, and trace links, all over Socket.IO. The thing watching the city is the same thing watching the code. That’s not a gimmick — it means the project’s observability has a face , and debugging is a first-class, in-world experience.
5. Neurochemistry + mood drift
Underneath the mood tags is a genuine affect model. engine/characters/neurochemistry.py gives every character 6 neurotransmitters — dopamine, serotonin, oxytocin, cortisol, adrenaline, endorphins — each with a baseline, a half-life decay curve, and a stimulus catalog (kiss, rejection, crew_victory, level_up…) that applies clamped deltas. Emotions are computed , not hardcoded: high dopamine + low cortisol → Confident ; high cortisol + high adrenaline → Panicked. The NeurochemistryInterceptor (priority 4) injects this derived state into the system prompt at the very front of the pipeline, and StimulusDetectInterceptor (88) closes the loop by detecting stimuli in the conversation post-call and feeding them back.
NaturalMoodDriftInterceptor (priority 5) makes the world feel alive between turns : arousal cools, tiredness accumulates, anger fades, happiness regresses toward a personality mean — deliberately slow, so emotions shift gradually rather than snapping. It piggybacks buff-expiry and tag-decay sweeps onto every call and slips the agent a one-line “inner feeling” cue. So a character isn’t a static persona answering questions — it’s a drifting emotional state that your words (and [STAT:]/[MOOD:] tags, and the threshold rules they trigger) continuously nudge.
NLM + Nexus — frontier-grade AI from local models
The Oracle’s All-Seeing Eye surfaces query-router provenance — which tier answered each query, with confidence and tokens-saved logged in Oracle format.
Local models are cheap, private, and fast — but a 0.6B–8B model running in LMStudio is not GPT-class on its own. CosySim closes that gap not by making the model bigger, but by making the model ask less and remember more. Two subsystems do the heavy lifting:
- Nexus KMS — a persistent SQLite + FTS5 + vector knowledge backbone (
:8700) that every agent, scene, and dev session reads from and writes back to. - NotebookLM (NLM) — Google’s Gemini, driven headlessly through a reverse-engineered private RPC stack, used as a free distillation and grounding layer.
The thesis is simple and provable in the code: the first time a question is asked it costs compute; every subsequent time it is served from Nexus for free. Expensive frontier-grade reasoning happens once, gets distilled into the knowledge base, and is thereafter answered locally — instantly. The local model becomes the last resort, not the first.
This is the part of CosySim most worth borrowing. The whole pipeline is open and grounded in real modules — read along.
The 7-tier query router
engine/nexus/query_router.py (NexusQueryRouter) is the heart of the system. Every information-retrieval request — agent context hydration, a player question, a dev lookup — passes through a confidence-gated cascade, cheapest tier first. Each tier either clears the min_confidence bar and returns, or falls through to the next.
| # | Tier | Mechanism | Cost | Confidence |
|---|---|---|---|---|
| 0 | Local session cache | In-process MD5-keyed dict, TTL local_cache_ttl (300s) |
~0 | inherited |
| 1 | Q &A cache | client.find_qa exact/fuzzy match, scored by word-overlap relevance (≥0.4 to count) |
~0, instant | up to 0.90 |
| 2 | Vector search | Gemini Embedding 2 → ChromaDB cosine over knowledge/qa/code/news |
fast | up to 0.92 |
| 2.5 | File Search | Google managed RAG with grounded citations over uploaded docs | API call | 0.85 |
| 3 | FTS knowledge | SQLite FTS5 across Nexus entries, title-overlap + length scored | fast | up to 0.85 |
| 4 | Nexus smart-ask | Server-side hybrid pipeline (FTS + NLM) via client.ask(depth=…) |
medium | variable |
| 5 | Direct NLM | nlm_unified_ask — free, Gemini-grounded answer with citations |
slow | ~0.8 |
| 6 | LLM fallback | Local LMStudio inference (engine.lmstudio.chat) |
local GPU | 0.6 |
The thresholds are real, tuned constants and every one is config-overridable (nexus.query_router.*):
CACHE_CONFIDENCE = 0.90 # Q&A cache hit
VECTOR_CONFIDENCE = 0.82 # strong vector match
FILE_SEARCH_CONFIDENCE = 0.85 # grounded in uploaded docs
SEARCH_HIGH = 0.75 / SEARCH_MEDIUM = 0.50 / SEARCH_LOW = 0.30
MIN_ANSWER_LENGTH = 20
Two details that make it robust rather than naive:
- Relevance gating, not first-result-wins. Tier 1 doesn’t trust the top Q&A row blindly —
_question_relevancecomputes a stop-word-filtered Jaccard overlap and scales confidence by it (0.4 overlap → 0.72 conf, 1.0 → 0.90). A weak match falls through instead of returning a confidently-wrong answer. - Provenance logging. Every resolution logs
tier=…, confidence=…, tokens_saved=…in Oracle format, and per-agent hit counts are tracked (agent_queries/agent_hits) — so you can see exactly which tier answered, for whom, and how much GPU it saved.
The self-improving flywheel
This is what makes local models punch above their weight. Look at tiers 3–6 in query(): every answer that required real work is written back as a Nexus Q &A pair, which promotes it to tier 1 for all future queries.
# Tier 6: LLM Fallback — store the answer back in Nexus for future reuse
if use_llm:
result = self._llm_fallback(question, ...)
if result.answer and len(result.answer) >= self.MIN_ANSWER_LENGTH:
self._store_qa(client, question, result.answer, ...) # → promotes to tier 1
self._stats.answers_stored += 1
And _store_qa doesn’t just cache — it also feeds the training flywheel (_feed_training_flywheel → collect_from_qa), so every fallback simultaneously becomes a future cache hit and a fine-tuning example. The loop is closed:
expensive answer (NLM / LLM)
│ store_qa
▼
Nexus Q&A pair ──────────► future query hits tier 1 (free, instant)
│ collect_from_qa
▼
TrainingFlywheel example ─► fine-tune local model
│
▼
better local fallback ────► cheaper tier 6, more cache hits next cycle
RouterStats.hit_rate() measures the payoff directly: hits ÷ total queries. As the cache fills, the rate climbs and llm_fallbacks falls. The nlm_router.py variant adds an explicit savings_report() breaking out answered_without_gpu = cache_hits + fts_hits + nlm_hits and estimated_tokens_saved — the system reports its own compounding ROI.
NLM chain-prompting: where frontier reasoning enters
NLM is the system’s gateway to Gemini — for free, at NotebookLM rate limits. engine/nexus/nlm_chain.py (NLMChainEngine) turns a single question into multi-step distillation and routes the results straight back into Nexus.
Chains are declarative (defined in config/nlm_notebooks.yaml), each step’s output piped into the next via a {previous_output} template variable:
engine = NLMChainEngine()
# progressive research: overview → details → examples → gaps
engine.execute_chain("architecture-review", notebook_id,
variables={"task_description": "..."})
# reverse-generate a whole Q&A set from one notebook
engine.distill_notebook("coding", questions=[...])
# weekly fleet sweep across all notebooks
engine.run_batch("weekly-review")
Crucially, execute_chain persists as it goes : the final synthesis is stored as a Nexus entry, and every substantive step is stored as a Q&A pair (_store_qa_in_nexus). So a single chain run — one burst of Gemini-grade reasoning — seeds dozens of tier-1 cache entries that the local stack serves forever after. generate_action_manifest even uses the task_decompose chain to turn a fuzzy task description into a JSON, agent-executable plan.
Behind it, nlm_direct_client.py (NLMDirectClient) speaks the raw batchexecute / GenerateFreeFormStreamed RPC protocol with browser-attached auth (SAPISIDHASH), a 302-operation rpcid registry, and full multimodality — text, URL, YouTube, image, audio, video, PDF in; reports, podcasts, mind-maps, flashcards out. Every output can become the next call’s input — recursive self-improvement is the architecture, not an afterthought.
The cache pipeline — Gemini as both generator and evaluator (click for more details)
The knowledge pipeline: one funnel, consistent quality
Every knowledge source — sessions, URL crawls, agent submissions, NLM distillation, manual notes — routes through a single funnel, engine/nexus/knowledge_pipeline.py (KnowledgePipeline.ingest):
ingest → validate → dedup → store → embed → Q&A → notify → train
Each stage is deliberate: content-hash dedup (SHA-256 of title + first 500 chars) blocks near-duplicates; a quality heuristic gates Q&A generation (quality ≥ 0.5); successful entries auto-embed into ChromaDB and auto-generate rule-based Q&A pairs; and everything feeds the DataCollector as a knowledge_synthesizer training example. The result: anything that enters Nexus is immediately discoverable by all retrieval tiers — FTS, vector, and Q&A cache — with no manual bookkeeping.
Why this punches above local weight
Frontier reasoning is amortized to zero. Gemini-grade answers (via NLM) are computed once and distilled into a free, instant local cache. The marginal cost of the 1000th identical query is a dict lookup.
Confidence gating prevents quality collapse. Cheap tiers only answer when they’re actually confident; otherwise the question escalates toward grounded Gemini. You get cache speed without cache staleness lies.
Grounded citations on demand. Tiers 2.5 and 5 return answers with source citations (File Search + NLM), so even “frontier” answers are verifiable, not hallucinated.
The system trains the system. Every fallback is both a cache write and a fine-tuning datum — the local model that handles tier 6 next month was taught by the Gemini that handled tier 5 this month.
It’s all observable.
router.stats,savings_report(), and Oracle provenance logs make the flywheel measurable — you can watch the hit rate climb and the GPU calls fall.from engine.nexus.query_router import get_query_router
router = get_query_router() res = router.query("How does the interceptor pipeline work?") print(res.source, res.confidence, res.tokens_saved) # e.g. "cache" 0.90 450 print(router.stats.to_dict()) # cache/vector/file_search/nlm/llm breakdown + hit rate
The whole stack is open, local-first, and self-documenting — a working example of how to give a small local model a memory that compounds and a tutor that’s free.
Discussion in the ATmosphere