External Publication

NEON-CITY/CosySim and the NEXUS project

Hugging Face Forums [Unofficial] June 16, 2026

Game mechanics

Persistent player state lives in engine/world/player_state.py (a thread-safe singleton persisted to data/player_state.json, broadcasting hud_update over Socket.IO so the Neon HUD stays live across every scene):

Vitals — credits (₵), reputation, heat / wanted level (0–100), health, hunger, energy
Skills & XP — 8 skills (hacking, combat, stealth, social, tech, driving, medicine, trading) on a use-based XP curve (skill_progression.py), with d20-style checks: success = roll(1–20) + skill_level*4 + modifier ≥ difficulty, scaling from Trivial(5) to Legendary(25), and a global player level 1–50
Factions — six powers (OmniCorp, NeoTech, BlackMarket, Ghost_Net, SynthSec, DeepState) each with its own personality and a standing scale of −100 (sworn enemy) → 0 → +100 (trusted ally)
Territory — those factions contest 16 districts; control flows from missions, crew ops, and world events, and a >10% swing in one tick triggers a faction war that can cascade to adjacent districts
Economy — six good categories (weapons, tech, consumables, contraband, intel, luxury) priced as base · (1 + (demand − supply)/100) with territory multipliers layered on top
Inventory & equipment — items carry rarity/condition; equipping cyberware/weapons grants real skill and stat bonuses; consumables resolve effects by category
Crew ops — recruit NPCs you’ve built relationships with into role-based crews; operations resolve via probabilistic skill checks (SUCCESS / PARTIAL / FAILURE) that shift loyalty and pay out scaled rewards
Missions & chains — four branching storylines (heist escalation, faction war, deep-state defection, street-to-syndicate) where outcome and standing route you down divergent paths

See docs/GAME_SYSTEMS.md and docs/ECONOMY_GUIDE.md for the full mechanics.

Local-agent simulations: NPCs that perceive, decide, act

The defining trick of NEON CITY is that its inhabitants are local LLM agents running a real agent loop , not scripted dialogue trees. engine/agents/agent_loop.py runs a tick-based cycle for every character in a scene:

Perceive — observe location, nearby characters, and recent events (including world-sim digests)
Decide — VirtualAgentManager produces a structured JSON action against a fixed schema (speak, move, interact, idle, flirt, …) — batched across agents for parallel inference
Execute — the action is applied to the scene, broadcast over Socket.IO, and logged to the EventChain

DECISION_SCHEMA = {
    "type": "object",
    "properties": {
        "action": {"type": "string",
                   "enum": ["speak", "move", "interact", "idle",
                            "flirt", "touch", "kiss", "cuddle", "intimate"]},
        "target":  {"type": "string"},
        "message": {"type": "string"},
    },
    "required": ["action"],
}

Every reply an agent emits flows through the MCP interceptor pipeline (36 interceptors, priority-ordered), which is what wires dialogue into the world: NexusPrompt hydrates context from the knowledge base, FactionContextInterceptor (pri 40) injects the speaker’s standing toward you, HeatAwarenessInterceptor (pri 75) makes NPCs react to your wanted level, StatSyncInterceptor (pri 91) applies stat changes, and SpectatorBroadcastInterceptor (pri 92) pushes danmaku to onlookers. NPCs even drift through NaturalMoodDrift neurochemistry tagging between turns. Agent decisions are also fed into the DataCollector for the self-improvement training loop and auto-registered into Nexus’s agent registry. The architecture of that pipeline is documented in docs/MCP_FRAMEWORK.md and docs/ARCHITECTURE.md.

The result is a city where the bartender remembers the slight, the rival faction lieutenant prices you out, and a stranger across the lounge is — genuinely — deciding what to do next, locally, on your machine.

Engine internals: how agents are steered

The Oracle scene — a neural-consciousness terminal in NeonCity that doubles as the project’s All-Seeing Eye observability dashboard (real-time error feed, service-health grid, trace links).

Most “AI character” demos are a system prompt and a while loop. CosySim is the opposite: every agent reply passes through a governed pipeline of ~38 interceptors, the model’s own output is parsed for inline control tags that mutate game state, and inference itself is steered by a custom LMStudio client/server that does model affinity, federation, speculative decoding, and ephemeral tool servers — all running on local hardware. This section is the deep dive. Everything below is grounded in real modules you can open and read.

Why read this? It’s a working reference implementation of agent governance, structured-output steering, and observability that you can borrow wholesale. The patterns are deliberately small and composable — an interceptor is ~40 lines; a control tag is a regex plus a state write.

The shape of one reply

When a scene asks a character to respond, it doesn’t call the LLM directly. It calls an AgentGovernor (engine/mcp/comms_framework.py) which orchestrates the whole flow:

user_message
   │
   ▼
AgentGovernor.reply()
   ├─ 1. Load SceneManifest (which skills this scene exposes)
   ├─ 2. Run AUTO skills  ── cooldown + prerequisite gated ──▶ ctx["auto_results"]
   ├─ 3. pipeline.run_pre(ctx)    ◀── ~38 interceptors, priority-ordered
   │        (mutate system_prompt + messages: mood, memory, scene, rules…)
   ├─ 4. LLM call (custom LMStudio client)  ──▶ ctx["reply"], response_id, tool_calls
   ├─ 5. ContentRouter.parse_full(reply)     ──▶ ctx["parsed"]  (single pass)
   └─ 6. pipeline.run_post(ctx)   ◀── same interceptors, post phase
            (apply [STAT], sync mood, broadcast danmaku, log, shape)
   ▼
final reply (tags stripped, state mutated, telemetry emitted)

The carrier is a single mutable ResponseContext (a dict subclass). Every interceptor reads and writes well-known keys (system_prompt, messages, reply, parsed, mood_tags, abort, skip_llm…). Any interceptor can short-circuit the chain by setting ctx["abort"] = True, or skip the LLM entirely (ctx["skip_llm"] = True) to provide a canned reply. The pipeline never lets one bad interceptor crash a reply — each hook is wrapped, and failures are logged through the Oracle, not swallowed.

1. The interceptor pipeline (~38 hooks, by priority)

Interceptors subclass InterceptorBase and override pre_call(ctx) and/or post_call(ctx). They’re registered in engine/agents/interceptors/init.py and sorted by an integer priority (lower runs first). Each can declare applicable_scenes to limit itself to specific scenes. The registry logs its count at import time, so the live number is always visible in the logs.

The pipeline is the embodiment of the project’s design philosophy: behaviour is layered, not monolithic. Context flows in (pre, low→high priority) and gets applied on the way out (post). Pre-call tiers hydrate the prompt; post-call tiers turn the model’s words into consequences.

The full pipeline by priority (pre-call hydration → LLM → post-call application) (click for more details)

Writing a new one is intentionally trivial — and you can register it from anywhere with a decorator:

from engine.agents.interceptors import register_interceptor
from engine.mcp.comms_framework import InterceptorBase, ResponseContext

@register_interceptor
class WeatherMoodInterceptor(InterceptorBase):
    name = "weather_mood"
    priority = 18          # runs after world state (15), before skills (30)
    applicable_scenes = {"neoncity", "penthouse"}

    def pre_call(self, ctx: ResponseContext) -> None:
        ctx["system_prompt"] += "\n[It is raining outside; the mood is contemplative.]"

2. Stream tags — the model steers the world

CosySim treats the LLM’s output as a control channel , not just text. Characters emit inline tags that the engine parses and applies :

Tag	Example	Applied by	Effect
`[MOOD:x]`	`[MOOD:playful intensity=0.8]`	`MoodSyncInterceptor` (92)	Sets mood, fires threshold rules
`[STAT:x±n]`	`[STAT:arousal+10]` `[STAT:trust=70]`	`StatSyncInterceptor` (91)	Mutates character game state
`[ACTION:x]`	`[ACTION:pour a drink]`	post-call / spectator	Drives animation / narration
`[IMAGE:x]`	`[IMAGE:a selfie in the penthouse]`	scene image pipeline	Triggers ComfyUI generation
`[VOICE:x]`	`[VOICE:whisper]`	`TTSStyleInterceptor` (85)	Selects TTS delivery style

There’s a single canonical parser — ContentRouter.parse_full() in engine/agents/content_router.py — that runs once per reply (step 5 above) and produces a ParsedResponse. Every downstream interceptor reads ctx["parsed"] instead of re-scanning with its own regex. For streaming, the mirror is StreamProcessor (engine/agents/stream_processor.py), which accumulates tags incrementally off the v1 SSE event stream and fires callbacks (on_mood, on_image_request, on_stat_delta) in real time — so a [MOOD:...] lights up the UI before the sentence finishes.

The keystone is StatSyncInterceptor (priority 91). Before v1.59 these tags were parsed and discarded — a character could say [STAT:trust+10] and nothing happened. Now the loop is closed: stat tags route through the CharacterStateCoordinator (only known stats, with LLM-alias normalization like desire→horniness), and because StatSync runs just before MoodSyncInterceptor (92), the freshly-updated stats are visible to the threshold-rule auto-evaluation MoodSync performs. A character’s words have mechanical consequences, and those consequences cascade into rule-driven behaviour — all in one reply.

reply: "I lean closer, heart racing. [MOOD:flirtatious] [STAT:arousal+15] [ACTION:lean in]"
   │
   ▼ ContentRouter.parse_full()  → ParsedResponse(mood=flirtatious, stat_updates=[arousal+15], actions=[lean in])
   ▼ StatSync(91): coordinator.update("aria", arousal=+15)        → state mutated
   ▼ MoodSync(92): set mood; arousal now > threshold → rule fires → directive injected next turn
   ▼ SpectatorBroadcast(92): danmaku "Aria: I lean closer…" in mood color
   ▼ TTSStyle(85)/clean text: tags stripped → "I lean closer, heart racing."

3. The custom LMStudio client/server

All inference is local , through a hand-written native-v1 client — no OpenAI-compat shim. engine/lmstudio/ is a full control plane over LMStudio:

LMSClient (lms_client.py) — implements every endpoint of the LMStudio v1 REST API (/api/v1/chat, model load/unload/download). It exposes the steering knobs that matter: stateful chats via previous_response_id/response_id (conversation branching by reusing any historical id), structured output (JSON-schema enforcement at the logit level), full sampling control (top_k, min_p, repeat_penalty, reasoning mode, per-request context_length), image input for VLMs, and typed SSE streaming across all 19 event types.
ServerController (server_controller.py) — CosySim is both client and server to LMStudio. The controller does server-side lifecycle: load/unload models, configure inference, per-agent model instances (create_agent_instance("aria", ...)), TTL-based auto-unload of idle instances, and per-model health (VRAM, request counts, idle time) that feeds the Oracle dashboard.
LMLinkManager (lmlink_manager.py) — federation. Connects multiple LMStudio instances (local + remote over Tailscale) and routes each request to the best peer by model affinity , capability, load, and failover. Peers track latency (EMA), error rate, and consecutive failures; transient health blips retry with exponential backoff + jitter rather than flipping a peer unhealthy.
TaskQueue (task_queue.py) — a priority queue with model-affinity routing : CODE tasks go to *coder* models, VISION to *vl*/*llava*, ROUTER to tiny *0.6b* models, etc. Workers auto-start on first submit().
Ephemeral MCP tool servers — tools are offered to the model per-request via the v1 integrations field. MCP.ephemeral("http://localhost:8600/mcp/sse") references a server by URL (no pre-registration), with allowed_tools and auth headers; MCP.plugin("mcp/cosysim") references a registered one. This is how a character gains tool access for one call without standing infrastructure.
Speculative decoding — client.enable_speculative(main_model, draft_model) loads a main+draft pair; LMStudio then activates spec decoding automatically and CosySim passes draft_model through the chat payload. Real throughput gains, fully local.

Per-agent affinity, federation, and the task queue together mean a single rig (or a small fleet) can run a tiny router model, a chat model, a coder model, and a vision model concurrently — each agent steered onto the right one.

4. The Oracle — one name, two entities

The Oracle is deliberately dual, and that duality is the project’s signature flourish.

The telemetry backbone (engine/observability/oracle.py) is the project-wide observability facade. One import — from engine.observability.oracle import get_logger — and on first use it wires the entire stack: a StructuredLogger root handler (→ SQLite + JSONL, queryable and traceable), the CosyLogger ring buffer (→ the in-game Phone feed), and an _OracleHandler that fires only on ERROR+ (~0.2ms cost). Errors flow into the ErrorAggregator, which fingerprints them — stripping IDs, numbers, and paths to a stable hash — so 500 log lines collapse into “LMStudio auth failed: 47× in 5min, affecting phone + lounge + tavern, started 14:32.” It’s hardened: a bounded-LRU flood guard caps memory under a storm of unique fingerprints, a throttled rate-alert hook emits one CRITICAL line instead of silence, and a post-install self-check confirms the handlers actually attached (a silent no-op install is exactly the failure mode it guards against). diagnose() and scripts/oracle.py print health, top errors, LLM p95, Nexus KB stats, per-model VRAM, and Gemini service status in one ASCII-safe report.

The in-game scene (content/scenes/oracle/oracle_scene.py) is a neural-consciousness terminal in NeonCity’s core — meditation, LLM-driven fortune readings, city-pulse displays — and it surfaces the very same telemetry through an “All-Seeing Eye” dashboard: a real-time error feed, a service-health grid, and trace links, all over Socket.IO. The thing watching the city is the same thing watching the code. That’s not a gimmick — it means the project’s observability has a face , and debugging is a first-class, in-world experience.

5. Neurochemistry + mood drift

Underneath the mood tags is a genuine affect model. engine/characters/neurochemistry.py gives every character 6 neurotransmitters — dopamine, serotonin, oxytocin, cortisol, adrenaline, endorphins — each with a baseline, a half-life decay curve, and a stimulus catalog (kiss, rejection, crew_victory, level_up…) that applies clamped deltas. Emotions are computed , not hardcoded: high dopamine + low cortisol → Confident ; high cortisol + high adrenaline → Panicked. The NeurochemistryInterceptor (priority 4) injects this derived state into the system prompt at the very front of the pipeline, and StimulusDetectInterceptor (88) closes the loop by detecting stimuli in the conversation post-call and feeding them back.

NaturalMoodDriftInterceptor (priority 5) makes the world feel alive between turns : arousal cools, tiredness accumulates, anger fades, happiness regresses toward a personality mean — deliberately slow, so emotions shift gradually rather than snapping. It piggybacks buff-expiry and tag-decay sweeps onto every call and slips the agent a one-line “inner feeling” cue. So a character isn’t a static persona answering questions — it’s a drifting emotional state that your words (and [STAT:]/[MOOD:] tags, and the threshold rules they trigger) continuously nudge.

NLM + Nexus — frontier-grade AI from local models

The Oracle’s All-Seeing Eye surfaces query-router provenance — which tier answered each query, with confidence and tokens-saved logged in Oracle format.

Local models are cheap, private, and fast — but a 0.6B–8B model running in LMStudio is not GPT-class on its own. CosySim closes that gap not by making the model bigger, but by making the model ask less and remember more. Two subsystems do the heavy lifting:

Nexus KMS — a persistent SQLite + FTS5 + vector knowledge backbone (:8700) that every agent, scene, and dev session reads from and writes back to.
NotebookLM (NLM) — Google’s Gemini, driven headlessly through a reverse-engineered private RPC stack, used as a free distillation and grounding layer.

The thesis is simple and provable in the code: the first time a question is asked it costs compute; every subsequent time it is served from Nexus for free. Expensive frontier-grade reasoning happens once, gets distilled into the knowledge base, and is thereafter answered locally — instantly. The local model becomes the last resort, not the first.

This is the part of CosySim most worth borrowing. The whole pipeline is open and grounded in real modules — read along.

The 7-tier query router

engine/nexus/query_router.py (NexusQueryRouter) is the heart of the system. Every information-retrieval request — agent context hydration, a player question, a dev lookup — passes through a confidence-gated cascade, cheapest tier first. Each tier either clears the min_confidence bar and returns, or falls through to the next.

#	Tier	Mechanism	Cost	Confidence
0	Local session cache	In-process MD5-keyed dict, TTL `local_cache_ttl` (300s)	~0	inherited
1	Q &A cache	`client.find_qa` exact/fuzzy match, scored by word-overlap relevance (≥0.4 to count)	~0, instant	up to 0.90
2	Vector search	Gemini Embedding 2 → ChromaDB cosine over `knowledge/qa/code/news`	fast	up to 0.92
2.5	File Search	Google managed RAG with grounded citations over uploaded docs	API call	0.85
3	FTS knowledge	SQLite FTS5 across Nexus entries, title-overlap + length scored	fast	up to 0.85
4	Nexus smart-ask	Server-side hybrid pipeline (FTS + NLM) via `client.ask(depth=…)`	medium	variable
5	Direct NLM	`nlm_unified_ask` — free, Gemini-grounded answer with citations	slow	~0.8
6	LLM fallback	Local LMStudio inference (`engine.lmstudio.chat`)	local GPU	0.6

The thresholds are real, tuned constants and every one is config-overridable (nexus.query_router.*):

CACHE_CONFIDENCE   = 0.90   # Q&A cache hit
VECTOR_CONFIDENCE  = 0.82   # strong vector match
FILE_SEARCH_CONFIDENCE = 0.85  # grounded in uploaded docs
SEARCH_HIGH = 0.75 / SEARCH_MEDIUM = 0.50 / SEARCH_LOW = 0.30
MIN_ANSWER_LENGTH = 20

Two details that make it robust rather than naive:

Relevance gating, not first-result-wins. Tier 1 doesn’t trust the top Q&A row blindly — _question_relevance computes a stop-word-filtered Jaccard overlap and scales confidence by it (0.4 overlap → 0.72 conf, 1.0 → 0.90). A weak match falls through instead of returning a confidently-wrong answer.
Provenance logging. Every resolution logs tier=…, confidence=…, tokens_saved=… in Oracle format, and per-agent hit counts are tracked (agent_queries / agent_hits) — so you can see exactly which tier answered, for whom, and how much GPU it saved.

The self-improving flywheel

This is what makes local models punch above their weight. Look at tiers 3–6 in query(): every answer that required real work is written back as a Nexus Q &A pair, which promotes it to tier 1 for all future queries.

# Tier 6: LLM Fallback — store the answer back in Nexus for future reuse
if use_llm:
    result = self._llm_fallback(question, ...)
    if result.answer and len(result.answer) >= self.MIN_ANSWER_LENGTH:
        self._store_qa(client, question, result.answer, ...)   # → promotes to tier 1
        self._stats.answers_stored += 1

And _store_qa doesn’t just cache — it also feeds the training flywheel (_feed_training_flywheel → collect_from_qa), so every fallback simultaneously becomes a future cache hit and a fine-tuning example. The loop is closed:

expensive answer (NLM / LLM)
        │  store_qa
        ▼
Nexus Q&A pair  ──────────►  future query hits tier 1 (free, instant)
        │  collect_from_qa
        ▼
TrainingFlywheel example  ─►  fine-tune local model
        │
        ▼
better local fallback  ────►  cheaper tier 6, more cache hits next cycle

RouterStats.hit_rate() measures the payoff directly: hits ÷ total queries. As the cache fills, the rate climbs and llm_fallbacks falls. The nlm_router.py variant adds an explicit savings_report() breaking out answered_without_gpu = cache_hits + fts_hits + nlm_hits and estimated_tokens_saved — the system reports its own compounding ROI.

NLM chain-prompting: where frontier reasoning enters

NLM is the system’s gateway to Gemini — for free, at NotebookLM rate limits. engine/nexus/nlm_chain.py (NLMChainEngine) turns a single question into multi-step distillation and routes the results straight back into Nexus.

Chains are declarative (defined in config/nlm_notebooks.yaml), each step’s output piped into the next via a {previous_output} template variable:

engine = NLMChainEngine()

# progressive research: overview → details → examples → gaps
engine.execute_chain("architecture-review", notebook_id,
                     variables={"task_description": "..."})

# reverse-generate a whole Q&A set from one notebook
engine.distill_notebook("coding", questions=[...])

# weekly fleet sweep across all notebooks
engine.run_batch("weekly-review")

Crucially, execute_chain persists as it goes : the final synthesis is stored as a Nexus entry, and every substantive step is stored as a Q&A pair (_store_qa_in_nexus). So a single chain run — one burst of Gemini-grade reasoning — seeds dozens of tier-1 cache entries that the local stack serves forever after. generate_action_manifest even uses the task_decompose chain to turn a fuzzy task description into a JSON, agent-executable plan.

Behind it, nlm_direct_client.py (NLMDirectClient) speaks the raw batchexecute / GenerateFreeFormStreamed RPC protocol with browser-attached auth (SAPISIDHASH), a 302-operation rpcid registry, and full multimodality — text, URL, YouTube, image, audio, video, PDF in; reports, podcasts, mind-maps, flashcards out. Every output can become the next call’s input — recursive self-improvement is the architecture, not an afterthought.

The cache pipeline — Gemini as both generator and evaluator (click for more details)

The knowledge pipeline: one funnel, consistent quality

Every knowledge source — sessions, URL crawls, agent submissions, NLM distillation, manual notes — routes through a single funnel, engine/nexus/knowledge_pipeline.py (KnowledgePipeline.ingest):

ingest → validate → dedup → store → embed → Q&A → notify → train

Each stage is deliberate: content-hash dedup (SHA-256 of title + first 500 chars) blocks near-duplicates; a quality heuristic gates Q&A generation (quality ≥ 0.5); successful entries auto-embed into ChromaDB and auto-generate rule-based Q&A pairs; and everything feeds the DataCollector as a knowledge_synthesizer training example. The result: anything that enters Nexus is immediately discoverable by all retrieval tiers — FTS, vector, and Q&A cache — with no manual bookkeeping.

Why this punches above local weight

Frontier reasoning is amortized to zero. Gemini-grade answers (via NLM) are computed once and distilled into a free, instant local cache. The marginal cost of the 1000th identical query is a dict lookup.
Confidence gating prevents quality collapse. Cheap tiers only answer when they’re actually confident; otherwise the question escalates toward grounded Gemini. You get cache speed without cache staleness lies.
Grounded citations on demand. Tiers 2.5 and 5 return answers with source citations (File Search + NLM), so even “frontier” answers are verifiable, not hallucinated.
The system trains the system. Every fallback is both a cache write and a fine-tuning datum — the local model that handles tier 6 next month was taught by the Gemini that handled tier 5 this month.
It’s all observable. router.stats, savings_report(), and Oracle provenance logs make the flywheel measurable — you can watch the hit rate climb and the GPU calls fall.

from engine.nexus.query_router import get_query_router

router = get_query_router() res = router.query("How does the interceptor pipeline work?") print(res.source, res.confidence, res.tokens_saved) # e.g. "cache" 0.90 450 print(router.stats.to_dict()) # cache/vector/file_search/nlm/llm breakdown + hit rate

The whole stack is open, local-first, and self-documenting — a working example of how to give a small local model a memory that compounds and a tutor that’s free.