Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihcco2v6xoy2b6eaf4medxxxewycu7n3owb37vyuxgripw3jlblge",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mofktpsomfk2"
  },
  "path": "/t/neon-city-cosysim-and-the-nexus-project/176853#post_2",
  "publishedAt": "2026-06-16T08:40:36.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Socket.IO",
    "docs/GAME_SYSTEMS.md",
    "docs/ECONOMY_GUIDE.md",
    "docs/MCP_FRAMEWORK.md",
    "docs/ARCHITECTURE.md",
    "engine/mcp/comms_framework.py",
    "engine/agents/interceptors/__init__.py",
    "(click for more details)",
    "engine/agents/content_router.py",
    "engine/lmstudio/",
    "engine/observability/oracle.py",
    "engine/characters/neurochemistry.py",
    "@register_interceptor"
  ],
  "textContent": "### **Game mechanics**\n\nPersistent player state lives in `engine/world/player_state.py` (a thread-safe singleton persisted to `data/player_state.json`, broadcasting `hud_update` over Socket.IO so the Neon HUD stays live across every scene):\n\n  * **Vitals** — credits (₵), reputation, **heat / wanted level** (0–100), health, hunger, energy\n  * **Skills & XP** — 8 skills (hacking, combat, stealth, social, tech, driving, medicine, trading) on a use-based XP curve (`skill_progression.py`), with d20-style checks: `success = roll(1–20) + skill_level*4 + modifier ≥ difficulty`, scaling from Trivial(5) to Legendary(25), and a global player level 1–50\n  * **Factions** — six powers (OmniCorp, NeoTech, BlackMarket, Ghost_Net, SynthSec, DeepState) each with its own personality and a standing scale of −100 (sworn enemy) → 0 → +100 (trusted ally)\n  * **Territory** — those factions contest 16 districts; control flows from missions, crew ops, and world events, and a >10% swing in one tick triggers a **faction war** that can cascade to adjacent districts\n  * **Economy** — six good categories (weapons, tech, consumables, contraband, intel, luxury) priced as `base · (1 + (demand − supply)/100)` with territory multipliers layered on top\n  * **Inventory & equipment** — items carry rarity/condition; equipping cyberware/weapons grants real skill and stat bonuses; consumables resolve effects by category\n  * **Crew ops** — recruit NPCs you’ve built relationships with into role-based crews; operations resolve via probabilistic skill checks (SUCCESS / PARTIAL / FAILURE) that shift loyalty and pay out scaled rewards\n  * **Missions & chains** — four branching storylines (heist escalation, faction war, deep-state defection, street-to-syndicate) where outcome and standing route you down divergent paths\n\n\n\nSee docs/GAME_SYSTEMS.md and docs/ECONOMY_GUIDE.md for the full mechanics.\n\n### **Local-agent simulations: NPCs that perceive, decide, act**\n\nThe defining trick of NEON CITY is that its inhabitants are **local LLM agents running a real agent loop** , not scripted dialogue trees. `engine/agents/agent_loop.py` runs a tick-based cycle for every character in a scene:\n\n  1. **Perceive** — observe location, nearby characters, and recent events (including world-sim digests)\n  2. **Decide** — `VirtualAgentManager` produces a _structured_ JSON action against a fixed schema (`speak`, `move`, `interact`, `idle`, `flirt`, …) — batched across agents for parallel inference\n  3. **Execute** — the action is applied to the scene, broadcast over Socket.IO, and logged to the `EventChain`\n\n\n\n\n    DECISION_SCHEMA = {\n        \"type\": \"object\",\n        \"properties\": {\n            \"action\": {\"type\": \"string\",\n                       \"enum\": [\"speak\", \"move\", \"interact\", \"idle\",\n                                \"flirt\", \"touch\", \"kiss\", \"cuddle\", \"intimate\"]},\n            \"target\":  {\"type\": \"string\"},\n            \"message\": {\"type\": \"string\"},\n        },\n        \"required\": [\"action\"],\n    }\n\n\nEvery reply an agent emits flows through the **MCP interceptor pipeline** (36 interceptors, priority-ordered), which is what wires dialogue into the world: `NexusPrompt` hydrates context from the knowledge base, `FactionContextInterceptor` (pri 40) injects the speaker’s standing toward you, `HeatAwarenessInterceptor` (pri 75) makes NPCs react to your wanted level, `StatSyncInterceptor` (pri 91) applies stat changes, and `SpectatorBroadcastInterceptor` (pri 92) pushes danmaku to onlookers. NPCs even drift through `NaturalMoodDrift` neurochemistry tagging between turns. Agent decisions are also fed into the `DataCollector` for the self-improvement training loop and auto-registered into Nexus’s agent registry. The architecture of that pipeline is documented in docs/MCP_FRAMEWORK.md and docs/ARCHITECTURE.md.\n\nThe result is a city where the bartender remembers the slight, the rival faction lieutenant prices you out, and a stranger across the lounge is — genuinely — _deciding_ what to do next, locally, on your machine.\n\n* * *\n\n## **Engine internals: how agents are steered**\n\n_The Oracle scene — a neural-consciousness terminal in NeonCity that doubles as the project’s All-Seeing Eye observability dashboard (real-time error feed, service-health grid, trace links)._\n\nMost “AI character” demos are a system prompt and a `while` loop. CosySim is the opposite: every agent reply passes through a **governed pipeline** of ~38 interceptors, the model’s own output is parsed for **inline control tags** that mutate game state, and inference itself is steered by a **custom LMStudio client/server** that does model affinity, federation, speculative decoding, and ephemeral tool servers — all running on local hardware. This section is the deep dive. Everything below is grounded in real modules you can open and read.\n\n> **Why read this?** It’s a working reference implementation of agent governance, structured-output steering, and observability that you can borrow wholesale. The patterns are deliberately small and composable — an interceptor is ~40 lines; a control tag is a regex plus a state write.\n\n### **The shape of one reply**\n\nWhen a scene asks a character to respond, it doesn’t call the LLM directly. It calls an `AgentGovernor` (engine/mcp/comms_framework.py) which orchestrates the whole flow:\n\n\n    user_message\n       │\n       ▼\n    AgentGovernor.reply()\n       ├─ 1. Load SceneManifest (which skills this scene exposes)\n       ├─ 2. Run AUTO skills  ── cooldown + prerequisite gated ──▶ ctx[\"auto_results\"]\n       ├─ 3. pipeline.run_pre(ctx)    ◀── ~38 interceptors, priority-ordered\n       │        (mutate system_prompt + messages: mood, memory, scene, rules…)\n       ├─ 4. LLM call (custom LMStudio client)  ──▶ ctx[\"reply\"], response_id, tool_calls\n       ├─ 5. ContentRouter.parse_full(reply)     ──▶ ctx[\"parsed\"]  (single pass)\n       └─ 6. pipeline.run_post(ctx)   ◀── same interceptors, post phase\n                (apply [STAT], sync mood, broadcast danmaku, log, shape)\n       ▼\n    final reply (tags stripped, state mutated, telemetry emitted)\n\n\n\nThe carrier is a single mutable `ResponseContext` (a `dict` subclass). Every interceptor reads and writes well-known keys (`system_prompt`, `messages`, `reply`, `parsed`, `mood_tags`, `abort`, `skip_llm`…). Any interceptor can short-circuit the chain by setting `ctx[\"abort\"] = True`, or skip the LLM entirely (`ctx[\"skip_llm\"] = True`) to provide a canned reply. The pipeline never lets one bad interceptor crash a reply — each hook is wrapped, and failures are logged through the Oracle, not swallowed.\n\n### **1. The interceptor pipeline (~38 hooks, by priority)**\n\nInterceptors subclass `InterceptorBase` and override `pre_call(ctx)` and/or `post_call(ctx)`. They’re registered in engine/agents/interceptors/__init__.py and sorted by an integer `priority` (lower runs first). Each can declare `applicable_scenes` to limit itself to specific scenes. The registry logs its count at import time, so the live number is always visible in the logs.\n\nThe pipeline is the embodiment of the project’s design philosophy: **behaviour is layered, not monolithic.** Context flows _in_ (pre, low→high priority) and gets _applied_ on the way _out_ (post). Pre-call tiers hydrate the prompt; post-call tiers turn the model’s words into consequences.\n\nThe full pipeline by priority (pre-call hydration → LLM → post-call application) (click for more details)\n\nWriting a new one is intentionally trivial — and you can register it from anywhere with a decorator:\n\n\n    from engine.agents.interceptors import register_interceptor\n    from engine.mcp.comms_framework import InterceptorBase, ResponseContext\n\n    @register_interceptor\n    class WeatherMoodInterceptor(InterceptorBase):\n        name = \"weather_mood\"\n        priority = 18          # runs after world state (15), before skills (30)\n        applicable_scenes = {\"neoncity\", \"penthouse\"}\n\n        def pre_call(self, ctx: ResponseContext) -> None:\n            ctx[\"system_prompt\"] += \"\\n[It is raining outside; the mood is contemplative.]\"\n\n\n### **2. Stream tags — the model steers the world**\n\nCosySim treats the LLM’s output as a **control channel** , not just text. Characters emit inline tags that the engine parses and _applies_ :\n\n**Tag** | **Example** | **Applied by** | **Effect**\n---|---|---|---\n`[MOOD:x]` | `[MOOD:playful intensity=0.8]` | `MoodSyncInterceptor` (92) | Sets mood, fires threshold rules\n`[STAT:x±n]` | `[STAT:arousal+10]` `[STAT:trust=70]` | `StatSyncInterceptor` (91) | Mutates character game state\n`[ACTION:x]` | `[ACTION:pour a drink]` | post-call / spectator | Drives animation / narration\n`[IMAGE:x]` | `[IMAGE:a selfie in the penthouse]` | scene image pipeline | Triggers ComfyUI generation\n`[VOICE:x]` | `[VOICE:whisper]` | `TTSStyleInterceptor` (85) | Selects TTS delivery style\n\nThere’s a single canonical parser — `ContentRouter.parse_full()` in engine/agents/content_router.py — that runs **once** per reply (step 5 above) and produces a `ParsedResponse`. Every downstream interceptor reads `ctx[\"parsed\"]` instead of re-scanning with its own regex. For streaming, the mirror is `StreamProcessor` (`engine/agents/stream_processor.py`), which accumulates tags _incrementally_ off the v1 SSE event stream and fires callbacks (`on_mood`, `on_image_request`, `on_stat_delta`) in real time — so a `[MOOD:...]` lights up the UI before the sentence finishes.\n\nThe keystone is `StatSyncInterceptor` (priority 91). Before v1.59 these tags were parsed and **discarded** — a character could say `[STAT:trust+10]` and nothing happened. Now the loop is closed: stat tags route through the `CharacterStateCoordinator` (only known stats, with LLM-alias normalization like `desire→horniness`), and because StatSync runs _just before_ `MoodSyncInterceptor` (92), the freshly-updated stats are visible to the threshold-rule auto-evaluation MoodSync performs. A character’s words have mechanical consequences, and those consequences cascade into rule-driven behaviour — all in one reply.\n\n\n    reply: \"I lean closer, heart racing. [MOOD:flirtatious] [STAT:arousal+15] [ACTION:lean in]\"\n       │\n       ▼ ContentRouter.parse_full()  → ParsedResponse(mood=flirtatious, stat_updates=[arousal+15], actions=[lean in])\n       ▼ StatSync(91): coordinator.update(\"aria\", arousal=+15)        → state mutated\n       ▼ MoodSync(92): set mood; arousal now > threshold → rule fires → directive injected next turn\n       ▼ SpectatorBroadcast(92): danmaku \"Aria: I lean closer…\" in mood color\n       ▼ TTSStyle(85)/clean text: tags stripped → \"I lean closer, heart racing.\"\n\n\n\n### **3. The custom LMStudio client/server**\n\nAll inference is **local** , through a hand-written native-v1 client — no OpenAI-compat shim. engine/lmstudio/ is a full control plane over LMStudio:\n\n  * **`LMSClient`** (`lms_client.py`) — implements every endpoint of the LMStudio v1 REST API (`/api/v1/chat`, model load/unload/download). It exposes the steering knobs that matter: **stateful chats** via `previous_response_id`/`response_id` (conversation branching by reusing any historical id), **structured output** (JSON-schema enforcement at the logit level), full sampling control (`top_k`, `min_p`, `repeat_penalty`, reasoning mode, per-request `context_length`), image input for VLMs, and typed SSE streaming across all 19 event types.\n  * **`ServerController`** (`server_controller.py`) — CosySim is **both client and server** to LMStudio. The controller does server-side lifecycle: load/unload models, configure inference, **per-agent model instances** (`create_agent_instance(\"aria\", ...)`), TTL-based auto-unload of idle instances, and per-model health (VRAM, request counts, idle time) that feeds the Oracle dashboard.\n  * **`LMLinkManager`** (`lmlink_manager.py`) — **federation**. Connects multiple LMStudio instances (local + remote over Tailscale) and routes each request to the best peer by **model affinity** , capability, load, and failover. Peers track latency (EMA), error rate, and consecutive failures; transient health blips retry with exponential backoff + jitter rather than flipping a peer unhealthy.\n  * **`TaskQueue`** (`task_queue.py`) — a priority queue with **model-affinity routing** : `CODE` tasks go to `*coder*` models, `VISION` to `*vl*`/`*llava*`, `ROUTER` to tiny `*0.6b*` models, etc. Workers auto-start on first `submit()`.\n  * **Ephemeral MCP tool servers** — tools are offered to the model per-request via the v1 `integrations` field. `MCP.ephemeral(\"http://localhost:8600/mcp/sse\")` references a server by URL (no pre-registration), with `allowed_tools` and auth headers; `MCP.plugin(\"mcp/cosysim\")` references a registered one. This is how a character gains tool access _for one call_ without standing infrastructure.\n  * **Speculative decoding** — `client.enable_speculative(main_model, draft_model)` loads a main+draft pair; LMStudio then activates spec decoding automatically and CosySim passes `draft_model` through the chat payload. Real throughput gains, fully local.\n\n\n\nPer-agent affinity, federation, and the task queue together mean a single rig (or a small fleet) can run a tiny router model, a chat model, a coder model, and a vision model concurrently — each agent steered onto the right one.\n\n### **4. The Oracle — one name, two entities**\n\nThe Oracle is deliberately dual, and that duality is the project’s signature flourish.\n\n**The telemetry backbone** (engine/observability/oracle.py) is the project-wide observability facade. One import — `from engine.observability.oracle import get_logger` — and on first use it wires the entire stack: a `StructuredLogger` root handler (→ SQLite + JSONL, queryable and traceable), the `CosyLogger` ring buffer (→ the in-game Phone feed), and an `_OracleHandler` that fires only on `ERROR+` (~0.2ms cost). Errors flow into the `ErrorAggregator`, which **fingerprints** them — stripping IDs, numbers, and paths to a stable hash — so 500 log lines collapse into _“LMStudio auth failed: 47× in 5min, affecting phone + lounge + tavern, started 14:32.”_ It’s hardened: a bounded-LRU flood guard caps memory under a storm of unique fingerprints, a throttled rate-alert hook emits one CRITICAL line instead of silence, and a post-install **self-check** confirms the handlers actually attached (a silent no-op install is exactly the failure mode it guards against). `diagnose()` and `scripts/oracle.py` print health, top errors, LLM p95, Nexus KB stats, per-model VRAM, and Gemini service status in one ASCII-safe report.\n\n**The in-game scene** (`content/scenes/oracle/oracle_scene.py`) is a neural-consciousness terminal in NeonCity’s core — meditation, LLM-driven fortune readings, city-pulse displays — _and_ it surfaces the very same telemetry through an “All-Seeing Eye” dashboard: a real-time error feed, a service-health grid, and trace links, all over Socket.IO. The thing watching the city is the same thing watching the code. That’s not a gimmick — it means the project’s observability has a _face_ , and debugging is a first-class, in-world experience.\n\n### **5. Neurochemistry + mood drift**\n\nUnderneath the mood tags is a genuine affect model. engine/characters/neurochemistry.py gives every character **6 neurotransmitters** — dopamine, serotonin, oxytocin, cortisol, adrenaline, endorphins — each with a baseline, a half-life decay curve, and a stimulus catalog (`kiss`, `rejection`, `crew_victory`, `level_up`…) that applies clamped deltas. Emotions are **computed** , not hardcoded: high dopamine + low cortisol → _Confident_ ; high cortisol + high adrenaline → _Panicked_. The `NeurochemistryInterceptor` (priority 4) injects this derived state into the system prompt at the very front of the pipeline, and `StimulusDetectInterceptor` (88) closes the loop by detecting stimuli in the conversation post-call and feeding them back.\n\n`NaturalMoodDriftInterceptor` (priority 5) makes the world feel _alive between turns_ : arousal cools, tiredness accumulates, anger fades, happiness regresses toward a personality mean — deliberately slow, so emotions shift gradually rather than snapping. It piggybacks buff-expiry and tag-decay sweeps onto every call and slips the agent a one-line “inner feeling” cue. So a character isn’t a static persona answering questions — it’s a drifting emotional state that your words (and `[STAT:]`/`[MOOD:]` tags, and the threshold rules they trigger) continuously nudge.\n\n* * *\n\n## **NLM + Nexus — frontier-grade AI from local models**\n\n_The Oracle’s All-Seeing Eye surfaces query-router provenance — which tier answered each query, with confidence and tokens-saved logged in Oracle format._\n\nLocal models are cheap, private, and fast — but a 0.6B–8B model running in LMStudio is not GPT-class on its own. CosySim closes that gap not by making the model bigger, but by making the model _ask less and remember more_. Two subsystems do the heavy lifting:\n\n  * **Nexus KMS** — a persistent SQLite + FTS5 + vector knowledge backbone (`:8700`) that every agent, scene, and dev session reads from and writes back to.\n  * **NotebookLM (NLM)** — Google’s Gemini, driven headlessly through a reverse-engineered private RPC stack, used as a _free_ distillation and grounding layer.\n\n\n\nThe thesis is simple and provable in the code: **the first time a question is asked it costs compute; every subsequent time it is served from Nexus for free.** Expensive frontier-grade reasoning happens once, gets distilled into the knowledge base, and is thereafter answered locally — instantly. The local model becomes the _last_ resort, not the first.\n\n> This is the part of CosySim most worth borrowing. The whole pipeline is open and grounded in real modules — read along.\n\n### **The 7-tier query router**\n\n`engine/nexus/query_router.py` (`NexusQueryRouter`) is the heart of the system. Every information-retrieval request — agent context hydration, a player question, a dev lookup — passes through a **confidence-gated cascade, cheapest tier first**. Each tier either clears the `min_confidence` bar and returns, or falls through to the next.\n\n**#** | **Tier** | **Mechanism** | **Cost** | **Confidence**\n---|---|---|---|---\n0 | Local session cache | In-process MD5-keyed dict, TTL `local_cache_ttl` (300s) | ~0 | inherited\n1 | **Q &A cache** | `client.find_qa` exact/fuzzy match, scored by word-overlap relevance (≥0.4 to count) | ~0, instant | up to 0.90\n2 | **Vector search** | Gemini Embedding 2 → ChromaDB cosine over `knowledge/qa/code/news` | fast | up to 0.92\n2.5 | **File Search** | Google managed RAG with **grounded citations** over uploaded docs | API call | 0.85\n3 | **FTS knowledge** | SQLite FTS5 across Nexus entries, title-overlap + length scored | fast | up to 0.85\n4 | **Nexus smart-ask** | Server-side hybrid pipeline (FTS + NLM) via `client.ask(depth=…)` | medium | variable\n5 | **Direct NLM** | `nlm_unified_ask` — free, Gemini-grounded answer with citations | slow | ~0.8\n6 | **LLM fallback** | Local LMStudio inference (`engine.lmstudio.chat`) | local GPU | 0.6\n\nThe thresholds are real, tuned constants and every one is config-overridable (`nexus.query_router.*`):\n\n\n    CACHE_CONFIDENCE   = 0.90   # Q&A cache hit\n    VECTOR_CONFIDENCE  = 0.82   # strong vector match\n    FILE_SEARCH_CONFIDENCE = 0.85  # grounded in uploaded docs\n    SEARCH_HIGH = 0.75 / SEARCH_MEDIUM = 0.50 / SEARCH_LOW = 0.30\n    MIN_ANSWER_LENGTH = 20\n\n\nTwo details that make it robust rather than naive:\n\n  * **Relevance gating, not first-result-wins.** Tier 1 doesn’t trust the top Q&A row blindly — `_question_relevance` computes a stop-word-filtered Jaccard overlap and _scales confidence by it_ (0.4 overlap → 0.72 conf, 1.0 → 0.90). A weak match falls through instead of returning a confidently-wrong answer.\n  * **Provenance logging.** Every resolution logs `tier=…, confidence=…, tokens_saved=…` in Oracle format, and per-agent hit counts are tracked (`agent_queries` / `agent_hits`) — so you can see exactly which tier answered, for whom, and how much GPU it saved.\n\n\n\n### **The self-improving flywheel**\n\nThis is what makes local models punch above their weight. Look at tiers 3–6 in `query()`: **every answer that required real work is written back as a Nexus Q &A pair**, which promotes it to tier 1 for all future queries.\n\n\n    # Tier 6: LLM Fallback — store the answer back in Nexus for future reuse\n    if use_llm:\n        result = self._llm_fallback(question, ...)\n        if result.answer and len(result.answer) >= self.MIN_ANSWER_LENGTH:\n            self._store_qa(client, question, result.answer, ...)   # → promotes to tier 1\n            self._stats.answers_stored += 1\n\n\nAnd `_store_qa` doesn’t just cache — it **also feeds the training flywheel** (`_feed_training_flywheel` → `collect_from_qa`), so every fallback simultaneously becomes a future cache hit _and_ a fine-tuning example. The loop is closed:\n\n\n    expensive answer (NLM / LLM)\n            │  store_qa\n            ▼\n    Nexus Q&A pair  ──────────►  future query hits tier 1 (free, instant)\n            │  collect_from_qa\n            ▼\n    TrainingFlywheel example  ─►  fine-tune local model\n            │\n            ▼\n    better local fallback  ────►  cheaper tier 6, more cache hits next cycle\n\n\n\n`RouterStats.hit_rate()` measures the payoff directly: hits ÷ total queries. As the cache fills, the rate climbs and `llm_fallbacks` falls. The `nlm_router.py` variant adds an explicit `savings_report()` breaking out `answered_without_gpu = cache_hits + fts_hits + nlm_hits` and `estimated_tokens_saved` — the system reports its own compounding ROI.\n\n### **NLM chain-prompting: where frontier reasoning enters**\n\nNLM is the system’s gateway to Gemini — for free, at NotebookLM rate limits. `engine/nexus/nlm_chain.py` (`NLMChainEngine`) turns a single question into **multi-step distillation** and routes the results straight back into Nexus.\n\nChains are declarative (defined in `config/nlm_notebooks.yaml`), each step’s output piped into the next via a `{previous_output}` template variable:\n\n\n    engine = NLMChainEngine()\n\n    # progressive research: overview → details → examples → gaps\n    engine.execute_chain(\"architecture-review\", notebook_id,\n                         variables={\"task_description\": \"...\"})\n\n    # reverse-generate a whole Q&A set from one notebook\n    engine.distill_notebook(\"coding\", questions=[...])\n\n    # weekly fleet sweep across all notebooks\n    engine.run_batch(\"weekly-review\")\n\n\nCrucially, `execute_chain` **persists as it goes** : the final synthesis is stored as a Nexus entry, and _every_ substantive step is stored as a Q&A pair (`_store_qa_in_nexus`). So a single chain run — one burst of Gemini-grade reasoning — seeds dozens of tier-1 cache entries that the local stack serves forever after. `generate_action_manifest` even uses the `task_decompose` chain to turn a fuzzy task description into a JSON, agent-executable plan.\n\nBehind it, `nlm_direct_client.py` (`NLMDirectClient`) speaks the raw `batchexecute` / `GenerateFreeFormStreamed` RPC protocol with browser-attached auth (SAPISIDHASH), a 302-operation rpcid registry, and full multimodality — text, URL, YouTube, image, audio, video, PDF in; reports, podcasts, mind-maps, flashcards out. **Every output can become the next call’s input** — recursive self-improvement is the architecture, not an afterthought.\n\nThe cache pipeline — Gemini as both generator and evaluator (click for more details)\n\n### **The knowledge pipeline: one funnel, consistent quality**\n\nEvery knowledge source — sessions, URL crawls, agent submissions, NLM distillation, manual notes — routes through a single funnel, `engine/nexus/knowledge_pipeline.py` (`KnowledgePipeline.ingest`):\n\n\n    ingest → validate → dedup → store → embed → Q&A → notify → train\n\n\n\nEach stage is deliberate: content-hash dedup (SHA-256 of title + first 500 chars) blocks near-duplicates; a quality heuristic gates Q&A generation (`quality ≥ 0.5`); successful entries auto-embed into ChromaDB and auto-generate rule-based Q&A pairs; and everything feeds the `DataCollector` as a `knowledge_synthesizer` training example. The result: anything that enters Nexus is immediately discoverable by **all** retrieval tiers — FTS, vector, _and_ Q&A cache — with no manual bookkeeping.\n\n### **Why this punches above local weight**\n\n  * **Frontier reasoning is amortized to zero.** Gemini-grade answers (via NLM) are computed once and distilled into a free, instant local cache. The marginal cost of the 1000th identical query is a dict lookup.\n  * **Confidence gating prevents quality collapse.** Cheap tiers only answer when they’re actually confident; otherwise the question escalates toward grounded Gemini. You get cache speed _without_ cache staleness lies.\n  * **Grounded citations on demand.** Tiers 2.5 and 5 return answers with source citations (File Search + NLM), so even “frontier” answers are verifiable, not hallucinated.\n  * **The system trains the system.** Every fallback is both a cache write and a fine-tuning datum — the local model that handles tier 6 next month was taught by the Gemini that handled tier 5 this month.\n  * **It’s all observable.** `router.stats`, `savings_report()`, and Oracle provenance logs make the flywheel measurable — you can watch the hit rate climb and the GPU calls fall.\n\n\n\n\n    from engine.nexus.query_router import get_query_router\n\n    router = get_query_router()\n    res = router.query(\"How does the interceptor pipeline work?\")\n    print(res.source, res.confidence, res.tokens_saved)   # e.g. \"cache\" 0.90 450\n    print(router.stats.to_dict())   # cache/vector/file_search/nlm/llm breakdown + hit rate\n\n\nThe whole stack is open, local-first, and self-documenting — a working example of how to give a small local model a memory that compounds and a tutor that’s free.",
  "title": "NEON-CITY/CosySim and the NEXUS project"
}