NEON-CITY/CosySim and the NEXUS project
CONTROL — How CosySim Trains and Governs Itself
The Oracle dashboard surfaces scheduler health, auto-loop cycles, and per-task timeout/error counts in real time.
Most AI demos are read-only: a model answers, you move on. CosySim’s CONTROL plane is the opposite. Every conversation, tool call, routing decision, and code edit becomes a training signal. A scheduler daemon wakes up on a cron-like cadence, checks whether enough new signal has accumulated, fine-tunes small local models on it, benchmarks the result against the incumbent, and promotes the winner — all on your own GPU, with no human in the loop and no data leaving the machine.
This is the part of the project most worth borrowing. It’s a working, end-to-end example of a local self-improvement loop : a data flywheel, a fine-tune orchestrator, an evaluation gate, an autonomous cycle controller, and an agent governor — wired together through a single scheduler.
The flywheel in one sentence: more interactions → richer datasets → better local models → better runtime behaviour → more interactions. See docs/TRAINING.md for the full pipeline walkthrough.
The five moving parts
| Layer | Module | Role |
|---|---|---|
| Flywheel | training/data_collector.py, engine/nexus/training_flywheel.py |
Capture every runtime event as a typed training example |
| Zoo | training/model_zoo.py |
Single source of truth: 16 ModelSpec entries, each with its own dataset key, train threshold, and base model |
| Trainer | training/finetune_orchestrator.py, training/auto_train.py |
QLoRA / Unsloth fine-tune jobs with queue, progress, checkpoint, auto-merge |
| Gate | training/evaluation_gate.py, training/model_registry.py |
Benchmark before/after; promote only if quality holds or improves |
| Controller | engine/nexus/auto_loop.py, engine/nexus/scheduler_daemon.py |
Closed-loop orchestration on a schedule; the AgentGovernor caps live agents |
1. The DataCollector flywheel — learning from your own interactions
DataCollector (training/data_collector.py) is a thread-safe, non-blocking JSONL appender that runtime components call as they work. It writes per-type live files to training/datasets/collected/{model_type}_live.jsonl. Every typed signal has a dedicated capture method:
collector.collect_tool_call(user_input, tool_name, params, success=True) # → tool_dispatch
collector.collect_grammar_error(bad_text, fixed_text, error_type="json") # → grammar_scanner
collector.collect_output_rating(output, rating=4, source="feed") # → output_evaluator
collector.collect_conversation(system_prompt, history, response, rating) # → conversational
collector.collect_code(prompt, code, language="python") # → coder
collector.collect_agent_decision(...) / collect_agent_outcome(...) # self-improvement loop
Failures here never crash the caller — each method is wrapped and logged through the Oracle, so the act of collecting training data can’t break the act of serving the user.
In parallel, TrainingFlywheel (engine/nexus/training_flywheel.py) harvests higher-level signal from the knowledge system — collect_from_qa, collect_from_nlm, collect_from_routing, collect_preference — into a SQLite-backed store with content-hash dedup, then exports in Alpaca, ShareGPT, or DPO formats (export_jsonl, export_sharegpt, export_dpo). The training-sync scheduler task drains Nexus Q&A into this store daily and auto-exports once 50+ unexported, quality-filtered examples accumulate.
2. The Model Zoo — one registry, many tiny specialists
MODEL_ZOO (training/model_zoo.py) is the declarative heart of the system: 16 ModelSpec entries, each declaring everything needed to train and evaluate one small specialist model.
"router_v3": ModelSpec(
id="router_v3",
base_model_alias="qwen-270m", # Qwen2.5-0.5B-Instruct
task_type="classification",
dataset_key="router_v3",
train_threshold=500, # auto-train fires at 500 collected examples
collect_from=["agent_routing_events", "intent_labels"],
auto_promote=True,
priority=2,
)
The fleet spans evaluators (qa_evaluator, output_evaluator), classifiers (router_v2/v3, conversation_analyzer), structured-output models (tool_dispatch), detectors (grammar_scanner), and generators (syntax_fixer, knowledge_synthesizer, coder, conversational) — plus voice backends. The philosophy: don’t fine-tune one big model; train a swarm of cheap 270M–3B specialists that each do one job well and run locally in LMStudio. Base models are resolved through aliases (qwen-270m → Qwen/Qwen2.5-0.5B-Instruct, llama-3b → meta-llama/Llama-3.2-3B-Instruct).
3. The FinetuneOrchestrator — QLoRA jobs as first-class objects
FinetuneOrchestrator (training/finetune_orchestrator.py) manages the full job lifecycle as persisted FinetuneJob records (training/jobs.jsonl): PENDING → RUNNING → DONE/FAILED/CANCELLED, with live progress, step/loss parsing, best-loss tracking, and auto-merge of the LoRA adapter on success.
Rather than depend on a heavyweight training harness in-process, it generates a standalone, cross-platform Unsloth training script per job and runs it as a subprocess (configurable via COSYSIM_TRAIN_PYTHON or training.python_executable, honouring the project’s venv rule). Hyperparameters scale with model size via FinetuneConfig — a 270M model gets lora_r=8, batch_size=8; a 3B model gets lora_r=32, batch_size=2, seq_len=2048. On completion it notifies the ModelRegistry.
Router v3 retrain — the canonical full cycle (click for more details)
4. The evaluation gate — no degraded model ever gets promoted
A self-improving system that can’t tell better from worse will happily train itself into the ground. evaluation_gate.py is the safety valve. It benchmarks the candidate against the incumbent and applies an explicit GatePolicy:
| Policy | Rule |
|---|---|
NO_REGRESSION |
candidate must score ≥ threshold × baseline |
MUST_IMPROVE |
a named metric must increase |
PARETO_DOMINANT |
candidate may not be dominated on any metric |
CUSTOM |
caller-supplied evaluation function |
Per-type benchmark prompt suites (router, tag-extraction, response-validate, general) score accuracy, latency, and consistency over multiple runs. Only models that clear the gate reach ModelRegistry, which supports single-score auto_promote and multi-criteria Pareto promotion — and that registry is what LMStudio loads as the active model.
5. The AutoLoop — closing the loop without a human
AutoLoop (engine/nexus/auto_loop.py) is the controller that turns the parts above into an autonomous cycle. It registers five scheduler callbacks and records every run in a SQLite cycle ledger (data/auto_loop.db):
| Cycle | Cadence | What it does |
|---|---|---|
| Experiment execution | every_2h |
Runs the oldest PENDING experiment; one per cycle to keep load predictable |
| Eval sweep | every_30m |
OnlineEvaluator.auto_check() — promote/rollback models past their thresholds |
| Training check | every_4h |
check_and_train_all_zoo() — fine-tune any zoo model past its train_threshold |
| Impact assessment | every_6h |
Finalize before/after impact snapshots, compute deltas |
| Full daily cycle | daily |
All four in sequence → a Markdown Daily Improvement Report stored in Nexus |
Each promotion, rollback, and training run is logged to the ImpactTracker, so the system keeps an auditable trail of what it changed about itself and what happened next. get_loop_status() exposes a health label (healthy / degraded / stalled) for the Oracle dashboard.
6. The scheduler — 90+ tasks, now with per-task timeouts
scheduler_daemon.py is a lightweight, cron-like daemon (not the agent task scheduler) that drives all of the above plus dozens of maintenance, knowledge, and content tasks — Nexus health, dedup, QA generation, news distillation, world-sim ticks, governance audits, model benchmarks, and the training tasks already described.
The v1.60.0 hardening pass is itself a good example of the project’s “fix the real problem” ethos. The original symptom: a hung external news fetch could block the entire scheduler loop for tens of seconds. The fix was structural, not a patch:
Per-task hard timeouts — every callback runs in a worker thread joined with a timeout; a hung task is abandoned (its daemon thread is detached, never blocking the loop) and recorded with a
timeout_count. Default is configurable viascheduler.default_timeout_seconds; network-bound tasks likenews-fetchget tighter caps.Honest “not implemented” stubs —
register_stub()/make_not_implemented()log one clear warning and return a sentinel that status records asnot_implemented, instead of silently faking success and hiding missing functionality.Non-blocking Nexus logging — task results are posted to Nexus on a fire-and-forget daemon thread that gives up immediately if Nexus is unreachable, so a down knowledge service can’t stall the loop it’s supposed to observe.
python -m engine.nexus.scheduler_daemon status # full task grid: next-due, run/error/timeout counts python -m engine.nexus.scheduler_daemon run # run one task now python -m training.auto_train --status # candidate counts vs thresholds python -m training.auto_train --dry-run # see what would train, train nothing
Governing the live agents — budgets, cooldowns, prerequisites
Self-improvement also means keeping the runtime agents in line. Every character reply flows through the AgentGovernor (engine/mcp/comms_framework.py), which wraps a CharacterAgent and enforces the full governance pipeline: build a ResponseContext, run auto-skills, run the 36-interceptor pre-call chain, call the LLM, parse tags, run the post-call chain.
Two governance mechanisms matter most for control:
InteractionPolicycaps each agent per scene —max_reply_tokens,tool_call_limit(rounds of tool calls per reply), tone/topic constraints, and in-character enforcement. Unset fields impose no constraint, so policies are additive.- Cooldowns + prerequisites (v1.59.0): the auto-skill path previously bypassed the registry’s throttling, so an auto skill could fire every single turn regardless of its declared
cooldown. The governor now consultsCOOLDOWN_TRACKER.can_use()and checks that each skill’sprerequisiteswere actually used before invoking it — and marks usage only after a successful call.
The result is a system where the agents are budgeted and rate-limited turn by turn, the scheduler is timeout-bounded task by task, and the models themselves are gated promotion by promotion — three layers of control over a system designed to keep changing itself.
Deeper dives: docs/TRAINING.md (flywheel + fine-tuning), docs/MCP_FRAMEWORK.md (governor + interceptor pipeline), docs/OPERATIONS.md (running the daemons), docs/NEXUS.md (knowledge flywheel inputs).
Integrations, Apps & CLI
NEONOS — the CosySim system surface where engine integrations, apps, and CLI converge
CosySim runs on local inference , but it does not run in a vacuum. The same engine that powers 35 scenes also exposes a deep integration layer (engine/integrations/), a fleet of standalone apps (apps/*.py), and a single unified CLI (cli.py). Everything reuses the same engine singletons, the same account pool, and the same secure config — so a HAR you capture in the browser, a Colab GPU you rent for free, and a NotebookLM notebook you distill all become first-class inputs to your local agents.
This is the part of the project most worth borrowing from: it is a worked example of how to wire cloud frontier models and local models into one coherent system without leaking a single secret into the repo.
Deep dives live in docs/INTEGRATIONS_SDK.md and docs/APPS.md. Per-service protocol specs are in the
*_API_REFERENCE.mdfiles.
The Integration Suite (engine/integrations/)
Each integration is a typed Python client that authenticates with session cookies from a shared account pool (or an env-supplied API key) and speaks the service’s real wire protocol — batchexecute, gRPC-web, or REST — reverse-engineered from HAR captures and V8 heap snapshots with ARGUS. No vendor SDK lock-in, no browser automation in the hot path.
| Domain | Module(s) | What it enables |
|---|---|---|
| GitHub Copilot | github_copilot_client.py |
Chat + model listing against the Copilot Individual API (38 frontier models — Claude, GPT, Gemini) via a GitHub browser session → short-lived Bearer token. Powers cli.py ask and the proxy. |
| NotebookLM | nlm_direct_client.py, notebooklm_sdk.py, nlm_rpc_registry.py |
Multi-turn grounded notebook chat, source ingest (text/URL/YouTube/image/audio/video/PDF), audio overviews, flashcards, mind maps, export-to-Sheets. The SDK wraps 37 rpcids + 24 gRPC methods with full docstrings — built for agents. |
| Gemini (consumer + Labs) | gemini_direct_client.py, gemini_extended_client.py, aistudio_client.py, appcatalyst_client.py, opal_client.py |
Direct Gemini chat (batchexecute), AI Studio MakerSuite (136 methods, structured JSON output), AppCatalyst REST access to Gemini 3 Flash Preview , and Opal creative workspace. |
| Managed RAG & caching | file_search_client.py, context_cache_client.py |
Google AI File Search — persistent doc/code stores with grounded citations, distilled back to local Nexus (“Google is the teacher, NEXUS is the student”). Context Cache reuses 50K±token prefixes (CLAUDE.md + context) across calls. |
| Workspace | google_drive_client.py, gsheets_client.py, google_docs_client.py, appscript_client.py, gas_client.py, workspace_gemini_client.py |
Drive upload/download/permissions, Sheets v4 CRUD, Docs create/export + Gemini content gen, Apps Script project/code/execution control, and the Gemini features embedded inside Workspace apps. |
| Colab (free GPU) | colab_client.py, colab_gpu_manager.py, colab_venv_manager.py, colab_notebook_builder.py, colab_tunnel_server.py |
Drive a Colab runtime as a remote compute backend: AI Agent tasks, kernel exec over WebSocket, venv/notebook provisioning, and an ngrok tunnel server exposing the GPU as an inference endpoint. |
| Compute routing | compute_router.py |
Unifies Colab tunnels, the Colab AI agent, and LMStudio behind one inference interface — tracks per-account quotas and tiers, falls back gracefully. |
| Account & auth plumbing | google_account_pool.py, github_account_importer.py, har_parser.py, har_extractor.py, rpcid_updater.py, rpc_proxy.py |
Round-robin multi-account cookie pool, HAR → pool import, and a live rpcid updater so rotated Google RPC IDs self-heal from the YAML registry. |
| Other | google_aim_client.py, homeassistant.py, anythingllm.py, artifact_bus.py |
Google AI Mode (udm=50) search threads, Home Assistant control, AnythingLLM bridge, and a cross-service artifact bus. |
Secure by construction
Secrets never touch the repo. Clients read keys from os.environ (e.g. appcatalyst_client.py resolves APPCATALYST_API_KEY / GOOGLE_API_KEY, aistudio_client.py loads a rotating key list from GOOGLE_AISTUDIO_KEYS) and cookies from a gitignored pool. The repo ships only structure :
.env.example # committed — shows the shape, no real values
.env / .env.local # gitignored
config/secrets.yaml # gitignored; *.example.* committed
data/accounts/pool.json, data/credentials/, **/client_secret*.json # gitignored
# .gitignore — v1.61.0: "never commit real values"
.env*
config/secrets.yaml
data/credentials/
**/*credentials*.json
**/client_secret*.json
See docs/CONFIGURATION.md for the full secret layout.
Standalone Apps (apps/*.py)
Every major subsystem has a thin, self-contained CLI entry point. They share apps/_bootstrap.py, which auto-re-execs into.venv/Scripts/python.exe (no manual activation), puts the project root on sys.path, and sets the CWD — then forwards to the engine. The apps are facades: the real logic lives in engine/, so an app and its in-process callers always behave identically.
| App | Purpose |
|---|---|
apps/nexus.py |
Nexus KMS — search, ask, add knowledge, sessions, NLM (docs/NEXUS.md) |
apps/argus.py |
Web-app recon — HAR/heap mining, bundle decompile, CDP scripting (docs/ARGUS.md) |
apps/lmstudio.py |
Local LLM status, model list, quick inference, benchmark |
apps/oracle.py |
System diagnostics — health, error aggregation, traces, perf |
apps/ask.py |
Unified query router → Copilot (38 models) / NotebookLM / LMStudio |
apps/filestore.py |
Gemini File Search managed RAG — store CRUD, upload, query |
apps/training.py |
Dataset + fine-tuning pipeline, benchmarks, live-traffic curation |
apps/cdp.py, apps/har.py, apps/heap.py |
Chrome DevTools, HAR, and V8 heap toolkits |
apps/account.py, apps/launch.py, apps/cleanup.py, apps/test.py |
Account pool, scene launcher, disk cleanup, smart test runner |
Multi-protocol AI gateway
Two proxy servers turn the whole stack into an OpenAI/Anthropic/Gemini-compatible endpoint — point any existing tool at it and get frontier models:
apps/multi_proxy.py→scripts/model_proxy_direct.pyon :5801 — zero-conversion : each protocol serializes straight to/from the Copilot backend with no intermediate format (≈7× faster). OpenAI, Anthropic, and Gemini request shapes are all served natively, including tool-call parsing.apps/proxy.py→ on :5800 — the original normalized gateway.python apps/multi_proxy.py --default opus --list-models # serve all 3 protocols on :5801
The Unified CLI (cli.py)
cli.py is the front door — 16 commands in four groups, each routing to a script, module, or app via the venv. Run it from anywhere; it handles the environment for you.
AI & Models: ask nlm nexus filestore proxy
Analysis: argus har heap cdp
Operations: oracle test scene launch cleanup
Accounts: account
python cli.py ask "Explain the interceptor pipeline" # → Copilot / NLM / local
python cli.py nexus search "economy ticks" # local knowledge base
python cli.py filestore bootstrap-all # Gemini managed RAG over the codebase
python cli.py account import github.har # HAR cookies → account pool
python cli.py argus har capture.har --report # deep API recon
python cli.py oracle --errors # what's broken, ranked
How a command reaches the engine (click for more details)
The throughline across all three layers: one engine, many faces. A cookie captured by cli.py account, a notebook seeded by apps/nexus.py, and a GPU tunnel opened by compute_router are equally available to a Flask scene, a skill, or your own script — which is exactly what makes this a useful reference implementation for agentic, local-first systems.
Discussion in the ATmosphere