{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreif5sv4tptboc2vec2lsyzsga7oqmv43vpxnwqrvgp7mhbfefofmqi",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3ml4zsqvxjlv2"
},
"path": "/t/managing-memory-when-trying-to-process-multiple-files/175768#post_2",
"publishedAt": "2026-05-05T19:32:07.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"Hugging Face chat templates",
"Hugging Face KV-cache docs",
"Gemma 4 Transformers docs",
"Aider repo map",
"Continue custom code RAG guide",
"Sourcegraph: how Cody understands your codebase",
"Tree-sitter",
"SQLite FTS5",
"Cache strategies / offloading",
"Transformers quantization",
"bitsandbytes quantization",
"KV-cache docs",
"How Cody understands your codebase",
"Lessons from building AI coding assistants",
"AI-assisted coding with Cody paper",
"Aider FAQ",
"Sourcegraph Cody context",
"Sentence Transformers retrieve and rerank",
"LanceDB codebase RAG",
"RepoBench",
"SWE-bench",
"Hugging Face RAG evaluation cookbook"
],
"textContent": "Yeah. Seems something is wrong:\n\n* * *\n\n# Managing memory when processing many source files with local Hugging Face models\n\nYes: passing source code as the `assistant` message is the wrong approach for this use case.\n\nThere are two problems:\n\n 1. **Chat-role problem:** `assistant` means “this is something the model previously said.” Your source code is user-provided evidence, not prior model output.\n 2. **Architecture problem:** putting many source files into one huge prompt is not a scalable local code-inspection strategy. You want scanning, chunking, search, retrieval, ranking, summaries, and token budgeting before the model sees the prompt.\n\n\n\nBetter mental model:\n\n\n Do not ask Gemma to hold the whole repository in the prompt.\n\n Use normal code-intelligence tools to find relevant evidence.\n Then ask Gemma to reason over that selected evidence.\n\n\nGemma 4’s large context window helps, but long context is still expensive. More tokens mean more tokenization, prefill, KV-cache memory, latency, and noise. Large context is useful for **selected evidence** , not for dumping a whole repo into every request.\n\nUseful references:\n\n * Hugging Face chat templates\n * Hugging Face KV-cache docs\n * Gemma 4 Transformers docs\n * Aider repo map\n * Continue custom code RAG guide\n * Sourcegraph: how Cody understands your codebase\n * Tree-sitter\n * SQLite FTS5\n\n\n\n* * *\n\n## What is wrong with the current function?\n\nYour function:\n\n\n def query_llm(system_message, user_message, assistant_message):\n messages = [\n {\"role\": \"system\", \"content\": system_message},\n {\"role\": \"user\", \"content\": user_message},\n {\"role\": \"assistant\", \"content\": assistant_message},\n ]\n\n text = processor.apply_chat_template(\n messages,\n tokenize=False,\n add_generation_prompt=True,\n enable_thinking=False\n )\n\n inputs = processor(text=text, return_tensors=\"pt\").to(model.device)\n input_len = inputs[\"input_ids\"].shape[-1]\n outputs = model.generate(**inputs, max_new_tokens=1024)\n response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)\n return response\n\n\n### 1. `assistant_message` is being misused\n\nIn chat templates, roles matter:\n\n * `system`: instructions about model behavior\n * `user`: user-provided request/context\n * `assistant`: prior model output\n\n\n\nSo this:\n\n\n {\"role\": \"assistant\", \"content\": assistant_message}\n\n\nmeans:\n\n\n The assistant previously said this.\n\n\nIf `assistant_message` contains source code, the effective conversation becomes:\n\n\n System: You are a code inspector.\n User: Please inspect this.\n Assistant: <giant pile of source code>\n Assistant: continue here...\n\n\nThat is not the right structure. You want:\n\n\n System: You are a code inspector.\n User: Here is selected source code. Please inspect it.\n Assistant: <model answer>\n\n\nSource code should usually be inside the **user** message as evidence.\n\n* * *\n\n### 2. `add_generation_prompt=True` makes this worse\n\n`add_generation_prompt=True` adds the model’s “now answer as assistant” marker.\n\nWith your current structure, the model sees a huge assistant turn and is asked to continue. That can make it treat code as prior assistant output rather than inspection context.\n\n* * *\n\n### 3. `max_new_tokens` does not limit input size\n\nThis:\n\n\n outputs = model.generate(**inputs, max_new_tokens=1024)\n\n\nlimits only the number of **new output tokens**.\n\nIt does not limit input tokens.\n\nSo this is still possible:\n\n\n input prompt: 150,000 tokens\n max_new_tokens: 1,024\n result: still OOM\n\n\nYou need to count input tokens before generation.\n\n* * *\n\n### 4. The ~1.7TB allocation is probably tensor/context blow-up\n\nA “tried to allocate ~1.7TB” error usually does not mean your source text literally needs 1.7TB.\n\nIt often means an internal tensor became enormous because of:\n\n * very long sequence length\n * prefill/attention memory\n * KV-cache memory\n * inefficient attention path\n * batching or cache implementation\n * model/library-specific shape issue\n * adapter/PEFT/structured-generation interaction\n\n\n\nLong context is not merely stored; it is processed. The model must prefill over the prompt before producing the first token, then keep key/value cache state during generation.\n\nRelevant docs:\n\n * Hugging Face KV-cache docs\n * Cache strategies / offloading\n\n\n\n* * *\n\n## First corrected version\n\nFor a simple one-shot inspection, use only `system` and `user`:\n\n\n def query_llm(system_message: str, user_message: str, max_new_tokens: int = 1024) -> str:\n messages = [\n {\"role\": \"system\", \"content\": system_message.strip()},\n {\"role\": \"user\", \"content\": user_message.strip()},\n ]\n\n text = processor.apply_chat_template(\n messages,\n tokenize=False,\n add_generation_prompt=True,\n enable_thinking=False,\n )\n\n inputs = processor(text=text, return_tensors=\"pt\").to(model.device)\n input_len = inputs[\"input_ids\"].shape[-1]\n\n outputs = model.generate(\n **inputs,\n max_new_tokens=max_new_tokens,\n do_sample=False,\n )\n\n return processor.decode(\n outputs[0][input_len:],\n skip_special_tokens=True,\n ).strip()\n\n\nPut code into the user message:\n\n\n system_message = \"\"\"\n You are a local code inspection assistant.\n\n Rules:\n - Analyze only the provided code context.\n - Do not invent behavior from files that are not shown.\n - If context is insufficient, say exactly what file or symbol is needed.\n - Return findings with file paths, line ranges, severity, evidence, and suggested fixes.\n \"\"\"\n\n user_message = \"\"\"\n Task:\n Inspect the following code for correctness, security, and maintainability issues.\n\n Code context:\n <file path=\"src/auth/session.py\" lines=\"1-120\">\n ... source code here ...\n </file>\n\n <file path=\"src/auth/routes.py\" lines=\"1-180\">\n ... source code here ...\n </file>\n \"\"\"\n\n\nThat fixes the role problem, but not the repo-scale memory problem.\n\n* * *\n\n## Better input construction\n\nYou can often let `apply_chat_template()` tokenize directly:\n\n\n def query_llm(system_message: str, user_message: str, max_new_tokens: int = 1024) -> str:\n messages = [\n {\"role\": \"system\", \"content\": system_message.strip()},\n {\"role\": \"user\", \"content\": user_message.strip()},\n ]\n\n inputs = processor.apply_chat_template(\n messages,\n tokenize=True,\n return_dict=True,\n return_tensors=\"pt\",\n add_generation_prompt=True,\n enable_thinking=False,\n ).to(model.device)\n\n input_len = inputs[\"input_ids\"].shape[-1]\n\n outputs = model.generate(\n **inputs,\n max_new_tokens=max_new_tokens,\n do_sample=False,\n )\n\n return processor.decode(\n outputs[0][input_len:],\n skip_special_tokens=True,\n ).strip()\n\n\nStill add a hard input-token limit.\n\n* * *\n\n## Add token counting and hard prompt limits\n\nThis is the most important practical fix after correcting roles.\n\n\n def count_tokens_from_messages(messages: list[dict]) -> int:\n inputs = processor.apply_chat_template(\n messages,\n tokenize=True,\n return_dict=True,\n return_tensors=\"pt\",\n add_generation_prompt=True,\n enable_thinking=False,\n )\n return inputs[\"input_ids\"].shape[-1]\n\n\nThen enforce a ceiling:\n\n\n def query_llm(\n system_message: str,\n user_message: str,\n *,\n max_prompt_tokens: int = 12_000,\n max_new_tokens: int = 1024,\n ) -> str:\n messages = [\n {\"role\": \"system\", \"content\": system_message.strip()},\n {\"role\": \"user\", \"content\": user_message.strip()},\n ]\n\n prompt_tokens = count_tokens_from_messages(messages)\n\n if prompt_tokens > max_prompt_tokens:\n raise ValueError(\n f\"Prompt too large: {prompt_tokens:,} tokens. \"\n f\"Limit is {max_prompt_tokens:,}. \"\n \"Retrieve fewer files, use smaller chunks, or summarize first.\"\n )\n\n inputs = processor.apply_chat_template(\n messages,\n tokenize=True,\n return_dict=True,\n return_tensors=\"pt\",\n add_generation_prompt=True,\n enable_thinking=False,\n ).to(model.device)\n\n input_len = inputs[\"input_ids\"].shape[-1]\n\n outputs = model.generate(\n **inputs,\n max_new_tokens=max_new_tokens,\n do_sample=False,\n )\n\n return processor.decode(\n outputs[0][input_len:],\n skip_special_tokens=True,\n ).strip()\n\n\nStart around:\n\n\n max_prompt_tokens = 12_000\n\n\nEven if the model supports much more, start small. Smaller prompts are easier to debug, faster, cheaper, and often better because the evidence is less diluted.\n\n* * *\n\n## Why whole-repo prompting fails\n\n### 1. Memory failure\n\nMemory is not just “can I load the model?”\n\nDuring generation, memory includes:\n\nBucket | Meaning\n---|---\nModel weights | Memory needed to load Gemma\nInput tokens | Tokenized prompt, including source\nPrefill | Processing the whole prompt before first output token\nKV cache | Stored key/value tensors for prior tokens\nOutput tokens | New generated tokens\nRuntime overhead | PyTorch/CUDA/processor/intermediate buffers\n\nQuantization helps model-weight memory. It does not make unlimited context safe.\n\nReferences:\n\n * Transformers quantization\n * bitsandbytes quantization\n * KV-cache docs\n\n\n\n* * *\n\n### 2. Context-quality failure\n\nEven if the prompt fits, “lots of code” is not the same as “useful evidence.”\n\nA huge prompt with 40 files can be worse than a small prompt with the right 5-10 chunks. The model has to find the relevant lines among irrelevant code.\n\nUseful links:\n\n * How Cody understands your codebase\n * Lessons from building AI coding assistants\n * AI-assisted coding with Cody paper\n\n\n\n* * *\n\n### 3. Debuggability failure\n\nIf the answer is bad after a giant prompt, you cannot tell what failed:\n\n\n Did the model miss the evidence?\n Was the relevant file absent?\n Was the relevant file truncated?\n Was the prompt too noisy?\n Was the answer hallucinated?\n Was the code split badly?\n Was the important symbol hidden in generated/vendor code?\n\n\nA retrieval pipeline gives inspectable logs:\n\n\n query\n retrieved files\n retrieved symbols\n chunk scores\n token counts\n prompt contents\n model answer\n\n\nThat makes the system improvable.\n\n* * *\n\n## Better architecture\n\nA local code inspector should look like this:\n\n\n Repository\n ↓\n File scanner\n ↓\n Ignore rules\n ↓\n Structural chunker\n ↓\n Exact index\n ↓\n Semantic index\n ↓\n Hybrid retriever\n ↓\n Reranker\n ↓\n Prompt builder with token budget\n ↓\n Gemma\n ↓\n Grounded findings\n\n\nNot this:\n\n\n Repository\n ↓\n Concatenate source files\n ↓\n Gemma\n ↓\n OOM or vague answer\n\n\nReferences:\n\n * Aider repo map\n * Aider FAQ\n * Continue custom code RAG guide\n * Sourcegraph Cody context\n\n\n\n* * *\n\n## What to build first\n\n### Phase 1: one-file inspector\n\nDo not start with multiple files. Start with one file.\n\n\n mythos inspect-file src/auth/session.py\n\n\nExpected behavior:\n\n\n 1. Read one file.\n 2. Wrap it in a <file> block.\n 3. Count tokens.\n 4. Refuse if too large.\n 5. Ask Gemma for structured findings.\n 6. Return path, line range, severity, evidence, and suggested fix.\n\n\nPrompt shape:\n\n\n Task:\n Inspect this file for correctness, security, and maintainability issues.\n\n Code context:\n <file path=\"src/auth/session.py\" lines=\"1-120\">\n ...\n </file>\n\n Return:\n 1. Findings\n 2. Evidence\n 3. Risk\n 4. Suggested fixes\n 5. Missing context, if any\n\n\nSuccess criteria:\n\n\n - no memory surprises\n - token count printed\n - stable output format\n - line-numbered findings\n - model says when it needs more context\n\n\n* * *\n\n### Phase 2: file scanning and ignore rules\n\nAggressively ignore junk before indexing.\n\nSkip:\n\n\n .git/\n node_modules/\n vendor/\n dist/\n build/\n target/\n .venv/\n venv/\n __pycache__/\n coverage/\n .next/\n .nuxt/\n .cache/\n *.min.js\n *.map\n generated files\n binary files\n large media files\n\n\nExample scanner:\n\n\n from pathlib import Path\n\n IGNORE_DIRS = {\n \".git\", \"node_modules\", \"vendor\", \"dist\", \"build\", \"target\",\n \".venv\", \"venv\", \"__pycache__\", \"coverage\", \".next\", \".nuxt\", \".cache\",\n }\n\n CODE_EXTENSIONS = {\n \".py\", \".js\", \".ts\", \".tsx\", \".jsx\",\n \".go\", \".rs\", \".java\", \".kt\", \".cs\",\n \".cpp\", \".c\", \".h\", \".hpp\",\n \".rb\", \".php\", \".swift\", \".scala\",\n \".sql\", \".yaml\", \".yml\", \".toml\", \".json\",\n }\n\n def should_skip(path: Path) -> bool:\n if any(part in IGNORE_DIRS for part in path.parts):\n return True\n\n if path.suffix not in CODE_EXTENSIONS:\n return True\n\n name = path.name.lower()\n\n return (\n name.endswith(\".min.js\")\n or name.endswith(\".map\")\n or \"generated\" in name\n )\n\n def collect_files(repo_root: str) -> list[Path]:\n root = Path(repo_root)\n return [\n path\n for path in root.rglob(\"*\")\n if path.is_file() and not should_skip(path)\n ]\n\n\nLater, add `.gitignore` support.\n\n* * *\n\n### Phase 3: exact search before embeddings\n\nFor code, exact search is not optional.\n\nCode has exact identifiers:\n\n\n create_session\n JWT_SECRET\n verify_password\n dangerouslySetInnerHTML\n deserialize\n POST /api/login\n process.env\n subprocess\n eval\n\n\nSemantic search can miss these. Exact search will not.\n\nUse SQLite FTS5 first.\n\nSchema:\n\n\n CREATE TABLE chunks (\n id INTEGER PRIMARY KEY,\n path TEXT NOT NULL,\n language TEXT,\n kind TEXT,\n symbol TEXT,\n start_line INTEGER,\n end_line INTEGER,\n text TEXT NOT NULL\n );\n\n CREATE VIRTUAL TABLE chunks_fts USING fts5(\n path,\n symbol,\n text,\n content='chunks',\n content_rowid='id'\n );\n\n\nCommands:\n\n\n mythos search \"JWT_SECRET\"\n mythos search \"password reset\"\n mythos search \"raw SQL\"\n mythos search \"dangerouslySetInnerHTML\"\n\n\n* * *\n\n### Phase 4: structural chunking\n\nDo not permanently chunk code by arbitrary character count.\n\nUseful code chunks are usually:\n\n\n function\n method\n class\n route handler\n test case\n config block\n module-level API\n\n\nUse Tree-sitter where possible.\n\nChunk record:\n\n\n {\n \"path\": \"src/auth/session.py\",\n \"language\": \"python\",\n \"kind\": \"function\",\n \"symbol\": \"create_session\",\n \"start_line\": 42,\n \"end_line\": 89,\n \"text\": \"def create_session(...): ...\"\n }\n\n\nThis metadata is what lets the model produce useful findings with file paths and lines.\n\n* * *\n\n### Phase 5: semantic search after exact search\n\nAfter exact search and chunks work, add embeddings.\n\nSemantic search helps with natural-language questions:\n\n\n Where is authorization enforced?\n What handles password reset?\n Where do we parse untrusted input?\n How does session refresh work?\n What code handles checkout?\n\n\nDo not replace exact search with embeddings. Use both.\n\nPipeline:\n\n\n query\n → exact search top 50\n → vector search top 50\n → merge and deduplicate\n → boost symbols/paths/tests\n → rerank\n → fit prompt budget\n\n\nReferences:\n\n * Continue custom code RAG guide\n * Sentence Transformers retrieve and rerank\n * LanceDB codebase RAG\n\n\n\n* * *\n\n### Phase 6: hybrid retrieval\n\nCombine:\n\n\n exact search\n + semantic search\n + symbol matching\n + path matching\n + test-file heuristics\n + recently-changed-file boosts\n + generated-file penalties\n\n\nSimple scoring idea:\n\n\n def score_candidate(candidate: dict, query: str) -> float:\n score = 0.0\n\n query_lower = query.lower()\n symbol = (candidate.get(\"symbol\") or \"\").lower()\n path = candidate[\"path\"].lower()\n text = candidate[\"text\"].lower()\n\n if query_lower == symbol:\n score += 2.0\n\n if query_lower in text:\n score += 1.5\n\n if any(part in path for part in query_lower.split()):\n score += 0.7\n\n if \"test\" in path or \"spec\" in path:\n score += 0.5\n\n if \"generated\" in path:\n score -= 1.0\n\n score += candidate.get(\"semantic_score\", 0.0)\n\n return score\n\n\nCrude and observable is better than sophisticated and opaque.\n\n* * *\n\n### Phase 7: reranking\n\nUse reranking after first-stage retrieval works.\n\n\n 1. Exact search top 50.\n 2. Vector search top 50.\n 3. Merge and deduplicate.\n 4. Rerank top 30.\n 5. Include top 5-12 chunks.\n 6. Fit token budget.\n\n\nThis keeps prompts small and relevant.\n\n* * *\n\n## Add a repo map\n\nA repo map bridges single-file inspection and whole-repo awareness.\n\nAider’s repo map is the key reference: summarize important identifiers and relationships, then select the most relevant parts that fit the token budget.\n\nSimple repo map:\n\n\n src/auth/\n routes.py\n login(request)\n logout(request)\n refresh_token(request)\n\n session.py\n create_session(user_id)\n verify_session(token)\n revoke_session(token)\n\n passwords.py\n hash_password(password)\n verify_password(password, hash)\n\n src/users/\n models.py\n User\n repository.py\n find_user_by_email(email)\n\n\nFor each file, store:\n\n\n path\n language\n short summary\n public symbols\n imports\n exports\n risk tags\n test files\n\n\nRisk tags:\n\n\n auth\n crypto\n sql\n filesystem\n network\n subprocess\n deserialization\n user-input\n secrets\n permissions\n\n\nThe repo map gives Gemma broad orientation without raw source bloat.\n\n* * *\n\n## Add a missing-context protocol\n\nTell the model:\n\n\n If the provided context is insufficient, do not guess.\n Return NEED_MORE_CONTEXT with specific files, symbols, tests, or configs needed.\n\n\nExample:\n\n\n NEED_MORE_CONTEXT:\n - src/auth/passwords.py::verify_password\n - tests/test_password_reset.py\n - src/config.py::JWT_SECRET\n\n\nThen retrieve those files/symbols and ask again.\n\nControlled loop:\n\n\n initial retrieval\n → Gemma analyzes\n → Gemma asks for missing context\n → retrieve more\n → Gemma produces final answer\n\n\nThis is a good minimal “agent” before using a full agent framework.\n\n* * *\n\n## Better `query_llm` for this project\n\n\n import torch\n\n def format_code_context(chunks: list[dict]) -> str:\n blocks = []\n\n for chunk in chunks:\n path = chunk[\"path\"]\n start = chunk.get(\"start_line\", \"?\")\n end = chunk.get(\"end_line\", \"?\")\n symbol = chunk.get(\"symbol\", \"\")\n\n blocks.append(\n f'<file path=\"{path}\" lines=\"{start}-{end}\" symbol=\"{symbol}\">\\n'\n f'{chunk[\"text\"]}\\n'\n f'</file>'\n )\n\n return \"\\n\\n\".join(blocks)\n\n def build_messages(\n system_message: str,\n task: str,\n chunks: list[dict],\n *,\n repo_summary: str = \"\",\n ) -> list[dict]:\n code_context = format_code_context(chunks)\n\n user_message = f\"\"\"\n Task:\n {task}\n\n Repository summary:\n {repo_summary}\n\n Code context:\n {code_context}\n\n Instructions:\n Return findings with file paths and line ranges.\n Classify each finding as HIGH, MEDIUM, LOW, or INFO.\n Distinguish confirmed issues from possible risks.\n If the provided context is insufficient, return NEED_MORE_CONTEXT with specific files or symbols needed.\n \"\"\".strip()\n\n return [\n {\"role\": \"system\", \"content\": system_message.strip()},\n {\"role\": \"user\", \"content\": user_message},\n ]\n\n def query_llm(\n system_message: str,\n task: str,\n chunks: list[dict],\n *,\n repo_summary: str = \"\",\n max_prompt_tokens: int = 12_000,\n max_new_tokens: int = 1024,\n use_cache_offload: bool = False,\n ) -> str:\n messages = build_messages(\n system_message=system_message,\n task=task,\n chunks=chunks,\n repo_summary=repo_summary,\n )\n\n count_inputs = processor.apply_chat_template(\n messages,\n tokenize=True,\n return_dict=True,\n return_tensors=\"pt\",\n add_generation_prompt=True,\n enable_thinking=False,\n )\n\n prompt_tokens = count_inputs[\"input_ids\"].shape[-1]\n\n if prompt_tokens > max_prompt_tokens:\n raise ValueError(\n f\"Prompt too large: {prompt_tokens:,} tokens. \"\n f\"Limit: {max_prompt_tokens:,}. \"\n \"Retrieve fewer chunks, use smaller chunks, or summarize first.\"\n )\n\n print(f\"prompt_tokens={prompt_tokens:,}\")\n print(f\"max_new_tokens={max_new_tokens:,}\")\n print(f\"chunk_count={len(chunks):,}\")\n\n inputs = {\n key: value.to(model.device)\n for key, value in count_inputs.items()\n if hasattr(value, \"to\")\n }\n\n input_len = inputs[\"input_ids\"].shape[-1]\n\n generation_kwargs = {\n \"max_new_tokens\": max_new_tokens,\n \"do_sample\": False,\n }\n\n if use_cache_offload:\n generation_kwargs[\"cache_implementation\"] = \"offloaded\"\n\n with torch.inference_mode():\n outputs = model.generate(\n **inputs,\n **generation_kwargs,\n )\n\n return processor.decode(\n outputs[0][input_len:],\n skip_special_tokens=True,\n ).strip()\n\n\nKey changes:\n\n\n - no code in assistant role\n - code is user-supplied context\n - prompt tokens are counted before generation\n - hard prompt limit prevents surprise OOM\n - chunks include paths and line ranges\n - deterministic generation by default\n - optional KV-cache offload\n - model can request missing context\n\n\n* * *\n\n## Suggested system prompt\n\n\n You are a local code inspection assistant.\n\n Rules:\n - Analyze only the provided context.\n - Do not assume behavior from files that are not shown.\n - If context is insufficient, return NEED_MORE_CONTEXT with specific files or symbols.\n - Prefer concrete findings over generic advice.\n - Include file paths and line ranges.\n - Classify severity as HIGH, MEDIUM, LOW, or INFO.\n - Distinguish confirmed issues from possible risks.\n - Do not claim something is vulnerable unless the provided code supports it.\n\n\n* * *\n\n## Suggested output format\n\n\n Finding 1\n Severity: HIGH\n Status: Confirmed / Likely / Speculative\n File: src/auth/session.py:42-61\n\n Issue:\n ...\n\n Evidence:\n ...\n\n Why it matters:\n ...\n\n Suggested fix:\n ...\n\n Missing context:\n ...\n\n\n* * *\n\n## Suggested commands\n\nStart with narrow, inspectable CLI commands.\n\n\n mythos inspect-file src/auth/session.py\n mythos search \"JWT_SECRET\"\n mythos inspect-symbol create_session\n mythos inspect-topic \"password reset security\"\n mythos inspect-diff\n\n\n`inspect-diff` will probably become the most useful mode because diffs naturally limit context size.\n\n* * *\n\n## Token budget recommendations\n\nStart conservatively:\n\nPrompt part | Initial budget\n---|---\nSystem prompt | 500-1,000 tokens\nUser task | 100-500 tokens\nRepo summary / repo map | 1,000-4,000 tokens\nRetrieved code | 6,000-16,000 tokens\nAnswer room | 1,000-2,000 tokens\nTotal initial target | 8K-24K tokens\n\nDo not start at the theoretical context maximum.\n\n* * *\n\n## Memory mitigations after architecture fixes\n\n### 1. Quantization\n\nUse quantization to reduce model-weight memory:\n\n\n import torch\n from transformers import AutoProcessor, AutoModelForCausalLM, BitsAndBytesConfig\n\n model_id = \"google/gemma-4-E2B-it\"\n\n quantization_config = BitsAndBytesConfig(\n load_in_4bit=True,\n bnb_4bit_compute_dtype=torch.bfloat16,\n )\n\n processor = AutoProcessor.from_pretrained(model_id)\n\n model = AutoModelForCausalLM.from_pretrained(\n model_id,\n device_map=\"auto\",\n torch_dtype=torch.bfloat16,\n quantization_config=quantization_config,\n attn_implementation=\"sdpa\",\n )\n\n\nQuantization helps load the model. It does not make unlimited prompts safe.\n\nReferences:\n\n * Transformers quantization\n * bitsandbytes quantization\n\n\n\n* * *\n\n### 2. Cache offloading\n\nIf a reasonable prompt still OOMs:\n\n\n outputs = model.generate(\n **inputs,\n max_new_tokens=1024,\n do_sample=False,\n cache_implementation=\"offloaded\",\n )\n\n\nThis trades speed for lower GPU memory pressure. Use it after retrieval and token budgeting, not instead of them.\n\n* * *\n\n### 3. Smaller output cap\n\nFor inspection, try:\n\n\n max_new_tokens = 512\n\n\nOr split:\n\n\n pass 1: brief diagnosis\n pass 2: expand only if needed\n\n\n* * *\n\n## Evaluation plan\n\nEvaluate retrieval separately from model reasoning.\n\nSmall local eval set:\n\n\n - question: \"Where is session creation implemented?\"\n expected_files:\n - src/auth/session.py\n - src/auth/routes.py\n\n - question: \"What validates password reset tokens?\"\n expected_files:\n - src/auth/password_reset.py\n - tests/test_password_reset.py\n\n - question: \"Is project deletion authorization enforced?\"\n expected_files:\n - src/projects/routes.py\n - src/auth/middleware.py\n - tests/test_project_permissions.py\n\n - question: \"Find places where raw SQL is constructed.\"\n expected_files:\n - src/db/search.py\n\n\nScore:\n\nMetric | Meaning\n---|---\nRetrieval recall | Did the right files appear?\nRetrieval precision | Were included chunks relevant?\nPrompt efficiency | Tokens per useful answer\nGrounding | Did answer cite included code?\nMissing-context quality | Did model ask for the right missing file?\nDiagnosis quality | Was the inspection correct?\n\nReferences:\n\n * RepoBench\n * SWE-bench\n * Hugging Face RAG evaluation cookbook\n\n\n\n* * *\n\n## What to avoid for now\n\n### Avoid fine-tuning\n\nYour immediate issue is not that Gemma lacks code-inspection behavior. It is that the right context is not being selected, packed, and grounded.\n\nFix this first:\n\n\n context selection\n prompt structure\n token budget\n retrieval evaluation\n\n\nbefore this:\n\n\n fine-tuning\n\n\n### Avoid a full agent framework at first\n\nStart with simple tools:\n\n\n search_text\n search_symbol\n open_file\n inspect_file\n inspect_diff\n\n\nThen add a controlled missing-context loop.\n\n### Avoid embeddings-only retrieval\n\nCode has exact names. Use exact search first, embeddings second.\n\n### Avoid UI work early\n\nA CLI with good logs is more useful than a polished interface.\n\nYou need to see:\n\n\n retrieved files\n retrieved symbols\n chunk scores\n token counts\n prompt size\n model answer\n\n\n* * *\n\n## Realistic build order\n\n### Milestone 1: stable one-file review\n\n\n mythos inspect-file src/auth/session.py\n\n\nMust have:\n\n\n - correct chat roles\n - token counting\n - hard prompt limit\n - structured findings\n\n\n### Milestone 2: exact search\n\n\n mythos search \"JWT_SECRET\"\n\n\nMust have:\n\n\n - scanner\n - ignore rules\n - SQLite FTS5 or ripgrep-style search\n - file path and line numbers\n\n\n### Milestone 3: inspect symbol\n\n\n mythos inspect-symbol create_session\n\n\nMust have:\n\n\n - definition lookup\n - caller lookup\n - test lookup if possible\n\n\n### Milestone 4: inspect topic\n\n\n mythos inspect-topic \"password reset security\"\n\n\nMust have:\n\n\n - exact search\n - semantic search\n - merged ranking\n - token-budgeted prompt\n\n\n### Milestone 5: repo map\n\nMust have:\n\n\n - directory summary\n - file summary\n - public symbols\n - imports/exports\n - risk tags\n\n\n### Milestone 6: missing-context loop\n\nMust have:\n\n\n - model can request specific files/symbols\n - system retrieves them\n - second-pass answer improves\n\n\n### Milestone 7: diff inspection\n\n\n mythos inspect-diff\n\n\nMust have:\n\n\n - changed functions\n - surrounding code\n - tests\n - likely regression/security risks\n\n\n* * *\n\n## Final recommendation\n\nFor this case:\n\n 1. Stop passing code as an `assistant` message.\n 2. Use `system` for behavior and `user` for task + selected code context.\n 3. Count prompt tokens before every generation.\n 4. Set an initial prompt cap around 12K tokens.\n 5. Build a one-file inspector first.\n 6. Add exact search before embeddings.\n 7. Chunk code by functions/classes/methods where possible.\n 8. Add semantic search after exact search works.\n 9. Use hybrid retrieval and reranking.\n 10. Add an Aider-style repo map.\n 11. Use `NEED_MORE_CONTEXT` instead of letting the model guess.\n 12. Evaluate retrieval separately from final answers.\n\n\n\n## Short summary\n\n * The current approach is wrong for source-code context.\n * The `assistant` role is not a context bucket.\n * The 1.7TB allocation is probably internal tensor/context blow-up, not literal text size.\n * `max_new_tokens` does not limit input size.\n * Quantization helps model-weight memory, not unlimited prompt size.\n * Long-context Gemma is useful, but retrieval is still required.\n * Build a code evidence engine: scanner → chunks → exact search → semantic search → rerank → prompt → Gemma.\n * The model should inspect selected evidence, not swallow the whole repository.\n\n",
"title": "Managing memory when trying to process multiple files"
}