Managing memory when trying to process multiple files
Yeah. Seems something is wrong:
Managing memory when processing many source files with local Hugging Face models
Yes: passing source code as the assistant message is the wrong approach for this use case.
There are two problems:
- Chat-role problem:
assistantmeans “this is something the model previously said.” Your source code is user-provided evidence, not prior model output. - Architecture problem: putting many source files into one huge prompt is not a scalable local code-inspection strategy. You want scanning, chunking, search, retrieval, ranking, summaries, and token budgeting before the model sees the prompt.
Better mental model:
Do not ask Gemma to hold the whole repository in the prompt.
Use normal code-intelligence tools to find relevant evidence.
Then ask Gemma to reason over that selected evidence.
Gemma 4’s large context window helps, but long context is still expensive. More tokens mean more tokenization, prefill, KV-cache memory, latency, and noise. Large context is useful for selected evidence , not for dumping a whole repo into every request.
Useful references:
- Hugging Face chat templates
- Hugging Face KV-cache docs
- Gemma 4 Transformers docs
- Aider repo map
- Continue custom code RAG guide
- Sourcegraph: how Cody understands your codebase
- Tree-sitter
- SQLite FTS5
What is wrong with the current function?
Your function:
def query_llm(system_message, user_message, assistant_message):
messages = [
{"role": "system", "content": system_message},
{"role": "user", "content": user_message},
{"role": "assistant", "content": assistant_message},
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
return response
1. assistant_message is being misused
In chat templates, roles matter:
system: instructions about model behavioruser: user-provided request/contextassistant: prior model output
So this:
{"role": "assistant", "content": assistant_message}
means:
The assistant previously said this.
If assistant_message contains source code, the effective conversation becomes:
System: You are a code inspector.
User: Please inspect this.
Assistant: <giant pile of source code>
Assistant: continue here...
That is not the right structure. You want:
System: You are a code inspector.
User: Here is selected source code. Please inspect it.
Assistant: <model answer>
Source code should usually be inside the user message as evidence.
2. add_generation_prompt=True makes this worse
add_generation_prompt=True adds the model’s “now answer as assistant” marker.
With your current structure, the model sees a huge assistant turn and is asked to continue. That can make it treat code as prior assistant output rather than inspection context.
3. max_new_tokens does not limit input size
This:
outputs = model.generate(**inputs, max_new_tokens=1024)
limits only the number of new output tokens.
It does not limit input tokens.
So this is still possible:
input prompt: 150,000 tokens
max_new_tokens: 1,024
result: still OOM
You need to count input tokens before generation.
4. The ~1.7TB allocation is probably tensor/context blow-up
A “tried to allocate ~1.7TB” error usually does not mean your source text literally needs 1.7TB.
It often means an internal tensor became enormous because of:
- very long sequence length
- prefill/attention memory
- KV-cache memory
- inefficient attention path
- batching or cache implementation
- model/library-specific shape issue
- adapter/PEFT/structured-generation interaction
Long context is not merely stored; it is processed. The model must prefill over the prompt before producing the first token, then keep key/value cache state during generation.
Relevant docs:
- Hugging Face KV-cache docs
- Cache strategies / offloading
First corrected version
For a simple one-shot inspection, use only system and user:
def query_llm(system_message: str, user_message: str, max_new_tokens: int = 1024) -> str:
messages = [
{"role": "system", "content": system_message.strip()},
{"role": "user", "content": user_message.strip()},
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
)
return processor.decode(
outputs[0][input_len:],
skip_special_tokens=True,
).strip()
Put code into the user message:
system_message = """
You are a local code inspection assistant.
Rules:
- Analyze only the provided code context.
- Do not invent behavior from files that are not shown.
- If context is insufficient, say exactly what file or symbol is needed.
- Return findings with file paths, line ranges, severity, evidence, and suggested fixes.
"""
user_message = """
Task:
Inspect the following code for correctness, security, and maintainability issues.
Code context:
<file path="src/auth/session.py" lines="1-120">
... source code here ...
</file>
<file path="src/auth/routes.py" lines="1-180">
... source code here ...
</file>
"""
That fixes the role problem, but not the repo-scale memory problem.
Better input construction
You can often let apply_chat_template() tokenize directly:
def query_llm(system_message: str, user_message: str, max_new_tokens: int = 1024) -> str:
messages = [
{"role": "system", "content": system_message.strip()},
{"role": "user", "content": user_message.strip()},
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
enable_thinking=False,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
)
return processor.decode(
outputs[0][input_len:],
skip_special_tokens=True,
).strip()
Still add a hard input-token limit.
Add token counting and hard prompt limits
This is the most important practical fix after correcting roles.
def count_tokens_from_messages(messages: list[dict]) -> int:
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
enable_thinking=False,
)
return inputs["input_ids"].shape[-1]
Then enforce a ceiling:
def query_llm(
system_message: str,
user_message: str,
*,
max_prompt_tokens: int = 12_000,
max_new_tokens: int = 1024,
) -> str:
messages = [
{"role": "system", "content": system_message.strip()},
{"role": "user", "content": user_message.strip()},
]
prompt_tokens = count_tokens_from_messages(messages)
if prompt_tokens > max_prompt_tokens:
raise ValueError(
f"Prompt too large: {prompt_tokens:,} tokens. "
f"Limit is {max_prompt_tokens:,}. "
"Retrieve fewer files, use smaller chunks, or summarize first."
)
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
enable_thinking=False,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
)
return processor.decode(
outputs[0][input_len:],
skip_special_tokens=True,
).strip()
Start around:
max_prompt_tokens = 12_000
Even if the model supports much more, start small. Smaller prompts are easier to debug, faster, cheaper, and often better because the evidence is less diluted.
Why whole-repo prompting fails
1. Memory failure
Memory is not just “can I load the model?”
During generation, memory includes:
| Bucket | Meaning |
|---|---|
| Model weights | Memory needed to load Gemma |
| Input tokens | Tokenized prompt, including source |
| Prefill | Processing the whole prompt before first output token |
| KV cache | Stored key/value tensors for prior tokens |
| Output tokens | New generated tokens |
| Runtime overhead | PyTorch/CUDA/processor/intermediate buffers |
Quantization helps model-weight memory. It does not make unlimited context safe.
References:
- Transformers quantization
- bitsandbytes quantization
- KV-cache docs
2. Context-quality failure
Even if the prompt fits, “lots of code” is not the same as “useful evidence.”
A huge prompt with 40 files can be worse than a small prompt with the right 5-10 chunks. The model has to find the relevant lines among irrelevant code.
Useful links:
- How Cody understands your codebase
- Lessons from building AI coding assistants
- AI-assisted coding with Cody paper
3. Debuggability failure
If the answer is bad after a giant prompt, you cannot tell what failed:
Did the model miss the evidence?
Was the relevant file absent?
Was the relevant file truncated?
Was the prompt too noisy?
Was the answer hallucinated?
Was the code split badly?
Was the important symbol hidden in generated/vendor code?
A retrieval pipeline gives inspectable logs:
query
retrieved files
retrieved symbols
chunk scores
token counts
prompt contents
model answer
That makes the system improvable.
Better architecture
A local code inspector should look like this:
Repository
↓
File scanner
↓
Ignore rules
↓
Structural chunker
↓
Exact index
↓
Semantic index
↓
Hybrid retriever
↓
Reranker
↓
Prompt builder with token budget
↓
Gemma
↓
Grounded findings
Not this:
Repository
↓
Concatenate source files
↓
Gemma
↓
OOM or vague answer
References:
- Aider repo map
- Aider FAQ
- Continue custom code RAG guide
- Sourcegraph Cody context
What to build first
Phase 1: one-file inspector
Do not start with multiple files. Start with one file.
mythos inspect-file src/auth/session.py
Expected behavior:
1. Read one file.
2. Wrap it in a <file> block.
3. Count tokens.
4. Refuse if too large.
5. Ask Gemma for structured findings.
6. Return path, line range, severity, evidence, and suggested fix.
Prompt shape:
Task:
Inspect this file for correctness, security, and maintainability issues.
Code context:
<file path="src/auth/session.py" lines="1-120">
...
</file>
Return:
1. Findings
2. Evidence
3. Risk
4. Suggested fixes
5. Missing context, if any
Success criteria:
- no memory surprises
- token count printed
- stable output format
- line-numbered findings
- model says when it needs more context
Phase 2: file scanning and ignore rules
Aggressively ignore junk before indexing.
Skip:
.git/
node_modules/
vendor/
dist/
build/
target/
.venv/
venv/
__pycache__/
coverage/
.next/
.nuxt/
.cache/
*.min.js
*.map
generated files
binary files
large media files
Example scanner:
from pathlib import Path
IGNORE_DIRS = {
".git", "node_modules", "vendor", "dist", "build", "target",
".venv", "venv", "__pycache__", "coverage", ".next", ".nuxt", ".cache",
}
CODE_EXTENSIONS = {
".py", ".js", ".ts", ".tsx", ".jsx",
".go", ".rs", ".java", ".kt", ".cs",
".cpp", ".c", ".h", ".hpp",
".rb", ".php", ".swift", ".scala",
".sql", ".yaml", ".yml", ".toml", ".json",
}
def should_skip(path: Path) -> bool:
if any(part in IGNORE_DIRS for part in path.parts):
return True
if path.suffix not in CODE_EXTENSIONS:
return True
name = path.name.lower()
return (
name.endswith(".min.js")
or name.endswith(".map")
or "generated" in name
)
def collect_files(repo_root: str) -> list[Path]:
root = Path(repo_root)
return [
path
for path in root.rglob("*")
if path.is_file() and not should_skip(path)
]
Later, add .gitignore support.
Phase 3: exact search before embeddings
For code, exact search is not optional.
Code has exact identifiers:
create_session
JWT_SECRET
verify_password
dangerouslySetInnerHTML
deserialize
POST /api/login
process.env
subprocess
eval
Semantic search can miss these. Exact search will not.
Use SQLite FTS5 first.
Schema:
CREATE TABLE chunks (
id INTEGER PRIMARY KEY,
path TEXT NOT NULL,
language TEXT,
kind TEXT,
symbol TEXT,
start_line INTEGER,
end_line INTEGER,
text TEXT NOT NULL
);
CREATE VIRTUAL TABLE chunks_fts USING fts5(
path,
symbol,
text,
content='chunks',
content_rowid='id'
);
Commands:
mythos search "JWT_SECRET"
mythos search "password reset"
mythos search "raw SQL"
mythos search "dangerouslySetInnerHTML"
Phase 4: structural chunking
Do not permanently chunk code by arbitrary character count.
Useful code chunks are usually:
function
method
class
route handler
test case
config block
module-level API
Use Tree-sitter where possible.
Chunk record:
{
"path": "src/auth/session.py",
"language": "python",
"kind": "function",
"symbol": "create_session",
"start_line": 42,
"end_line": 89,
"text": "def create_session(...): ..."
}
This metadata is what lets the model produce useful findings with file paths and lines.
Phase 5: semantic search after exact search
After exact search and chunks work, add embeddings.
Semantic search helps with natural-language questions:
Where is authorization enforced?
What handles password reset?
Where do we parse untrusted input?
How does session refresh work?
What code handles checkout?
Do not replace exact search with embeddings. Use both.
Pipeline:
query
→ exact search top 50
→ vector search top 50
→ merge and deduplicate
→ boost symbols/paths/tests
→ rerank
→ fit prompt budget
References:
- Continue custom code RAG guide
- Sentence Transformers retrieve and rerank
- LanceDB codebase RAG
Phase 6: hybrid retrieval
Combine:
exact search
+ semantic search
+ symbol matching
+ path matching
+ test-file heuristics
+ recently-changed-file boosts
+ generated-file penalties
Simple scoring idea:
def score_candidate(candidate: dict, query: str) -> float:
score = 0.0
query_lower = query.lower()
symbol = (candidate.get("symbol") or "").lower()
path = candidate["path"].lower()
text = candidate["text"].lower()
if query_lower == symbol:
score += 2.0
if query_lower in text:
score += 1.5
if any(part in path for part in query_lower.split()):
score += 0.7
if "test" in path or "spec" in path:
score += 0.5
if "generated" in path:
score -= 1.0
score += candidate.get("semantic_score", 0.0)
return score
Crude and observable is better than sophisticated and opaque.
Phase 7: reranking
Use reranking after first-stage retrieval works.
1. Exact search top 50.
2. Vector search top 50.
3. Merge and deduplicate.
4. Rerank top 30.
5. Include top 5-12 chunks.
6. Fit token budget.
This keeps prompts small and relevant.
Add a repo map
A repo map bridges single-file inspection and whole-repo awareness.
Aider’s repo map is the key reference: summarize important identifiers and relationships, then select the most relevant parts that fit the token budget.
Simple repo map:
src/auth/
routes.py
login(request)
logout(request)
refresh_token(request)
session.py
create_session(user_id)
verify_session(token)
revoke_session(token)
passwords.py
hash_password(password)
verify_password(password, hash)
src/users/
models.py
User
repository.py
find_user_by_email(email)
For each file, store:
path
language
short summary
public symbols
imports
exports
risk tags
test files
Risk tags:
auth
crypto
sql
filesystem
network
subprocess
deserialization
user-input
secrets
permissions
The repo map gives Gemma broad orientation without raw source bloat.
Add a missing-context protocol
Tell the model:
If the provided context is insufficient, do not guess.
Return NEED_MORE_CONTEXT with specific files, symbols, tests, or configs needed.
Example:
NEED_MORE_CONTEXT:
- src/auth/passwords.py::verify_password
- tests/test_password_reset.py
- src/config.py::JWT_SECRET
Then retrieve those files/symbols and ask again.
Controlled loop:
initial retrieval
→ Gemma analyzes
→ Gemma asks for missing context
→ retrieve more
→ Gemma produces final answer
This is a good minimal “agent” before using a full agent framework.
Better query_llm for this project
import torch
def format_code_context(chunks: list[dict]) -> str:
blocks = []
for chunk in chunks:
path = chunk["path"]
start = chunk.get("start_line", "?")
end = chunk.get("end_line", "?")
symbol = chunk.get("symbol", "")
blocks.append(
f'<file path="{path}" lines="{start}-{end}" symbol="{symbol}">\n'
f'{chunk["text"]}\n'
f'</file>'
)
return "\n\n".join(blocks)
def build_messages(
system_message: str,
task: str,
chunks: list[dict],
*,
repo_summary: str = "",
) -> list[dict]:
code_context = format_code_context(chunks)
user_message = f"""
Task:
{task}
Repository summary:
{repo_summary}
Code context:
{code_context}
Instructions:
Return findings with file paths and line ranges.
Classify each finding as HIGH, MEDIUM, LOW, or INFO.
Distinguish confirmed issues from possible risks.
If the provided context is insufficient, return NEED_MORE_CONTEXT with specific files or symbols needed.
""".strip()
return [
{"role": "system", "content": system_message.strip()},
{"role": "user", "content": user_message},
]
def query_llm(
system_message: str,
task: str,
chunks: list[dict],
*,
repo_summary: str = "",
max_prompt_tokens: int = 12_000,
max_new_tokens: int = 1024,
use_cache_offload: bool = False,
) -> str:
messages = build_messages(
system_message=system_message,
task=task,
chunks=chunks,
repo_summary=repo_summary,
)
count_inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
enable_thinking=False,
)
prompt_tokens = count_inputs["input_ids"].shape[-1]
if prompt_tokens > max_prompt_tokens:
raise ValueError(
f"Prompt too large: {prompt_tokens:,} tokens. "
f"Limit: {max_prompt_tokens:,}. "
"Retrieve fewer chunks, use smaller chunks, or summarize first."
)
print(f"prompt_tokens={prompt_tokens:,}")
print(f"max_new_tokens={max_new_tokens:,}")
print(f"chunk_count={len(chunks):,}")
inputs = {
key: value.to(model.device)
for key, value in count_inputs.items()
if hasattr(value, "to")
}
input_len = inputs["input_ids"].shape[-1]
generation_kwargs = {
"max_new_tokens": max_new_tokens,
"do_sample": False,
}
if use_cache_offload:
generation_kwargs["cache_implementation"] = "offloaded"
with torch.inference_mode():
outputs = model.generate(
**inputs,
**generation_kwargs,
)
return processor.decode(
outputs[0][input_len:],
skip_special_tokens=True,
).strip()
Key changes:
- no code in assistant role
- code is user-supplied context
- prompt tokens are counted before generation
- hard prompt limit prevents surprise OOM
- chunks include paths and line ranges
- deterministic generation by default
- optional KV-cache offload
- model can request missing context
Suggested system prompt
You are a local code inspection assistant.
Rules:
- Analyze only the provided context.
- Do not assume behavior from files that are not shown.
- If context is insufficient, return NEED_MORE_CONTEXT with specific files or symbols.
- Prefer concrete findings over generic advice.
- Include file paths and line ranges.
- Classify severity as HIGH, MEDIUM, LOW, or INFO.
- Distinguish confirmed issues from possible risks.
- Do not claim something is vulnerable unless the provided code supports it.
Suggested output format
Finding 1
Severity: HIGH
Status: Confirmed / Likely / Speculative
File: src/auth/session.py:42-61
Issue:
...
Evidence:
...
Why it matters:
...
Suggested fix:
...
Missing context:
...
Suggested commands
Start with narrow, inspectable CLI commands.
mythos inspect-file src/auth/session.py
mythos search "JWT_SECRET"
mythos inspect-symbol create_session
mythos inspect-topic "password reset security"
mythos inspect-diff
inspect-diff will probably become the most useful mode because diffs naturally limit context size.
Token budget recommendations
Start conservatively:
| Prompt part | Initial budget |
|---|---|
| System prompt | 500-1,000 tokens |
| User task | 100-500 tokens |
| Repo summary / repo map | 1,000-4,000 tokens |
| Retrieved code | 6,000-16,000 tokens |
| Answer room | 1,000-2,000 tokens |
| Total initial target | 8K-24K tokens |
Do not start at the theoretical context maximum.
Memory mitigations after architecture fixes
1. Quantization
Use quantization to reduce model-weight memory:
import torch
from transformers import AutoProcessor, AutoModelForCausalLM, BitsAndBytesConfig
model_id = "google/gemma-4-E2B-it"
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
quantization_config=quantization_config,
attn_implementation="sdpa",
)
Quantization helps load the model. It does not make unlimited prompts safe.
References:
- Transformers quantization
- bitsandbytes quantization
2. Cache offloading
If a reasonable prompt still OOMs:
outputs = model.generate(
**inputs,
max_new_tokens=1024,
do_sample=False,
cache_implementation="offloaded",
)
This trades speed for lower GPU memory pressure. Use it after retrieval and token budgeting, not instead of them.
3. Smaller output cap
For inspection, try:
max_new_tokens = 512
Or split:
pass 1: brief diagnosis
pass 2: expand only if needed
Evaluation plan
Evaluate retrieval separately from model reasoning.
Small local eval set:
- question: "Where is session creation implemented?"
expected_files:
- src/auth/session.py
- src/auth/routes.py
- question: "What validates password reset tokens?"
expected_files:
- src/auth/password_reset.py
- tests/test_password_reset.py
- question: "Is project deletion authorization enforced?"
expected_files:
- src/projects/routes.py
- src/auth/middleware.py
- tests/test_project_permissions.py
- question: "Find places where raw SQL is constructed."
expected_files:
- src/db/search.py
Score:
| Metric | Meaning |
|---|---|
| Retrieval recall | Did the right files appear? |
| Retrieval precision | Were included chunks relevant? |
| Prompt efficiency | Tokens per useful answer |
| Grounding | Did answer cite included code? |
| Missing-context quality | Did model ask for the right missing file? |
| Diagnosis quality | Was the inspection correct? |
References:
- RepoBench
- SWE-bench
- Hugging Face RAG evaluation cookbook
What to avoid for now
Avoid fine-tuning
Your immediate issue is not that Gemma lacks code-inspection behavior. It is that the right context is not being selected, packed, and grounded.
Fix this first:
context selection
prompt structure
token budget
retrieval evaluation
before this:
fine-tuning
Avoid a full agent framework at first
Start with simple tools:
search_text
search_symbol
open_file
inspect_file
inspect_diff
Then add a controlled missing-context loop.
Avoid embeddings-only retrieval
Code has exact names. Use exact search first, embeddings second.
Avoid UI work early
A CLI with good logs is more useful than a polished interface.
You need to see:
retrieved files
retrieved symbols
chunk scores
token counts
prompt size
model answer
Realistic build order
Milestone 1: stable one-file review
mythos inspect-file src/auth/session.py
Must have:
- correct chat roles
- token counting
- hard prompt limit
- structured findings
Milestone 2: exact search
mythos search "JWT_SECRET"
Must have:
- scanner
- ignore rules
- SQLite FTS5 or ripgrep-style search
- file path and line numbers
Milestone 3: inspect symbol
mythos inspect-symbol create_session
Must have:
- definition lookup
- caller lookup
- test lookup if possible
Milestone 4: inspect topic
mythos inspect-topic "password reset security"
Must have:
- exact search
- semantic search
- merged ranking
- token-budgeted prompt
Milestone 5: repo map
Must have:
- directory summary
- file summary
- public symbols
- imports/exports
- risk tags
Milestone 6: missing-context loop
Must have:
- model can request specific files/symbols
- system retrieves them
- second-pass answer improves
Milestone 7: diff inspection
mythos inspect-diff
Must have:
- changed functions
- surrounding code
- tests
- likely regression/security risks
Final recommendation
For this case:
- Stop passing code as an
assistantmessage. - Use
systemfor behavior anduserfor task + selected code context. - Count prompt tokens before every generation.
- Set an initial prompt cap around 12K tokens.
- Build a one-file inspector first.
- Add exact search before embeddings.
- Chunk code by functions/classes/methods where possible.
- Add semantic search after exact search works.
- Use hybrid retrieval and reranking.
- Add an Aider-style repo map.
- Use
NEED_MORE_CONTEXTinstead of letting the model guess. - Evaluate retrieval separately from final answers.
Short summary
- The current approach is wrong for source-code context.
- The
assistantrole is not a context bucket. - The 1.7TB allocation is probably internal tensor/context blow-up, not literal text size.
max_new_tokensdoes not limit input size.- Quantization helps model-weight memory, not unlimited prompt size.
- Long-context Gemma is useful, but retrieval is still required.
- Build a code evidence engine: scanner → chunks → exact search → semantic search → rerank → prompt → Gemma.
- The model should inspect selected evidence, not swallow the whole repository.
Discussion in the ATmosphere