External Publication

Visit Post

Managing memory when trying to process multiple files

Hugging Face Forums [Unofficial] May 5, 2026

Source

Yeah. Seems something is wrong:

Managing memory when processing many source files with local Hugging Face models

Yes: passing source code as the assistant message is the wrong approach for this use case.

There are two problems:

Chat-role problem: assistant means “this is something the model previously said.” Your source code is user-provided evidence, not prior model output.
Architecture problem: putting many source files into one huge prompt is not a scalable local code-inspection strategy. You want scanning, chunking, search, retrieval, ranking, summaries, and token budgeting before the model sees the prompt.

Better mental model:

Do not ask Gemma to hold the whole repository in the prompt.

Use normal code-intelligence tools to find relevant evidence.
Then ask Gemma to reason over that selected evidence.

Gemma 4’s large context window helps, but long context is still expensive. More tokens mean more tokenization, prefill, KV-cache memory, latency, and noise. Large context is useful for selected evidence , not for dumping a whole repo into every request.

Useful references:

Hugging Face chat templates
Hugging Face KV-cache docs
Gemma 4 Transformers docs
Aider repo map
Continue custom code RAG guide
Sourcegraph: how Cody understands your codebase
Tree-sitter
SQLite FTS5

What is wrong with the current function?

Your function:

def query_llm(system_message, user_message, assistant_message):
    messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_message},
    {"role": "assistant", "content": assistant_message},
    ]

    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False
    )

    inputs = processor(text=text, return_tensors="pt").to(model.device)
    input_len = inputs["input_ids"].shape[-1]
    outputs = model.generate(**inputs, max_new_tokens=1024)
    response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
    return response

1. `assistant_message` is being misused

In chat templates, roles matter:

system: instructions about model behavior
user: user-provided request/context
assistant: prior model output

So this:

{"role": "assistant", "content": assistant_message}

means:

The assistant previously said this.

If assistant_message contains source code, the effective conversation becomes:

System: You are a code inspector.
User: Please inspect this.
Assistant: <giant pile of source code>
Assistant: continue here...

That is not the right structure. You want:

System: You are a code inspector.
User: Here is selected source code. Please inspect it.
Assistant: <model answer>

Source code should usually be inside the user message as evidence.

2. `add_generation_prompt=True` makes this worse

add_generation_prompt=True adds the model’s “now answer as assistant” marker.

With your current structure, the model sees a huge assistant turn and is asked to continue. That can make it treat code as prior assistant output rather than inspection context.

3. `max_new_tokens` does not limit input size

This:

outputs = model.generate(**inputs, max_new_tokens=1024)

limits only the number of new output tokens.

It does not limit input tokens.

So this is still possible:

input prompt: 150,000 tokens
max_new_tokens: 1,024
result: still OOM

You need to count input tokens before generation.

4. The ~1.7TB allocation is probably tensor/context blow-up

A “tried to allocate ~1.7TB” error usually does not mean your source text literally needs 1.7TB.

It often means an internal tensor became enormous because of:

very long sequence length
prefill/attention memory
KV-cache memory
inefficient attention path
batching or cache implementation
model/library-specific shape issue
adapter/PEFT/structured-generation interaction

Long context is not merely stored; it is processed. The model must prefill over the prompt before producing the first token, then keep key/value cache state during generation.

Relevant docs:

Hugging Face KV-cache docs
Cache strategies / offloading

First corrected version

For a simple one-shot inspection, use only system and user:

def query_llm(system_message: str, user_message: str, max_new_tokens: int = 1024) -> str:
    messages = [
        {"role": "system", "content": system_message.strip()},
        {"role": "user", "content": user_message.strip()},
    ]

    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False,
    )

    inputs = processor(text=text, return_tensors="pt").to(model.device)
    input_len = inputs["input_ids"].shape[-1]

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=False,
    )

    return processor.decode(
        outputs[0][input_len:],
        skip_special_tokens=True,
    ).strip()

Put code into the user message:

system_message = """
You are a local code inspection assistant.

Rules:
- Analyze only the provided code context.
- Do not invent behavior from files that are not shown.
- If context is insufficient, say exactly what file or symbol is needed.
- Return findings with file paths, line ranges, severity, evidence, and suggested fixes.
"""

user_message = """
Task:
Inspect the following code for correctness, security, and maintainability issues.

Code context:
<file path="src/auth/session.py" lines="1-120">
... source code here ...
</file>

<file path="src/auth/routes.py" lines="1-180">
... source code here ...
</file>
"""

That fixes the role problem, but not the repo-scale memory problem.

Better input construction

You can often let apply_chat_template() tokenize directly:

def query_llm(system_message: str, user_message: str, max_new_tokens: int = 1024) -> str:
    messages = [
        {"role": "system", "content": system_message.strip()},
        {"role": "user", "content": user_message.strip()},
    ]

    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
        add_generation_prompt=True,
        enable_thinking=False,
    ).to(model.device)

    input_len = inputs["input_ids"].shape[-1]

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=False,
    )

    return processor.decode(
        outputs[0][input_len:],
        skip_special_tokens=True,
    ).strip()

Still add a hard input-token limit.

Add token counting and hard prompt limits

This is the most important practical fix after correcting roles.

def count_tokens_from_messages(messages: list[dict]) -> int:
    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
        add_generation_prompt=True,
        enable_thinking=False,
    )
    return inputs["input_ids"].shape[-1]

Then enforce a ceiling:

def query_llm(
    system_message: str,
    user_message: str,
    *,
    max_prompt_tokens: int = 12_000,
    max_new_tokens: int = 1024,
) -> str:
    messages = [
        {"role": "system", "content": system_message.strip()},
        {"role": "user", "content": user_message.strip()},
    ]

    prompt_tokens = count_tokens_from_messages(messages)

    if prompt_tokens > max_prompt_tokens:
        raise ValueError(
            f"Prompt too large: {prompt_tokens:,} tokens. "
            f"Limit is {max_prompt_tokens:,}. "
            "Retrieve fewer files, use smaller chunks, or summarize first."
        )

    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
        add_generation_prompt=True,
        enable_thinking=False,
    ).to(model.device)

    input_len = inputs["input_ids"].shape[-1]

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=False,
    )

    return processor.decode(
        outputs[0][input_len:],
        skip_special_tokens=True,
    ).strip()

Start around:

max_prompt_tokens = 12_000

Even if the model supports much more, start small. Smaller prompts are easier to debug, faster, cheaper, and often better because the evidence is less diluted.

Why whole-repo prompting fails

1. Memory failure

Memory is not just “can I load the model?”

During generation, memory includes:

Bucket	Meaning
Model weights	Memory needed to load Gemma
Input tokens	Tokenized prompt, including source
Prefill	Processing the whole prompt before first output token
KV cache	Stored key/value tensors for prior tokens
Output tokens	New generated tokens
Runtime overhead	PyTorch/CUDA/processor/intermediate buffers

Quantization helps model-weight memory. It does not make unlimited context safe.

References:

Transformers quantization
bitsandbytes quantization
KV-cache docs

2. Context-quality failure

Even if the prompt fits, “lots of code” is not the same as “useful evidence.”

A huge prompt with 40 files can be worse than a small prompt with the right 5-10 chunks. The model has to find the relevant lines among irrelevant code.

Useful links:

How Cody understands your codebase
Lessons from building AI coding assistants
AI-assisted coding with Cody paper

3. Debuggability failure

If the answer is bad after a giant prompt, you cannot tell what failed:

Did the model miss the evidence?
Was the relevant file absent?
Was the relevant file truncated?
Was the prompt too noisy?
Was the answer hallucinated?
Was the code split badly?
Was the important symbol hidden in generated/vendor code?

A retrieval pipeline gives inspectable logs:

query
retrieved files
retrieved symbols
chunk scores
token counts
prompt contents
model answer

That makes the system improvable.

Better architecture

A local code inspector should look like this:

Repository
  ↓
File scanner
  ↓
Ignore rules
  ↓
Structural chunker
  ↓
Exact index
  ↓
Semantic index
  ↓
Hybrid retriever
  ↓
Reranker
  ↓
Prompt builder with token budget
  ↓
Gemma
  ↓
Grounded findings

Not this:

Repository
  ↓
Concatenate source files
  ↓
Gemma
  ↓
OOM or vague answer

References:

Aider repo map
Aider FAQ
Continue custom code RAG guide
Sourcegraph Cody context

What to build first

Phase 1: one-file inspector

Do not start with multiple files. Start with one file.

mythos inspect-file src/auth/session.py

Expected behavior:

1. Read one file.
2. Wrap it in a <file> block.
3. Count tokens.
4. Refuse if too large.
5. Ask Gemma for structured findings.
6. Return path, line range, severity, evidence, and suggested fix.

Prompt shape:

Task:
Inspect this file for correctness, security, and maintainability issues.

Code context:
<file path="src/auth/session.py" lines="1-120">
...
</file>

Return:
1. Findings
2. Evidence
3. Risk
4. Suggested fixes
5. Missing context, if any

Success criteria:

- no memory surprises
- token count printed
- stable output format
- line-numbered findings
- model says when it needs more context

Phase 2: file scanning and ignore rules

Aggressively ignore junk before indexing.

Skip:

.git/
node_modules/
vendor/
dist/
build/
target/
.venv/
venv/
__pycache__/
coverage/
.next/
.nuxt/
.cache/
*.min.js
*.map
generated files
binary files
large media files

Example scanner:

from pathlib import Path

IGNORE_DIRS = {
    ".git", "node_modules", "vendor", "dist", "build", "target",
    ".venv", "venv", "__pycache__", "coverage", ".next", ".nuxt", ".cache",
}

CODE_EXTENSIONS = {
    ".py", ".js", ".ts", ".tsx", ".jsx",
    ".go", ".rs", ".java", ".kt", ".cs",
    ".cpp", ".c", ".h", ".hpp",
    ".rb", ".php", ".swift", ".scala",
    ".sql", ".yaml", ".yml", ".toml", ".json",
}

def should_skip(path: Path) -> bool:
    if any(part in IGNORE_DIRS for part in path.parts):
        return True

    if path.suffix not in CODE_EXTENSIONS:
        return True

    name = path.name.lower()

    return (
        name.endswith(".min.js")
        or name.endswith(".map")
        or "generated" in name
    )

def collect_files(repo_root: str) -> list[Path]:
    root = Path(repo_root)
    return [
        path
        for path in root.rglob("*")
        if path.is_file() and not should_skip(path)
    ]

Later, add .gitignore support.

Phase 3: exact search before embeddings

For code, exact search is not optional.

Code has exact identifiers:

create_session
JWT_SECRET
verify_password
dangerouslySetInnerHTML
deserialize
POST /api/login
process.env
subprocess
eval

Semantic search can miss these. Exact search will not.

Use SQLite FTS5 first.

Schema:

CREATE TABLE chunks (
    id INTEGER PRIMARY KEY,
    path TEXT NOT NULL,
    language TEXT,
    kind TEXT,
    symbol TEXT,
    start_line INTEGER,
    end_line INTEGER,
    text TEXT NOT NULL
);

CREATE VIRTUAL TABLE chunks_fts USING fts5(
    path,
    symbol,
    text,
    content='chunks',
    content_rowid='id'
);

Commands:

mythos search "JWT_SECRET"
mythos search "password reset"
mythos search "raw SQL"
mythos search "dangerouslySetInnerHTML"

Phase 4: structural chunking

Do not permanently chunk code by arbitrary character count.

Useful code chunks are usually:

function
method
class
route handler
test case
config block
module-level API

Use Tree-sitter where possible.

Chunk record:

{
  "path": "src/auth/session.py",
  "language": "python",
  "kind": "function",
  "symbol": "create_session",
  "start_line": 42,
  "end_line": 89,
  "text": "def create_session(...): ..."
}

This metadata is what lets the model produce useful findings with file paths and lines.

Phase 5: semantic search after exact search

After exact search and chunks work, add embeddings.

Semantic search helps with natural-language questions:

Where is authorization enforced?
What handles password reset?
Where do we parse untrusted input?
How does session refresh work?
What code handles checkout?

Do not replace exact search with embeddings. Use both.

Pipeline:

query
  → exact search top 50
  → vector search top 50
  → merge and deduplicate
  → boost symbols/paths/tests
  → rerank
  → fit prompt budget

References:

Continue custom code RAG guide
Sentence Transformers retrieve and rerank
LanceDB codebase RAG

Phase 6: hybrid retrieval

Combine:

exact search
+ semantic search
+ symbol matching
+ path matching
+ test-file heuristics
+ recently-changed-file boosts
+ generated-file penalties

Simple scoring idea:

def score_candidate(candidate: dict, query: str) -> float:
    score = 0.0

    query_lower = query.lower()
    symbol = (candidate.get("symbol") or "").lower()
    path = candidate["path"].lower()
    text = candidate["text"].lower()

    if query_lower == symbol:
        score += 2.0

    if query_lower in text:
        score += 1.5

    if any(part in path for part in query_lower.split()):
        score += 0.7

    if "test" in path or "spec" in path:
        score += 0.5

    if "generated" in path:
        score -= 1.0

    score += candidate.get("semantic_score", 0.0)

    return score

Crude and observable is better than sophisticated and opaque.

Phase 7: reranking

Use reranking after first-stage retrieval works.

1. Exact search top 50.
2. Vector search top 50.
3. Merge and deduplicate.
4. Rerank top 30.
5. Include top 5-12 chunks.
6. Fit token budget.

This keeps prompts small and relevant.

Add a repo map

A repo map bridges single-file inspection and whole-repo awareness.

Aider’s repo map is the key reference: summarize important identifiers and relationships, then select the most relevant parts that fit the token budget.

Simple repo map:

src/auth/
  routes.py
    login(request)
    logout(request)
    refresh_token(request)

  session.py
    create_session(user_id)
    verify_session(token)
    revoke_session(token)

  passwords.py
    hash_password(password)
    verify_password(password, hash)

src/users/
  models.py
    User
  repository.py
    find_user_by_email(email)

For each file, store:

path
language
short summary
public symbols
imports
exports
risk tags
test files

Risk tags:

auth
crypto
sql
filesystem
network
subprocess
deserialization
user-input
secrets
permissions

The repo map gives Gemma broad orientation without raw source bloat.

Add a missing-context protocol

Tell the model:

If the provided context is insufficient, do not guess.
Return NEED_MORE_CONTEXT with specific files, symbols, tests, or configs needed.

Example:

NEED_MORE_CONTEXT:
- src/auth/passwords.py::verify_password
- tests/test_password_reset.py
- src/config.py::JWT_SECRET

Then retrieve those files/symbols and ask again.

Controlled loop:

initial retrieval
  → Gemma analyzes
  → Gemma asks for missing context
  → retrieve more
  → Gemma produces final answer

This is a good minimal “agent” before using a full agent framework.

Better `query_llm` for this project

import torch

def format_code_context(chunks: list[dict]) -> str:
    blocks = []

    for chunk in chunks:
        path = chunk["path"]
        start = chunk.get("start_line", "?")
        end = chunk.get("end_line", "?")
        symbol = chunk.get("symbol", "")

        blocks.append(
            f'<file path="{path}" lines="{start}-{end}" symbol="{symbol}">\n'
            f'{chunk["text"]}\n'
            f'</file>'
        )

    return "\n\n".join(blocks)

def build_messages(
    system_message: str,
    task: str,
    chunks: list[dict],
    *,
    repo_summary: str = "",
) -> list[dict]:
    code_context = format_code_context(chunks)

    user_message = f"""
Task:
{task}

Repository summary:
{repo_summary}

Code context:
{code_context}

Instructions:
Return findings with file paths and line ranges.
Classify each finding as HIGH, MEDIUM, LOW, or INFO.
Distinguish confirmed issues from possible risks.
If the provided context is insufficient, return NEED_MORE_CONTEXT with specific files or symbols needed.
""".strip()

    return [
        {"role": "system", "content": system_message.strip()},
        {"role": "user", "content": user_message},
    ]

def query_llm(
    system_message: str,
    task: str,
    chunks: list[dict],
    *,
    repo_summary: str = "",
    max_prompt_tokens: int = 12_000,
    max_new_tokens: int = 1024,
    use_cache_offload: bool = False,
) -> str:
    messages = build_messages(
        system_message=system_message,
        task=task,
        chunks=chunks,
        repo_summary=repo_summary,
    )

    count_inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
        add_generation_prompt=True,
        enable_thinking=False,
    )

    prompt_tokens = count_inputs["input_ids"].shape[-1]

    if prompt_tokens > max_prompt_tokens:
        raise ValueError(
            f"Prompt too large: {prompt_tokens:,} tokens. "
            f"Limit: {max_prompt_tokens:,}. "
            "Retrieve fewer chunks, use smaller chunks, or summarize first."
        )

    print(f"prompt_tokens={prompt_tokens:,}")
    print(f"max_new_tokens={max_new_tokens:,}")
    print(f"chunk_count={len(chunks):,}")

    inputs = {
        key: value.to(model.device)
        for key, value in count_inputs.items()
        if hasattr(value, "to")
    }

    input_len = inputs["input_ids"].shape[-1]

    generation_kwargs = {
        "max_new_tokens": max_new_tokens,
        "do_sample": False,
    }

    if use_cache_offload:
        generation_kwargs["cache_implementation"] = "offloaded"

    with torch.inference_mode():
        outputs = model.generate(
            **inputs,
            **generation_kwargs,
        )

    return processor.decode(
        outputs[0][input_len:],
        skip_special_tokens=True,
    ).strip()

Key changes:

- no code in assistant role
- code is user-supplied context
- prompt tokens are counted before generation
- hard prompt limit prevents surprise OOM
- chunks include paths and line ranges
- deterministic generation by default
- optional KV-cache offload
- model can request missing context

Suggested system prompt

You are a local code inspection assistant.

Rules:
- Analyze only the provided context.
- Do not assume behavior from files that are not shown.
- If context is insufficient, return NEED_MORE_CONTEXT with specific files or symbols.
- Prefer concrete findings over generic advice.
- Include file paths and line ranges.
- Classify severity as HIGH, MEDIUM, LOW, or INFO.
- Distinguish confirmed issues from possible risks.
- Do not claim something is vulnerable unless the provided code supports it.

Suggested output format

Finding 1
Severity: HIGH
Status: Confirmed / Likely / Speculative
File: src/auth/session.py:42-61

Issue:
...

Evidence:
...

Why it matters:
...

Suggested fix:
...

Missing context:
...

Suggested commands

Start with narrow, inspectable CLI commands.

mythos inspect-file src/auth/session.py
mythos search "JWT_SECRET"
mythos inspect-symbol create_session
mythos inspect-topic "password reset security"
mythos inspect-diff

inspect-diff will probably become the most useful mode because diffs naturally limit context size.

Token budget recommendations

Start conservatively:

Prompt part	Initial budget
System prompt	500-1,000 tokens
User task	100-500 tokens
Repo summary / repo map	1,000-4,000 tokens
Retrieved code	6,000-16,000 tokens
Answer room	1,000-2,000 tokens
Total initial target	8K-24K tokens

Do not start at the theoretical context maximum.

Memory mitigations after architecture fixes

1. Quantization

Use quantization to reduce model-weight memory:

import torch
from transformers import AutoProcessor, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "google/gemma-4-E2B-it"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

processor = AutoProcessor.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=quantization_config,
    attn_implementation="sdpa",
)

Quantization helps load the model. It does not make unlimited prompts safe.

References:

Transformers quantization
bitsandbytes quantization

2. Cache offloading

If a reasonable prompt still OOMs:

outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    do_sample=False,
    cache_implementation="offloaded",
)

This trades speed for lower GPU memory pressure. Use it after retrieval and token budgeting, not instead of them.

3. Smaller output cap

For inspection, try:

max_new_tokens = 512

Or split:

pass 1: brief diagnosis
pass 2: expand only if needed

Evaluation plan

Evaluate retrieval separately from model reasoning.

Small local eval set:

- question: "Where is session creation implemented?"
  expected_files:
    - src/auth/session.py
    - src/auth/routes.py

- question: "What validates password reset tokens?"
  expected_files:
    - src/auth/password_reset.py
    - tests/test_password_reset.py

- question: "Is project deletion authorization enforced?"
  expected_files:
    - src/projects/routes.py
    - src/auth/middleware.py
    - tests/test_project_permissions.py

- question: "Find places where raw SQL is constructed."
  expected_files:
    - src/db/search.py

Score:

Metric	Meaning
Retrieval recall	Did the right files appear?
Retrieval precision	Were included chunks relevant?
Prompt efficiency	Tokens per useful answer
Grounding	Did answer cite included code?
Missing-context quality	Did model ask for the right missing file?
Diagnosis quality	Was the inspection correct?

References:

RepoBench
SWE-bench
Hugging Face RAG evaluation cookbook

What to avoid for now

Avoid fine-tuning

Your immediate issue is not that Gemma lacks code-inspection behavior. It is that the right context is not being selected, packed, and grounded.

Fix this first:

context selection
prompt structure
token budget
retrieval evaluation

before this:

fine-tuning

Avoid a full agent framework at first

Start with simple tools:

search_text
search_symbol
open_file
inspect_file
inspect_diff

Then add a controlled missing-context loop.

Avoid embeddings-only retrieval

Code has exact names. Use exact search first, embeddings second.

Avoid UI work early

A CLI with good logs is more useful than a polished interface.

You need to see:

retrieved files
retrieved symbols
chunk scores
token counts
prompt size
model answer

Realistic build order

Milestone 1: stable one-file review

mythos inspect-file src/auth/session.py

Must have:

- correct chat roles
- token counting
- hard prompt limit
- structured findings

Milestone 2: exact search

mythos search "JWT_SECRET"

Must have:

- scanner
- ignore rules
- SQLite FTS5 or ripgrep-style search
- file path and line numbers

Milestone 3: inspect symbol

mythos inspect-symbol create_session

Must have:

- definition lookup
- caller lookup
- test lookup if possible

Milestone 4: inspect topic

mythos inspect-topic "password reset security"

Must have:

- exact search
- semantic search
- merged ranking
- token-budgeted prompt

Milestone 5: repo map

Must have:

- directory summary
- file summary
- public symbols
- imports/exports
- risk tags

Milestone 6: missing-context loop

Must have:

- model can request specific files/symbols
- system retrieves them
- second-pass answer improves

Milestone 7: diff inspection

mythos inspect-diff

Must have:

- changed functions
- surrounding code
- tests
- likely regression/security risks

Final recommendation

For this case:

Stop passing code as an assistant message.
Use system for behavior and user for task + selected code context.
Count prompt tokens before every generation.
Set an initial prompt cap around 12K tokens.
Build a one-file inspector first.
Add exact search before embeddings.
Chunk code by functions/classes/methods where possible.
Add semantic search after exact search works.
Use hybrid retrieval and reranking.
Add an Aider-style repo map.
Use NEED_MORE_CONTEXT instead of letting the model guess.
Evaluate retrieval separately from final answers.

Short summary

The current approach is wrong for source-code context.
The assistant role is not a context bucket.
The 1.7TB allocation is probably internal tensor/context blow-up, not literal text size.
max_new_tokens does not limit input size.
Quantization helps model-weight memory, not unlimited prompt size.
Long-context Gemma is useful, but retrieval is still required.
Build a code evidence engine: scanner → chunks → exact search → semantic search → rerank → prompt → Gemma.
The model should inspect selected evidence, not swallow the whole repository.

Managing memory when processing many source files with local Hugging Face models

What is wrong with the current function?

1. assistant_message is being misused

2. add_generation_prompt=True makes this worse

3. max_new_tokens does not limit input size

4. The ~1.7TB allocation is probably tensor/context blow-up

First corrected version

Better input construction

Add token counting and hard prompt limits

Why whole-repo prompting fails

1. Memory failure

2. Context-quality failure

3. Debuggability failure

Better architecture

What to build first

Phase 1: one-file inspector

Phase 2: file scanning and ignore rules

Phase 3: exact search before embeddings

Phase 4: structural chunking

Phase 5: semantic search after exact search

Phase 6: hybrid retrieval

Phase 7: reranking

Add a repo map

Add a missing-context protocol

Better query_llm for this project

Suggested system prompt

Suggested output format

Suggested commands

Token budget recommendations

Memory mitigations after architecture fixes

1. Quantization

2. Cache offloading

3. Smaller output cap

Evaluation plan

What to avoid for now

Avoid fine-tuning

Avoid a full agent framework at first

Avoid embeddings-only retrieval

Avoid UI work early

Realistic build order

Milestone 1: stable one-file review

Milestone 2: exact search

Milestone 3: inspect symbol

Milestone 4: inspect topic

Milestone 5: repo map

Milestone 6: missing-context loop

Milestone 7: diff inspection

Final recommendation

Short summary

Discussion in the ATmosphere

1. `assistant_message` is being misused

2. `add_generation_prompt=True` makes this worse

3. `max_new_tokens` does not limit input size

Better `query_llm` for this project