{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreialm5dk6toj67nfl2kthn5p2l4ixvuq63h4nktcwatd47bmhr4vsa",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3ml4t4rgy4pm2"
  },
  "path": "/t/managing-memory-when-trying-to-process-multiple-files/175768#post_1",
  "publishedAt": "2026-05-05T17:07:42.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Hey all,\n\nI’m trying to explore making a code inspector (Mythos at home) with huggingface models. I’m currently working with gemma4 and and while I can load the smaller versions just fine, when I try to add a bunch of source code to a prompt I get errors saying I don’t have enough memory. One was trying to allocate ~1.7TB\n\nI’ve made a function\n\n\n    def query_llm(system_message, user_message, assistant_message):\n        messages = [\n        {\"role\": \"system\", \"content\": system_message},\n        {\"role\": \"user\", \"content\": user_message},\n        {\"role\": \"assistant\", \"content\": assistant_message},\n        ]\n\n        text = processor.apply_chat_template(\n            messages,\n            tokenize=False,\n            add_generation_prompt=True,\n            enable_thinking=False\n        )\n\n        inputs = processor(text=text, return_tensors=\"pt\").to(model.device)\n        input_len = inputs[\"input_ids\"].shape[-1]\n        # Generate output\n        outputs = model.generate(**inputs, max_new_tokens=1024)\n        response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)\n        return response\n\n\nAnd I’m passing the code in as the assistant messages. Is this just the wrong approach? Is there any wisdom/guidance on how to go about doing local code analysis?",
  "title": "Managing memory when trying to process multiple files"
}