{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreialm5dk6toj67nfl2kthn5p2l4ixvuq63h4nktcwatd47bmhr4vsa",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3ml4t4rgy4pm2"
},
"path": "/t/managing-memory-when-trying-to-process-multiple-files/175768#post_1",
"publishedAt": "2026-05-05T17:07:42.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "Hey all,\n\nI’m trying to explore making a code inspector (Mythos at home) with huggingface models. I’m currently working with gemma4 and and while I can load the smaller versions just fine, when I try to add a bunch of source code to a prompt I get errors saying I don’t have enough memory. One was trying to allocate ~1.7TB\n\nI’ve made a function\n\n\n def query_llm(system_message, user_message, assistant_message):\n messages = [\n {\"role\": \"system\", \"content\": system_message},\n {\"role\": \"user\", \"content\": user_message},\n {\"role\": \"assistant\", \"content\": assistant_message},\n ]\n\n text = processor.apply_chat_template(\n messages,\n tokenize=False,\n add_generation_prompt=True,\n enable_thinking=False\n )\n\n inputs = processor(text=text, return_tensors=\"pt\").to(model.device)\n input_len = inputs[\"input_ids\"].shape[-1]\n # Generate output\n outputs = model.generate(**inputs, max_new_tokens=1024)\n response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)\n return response\n\n\nAnd I’m passing the code in as the assistant messages. Is this just the wrong approach? Is there any wisdom/guidance on how to go about doing local code analysis?",
"title": "Managing memory when trying to process multiple files"
}