{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreid32hvy4ldfphzdl4uxkhn2uk4s5unreicpu7lh5vrf7hlf4tlysy",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mijrwuf2iqp2"
},
"path": "/t/avoid-re-encoding-reference-images-in-vision-llm-when-comparison-criteria-are-user-defined/174897#post_1",
"publishedAt": "2026-04-02T13:19:06.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "Hi everyone,\n\nI’m working with a Vision-LLM (like Qwen-VL / LLaVA / llama.cpp-based multimodal models) where I need to compare new images against reference images. The key part of my use case is that **users define the comparison criteria** (e.g., fur length, ear shape, color patterns), and I’m using **image-to-text models** to evaluate how well a new image matches a reference according to these criteria.\n\nCurrently, every time I send a prompt including the reference images, the model **re-encodes them from scratch**. From the logs, I can see:\nllama-server\n\n\n encoding image slice...\n image slice encoded in 3800–4800 ms\n decoding image batch ...\n\n\nEven for the same reference images, this happens **every single request** , which makes inference slow.\n\nQuestions:\n\n * Has anyone dealt with **user-defined comparison criteria** in Vision-LLM pipelines?\n\n * Are there ways to **cache or pre-load reference images** in llama.cpp / Hugging Face pipelines to avoid repeated encoding?\n\n * What are recommended strategies to efficiently compare new images against a set of references **using image-to-text models** without reprocessing the reference images each time?\n\n\n\n\nThanks in advance for any advice or examples!",
"title": "Avoid Re-encoding Reference Images in Vision-LLM When Comparison Criteria Are User-Defined"
}