{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreid32hvy4ldfphzdl4uxkhn2uk4s5unreicpu7lh5vrf7hlf4tlysy",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mik7f6y7zf32"
  },
  "path": "/t/avoid-re-encoding-reference-images-in-vision-llm-when-comparison-criteria-are-user-defined/174897#post_1",
  "publishedAt": "2026-04-02T13:19:06.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Hi everyone,\n\nI’m working with a Vision-LLM (like Qwen-VL / LLaVA / llama.cpp-based multimodal models) where I need to compare new images against reference images. The key part of my use case is that **users define the comparison criteria** (e.g., fur length, ear shape, color patterns), and I’m using **image-to-text models** to evaluate how well a new image matches a reference according to these criteria.\n\nCurrently, every time I send a prompt including the reference images, the model **re-encodes them from scratch**. From the logs, I can see:\nllama-server\n\n\n    encoding image slice...\n    image slice encoded in 3800–4800 ms\n    decoding image batch ...\n\n\nEven for the same reference images, this happens **every single request** , which makes inference slow.\n\nQuestions:\n\n  * Has anyone dealt with **user-defined comparison criteria** in Vision-LLM pipelines?\n\n  * Are there ways to **cache or pre-load reference images** in llama.cpp / Hugging Face pipelines to avoid repeated encoding?\n\n  * What are recommended strategies to efficiently compare new images against a set of references **using image-to-text models** without reprocessing the reference images each time?\n\n\n\n\nThanks in advance for any advice or examples!",
  "title": "Avoid Re-encoding Reference Images in Vision-LLM When Comparison Criteria Are User-Defined"
}