Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibvkliqczdompp4p2ge2lgdf3uwunlarrcpydmd5qxxajo3rhnhsq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3milodocdusc2"
  },
  "path": "/t/avoid-re-encoding-reference-images-in-vision-llm-when-comparison-criteria-are-user-defined/174897#post_2",
  "publishedAt": "2026-04-03T05:37:18.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "vLLM",
    "GitHub",
    "Tiger AI Lab",
    "Hugging Face",
    "NVIDIA Docs"
  ],
  "textContent": "> Are there ways to **cache or pre-load reference images** in llama.cpp / Hugging Face pipelines to avoid repeated encoding?\n\nThis is an area where things can change significantly based on a single factor, such as backend or model repository metadata (for example, even when using the same GGUF from Llama.cpp and Ollama, the behaviour of the vision component differs…), so I think it will be quite time-consuming…\nFor now, with VLMs, you’re more likely to be at the mercy of differences in implementation between models than with LLMs, so implementing your own cache can be a hassle…\n\n* * *\n\nThe clean way to think about your problem is:\n\n  * the **reference image is stable** ,\n  * the **comparison rule is not**.\n\n\n\nSo the reusable thing is **not** the final answer and **not** the raw image. It is the reference image’s **post-vision representation** : the output of the vision encoder and projector, plus the metadata the model needs to place those features back into the LLM. Current docs and issue threads across vLLM, Qwen, and NVIDIA’s multimodal serving stack all point to that same boundary. (vLLM)\n\n## What is happening in your current setup\n\nYour `llama-server` log is showing the expensive vision path running again on every request. In a typical VLM stack, the path is:\n\n  1. image preprocessing,\n  2. vision encoder,\n  3. projector or bridge into the LLM space,\n  4. language-model prefill and decode.\n\n\n\nIf you send the same reference image again as raw image input, the server usually repeats steps 1 to 3 unless the runtime explicitly supports caching or injecting precomputed vision features. That is exactly the kind of repeated work vLLM’s multimodal RFC calls out: identical media being re-encoded across requests wastes encoder and projector compute, bandwidth, and scheduling capacity. (GitHub)\n\n## Direct answers\n\n### 1. Has anyone dealt with user-defined comparison criteria?\n\nYes.\n\nNot always under that exact phrase, but the closest public work treats the problem as **instruction-aware multimodal matching** rather than fixed similarity. VLM2Vec explicitly supports instruction-following multimodal embeddings for combinations of image and text, and Qwen3-VL-Embedding plus Qwen3-VL-Reranker are built for multimodal retrieval and reranking, with the embedding model handling recall and the reranker handling precise relevance scoring. That is very close to “the user defines what counts as similar this time.” (Tiger AI Lab)\n\n### 2. Can you cache or preload reference images in llama.cpp or Hugging Face?\n\n**llama.cpp:** partially in theory, poorly in practice today for multimodal. The server README documents slot save and restore with `--slot-save-path` and `/slots/{id}?action=save|restore`, but current open issues show that vision-enabled and `--mmproj` setups still have serious limits around slot persistence and cache reuse. There is an open PR adding `/vision/embedding` and `image_embedding` inputs, which is exactly the direction you want, but it is still an open PR, not stable baseline functionality. (GitHub)\n\n**Hugging Face Transformers:** yes, but usually through a **model-specific wrapper** , not a polished universal API for every VLM. Qwen2.5-VL exposes `get_image_features(pixel_values, image_grid_thw)`, which is the right seam for extracting reusable visual features. But the low-level path many people try next, passing those features back through `inputs_embeds`, has shown regressions and shape mismatches in the issue tracker. So it is possible, but brittle if you build directly on the lowest-level generation plumbing. (Hugging Face)\n\n**vLLM:** yes, much more directly. Its docs show content-hash-based cached multimodal inputs, stable `multi_modal_uuids`, the ability to skip resending cached media, and support for `image_embeds` input, including Qwen-family cases that require `image_grid_thw`. vLLM also has an `EncoderCacheManager` specifically for multimodal encoder outputs. (vLLM)\n\n### 3. What is the recommended strategy?\n\nFor your case, the best strategy is:\n\n  * **encode each reference image once** ,\n  * **store its visual package** ,\n  * **let the user define criteria later** ,\n  * **reuse the cached reference features** ,\n  * **only re-encode the new candidate image**.\n\n\n\nIf you have many references, add a first retrieval stage so you do not run the full generative comparison against every reference. That retrieval stage can use an instruction-aware embedding model, and the final stage can use a generative VLM or reranker for detailed, criterion-by-criterion judgment. (NVIDIA Docs)\n\n* * *\n\n## The architecture I would use\n\n## A. Split the pipeline into fixed work and variable work\n\n### Fixed work\n\nFor each reference image, do this once:\n\n  * preprocess with the exact model processor,\n  * run the vision encoder and projector,\n  * store the resulting features.\n\n\n\n### Variable work\n\nFor each user request:\n\n  * parse the user’s rubric,\n  * encode the new candidate image,\n  * fetch the cached reference features,\n  * run the final comparison.\n\n\n\nThis works because your rubric changes, but the reference image does not. Encoder cache is the right optimization for that pattern. NVIDIA’s Dynamo docs describe this explicitly: the embedding cache stores vision encoder outputs and reuses them when the same image appears again, and they also state that this is separate from KV cache. (NVIDIA Docs)\n\n## B. Cache a **visual package** , not just a tensor\n\nFor Qwen-family models, the reusable object is not just `image_embeds`. You also need the metadata that tells the model how those features map into the LLM side. vLLM’s Qwen examples show that `image_embeds` for Qwen2-VL must be paired with `image_grid_thw`, and the Qwen2.5-VL docs describe `image_grid_thw` as the temporal, height, and width feature shape of each image in the LLM. (vLLM)\n\nSo I would store, per reference image:\n\n  * `image_embeds`\n  * `image_grid_thw` or equivalent model-specific metadata\n  * model ID and revision\n  * processor ID and revision\n  * preprocessing settings\n  * source image hash\n\n\n\nThe last three are an engineering recommendation, not a literal API requirement. But Qwen’s docs show that preprocessing settings such as `min_pixels` and `max_pixels` change resolution and therefore compute and feature layout, so they belong in your cache key if you want correctness. (Hugging Face)\n\n## C. Add a retrieval stage if you have many references\n\nIf you have more than a small handful of references, do **not** ask a generative VLM to compare the candidate image against every reference. That is the expensive path.\n\nUse:\n\n  1. an **embedding model** for coarse recall,\n  2. a **reranker or generative VLM** for final scoring.\n\n\n\nQwen3-VL-Embedding is designed exactly as an embedding plus reranking pair, where the embedding model handles the initial recall stage and the reranker does precise scoring. VLM2Vec is also relevant because it supports instruction-guided multimodal embeddings, which matches your “user-defined criteria” requirement better than plain task-agnostic similarity. (GitHub)\n\nThat means a user query like:\n\n> short fur, triangular ears, dark forehead stripes\n\ncan first be used to retrieve the top few candidate references, and then only those finalists go through the expensive detailed comparison stage. (Tiger AI Lab)\n\n* * *\n\n## What I would do in each stack\n\n## 1. If you must stay on llama.cpp\n\nThis is the hardest path today.\n\nThe good news is that `llama-server` already documents slot save and restore for prompt cache. The bad news is that current open issues show those capabilities are still blocked or incomplete for multimodal contexts and even for some text-only conversations when `--mmproj` is loaded. One open issue says slot save for vision-enabled models does not work; another requests slot save or restore for hybrid Qwen multimodal use; another says that loading `--mmproj` can block slot persistence, context shift, and prompt cache reuse because the server treats “multimodal capability exists” as if “this slot contains images.” (GitHub)\n\nSo for llama.cpp, my advice is:\n\n  * do **not** count on multimodal slot persistence as your main solution today,\n  * use a long-lived in-memory session only as a temporary optimization,\n  * watch the `/vision/embedding` work closely,\n  * or move the multimodal serving layer elsewhere.\n\n\n\nThe open PR is important because it adds `/vision/embedding` and `image_embedding` inputs to `llama-server`, explicitly to decouple image understanding from loading and running the visual projector every time. That is the right direction, but it is not merged baseline functionality yet. (GitHub)\n\n## 2. If you stay in raw Hugging Face\n\nUse a **custom wrapper** around the visual path.\n\nFor Qwen2.5-VL, you have a documented image-feature seam: `get_image_features(pixel_values, image_grid_thw)`. That is where I would capture the reference image features and store them. Then I would write a model-specific path to feed those features back in later. (Hugging Face)\n\nWhat I would **not** do is make raw `inputs_embeds` your public application boundary. The Qwen2-VL issue shows that `inputs_embeds`-based generation can break with tensor-shape mismatches and image-token alignment problems. It is still useful plumbing, but it is not a stable long-term API abstraction. (GitHub)\n\n## 3. If you can move to vLLM\n\nThis is the cleanest open-source fit for your problem.\n\nvLLM already documents:\n\n  * content-hash caching for multimodal items,\n  * stable `multi_modal_uuids`,\n  * skipping the actual media payload on a cache hit,\n  * direct `image_embeds` inputs,\n  * Qwen-specific support for `image_grid_thw`,\n  * an encoder cache manager for multimodal encoder outputs. (vLLM)\n\n\n\nSo if your question is “what stack today is closest to the architecture I want,” the answer is vLLM.\n\nThat said, the public issue history shows this area is still evolving. vLLM’s own RFC acknowledges repeated media re-encoding as a real problem, which is why the encoder-cache direction exists at all. (GitHub)\n\n## 4. If you want the clearest production pattern\n\nNVIDIA Dynamo’s multimodal docs are the most explicit statement of the architecture you want:\n\n  * a **CPU-side LRU embedding cache** stores vision encoder outputs,\n  * repeated images reuse cached embeddings,\n  * on a cache hit, the encode worker is skipped entirely,\n  * embedding cache is separate from KV cache. (NVIDIA Docs)\n\n\n\nThat is the exact systems pattern your workload wants.\n\n* * *\n\n## What I would recommend for your exact use case\n\n## Option 1. Best overall design\n\nUse:\n\n  * a **multimodal embedding model** for retrieval or shortlist generation,\n  * a **generative VLM or reranker** for final criterion-aware scoring,\n  * a **reference feature store** that caches post-encoder visual packages.\n\n\n\nThis gives you flexibility for user-defined criteria without re-encoding fixed references. It also scales better than pairwise generative comparison against every reference. (GitHub)\n\n## Option 2. Smallest change from your current setup\n\nIf you want minimal changes:\n\n  * keep your current VLM,\n  * create a separate offline job that pre-encodes reference images,\n  * store those features,\n  * patch or wrap the serving layer so requests use cached reference features.\n\n\n\nIf you stay on llama.cpp, this likely means maintaining a custom branch or waiting for the `/vision/embedding` work to mature. If you move to vLLM, this is much closer to supported behavior already. (GitHub)\n\n## Option 3. Add a structured attribute cache\n\nThis is not enough by itself, but it is useful.\n\nFor each reference image, generate a structured sidecar once:\n\n\n    {\n      \"fur_length\": \"short\",\n      \"ear_shape\": \"upright triangular\",\n      \"color_pattern\": \"tabby with white chest\",\n      \"facial_markings\": \"dark forehead stripes\"\n    }\n\n\nThen many user criteria can be answered or prefiltered cheaply from text and attributes, with the full VLM only used for ambiguous or fine-grained checks.\n\nThis is an engineering recommendation rather than something one source states directly. It follows from the documented separation between reusable visual features, retrieval models, and reranking or generation stages. (GitHub)\n\n* * *\n\n## Pitfalls to avoid\n\n### 1. Do not confuse encoder cache with KV cache\n\nThey solve different problems. NVIDIA’s docs say this explicitly: embedding or encoder cache stores vision encoder outputs, while KV cache reuses attention state after prefill. Your slowdown is showing up during image encoding, so encoder cache matters more than KV cache. (NVIDIA Docs)\n\n### 2. Do not assume “same image” means “same safe cached embedding”\n\nFor Qwen-family models, preprocessing settings affect the feature layout. The docs show `min_pixels` and `max_pixels` change resolution, and `image_grid_thw` is part of the feature contract. So your cache key should include not just the image hash, but also the model revision, processor revision, and preprocessing settings. That is an engineering inference grounded in the documented feature contract. (Hugging Face)\n\n### 3. Do not build your product around brittle low-level generation hooks\n\nThe `inputs_embeds` route can work, but the Transformers issue history shows it is a sharp edge for multimodal generation. Use it as internal plumbing, not as the main public abstraction of your application if you can avoid it. (GitHub)\n\n### 4. Do not run full generative comparison against every reference\n\nUse embedding recall first if the reference set is large. That is what Qwen3-VL-Embedding plus its reranker, and instruction-aware embedding work like VLM2Vec, are built for. (GitHub)\n\n* * *\n\n## Bottom line\n\nFor your case, the right answer is:\n\n  * **Yes** , this is a known problem.\n  * **Yes** , people handle it by splitting **fixed visual encoding** from **variable text-side criteria**.\n  * **llama.cpp** has partial building blocks but does **not** yet look like the cleanest production answer for multimodal reference reuse.\n  * **Hugging Face** can do this with model-specific code, but the low-level path is brittle.\n  * **vLLM** is the best open-source fit today if you want cached media inputs and direct precomputed `image_embeds`.\n  * **Dynamo** is the clearest reference architecture for “skip vision encoding on cache hit.” (GitHub)\n\n\n\nThe most useful mental model is:\n\n**cache reference-image features once, retrieve first if needed, and let the final VLM judge only the few candidates that matter under the user’s current rubric.** (NVIDIA Docs)",
  "title": "Avoid Re-encoding Reference Images in Vision-LLM When Comparison Criteria Are User-Defined"
}