{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihttayrg6hi4ohbs226drymadpjlfc76xqgbmsxyat3tneuaqtvdq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mpiuihwj3w52"
  },
  "path": "/t/we-all-start-somewhere/177233#post_6",
  "publishedAt": "2026-06-30T10:15:14.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "RAG Evaluation cookbook",
    "faithfulness, answer relevancy, context precision, and context recall",
    "(click for more details)",
    "Gemma 4 tool-call boundary failures",
    "Qwen3.5 27B tool calling",
    "huggingface_hub environment variable docs",
    "installation/offline docs",
    "Forensic Implications of Localized AI",
    "Qwen3-4B-Instruct-2507",
    "default chat template no longer being allowed as of Transformers v4.44",
    "FLUX.2 dev",
    "FLUX.2 Klein 4B",
    "ComfyUI-GGUF"
  ],
  "textContent": "Oh. If I know the hardware specs, I can narrow the candidates down quite a bit. As for unstable networks while traveling… hmm… what to do about that​:\n\n* * *\n\n## Direct answer\n\nGiven that workstation, I would treat this as two separate problems:\n\n  1. **On the workstation:** choose realistic local baselines for a 12GB VRAM GPU, then test them with a small eval set.\n  2. **While traveling:** build a known-good offline pack, and test it with the network disconnected before leaving.\n\n\n\nFor the RAG question: poor RAG results are **not enough evidence for retraining**. I would first separate retrieval failure, context construction failure, instruction/guardrail conflict, tool/function-calling failure, model capability failure, and repeated behavior-pattern failure.\n\nFor the hardware: with RTX 5070 12GB + 32GB RAM, I would start with:\n\n  * **LLMs:** 3B–9B instruct/coder models as the first practical range.\n  * **RAG:** small LLM + local embedding model + local vector/search index.\n  * **Image generation:** ComfyUI as the backend, with SDXL first, then FLUX.1 GGUF, then FLUX.2 Klein 4B/9B GGUF, and full FLUX.2 dev only as a stretch experiment.\n  * **Travel:** one small known-good LLM, one embedding model, one local index, one image workflow if needed, all pre-downloaded.\n\n\n\n* * *\n\n## 1. RAG failure is not automatically a retraining problem\n\nIf you already have a vector DB, moderate chunk size, detailed instructions, and guardrails, but the answers are still poor, I would not jump straight to retraining.\n\nI would first split the failure like this:\n\nFailure class | What it means | First test\n---|---|---\nRetrieval failure | The right chunks are not retrieved | Show top-k chunks before generation\nContext construction failure | Right chunks exist, but noisy/wrong context is passed | Inspect the final context sent to the model\nInstruction / guardrail failure | Instructions fight the evidence or each other | Temporarily simplify the instruction stack\nTool / function-calling failure | Tool protocol, template, parser, or proxy is broken | Test no-tools and backend-direct paths\nModel capability failure | Model sees the right evidence but cannot reason over it | Try a stronger model on the same context\nRepeated behavior-pattern failure | Same format/style/process failure repeats | Build eval examples, then consider LoRA/PEFT\n\nThe important distinction:\n\n  * If the **right evidence is not retrieved** , fix retrieval/chunking/reranking.\n  * If the **right evidence is retrieved but ignored** , fix prompt/context formatting or try another model.\n  * If the **model can answer but the output pattern is unstable** , use examples or consider fine-tuning.\n  * If the **same behavior failure repeats across many examples** , then LoRA/PEFT becomes more interesting.\n\n\n\nHugging Face has a useful RAG Evaluation cookbook. Ragas also has practical RAG metrics such as faithfulness, answer relevancy, context precision, and context recall. I would not necessarily adopt a full framework immediately, but the categories are useful.\n\nA practical RAG triage test plan (click for more details)\n\n* * *\n\n## 2. If tools are involved, “bad RAG” may actually be a tool protocol failure\n\nOne extra caveat: if your RAG system uses function calling or tools, do not assume every failure is retrieval or fine-tuning related.\n\nTool-call failures can look like bad RAG.\n\nA model may retrieve the right information, but the runtime, proxy, client, or chat template may be serializing or parsing tool calls incorrectly. This is especially easy to miss when using OpenAI-compatible adapters, local runners, streaming, or agent clients.\n\nRecent Gemma 4 and Qwen/Ollama discussions are useful examples of this class of issue. The Gemma 4 tool-call discussion frames many failures as multi-layer protocol-boundary problems around native tool-call syntax, chat templates, GGUF artifacts, runtime parsers, streaming parsers, OpenAI-compatible proxy layers, and agent loops rather than a single simple “model is bad” issue. See the HF Forum analysis on Gemma 4 tool-call boundary failures. Ollama also has Qwen tool-call issues such as Qwen3.5 27B tool calling and related `/api/chat tools` prompt-construction problems.\n\nSo, if tools are involved, I would test the tool layer separately.\n\nTool/function-calling checks (click for more details)\n\n* * *\n\n## 3. Travel problem: package, do not improvise\n\nSince your workstation can run local models, I would treat the travel issue mostly as an **offline packaging problem** , not a training problem.\n\nThe target is a known-good pack that does not need the network at startup.\n\nMinimum idea:\n\n  * one local runner\n  * one small LLM\n  * one embedding model\n  * optional reranker\n  * one local vector/search index\n  * prompts and small eval set\n  * one image workflow if image generation matters\n  * no remote embedding API\n  * no fallback LLM API\n  * no login required during startup\n  * no model download required during startup\n  * network-off test before leaving\n\n\n\nHugging Face cache/offline behavior can be controlled with things like `HF_HOME`, `HF_HUB_CACHE`, and `HF_HUB_OFFLINE`; see the huggingface_hub environment variable docs and the Transformers installation/offline docs.\n\nAlso, local runners can still leave artifacts. A recent paper, Forensic Implications of Localized AI, analyzes caches, configs, prompt histories, logs, and network activity traces for tools such as Ollama, LM Studio, and llama.cpp. The practical takeaway is simple: local is not automatically private; you still need to know where files and histories go.\n\nKnown-good travel pack (click for more details)\n\n* * *\n\n## 4. Hardware-aware LLM starting points\n\nWith RTX 5070 12GB + 32GB RAM, I would choose models by **headroom** , not just maximum quality.\n\nFor a first baseline, I would start in the **3B–9B instruct/coder range**.\n\nTier | Candidate size | Role\n---|---|---\nTravel-safe | 3B–4B | Fast local/offline baseline\nWorkstation baseline | 7B–9B | Better quality while still practical\nStretch | 12B-ish quant | Possible, but watch context length, speed, and VRAM\nLater experiment | 14B+, large MoE, 30B-A3B | Try after workflow is stable\n\nA good 4B model can be a useful travel baseline. For example, Qwen3-4B-Instruct-2507 is a 4B instruct model with local-app and quantization paths. The card also shows why version/runtime notes matter: it recommends recent Transformers and notes errors with older versions.\n\nOther compact baselines could include Gemma 3 4B class models, Phi-mini class models, or similar 3B–4B instruct models. For workstation use, 7B–9B models are a natural next bucket. For coding, I would prefer a coder-specialized 7B–9B model over a huge barely-fitting general model as the first baseline.\n\nHow I would choose LLMs on 12GB VRAM (click for more details)\n\n* * *\n\n## 5. Newest is not always the best first baseline\n\nI would separate “newest” from “most stable.”\n\nNewer models often have better capability, but older-but-not-ancient models may have more mature backend support:\n\n  * tested GGUFs\n  * known templates\n  * working Ollama/llama.cpp/LM Studio paths\n  * documented ComfyUI workflows\n  * known failure modes\n  * more community troubleshooting\n\n\n\nToo old can become painful in the opposite direction:\n\n  * missing `chat_template`\n  * old pipeline assumptions\n  * stale custom nodes\n  * old dependency expectations\n  * hard-to-run Diffusers/ComfyUI workflows\n  * unsupported tool-call formats\n\n\n\nSo for a first baseline, I would usually pick **recent enough and boring** , not largest/newest possible.\n\nTransformers and vLLM have become stricter about explicit chat templates. There is a useful HF Forum note on default chat template no longer being allowed as of Transformers v4.44, and vLLM’s OpenAI-compatible server docs also emphasize chat template handling.\n\nNewest vs stable tradeoff (click for more details)\n\n* * *\n\n## 6. Image generation on 12GB VRAM: ComfyUI as backend\n\nFor image/video, I would not make the LLM itself do that directly.\n\nI would keep the component boundary:\n\n\n    chat / RAG / agent layer\n    → small local tool interface\n    → ComfyUI or another image/video backend\n    → output file path\n\n\nOn your workstation, I would probably use **ComfyUI first**. Not because it is the simplest UI, but because it is flexible, workflow-based, local, and widely used for SDXL / FLUX / I2V / T2V experimentation.\n\nFor 12GB VRAM, I would test in this order:\n\n  1. **SDXL** as the boring stable baseline.\n  2. **FLUX.1 GGUF** as a stronger modern image path.\n  3. **FLUX.2 Klein 4B/9B GGUF** as a more realistic FLUX.2 path.\n  4. **Full FLUX.2 dev GGUF** only as a stretch experiment.\n\n\n\nThe full FLUX.2 dev model is a 32B image model, so I would not make it the first 12GB-VRAM baseline. FLUX.2 Klein 4B is much more plausible as a smaller FLUX.2 path, though exact quant/workflow/offload still matters. ComfyUI-GGUF is the relevant ComfyUI path for GGUF image models.\n\nComfyUI mental model (click for more details)\n\n* * *\n\n## 7. Low-VRAM I2V/T2V: workflow matters as much as model name\n\nFor low-VRAM image/video, I would evaluate workflow behavior, not just model name.\n\nMetrics I would track:\n\n  * VRAM headroom\n  * generation time\n  * source fidelity\n  * identity preservation\n  * prompt obedience\n  * artifact rate\n  * seed sensitivity\n  * resolution\n  * number of frames\n  * sampler/scheduler\n  * CFG\n  * steps\n  * whether the subject gets repainted\n\n\n\nFor I2V/T2V especially, I would keep prompts simple at first:\n\n  * one clip\n  * one small action\n  * no complex camera move\n  * no speaking + big motion + camera movement all at once\n  * test blink / breathing / slight smile / small head motion first\n\n\n\nThis is basically the same principle as RAG eval: change one thing at a time.\n\nLow-VRAM I2V/T2V checklist (click for more details)\n\n* * *\n\n## 8. Suggested first setup\n\nGiven everything above, I would build two profiles.\n\n### Workstation profile\n\nUse this for heavier testing:\n\n  * 7B–9B local instruct/coder LLM\n  * optional 12B quant experiment\n  * local RAG index\n  * ComfyUI\n  * SDXL baseline\n  * FLUX.1 GGUF test\n  * FLUX.2 Klein test\n  * full FLUX.2 dev only later\n  * small eval set for LLM/RAG/image\n\n\n\n### Travel profile\n\nUse this for reliability:\n\n  * 3B–4B instruct LLM\n  * one embedding model\n  * local vector/search index\n  * no remote embedding API\n  * no fallback LLM API\n  * one tested ComfyUI workflow if images are needed\n  * one stable image model, probably SDXL or a small FLUX GGUF\n  * all files pre-downloaded\n  * network-off test before leaving\n\n\n\nThe workstation can be where you test heavier and newer things. The travel pack should be small, boring, and already proven.\n\n* * *\n\n## 9. What would help next\n\nFor more concrete recommendations, the next useful details would be:\n\n  * Is the travel machine the same workstation or a laptop?\n  * Can you carry an external SSD?\n  * Do you want the travel pack to run only on Windows, or also Linux?\n  * Is the priority chat, coding, RAG, image generation, or video?\n  * Do you need tool/function calling?\n  * Does “offline” mean convenience, or sensitive/private data?\n  * What runner have you tried: Ollama, LM Studio, llama.cpp, vLLM, something else?\n  * What RAG stack are you using now?\n  * Are tools/function calls part of the RAG flow?\n  * What models have already failed?\n\n\n\nThat would narrow the suggestions more than a generic model list.",
  "title": "We all start somewhere"
}