External Publication
Visit Post

Isn't there a simpler way to run LLMs / models locally?

Hugging Face Forums [Unofficial] June 30, 2026
Source

There are a few fully multimodal / omni-style large models, but if the more general goal is “I want my OSS local chat/RAG setup to call T2I/T2V”, I would usually make the image/video models separate local server processes and connect them as a pipeline. The execution cost, debugging cost, and replacement cost are usually lower that way. Existing frameworks already cover a lot of this:


Short answer

If you mean one offline Claude/Sonnet-like model or one magic USB app that also does high-quality T2I/T2V , then yes, I think that is probably pushing it.

If you mean a local pipeline , it becomes much more realistic:

chat / RAG / agent / workflow UI
→ tool call, HTTP request, CLI call, plugin, skill, or MCP
→ local image/video generation backend
→ output image/video file

So I would not start by trying to make the chat model itself render images or videos. I would start by making the media generator a small callable local backend.

The chat/RAG side could be LM Studio, Open WebUI, LibreChat, AnythingLLM, Dify, n8n, LangChain, LlamaIndex, Haystack, smolagents, a Python CLI, or something custom. The important part is not the UI. The important part is that the image/video generator has a narrow local interface.

The common pattern

A lot of existing tools already point in the same direction:

Existing shape What it shows
Open WebUI + ComfyUI / Open WebUI + AUTOMATIC1111 A local chat/RAG UI can call a separate image-generation backend.
LibreChat Stable Diffusion tool An agent tool can call a local AUTOMATIC1111 Stable Diffusion WebUI API.
Dify ComfyUI plugin A workflow/RAG app builder can call exported ComfyUI API workflows.
AnythingLLM custom agent skills An agent layer can call APIs, local Python scripts, or OS-level actions depending on deployment.
Open WebUI tools / MCP / OpenAPI tool servers The tool layer can live as a separate process and be called over HTTP/MCP/OpenAPI-style boundaries.
n8n + ComfyUI media workflows The same generation backend can be called by automation workflows, not only chat UIs.

These are not the same stack, and I would not treat any one of them as the universal answer. But they show the general architecture:

LLM / RAG / agent layer
→ small tool interface
→ media-generation backend
→ file path, URL, or artifact

That is why I would frame this as a component-boundary problem rather than as a “find one bigger model” problem.

I would separate three questions

I would split this into three separate questions:

  1. How do I learn T2I/T2V itself?
  2. How do I run T2I/T2V locally as a callable process?
  3. How do I let a chat/RAG/agent layer call that process?

Those are related, but mixing them too early makes debugging much harder.

First practical target

Before involving RAG, agents, or MCP, I would first make this boring test pass:

prompt in
→ local process
→ output image/video file path out

That one test tells you whether the generation backend is real, local, reproducible, and scriptable.

More detail: learning T2I/T2V vs making it callable (click for more details)

ComfyUI path

In this context, I would not treat ComfyUI as “just a GUI”. I would treat it as a workflow-backed local inference server :

build workflow visually
→ export/use API form
→ call workflow from another process
→ return generated file

A practical first milestone:

  1. Build one known-good workflow manually in ComfyUI.
  2. Confirm it works locally.
  3. Export/use the API format.
  4. Expose only a few stable inputs.
  5. Call the workflow from a small Python script.
  6. Return an output file path.
  7. Only then connect it to chat/RAG/agent tooling.

The useful starting docs are:

  • ComfyUI repository
  • ComfyUI server routes, especially the /prompt queue flow
  • ComfyUI API examples
  • ComfyUI workflow API format
  • ComfyUI basic API example script

The hard part is often not “can I send an HTTP request?” The harder part is “which inputs of this workflow are stable and safe to expose?”

More detail: ComfyUI backend notes, examples, and caveats (click for more details)

Diffusers path

If you are more comfortable in Python, Diffusers is the more natural route.

The basic shape is:

Python Diffusers pipeline
→ local FastAPI endpoint or CLI wrapper
→ output file path

Diffusers has an official guide for using a pipeline as a server inference engine: Create a server.

For images, this can be fairly straightforward:

/generate-image
input: prompt, seed, size, model/preset
output: image path

For video, I would treat it as a longer job:

/generate-video
input: prompt or input_image, seed, size, frames, preset
output: job_id first
then: progress/logs
finally: video path

That job-style design matters because video generation is usually slower, larger, and more failure-prone than single-image generation.

More detail: Diffusers/API-server examples (click for more details)

AUTOMATIC1111 / Forge path

If someone already uses the Stable Diffusion WebUI ecosystem, then AUTOMATIC1111 or Forge-style APIs are another practical path.

Useful links:

  • AUTOMATIC1111 Stable Diffusion WebUI API wiki
  • LibreChat Stable Diffusion tool
  • LibreChat image generation overview

This path is useful when you want something like:

chat/agent tool
→ AUTOMATIC1111 API
→ txt2img/img2img
→ output image

It may be less ideal if your main target is complex video workflows, but it is a useful example of the same general architecture.

Where MCP fits

MCP is useful, but I would put it one layer above the renderer.

In other words:

MCP is not the image/video model.
MCP is a tool boundary that can call the image/video backend.

A reasonable path would be:

1. Make ComfyUI / Diffusers / A1111 callable locally.
2. Wrap it in a narrow HTTP or CLI interface.
3. Test it without an agent.
4. Then expose that interface as an MCP tool if you want multiple MCP-compatible clients to call it.

If a simple local HTTP API is enough, start there. MCP is useful once you need a standard tool boundary, tool discovery, or integration across several clients.

More detail: MCP / tool / agent integration links (click for more details)

Add video after image generation works

T2V/I2V uses the same architecture, but I would be more conservative with it.

Ready-made chat integrations are usually more mature for image generation than for video generation. For video, I would first prove the workflow locally, then wrap it as a job-style API with queue/progress/output-path handling.

A realistic order:

1. T2I: one local image from one prompt
2. img2img / inpaint / control workflow
3. callable local API
4. chat/RAG/tool integration
5. I2V/T2V after the image path is stable

Why video later?

  • more VRAM,
  • longer jobs,
  • more output data,
  • queue/progress handling matters,
  • timeouts matter,
  • FFmpeg or video encoding may matter,
  • failure recovery matters,
  • output file management matters.

More detail: video model/workflow examples (click for more details)

Offline / travel checklist

For the travel/offline part, I would not trust “it worked once while online” as proof. I would do a real network-disabled test before depending on it.

Checklist:

  • model weights are already downloaded,
  • gated model access is already resolved,
  • Python packages are installed,
  • ComfyUI custom nodes/extensions are installed,
  • FFmpeg/video tools are installed if needed,
  • no hidden dependency on HF Inference API, Replicate, fal, Comfy Cloud, OpenAI, etc.,
  • cache directories are on the disk you carry,
  • output paths are predictable,
  • logs do not leak sensitive prompts/images,
  • the target machine has the required GPU driver / CUDA / ROCm / PyTorch stack,
  • the app still runs after disconnecting the network completely.

USB/SSD can carry models, caches, environments, workflows, and scripts. It cannot magically carry the target machine’s GPU driver, VRAM, CUDA/ROCm compatibility, or OS-level dependency state.

More detail: related but not identical cases (click for more details)

My suggested first build

If I were trying to make this practical, I would use this order:

Step 1:
Pick one backend:
ComfyUI, Diffusers, or AUTOMATIC1111/Forge.

Step 2:
Generate one image locally.
No RAG, no agent, no MCP yet.

Step 3:
Call the generator from a script.
prompt → backend → output path.

Step 4:
Freeze one known-good workflow/preset.
Expose only a narrow input schema.

Step 5:
Connect that script/API to the chat/RAG layer.

Step 6:
Only then add MCP/OpenAPI/tool metadata if useful.

Step 7:
Only then add video.
Treat video as a longer job with queue/progress/output handling.

That is less glamorous than “one fully multimodal offline assistant”, but it is easier to debug, easier to replace piece by piece, and closer to what existing OSS tools already support.

Discussion in the ATmosphere

Loading comments...