Raw Record Source

{
  "path": "/3m724p6jhos2s",
  "site": "at://did:plc:57od6g2ic3e3b3kauctjmo3k/site.standard.publication/3lwagtcm36s2d",
  "$type": "site.standard.document",
  "title": "Self-Hosting an LLM: A Scatter Pack",
  "content": {
    "$type": "pub.leaflet.content",
    "pages": [
      {
        "id": "019adb0d-96b8-7ff3-97fc-8d5e3083f1d8",
        "$type": "pub.leaflet.pages.canvas",
        "blocks": [
          {
            "x": 679,
            "y": 3145,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://huggingface.co/models",
              "$type": "pub.leaflet.blocks.website",
              "title": "Models – Hugging Face",
              "description": "Explore machine learning models."
            },
            "width": 360
          },
          {
            "x": 688,
            "y": 3277,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "Once you've got a clue, you can take your search over here to find a specific model that meets size/quantization constraints."
            },
            "width": 360
          },
          {
            "x": 591,
            "y": 4149,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.image",
              "image": {
                "$type": "blob",
                "ref": {
                  "$link": "bafkreib2q5x6rjvxhxph7mn5olyu3vm32fwwi7jshx7ybnds5prjztetjq"
                },
                "mimeType": "image/png",
                "size": 75855
              },
              "aspectRatio": {
                "width": 776,
                "height": 348
              }
            },
            "width": 598,
            "rotation": 12
          },
          {
            "x": 270,
            "y": 1617,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "If you choose vLLM or parallelized solutions, you have to select a smaller model than you would otherwise, because the key-value cache needs to reside in VRAM."
            },
            "width": 710
          },
          {
            "x": 270,
            "y": 1713,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "If you have Apple Silicon, you probably want to use MLX, which has its own architecture that models need explicit conversion to support."
            },
            "width": 699
          },
          {
            "x": 270,
            "y": 1785,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "Ollama has its own catalog of models that you can deploy from, which can be easier to use than Hugging Face"
            },
            "width": 699
          },
          {
            "x": 274,
            "y": 1880,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [
                {
                  "index": {
                    "byteEnd": 50,
                    "byteStart": 44
                  },
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#italic"
                    }
                  ]
                }
              ],
              "plaintext": "These are all conditions that I wish I knew before I started to shop for models. I got excited about running Deepseek R1 7B, but I had to run a smaller model."
            },
            "width": 716
          },
          {
            "x": 8,
            "y": 75,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "I got nerd-sniped (I think that's the term?) into running a text-generation LLM at home on a GPU I had lying around. In lieu of a well-formed blogpost, here's a collection of things that helped me along the way."
            },
            "width": 1057
          },
          {
            "x": 290,
            "y": 711,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.header",
              "level": 2,
              "facets": [],
              "plaintext": "Step 2: Choose your server"
            },
            "width": 393
          },
          {
            "x": 274,
            "y": 2048,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "Does it need to be network available?"
            },
            "width": 591
          },
          {
            "x": 274,
            "y": 2096,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "How many people/applications would like to access your server at once?"
            },
            "width": 751
          },
          {
            "x": 285,
            "y": 285,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "What do you want to do with a model? Is this just to say you can do it, or do you want to accomplish something?"
            },
            "width": 605
          },
          {
            "x": 289,
            "y": 242,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.header",
              "level": 2,
              "facets": [],
              "plaintext": "Step 1: Do a little dreaming"
            },
            "width": 360
          },
          {
            "x": 285,
            "y": 357,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "Here are some things that inspired me to host my own model:"
            },
            "width": 570
          },
          {
            "x": 840,
            "y": 487,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://www.home-assistant.io/voice_control/",
              "$type": "pub.leaflet.blocks.website",
              "title": "Talking with Home Assistant - get your system up & running",
              "description": "Open source home automation that puts local control and privacy first. Powered by a worldwide community of tinkerers and DIY enthusiasts. Perfect to run on a Raspberry Pi or a local server."
            },
            "width": 360,
            "rotation": -26
          },
          {
            "x": 567,
            "y": 483,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://bsky.app/profile/zzstoatzzdevlog.bsky.social",
              "$type": "pub.leaflet.blocks.website",
              "title": "zzstoatzzdevlog.bsky.social",
              "description": "(maybe) interesting stuff i (@zzstoatzz.io) do - narrated by claude!  see https://github.com/jlowin/fastmcp/pull/916"
            },
            "width": 360,
            "rotation": -27
          },
          {
            "x": 295,
            "y": 479,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://github.com/letta-ai/letta-code",
              "$type": "pub.leaflet.blocks.website",
              "title": "GitHub - letta-ai/letta-code: A self-improving, stateful coding agent that can learn from experience and improve with use.",
              "description": "A self-improving, stateful coding agent that can learn from experience and improve with use. - letta-ai/letta-code"
            },
            "width": 360,
            "rotation": -26
          },
          {
            "x": 290,
            "y": 762,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "I'm assuming that you want to run an off-the-shelf model, and boy-oh-boy are there a lot of them. Deciding what you'll use to run your model will help you refine your search. For me, it came down to parallelism:"
            },
            "width": 608
          },
          {
            "x": 8,
            "y": 147,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "I'm calling this a scatter pack because it's kind of like a starter pack, but it's mostly a scattered mix of advice, tutorial, curation, and post-mortem blogging. I'm not getting very far into implementation detail, and I probably won't cover your use case."
            },
            "width": 1116
          },
          {
            "x": 840,
            "y": 902,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [
                {
                  "index": {
                    "byteEnd": 16,
                    "byteStart": 0
                  },
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#bold"
                    }
                  ]
                }
              ],
              "plaintext": "No parallelism ="
            },
            "width": 178
          },
          {
            "x": 740,
            "y": 944,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "Multiple requests need to wait for the request before them to complete"
            },
            "width": 360
          },
          {
            "x": 338,
            "y": 1048,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "I started with Ollama, but I quickly found out I needed to switch to vLLM. Here are some options, in addition to those two:"
            },
            "width": 556
          },
          {
            "x": 61,
            "y": 1164,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://ollama.com/",
              "$type": "pub.leaflet.blocks.website",
              "title": "Ollama",
              "description": "Get up and running with large language models."
            },
            "width": 360
          },
          {
            "x": 442,
            "y": 1164,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://docs.vllm.ai/en/latest/",
              "$type": "pub.leaflet.blocks.website",
              "title": "vLLM",
              "description": "You are viewing the latest developer preview docs. Click here to view docs for the latest stable release."
            },
            "width": 360
          },
          {
            "x": 828,
            "y": 1162,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://github.com/turboderp-org/exllamav3",
              "$type": "pub.leaflet.blocks.website",
              "title": "GitHub - turboderp-org/exllamav3: An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs",
              "description": "An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs  - GitHub - turboderp-org/exllamav3: An optimized quantization and inference library for runni..."
            },
            "width": 360
          },
          {
            "x": 262,
            "y": 1296,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://github.com/ggml-org/llama.cpp",
              "$type": "pub.leaflet.blocks.website",
              "title": "GitHub - ggml-org/llama.cpp: LLM inference in C/C++",
              "description": "LLM inference in C/C++. Contribute to ggml-org/llama.cpp development by creating an account on GitHub."
            },
            "width": 360
          },
          {
            "x": 700,
            "y": 1295,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://ml-explore.github.io/mlx/build/html/index.html",
              "$type": "pub.leaflet.blocks.website",
              "title": "MLX — MLX 0.30.0 documentation",
              "description": "MLX is a NumPy-like array framework designed for efficient and flexible machine\nlearning on Apple silicon, brought to you by Apple machine learning research."
            },
            "width": 360
          },
          {
            "x": 272,
            "y": 902,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [
                {
                  "index": {
                    "byteEnd": 11,
                    "byteStart": 0
                  },
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#bold"
                    }
                  ]
                }
              ],
              "plaintext": "Parallelism ="
            },
            "width": 154
          },
          {
            "x": 163,
            "y": 943,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "You can have many completions running at the same time"
            },
            "width": 360
          },
          {
            "x": 383,
            "y": 1479,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [
                {
                  "index": {
                    "byteEnd": 49,
                    "byteStart": 0
                  },
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#italic"
                    }
                  ]
                }
              ],
              "plaintext": "Why are we selecting the server before the model?"
            },
            "width": 532
          },
          {
            "x": 29,
            "y": 478,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://bsky.app/profile/penelope.hailey.at",
              "$type": "pub.leaflet.blocks.website",
              "title": "Penelope (@penelope.hailey.at)",
              "description": "@hailey.at takes care of me. if there are problems, let her know."
            },
            "width": 360,
            "rotation": -27
          },
          {
            "x": 267,
            "y": 1549,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "Again, I'm focusing on picking a model off-the-shelf, so your choice of server can introduce extra constraints on your selection. For example:"
            },
            "width": 734
          },
          {
            "x": 274,
            "y": 1952,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "Here are some questions you should ask yourself during this process:"
            },
            "width": 653
          },
          {
            "x": 274,
            "y": 2000,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "Where will it run (e.g. desktop environment or container)?"
            },
            "width": 671
          },
          {
            "x": 252,
            "y": 2246,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.header",
              "level": 2,
              "facets": [],
              "plaintext": "Step 3: Choose a model"
            },
            "width": 360
          },
          {
            "x": 254,
            "y": 2293,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "Okay, hopefully you know what you're gonna use to run your model. If you're lost or confused, I'd recommend Ollama as the place to start."
            },
            "width": 721
          },
          {
            "x": 254,
            "y": 2365,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "The GPU I'm working with is an RTX 3060 Ti with 8GB of VRAM. To run our model, it needs to fit in VRAM--and I already told you about my parallelism requirement, so we're going to have to do a bit of code golfing."
            },
            "width": 699
          },
          {
            "x": 254,
            "y": 2461,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "There are three model characteristics that I needed to consider in selecting my model. I'll briefly describe them and link out to IBM, who can explain each in more detail:"
            },
            "width": 713
          },
          {
            "x": 28,
            "y": 2603,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [
                {
                  "index": {
                    "byteEnd": 16,
                    "byteStart": 0
                  },
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#bold"
                    }
                  ]
                }
              ],
              "plaintext": "Model Parameters"
            },
            "width": 200
          },
          {
            "x": 31,
            "y": 2641,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "This is the most important factor in how much \"intelligence\" the model has. More parameters = bigger model. Many models are available with different numbers of parameters."
            },
            "width": 360
          },
          {
            "x": 422,
            "y": 2603,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [
                {
                  "index": {
                    "byteEnd": 12,
                    "byteStart": 0
                  },
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#bold"
                    }
                  ]
                }
              ],
              "plaintext": "Quantization"
            },
            "width": 147
          },
          {
            "x": 837,
            "y": 2603,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [
                {
                  "index": {
                    "byteEnd": 14,
                    "byteStart": 0
                  },
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#bold"
                    }
                  ]
                }
              ],
              "plaintext": "Context Window"
            },
            "width": 178
          },
          {
            "x": 426,
            "y": 2641,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "Quantization is part compression, part optimization. It shrinks your model by reducing the precision of certain weights, but it that also increases the speed of execution. It can also make your model more unstable, so try to strike a balance."
            },
            "width": 360
          },
          {
            "x": 837,
            "y": 2639,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "I'm not sure how much this affects non-parallel deployments, but this is a factor in computing the KV cache size. Fortunately, you can adjust this in vLLM, but it's good to know what size the model supports. Bigger context = bigger cache"
            },
            "width": 360
          },
          {
            "x": 429,
            "y": 2831,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://www.ibm.com/think/topics/quantization",
              "$type": "pub.leaflet.blocks.website",
              "title": "What is Quantization? | IBM",
              "description": "Quantization is the process of reducing the precision of a digital signal, typically from a higher-precision format to a lower-precision format."
            },
            "width": 360
          },
          {
            "x": 842,
            "y": 2830,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://www.ibm.com/think/topics/context-window",
              "$type": "pub.leaflet.blocks.website",
              "title": "What is a context window? | IBM",
              "description": "The context window (or “context length”) of a large language model (LLM) is the amount of text, in tokens, that the model can consider or “remember” at once."
            },
            "width": 360
          },
          {
            "x": 30,
            "y": 2831,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://www.ibm.com/think/topics/model-parameters",
              "$type": "pub.leaflet.blocks.website",
              "title": "What are Model Parameters? | IBM",
              "description": "Model parameters are the internal configuration variables of a machine learning model which control how it processes data and makes predictions."
            },
            "width": 360
          },
          {
            "x": 435,
            "y": 3026,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "Here are a couple of places where you can look for models:"
            },
            "width": 360
          },
          {
            "x": 222,
            "y": 3144,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://ollama.com/search",
              "$type": "pub.leaflet.blocks.website",
              "title": "Ollama Search",
              "description": "Search for models on Ollama."
            },
            "width": 360
          },
          {
            "x": 228,
            "y": 3275,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [
                {
                  "index": {
                    "byteEnd": 26,
                    "byteStart": 0
                  },
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#bold"
                    }
                  ]
                }
              ],
              "plaintext": "I recommend starting here, even if you're not using Ollama. It's a good place to see what's new and popular, and it concisely displays the available parameter versions."
            },
            "width": 360
          },
          {
            "x": 606,
            "y": 3787,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://huggingface.co/JunHowie/Qwen3-4B-Thinking-2507-GPTQ-Int4",
              "$type": "pub.leaflet.blocks.website",
              "title": "JunHowie/Qwen3-4B-Thinking-2507-GPTQ-Int4 · Hugging Face",
              "description": "We’re on a journey to advance and democratize artificial intelligence through open source and open science."
            },
            "width": 360
          },
          {
            "x": 604,
            "y": 3617,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507",
              "$type": "pub.leaflet.blocks.website",
              "title": "Qwen/Qwen3-4B-Thinking-2507 · Hugging Face",
              "description": "We’re on a journey to advance and democratize artificial intelligence through open source and open science."
            },
            "width": 360
          },
          {
            "x": 268,
            "y": 3626,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "The base model on Hugging Face:"
            },
            "width": 325
          },
          {
            "x": 265,
            "y": 3787,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "The quantized variant I'm actually running:"
            },
            "width": 360
          },
          {
            "x": 339,
            "y": 3463,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [
                {
                  "index": {
                    "byteEnd": 58,
                    "byteStart": 36
                  },
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#bold"
                    }
                  ]
                }
              ],
              "plaintext": "The model that I decided to run was Qwen3-4B-Thinking-2507, because it's small, recent, and scores well on benchmarks that don't mean all that much to me because I'm not a professional in this field."
            },
            "width": 569
          },
          {
            "x": 271,
            "y": 3983,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "The Hugging Face interface had a bit of a learning curve, so here are some scribbleshots that might save you some headache:"
            },
            "width": 656
          },
          {
            "x": 255,
            "y": 5402,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "You probably want to use your LLM somehow, so here are some services that you might consider standing up next to your LLM. Open WebUI is a bit of a no-brainer--I found it a super easy way to test my model out once deployed."
            },
            "width": 714
          },
          {
            "x": 69,
            "y": 4164,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.image",
              "image": {
                "$type": "blob",
                "ref": {
                  "$link": "bafkreigifrfj6v3yjmroxer7ycdu4xw463zjvjasaijzbroox3g4mip5wa"
                },
                "mimeType": "image/png",
                "size": 71617
              },
              "aspectRatio": {
                "width": 874,
                "height": 349
              }
            },
            "width": 545,
            "rotation": -7
          },
          {
            "x": 153,
            "y": 4398,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.image",
              "image": {
                "$type": "blob",
                "ref": {
                  "$link": "bafkreibvtwf4dhw2yzz5cqhucdgcqounfvt7ilxsgcpssdtbon5fnetrwy"
                },
                "mimeType": "image/png",
                "size": 279084
              },
              "aspectRatio": {
                "width": 1362,
                "height": 1124
              }
            },
            "width": 951,
            "rotation": 4
          },
          {
            "x": 255,
            "y": 5351,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.header",
              "level": 2,
              "facets": [],
              "plaintext": "Step 4: Pick your extras"
            },
            "width": 360
          },
          {
            "x": 261,
            "y": 6198,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "I can only talk about my experience, as this isn't a real guide--but here are links that helped me."
            },
            "width": 360
          },
          {
            "x": 690,
            "y": 6509,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://docs.unsloth.ai/basics/inference-and-deployment/vllm-guide/vllm-engine-arguments",
              "$type": "pub.leaflet.blocks.website",
              "title": "vLLM Engine Arguments | Unsloth Documentation",
              "description": "vLLM engine arguments, flags, options for serving models on vLLM."
            },
            "width": 360
          },
          {
            "x": 799,
            "y": 6362,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://docs.vllm.ai/en/v0.11.0/deployment/docker.html",
              "$type": "pub.leaflet.blocks.website",
              "title": "Using Docker - vLLM",
              "description": "vLLM offers an official Docker image for deployment. The image can be used to run OpenAI compatible server and is available on Docker Hub as vllm/vllm-openai."
            },
            "width": 360
          },
          {
            "x": 75,
            "y": 6819,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.code",
              "language": "ini",
              "plaintext": "[Unit]\nDescription=vLLM OpenAI-compatible server\n\n[Container]\nPod=ai.pod\nContainerName=vllm\nImage=docker.io/vllm/vllm-openai:latest\nAddDevice=nvidia.com/gpu=all\nSecurityLabelDisable=true\nPodmanArgs=--ipc=host\n\nExec=JunHowie/Qwen3-4B-Thinking-2507-GPTQ-Int4 --kv-cache-dtype fp8 --gpu-memory-utilization 0.9 --max-model-len 32\n768 --max-num-seqs 16 --swap-space 6 --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser he\nrmes --served-model-name qwen/qwen3-4b-thinking-2507\nEnvironment=\"PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True\"\n \nVolume=/media/dregheap/ai/vllm:/root/.cache/huggingface:z",
              "syntaxHighlightingTheme": "rose-pine"
            },
            "width": 1105
          },
          {
            "x": 411,
            "y": 5684,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://www.letta.com/",
              "$type": "pub.leaflet.blocks.website",
              "title": "Letta",
              "description": "The platform for stateful agents. Build AI agents with long-term memory, advanced reasoning, and custom tools using the Letta API and Agent Development Environment (ADE)."
            },
            "width": 360
          },
          {
            "x": 134,
            "y": 5565,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://openwebui.com/",
              "$type": "pub.leaflet.blocks.website",
              "title": "Open WebUI",
              "description": "Open WebUI is an extensible, self-hosted interface for AI that adapts to your workflow, all while operating entirely offline; Supported LLM runners include Ollama and OpenAI-compatible APIs."
            },
            "width": 360,
            "rotation": -11
          },
          {
            "x": 725,
            "y": 5563,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://www.litellm.ai/",
              "$type": "pub.leaflet.blocks.website",
              "title": "LiteLLM",
              "description": "LLM Gateway (OpenAI Proxy) to manage authentication, loadbalancing, and spend tracking across 100+ LLMs. All in the OpenAI format."
            },
            "width": 360,
            "rotation": 14
          },
          {
            "x": 693,
            "y": 5857,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://github.com/jekalmin/extended_openai_conversation",
              "$type": "pub.leaflet.blocks.website",
              "title": "GitHub - jekalmin/extended_openai_conversation: Home Assistant custom component of conversation agent. It uses OpenAI to control your devices.",
              "description": "Home Assistant custom component of conversation agent. It uses OpenAI to control your devices. - jekalmin/extended_openai_conversation"
            },
            "width": 360
          },
          {
            "x": 174,
            "y": 5844,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://github.com/charmbracelet/crush",
              "$type": "pub.leaflet.blocks.website",
              "title": "GitHub - charmbracelet/crush: The glamourous AI coding agent for your favourite terminal 💘",
              "description": "The glamourous AI coding agent for your favourite terminal 💘 - charmbracelet/crush"
            },
            "width": 360
          },
          {
            "x": 261,
            "y": 6147,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.header",
              "level": 2,
              "facets": [],
              "plaintext": "Step 5: Make it work!"
            },
            "width": 360
          },
          {
            "x": 261,
            "y": 6294,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "My homeserver is fully managed via Podman Quadlets and Kubernetes manifests. If you're in a similar boat, you'll need to do some finagling to get your GPU into your container."
            },
            "width": 360
          },
          {
            "x": 681,
            "y": 6223,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html",
              "$type": "pub.leaflet.blocks.website",
              "title": "Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit",
              "description": "Install the NVIDIA GPU driver for your Linux distribution.\nNVIDIA recommends installing the driver by using the package manager for your distribution.\nFor information about installing the driver with a package manager, refer to\nthe NVIDIA Driver Installation Quickstart Guide.\nAlternatively, you can install the driver by downloading a .run installer."
            },
            "width": 360
          },
          {
            "x": 261,
            "y": 6584,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [
                {
                  "index": {
                    "byteEnd": 50,
                    "byteStart": 25
                  },
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#code"
                    }
                  ]
                }
              ],
              "plaintext": "I also needed to include SecurityLabelDisable=true in my Quadlet, because of SELinux. Otherwise, my container was unable to access the GPU."
            },
            "width": 360
          },
          {
            "x": 261,
            "y": 6462,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [
                {
                  "index": {
                    "byteEnd": 94,
                    "byteStart": 78
                  },
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#code"
                    }
                  ]
                },
                {
                  "index": {
                    "byteEnd": 123,
                    "byteStart": 113
                  },
                  "features": [
                    {
                      "$type": "pub.leaflet.richtext.facet#code"
                    }
                  ]
                }
              ],
              "plaintext": "My only advice on this front is that I couldn't get GPU mounting to work with podman kube play. I had to write a .container Quadlet in the end."
            },
            "width": 360
          },
          {
            "x": 261,
            "y": 6729,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "If it's useful, here's the entire Quadlet I used to get vLLM running:"
            },
            "width": 360
          },
          {
            "x": 6,
            "y": 14,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.header",
              "level": 1,
              "facets": [],
              "plaintext": "Self-Hosting an LLM: A Scatter Pack"
            },
            "width": 612
          },
          {
            "x": 806,
            "y": 6651,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "src": "https://docs.podman.io/en/latest/markdown/podman-systemd.unit.5.html",
              "$type": "pub.leaflet.blocks.website",
              "title": "podman-systemd.unit — Podman  documentation",
              "description": "name.container, name.volume, name.network, name.kube name.image, name.build name.pod, name.artifact"
            },
            "width": 360
          },
          {
            "x": 439,
            "y": 7459,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.header",
              "level": 2,
              "facets": [],
              "plaintext": "Step 6: Enjoy!"
            },
            "width": 360
          },
          {
            "x": 439,
            "y": 7510,
            "$type": "pub.leaflet.pages.canvas#block",
            "block": {
              "$type": "pub.leaflet.blocks.text",
              "facets": [],
              "plaintext": "I hope you found something useful or interesting in this mess. Best of luck!"
            },
            "width": 360
          }
        ]
      }
    ]
  },
  "description": "",
  "publishedAt": "2025-12-02T23:43:39.407Z"
}