{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreievd2nf64dbf4qnuv2x5loyiaaoof6a63v7pelzxps7vsxjf5cmdm",
"uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3moini5femak2"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreiaf75wxjmtooowpzxqghflyrzxggvursg6v32xjh5ac5b2akt7anq"
},
"mimeType": "image/webp",
"size": 68542
},
"path": "/nolanvale/self-hosting-your-first-llm-for-enterprise-what-nobody-tells-you-before-you-start-1d6f",
"publishedAt": "2026-06-17T15:21:05.000Z",
"site": "https://dev.to",
"tags": [
"ai",
"infrastructure",
"llm",
"tutorial"
],
"textContent": "I have done this setup process more times than I want to count. Every time I find something that the documentation skipped or assumed. This is the version I wish I had read first.\n\nThis covers deploying a production-ready self-hosted LLM inference server for an enterprise RAG use case. I am using Llama 3 8B with vLLM on a single A100 instance. Adjust for your hardware.\n\n**What you actually need before you touch a single command**\n\nGPU memory math first. Llama 3 8B in fp16 needs roughly 16GB VRAM just for model weights. Add KV cache for your expected concurrent sessions and you are pushing 35-40GB. One A100 80GB handles this comfortably. One A100 40GB will work but you are tight. Two A10Gs in tensor parallel will work. Know your numbers before provisioning.\n\nYour network topology matters. The inference server needs to reach your vector database and your application layer. If those are in a private VPC, your inference server needs to be in the same VPC or peered. Setting this up after the fact while production is waiting is miserable.\n\n**The actual setup**\n\n\n\n # Create dedicated inference user, do not run this as root\n useradd -m -s /bin/bash inference\n su - inference\n\n # CUDA needs to be installed on the host, check first\n nvidia-smi\n nvcc --version\n\n # Install vLLM (this takes a while, get coffee)\n pip install vllm\n\n # Test that your GPU is visible to Python\n python3 -c \"import torch; print(torch.cuda.device_count(), torch.cuda.get_device_name(0))\"\n\n\nIf that last line fails, your CUDA setup is wrong and nothing else matters until you fix it.\n\n\n\n # Pull the model (you need a HuggingFace account and token for Llama 3)\n huggingface-cli login\n huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct \\\n --local-dir /opt/models/llama3-8b-instruct\n\n # Start the server\n python -m vllm.entrypoints.openai.api_server \\\n --model /opt/models/llama3-8b-instruct \\\n --host 0.0.0.0 \\\n --port 8000 \\\n --max-model-len 8192 \\\n --gpu-memory-utilization 0.85 \\\n --served-model-name llama3\n\n\nThe `--gpu-memory-utilization 0.85` is important. Leave headroom. I have seen deployments set this to 0.95 and then crash under load when KV cache allocation spills over.\n\n**The thing that will break in production that did not break in testing**\n\nConcurrent requests. During testing you send one request, it works, you move on. Under production load with ten concurrent users, the KV cache fills and latency spikes.\n\nAdd this to your startup command:\n\n\n\n --max-num-seqs 32 \\\n --max-num-batched-tokens 16384\n\n\nTune these numbers based on your actual concurrency. Too high and you run out of memory. Too low and you are leaving throughput on the table. Run a load test with realistic concurrency before going live.\n\n**Health check and process management**\n\nDo not run this as a foreground process. Use systemd:\n\n\n\n # /etc/systemd/system/llm-inference.service\n [Unit]\n Description=vLLM Inference Server\n After=network.target\n\n [Service]\n Type=simple\n User=inference\n ExecStart=/home/inference/.local/bin/python -m vllm.entrypoints.openai.api_server \\\n --model /opt/models/llama3-8b-instruct \\\n --host 0.0.0.0 \\\n --port 8000 \\\n --max-model-len 8192 \\\n --gpu-memory-utilization 0.85 \\\n --served-model-name llama3\n Restart=on-failure\n RestartSec=10\n\n [Install]\n WantedBy=multi-user.target\n\n\n\n systemctl enable llm-inference\n systemctl start llm-inference\n systemctl status llm-inference\n\n\nHealth check endpoint is at `http://your-server:8000/health`. Put this behind your load balancer health check.\n\n**Connecting your application**\n\nvLLM serves an OpenAI-compatible API, so your existing OpenAI SDK calls work with a base URL change:\n\n\n\n from openai import OpenAI\n\n client = OpenAI(\n base_url=\"http://your-inference-server:8000/v1\",\n api_key=\"not-needed-but-required-by-sdk\"\n )\n\n response = client.chat.completions.create(\n model=\"llama3\",\n messages=[{\"role\": \"user\", \"content\": \"test\"}]\n )\n\n\nIf your existing RAG code uses the OpenAI SDK, this is literally a one-line change for the base URL. That is the point.\n\nTwo things I want to flag before you sign off on this as production-ready. First, add authentication in front of the inference server. vLLM has no auth by default. Put nginx with API key validation in front of it before anything touches it from outside your private network. Second, set up GPU monitoring. Watch VRAM utilization, KV cache hit rate, and request queue depth. These three metrics will tell you everything about whether your deployment is healthy or about to fall over.\n\nThe rest is just tuning. But get those two things in place before you call it production.",
"title": "Self-Hosting Your First LLM for Enterprise: What Nobody Tells You Before You Start"
}