Self-Hosting Your First LLM for Enterprise: What Nobody Tells You Before You Start
I have done this setup process more times than I want to count. Every time I find something that the documentation skipped or assumed. This is the version I wish I had read first.
This covers deploying a production-ready self-hosted LLM inference server for an enterprise RAG use case. I am using Llama 3 8B with vLLM on a single A100 instance. Adjust for your hardware.
What you actually need before you touch a single command
GPU memory math first. Llama 3 8B in fp16 needs roughly 16GB VRAM just for model weights. Add KV cache for your expected concurrent sessions and you are pushing 35-40GB. One A100 80GB handles this comfortably. One A100 40GB will work but you are tight. Two A10Gs in tensor parallel will work. Know your numbers before provisioning.
Your network topology matters. The inference server needs to reach your vector database and your application layer. If those are in a private VPC, your inference server needs to be in the same VPC or peered. Setting this up after the fact while production is waiting is miserable.
The actual setup
# Create dedicated inference user, do not run this as root
useradd -m -s /bin/bash inference
su - inference
# CUDA needs to be installed on the host, check first
nvidia-smi
nvcc --version
# Install vLLM (this takes a while, get coffee)
pip install vllm
# Test that your GPU is visible to Python
python3 -c "import torch; print(torch.cuda.device_count(), torch.cuda.get_device_name(0))"
If that last line fails, your CUDA setup is wrong and nothing else matters until you fix it.
# Pull the model (you need a HuggingFace account and token for Llama 3)
huggingface-cli login
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct \
--local-dir /opt/models/llama3-8b-instruct
# Start the server
python -m vllm.entrypoints.openai.api_server \
--model /opt/models/llama3-8b-instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.85 \
--served-model-name llama3
The --gpu-memory-utilization 0.85 is important. Leave headroom. I have seen deployments set this to 0.95 and then crash under load when KV cache allocation spills over.
The thing that will break in production that did not break in testing
Concurrent requests. During testing you send one request, it works, you move on. Under production load with ten concurrent users, the KV cache fills and latency spikes.
Add this to your startup command:
--max-num-seqs 32 \
--max-num-batched-tokens 16384
Tune these numbers based on your actual concurrency. Too high and you run out of memory. Too low and you are leaving throughput on the table. Run a load test with realistic concurrency before going live.
Health check and process management
Do not run this as a foreground process. Use systemd:
# /etc/systemd/system/llm-inference.service
[Unit]
Description=vLLM Inference Server
After=network.target
[Service]
Type=simple
User=inference
ExecStart=/home/inference/.local/bin/python -m vllm.entrypoints.openai.api_server \
--model /opt/models/llama3-8b-instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.85 \
--served-model-name llama3
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
systemctl enable llm-inference
systemctl start llm-inference
systemctl status llm-inference
Health check endpoint is at http://your-server:8000/health. Put this behind your load balancer health check.
Connecting your application
vLLM serves an OpenAI-compatible API, so your existing OpenAI SDK calls work with a base URL change:
from openai import OpenAI
client = OpenAI(
base_url="http://your-inference-server:8000/v1",
api_key="not-needed-but-required-by-sdk"
)
response = client.chat.completions.create(
model="llama3",
messages=[{"role": "user", "content": "test"}]
)
If your existing RAG code uses the OpenAI SDK, this is literally a one-line change for the base URL. That is the point.
Two things I want to flag before you sign off on this as production-ready. First, add authentication in front of the inference server. vLLM has no auth by default. Put nginx with API key validation in front of it before anything touches it from outside your private network. Second, set up GPU monitoring. Watch VRAM utilization, KV cache hit rate, and request queue depth. These three metrics will tell you everything about whether your deployment is healthy or about to fall over.
The rest is just tuning. But get those two things in place before you call it production.
Discussion in the ATmosphere