Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifzzboqwyhjcje7fipccyjxuy6rwnbk7hueb34liuoophayxde5au",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mp3wyw3gpps2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreid2w4bqzy7asys5pgux72jsujpfnpi5pkyclctcbup3qhljog3tti"
    },
    "mimeType": "image/webp",
    "size": 340962
  },
  "path": "/the-persistent-engineer/your-first-llm-api-on-kubernetes-from-model-to-curl-request-4l1j",
  "publishedAt": "2026-06-25T07:44:50.000Z",
  "site": "https://dev.to",
  "tags": [
    "kubernetes",
    "llm",
    "ai",
    "devops",
    "Part 1: Everything You Know About Scaling Web Apps Breaks When You Serve an LLM",
    "Part 2: The Request Is the Wrong Unit of Scale for LLMs on Kubernetes",
    "Part 3: How Do You Fit a Trillion-Parameter Model Into a Kubernetes Cluster?",
    "Part 4: Before the Pod Starts: GPU Node Setup for LLMs on Kubernetes",
    "Part 5: OpenAI Already Told Us the Kubernetes Scaling Story, Most People Just Did Not Read It Closely",
    "Part 4",
    "https://huggingface.co/docs/hub/security-tokens",
    "vLLM Kubernetes docs",
    "vLLM OpenAI-compatible server docs"
  ],
  "textContent": "> **Series links**\n>\n>   * Part 1: Everything You Know About Scaling Web Apps Breaks When You Serve an LLM\n>   * Part 2: The Request Is the Wrong Unit of Scale for LLMs on Kubernetes\n>   * Part 3: How Do You Fit a Trillion-Parameter Model Into a Kubernetes Cluster?\n>   * Part 4: Before the Pod Starts: GPU Node Setup for LLMs on Kubernetes\n>   * Part 5: OpenAI Already Told Us the Kubernetes Scaling Story, Most People Just Did Not Read It Closely\n>\n\n\nSo far in this series, we have covered the mental model, tokens, model size, GPU node readiness, and OpenAI's Kubernetes scaling lessons.\n\nNow we should run something.\n\nIn this part, we will deploy an actual model on a Kubernetes GPU node, expose it as an OpenAI-compatible API, and call it with `curl`. The model is:\n\n\n\n    Qwen/Qwen2.5-1.5B-Instruct\n\n\nThat model is small enough for a first single-GPU walkthrough, but still behaves like a real chat model. If your GPU is very small, try `Qwen/Qwen2.5-0.5B-Instruct`. If you have more memory and want a bigger test, try `Qwen/Qwen2.5-7B-Instruct`.\n\nDo not start with the biggest model you can name. Start with a model your node can actually load. The goal here is not benchmark glory. The goal is to get from Kubernetes GPU capacity to a working LLM API request.\n\n##  What vLLM is doing in this setup\n\nKubernetes is not serving the model by itself. Kubernetes schedules the pod, gives it networking, mounts the Secret, and asks the NVIDIA device plugin for a GPU. After that, the model server inside the container has to do the LLM-specific work.\n\nvLLM is that model server in this walkthrough. It downloads the model weights, loads them into GPU memory, starts an HTTP server, accepts OpenAI-compatible requests, batches work internally, runs the model, and streams or returns generated tokens.\n\nThat distinction matters. The Kubernetes Deployment does not magically become an LLM API because it has `nvidia.com/gpu: 1`. It becomes an LLM API because the container starts a serving engine that knows how to load a Hugging Face model and expose routes like `/v1/chat/completions`.\n\nvLLM is a good first serving engine because it hides a lot of ugly details without hiding the shape from you. You still see the model name, GPU request, port, token Secret, logs, Service, and curl request. But you do not have to write your own batching loop, tokenizer path, HTTP server, or OpenAI-compatible API wrapper just to prove the deployment works.\n\nvLLM is the engine. The thing we care about is the model API it serves.\n\n##  Prerequisites\n\nI am assuming you already completed the GPU node setup from Part 4. That means the NVIDIA driver stack, container runtime, GPU Operator or NVIDIA device plugin, labels, and basic GPU checks are already working.\n\nWe are not reinstalling the GPU Operator here. Before deploying the model, confirm Kubernetes can see GPU capacity:\n\n\n\n    kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\\.com/gpu\n\n\nA useful output looks like this:\n\n\n\n    NAME            GPU\n    gpu-worker-01   1\n\n\nIf the GPU column is empty, `<none>`, or missing, stop here. Kubernetes cannot schedule this workload until the node advertises `nvidia.com/gpu`.\n\n##  Create a Hugging Face token first\n\nEven though `Qwen/Qwen2.5-1.5B-Instruct` is public, we will still use a Hugging Face token. That is intentional.\n\nReal teams often start with a public model and later swap to a gated model, private model, licensed model, or organization repository. If the token path is already part of the Deployment, that swap is much less annoying.\n\nCreate a token first:\n\n  1. Open the official Hugging Face token docs: https://huggingface.co/docs/hub/security-tokens\n  2. Create a token with read access.\n  3. Copy the token value and keep it ready.\n\n\n\nFrom this point onward, I will assume you have the token value. Do not paste it into Git. Do not put it directly in a Deployment manifest. Put it in a Kubernetes Secret.\n\n##  Create the namespace and Secret\n\nKeep the first LLM workload out of the default namespace:\n\n\n\n    kubectl create namespace llm-demo\n\n\nSet the token in your shell:\n\n\n\n    export HF_TOKEN=\"hf_your_token_here\"\n\n\nCreate the Secret:\n\n\n\n    kubectl create secret generic hf-token \\\n      -n llm-demo \\\n      --from-literal=HF_TOKEN=\"${HF_TOKEN}\"\n\n\nCheck that it exists:\n\n\n\n    kubectl get secret hf-token -n llm-demo\n\n\nExpected shape:\n\n\n\n    NAME       TYPE     DATA   AGE\n    hf-token   Opaque   1      10s\n\n\nExistence is enough. Do not print the token back unless you have a specific reason.\n\n##  Deploy the model API\n\nvLLM gives us the model server and the OpenAI-compatible HTTP API. The Kubernetes pattern is documented in the vLLM Kubernetes docs, and the API shape is documented in the vLLM OpenAI-compatible server docs.\n\nCreate `qwen-vllm.yaml`:\n\n\n\n    apiVersion: apps/v1\n    kind: Deployment\n    metadata:\n      name: qwen-vllm\n      namespace: llm-demo\n    spec:\n      replicas: 1\n      selector:\n        matchLabels:\n          app: qwen-vllm\n      template:\n        metadata:\n          labels:\n            app: qwen-vllm\n        spec:\n          containers:\n            - name: vllm\n              image: vllm/vllm-openai:latest\n              imagePullPolicy: IfNotPresent\n              command:\n                - vllm\n                - serve\n                - Qwen/Qwen2.5-1.5B-Instruct\n              args:\n                - --host\n                - 0.0.0.0\n                - --port\n                - \"8000\"\n              ports:\n                - containerPort: 8000\n                  name: http\n              env:\n                - name: HF_TOKEN\n                  valueFrom:\n                    secretKeyRef:\n                      name: hf-token\n                      key: HF_TOKEN\n                - name: HUGGING_FACE_HUB_TOKEN\n                  valueFrom:\n                    secretKeyRef:\n                      name: hf-token\n                      key: HF_TOKEN\n              resources:\n                limits:\n                  nvidia.com/gpu: 1\n              volumeMounts:\n                - name: shm\n                  mountPath: /dev/shm\n          volumes:\n            - name: shm\n              emptyDir:\n                medium: Memory\n                sizeLimit: 2Gi\n    ---\n    apiVersion: v1\n    kind: Service\n    metadata:\n      name: qwen-vllm\n      namespace: llm-demo\n    spec:\n      selector:\n        app: qwen-vllm\n      ports:\n        - name: http\n          port: 8000\n          targetPort: 8000\n\n\nA few details matter.\n\nThe pod requests one GPU with `nvidia.com/gpu: 1`. That is what makes this schedulable as a GPU workload. The token appears as both `HF_TOKEN` and `HUGGING_FACE_HUB_TOKEN` because different libraries and examples use different names. Both point to the same Secret value.\n\nThe `/dev/shm` mount is there because model servers often use shared memory heavily. Tiny default shared memory limits inside containers can create strange failures. A memory-backed `emptyDir` keeps the first deployment boring.\n\nWhen this pod starts, vLLM does roughly five things. It reads the model name from the command, uses the Hugging Face token to access the repository, downloads or reuses the model files, initializes the tokenizer and model runtime, then starts the API server on port `8000`. Only after that finishes is the API useful.\n\nFor production, pin the `vllm/vllm-openai` image version instead of using `latest`. For this walkthrough, `latest` keeps the example readable.\n\nApply it:\n\n\n\n    kubectl apply -f qwen-vllm.yaml\n\n\nExpected output:\n\n\n\n    deployment.apps/qwen-vllm created\n    service/qwen-vllm created\n\n\n##  Watch startup properly\n\nWatch the pod:\n\n\n\n    kubectl get pods -n llm-demo -w\n\n\nYou may see:\n\n\n\n    NAME                         READY   STATUS              RESTARTS   AGE\n    qwen-vllm-6c9f7d8c9d-x9v2m   0/1     Pending             0          3s\n    qwen-vllm-6c9f7d8c9d-x9v2m   0/1     ContainerCreating   0          15s\n    qwen-vllm-6c9f7d8c9d-x9v2m   1/1     Running             0          2m\n\n\nDo not celebrate too early.\n\n`Running` is not the same as ready. The container can be running while the image is still settling, the model is downloading, CUDA is initializing, weights are loading, or vLLM is preparing the serving engine. The first start is usually slower because the model has to be pulled.\n\nFollow the logs:\n\n\n\n    kubectl logs -n llm-demo -f deployment/qwen-vllm\n\n\nYou are looking for the server to finish loading the model and listen on port `8000`. The exact log lines vary by vLLM version. If logs are still busy, wait. If they show a clear error, jump to the troubleshooting table below.\n\n##  Port-forward the Service\n\nFor the first test, do not create public ingress. Do not add DNS. Do not put it behind an internet-facing load balancer.\n\nUse port-forward:\n\n\n\n    kubectl port-forward -n llm-demo svc/qwen-vllm 8000:8000\n\n\nKeep that command running. You should see:\n\n\n\n    Forwarding from 127.0.0.1:8000 -> 8000\n    Forwarding from [::1]:8000 -> 8000\n\n\nNow local port `8000` forwards to the Kubernetes Service, which forwards to the vLLM pod.\n\n##  Send the first curl request\n\nIn another terminal, call the OpenAI-compatible chat endpoint:\n\n\n\n    curl http://127.0.0.1:8000/v1/chat/completions \\\n      -H \"Content-Type: application/json\" \\\n      -d '{\n        \"model\": \"Qwen/Qwen2.5-1.5B-Instruct\",\n        \"messages\": [\n          {\n            \"role\": \"system\",\n            \"content\": \"You are a concise Kubernetes assistant.\"\n          },\n          {\n            \"role\": \"user\",\n            \"content\": \"Explain what a Kubernetes Service does in two sentences.\"\n          }\n        ],\n        \"max_tokens\": 120,\n        \"temperature\": 0.2\n      }'\n\n\n###  Why does the curl request include the model name again?\n\nThis part looks redundant at first:\n\n\n\n    \"model\": \"Qwen/Qwen2.5-1.5B-Instruct\"\n\n\nWe already gave the model name to `vllm serve` in the Deployment. That tells the server which model to load into memory. The `model` field in the curl request is part of the OpenAI-compatible API contract. Clients send it so the server knows which served model the request is targeting.\n\nIn this article, the server has only one model, so the value feels repetitive. In real systems, the same API style may sit behind routers, gateways, aliases, multiple deployments, or clients that can switch between models. Keeping the field means curl, OpenAI SDK code, and later gateway setup all follow the same shape.\n\nFor the first run, keep the value identical to the model passed to `vllm serve`. Later, vLLM can expose a different client-facing name with a served model name alias, but that is extra complexity we do not need yet.\n\nA successful response will be JSON. The exact wording will differ, but the shape should look familiar:\n\n\n\n    {\n      \"object\": \"chat.completion\",\n      \"model\": \"Qwen/Qwen2.5-1.5B-Instruct\",\n      \"choices\": [\n        {\n          \"message\": {\n            \"role\": \"assistant\",\n            \"content\": \"A Kubernetes Service provides a stable network endpoint for a set of Pods, even as those Pods are created, deleted, or replaced. It selects Pods using labels and forwards traffic to the matching backends.\"\n          }\n        }\n      ]\n    }\n\n\nThat is the moment the deployment becomes real. The request reached your model server, vLLM handled the OpenAI-compatible route, the model generated text, and the response came back through Kubernetes. Not a diagram, not a promise. A model answered through an API running inside the cluster.\n\n##  Swapping the model\n\nTo try the smaller model, change the served model:\n\n\n\n    command:\n      - vllm\n      - serve\n      - Qwen/Qwen2.5-0.5B-Instruct\n\n\nThen change the curl body too:\n\n\n\n    \"model\": \"Qwen/Qwen2.5-0.5B-Instruct\"\n\n\nFor a larger test, use `Qwen/Qwen2.5-7B-Instruct` in both places.\n\nFor a first run, keep the model name in the request identical to the model name served by vLLM. You can configure aliases later. Today, remove avoidable debugging.\n\n##  What happened\n\nKubernetes scheduled a pod onto a node that advertises `nvidia.com/gpu`. The NVIDIA device plugin made the GPU available to the container. The Hugging Face token let the container pull the model. vLLM loaded the model onto the GPU and started an HTTP server on port `8000`. The Service gave the pod a stable in-cluster endpoint. Port-forward gave us a safe local path. Curl proved the API could answer through `/v1/chat/completions`.\n\nThat is the basic loop every LLM platform needs before it becomes fancy:\n\n  1. Can Kubernetes schedule the workload onto a GPU?\n  2. Can the container see the GPU?\n  3. Can the model server download and load the model?\n  4. Can the API route accept a request?\n  5. Can the model generate a response?\n  6. Can you observe failures when any of those steps break?\n\n\n\nIf this loop is unreliable, autoscaling and gateways will not save you. They will only hide the problem for a while.\n\n##  Troubleshooting\n\nSymptom | What it usually means | What to check\n---|---|---\nPod stuck in `Pending` | Kubernetes cannot find a matching node | Run `kubectl describe pod -n llm-demo <pod-name>` and read scheduler events. Confirm GPU capacity exists.\n`nvidia.com/gpu` missing | GPU Operator or device plugin path is broken | Re-run the GPU visibility command and go back to Part 4 before continuing.\nHugging Face download fails | Token is missing, wrong, expired, or lacks model access | Recreate the token, update the Secret, then run `kubectl rollout restart deployment/qwen-vllm -n llm-demo`.\nCUDA initialization error | Driver, runtime, image, or node stack mismatch | Check pod logs, GPU Operator status, driver version, and a simple CUDA test pod.\nPod crashes with OOM | Model or runtime needs more memory | Try `Qwen/Qwen2.5-0.5B-Instruct`, use a larger GPU, or tune model/runtime settings later.\n`curl: connection refused` | Server is not ready or port-forward is not running | Check logs, keep port-forward running, and verify `kubectl get svc -n llm-demo`.\nModel name mismatch | Request model differs from served model | Make the curl `model` value match the `vllm serve` model.\n\nThe most common mistake is treating `Running` as the finish line. It is not. For model serving, readiness is tied to download, GPU initialization, model loading, and server startup. Watch logs, not just pod phase.\n\n##  Clean up\n\nIf this was only a test, delete the namespace:\n\n\n\n    kubectl delete namespace llm-demo\n\n\nThat removes the Deployment, Service, and Secret. If you keep experimenting, remember that a GPU pod can hold expensive capacity even when nobody is sending requests.\n\n##  What we are not covering yet\n\nThis article stops at the first working API call. We are not covering public ingress, authentication, autoscaling, multi-GPU serving, quantization, production monitoring, or cost optimization yet.\n\nThose are not tiny details. Public ingress brings TLS, routing, limits, and abuse controls. Authentication decides who can call the model. Autoscaling needs LLM-specific signals, not only CPU. Multi-GPU serving changes scheduling and failure behavior. Quantization changes memory and quality tradeoffs. Monitoring needs token, latency, GPU, queue, and model-server metrics.\n\nBut all of that comes after this basic path works.\n\nA Kubernetes LLM platform starts becoming real when a model can load, serve, and answer through an API that other systems can call. Today we got there with one Deployment, one Service, one Secret, and one curl request.\n\nIn the next parts, we can make this less like a demo and more like a platform: readiness, observability, routing, auth, scaling, and the failure paths that show up once real users start sending prompts.\n\nIf you are following the series, subscribe and keep the manifest from this article handy. It is a good checklist for the first LLM-on-Kubernetes question: can we actually serve a model and call it?",
  "title": "Your First LLM API on Kubernetes: From Model to Curl Request"
}