Raw Record Source

{
  "$type": "site.standard.document",
  "content": {
    "$type": "site.standard.content.markdown",
    "text": "This might only work for a few months (or even days), but after spending a few hours trying to get an open source LLMs to work on AMDGPUs inside Docker, I thought I'd share my findings. My GPU is an AMD 7900 XTX, and I was only able to make it work with [the `llama-cpp` Python bindings](https://llama-cpp-python.readthedocs.io/en/latest/). This should work for any [ROCm supported AMDGPUs](https://rocm.docs.amd.com/en/latest/).\n\nThe first thing is to build and setup our Docker image. This is what I ended up with:\n\n```dockerfile\nFROM rocm/dev-ubuntu-22.04:5.7-complete\n\n# Environment variables\nENV GPU_TARGETS=gfx1100\nENV LLAMA_HIPBLAS=1\nENV CC=/opt/rocm/llvm/bin/clang\nENV CXX=/opt/rocm/llvm/bin/clang++\n\n# Install pytorch and llama-cpp-python\nRUN pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm5.7\nRUN CMAKE_ARGS=\"-DLLAMA_HIPBLAS=1 -DAMDGPU_TARGETS=gfx1100\" pip3 install llama-cpp-python --force-reinstall --upgrade --no-cache-dir\n```\n\nYou might need to change `gfx1100` to your GPU's family/target.\n\nNext, we need to build the image:\n\n```bash\ndocker build --no-cache -t amd-llm .\n```\n\nNow we can run the image with this ~complex~ precise command:\n\n```bash\ndocker run -it --network=host --device=/dev/kfd \\\n    --device=/dev/dri --group-add=video --ipc=host \\\n    --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \\\n    --entrypoint=bash \\\n    -v $(PWD):/models \\\n    amd-llm\n```\n\nThis will mount the current directory to `/models` inside the container and get you into a bash shell. Now is time to check if the Pytorch installation is working and able to detect the GPU. These commands should work:\n\n```python\nimport torch\nprint(torch.cuda.is_available())\nprint(torch.cuda.get_device_name(torch.cuda.current_device()))\n\nprint(f\"CUDA available: {torch.cuda.is_available()}\")\nprint(f\"CUDA version: {torch.version.cuda}\")\nprint(f\"CUDA arch list: {torch.cuda.get_arch_list()}\")\nprint(f\"CUDNN available: {torch.backends.cudnn.is_available()}\")\nprint(f\"CUDNN version: {torch.backends.cudnn.version()}\")\n\ntensor = torch.randn(2, 2)\nres = tensor.to(0)\n```\n\nIf everything is working, you should see something like this:\n\n```bash\nTrue\nRadeon RX 7900 XTX\nCUDA available: True\nCUDA version: None\nCUDA arch list: ['gfx900', 'gfx906', 'gfx908', 'gfx90a', 'gfx1030', 'gfx1100']\nCUDNN available: True\nCUDNN version: 2020000\n```\n\nNow, let's do some LLMing and put those graphical processing units to work with one of the latest models, Mistral!\n\nDownload the model:\n\n```bash\nwget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf\n```\n\nAnd with that, we should be ready to run the model with `llama-cpp-python`:\n\n```python\nfrom llama_cpp import Llama\n\nllm = Llama(\n    model_path=\"mistral-7b-instruct-v0.1.Q4_K_M.gguf\",\n    n_gpu_layers=-1,\n    main_gpu=1\n)\n\noutput = llm(\n    \"Q: Name the planets in the solar system. A: \",\n    max_tokens=2048,\n    stop=[\"Q:\", \"\\n\"],\n    echo=True,\n)\n\nprint(output)\n```\n\nFor me, it printed the following:\n\n```json\n{\n   \"id\":\"cmpl-f3887631-d106-43e8-97c0-5deee07dcd2f\",\n   \"object\":\"text_completion\",\n   \"created\":1698995742,\n   \"model\":\"mistral-7b-instruct-v0.1.Q4_K_M.gguf\",\n   \"choices\":[\n      {\n         \"text\":\"Q: Name the planets in the solar system. A: 1. Mercury, 2. Venus, 3. Earth, 4. Mars, 5. Jupiter, 6. Saturn, 7. Uranus, 8. Neptune\",\n         \"index\":0,\n         \"logprobs\":\"None\",\n         \"finish_reason\":\"stop\"\n      }\n   ],\n   \"usage\":{\n      \"prompt_tokens\":14,\n      \"completion_tokens\":46,\n      \"total_tokens\":60\n   }\n}\n```\n\n🎉 🎉 🎉\n\nIf you, like me, are wondering if the GPU was actually being used, you can install [nvtop](https://github.com/Syllo/nvtop) and execute it.\n\n![GPU usage](https://user-images.githubusercontent.com/1682202/280206444-6cbc9942-eb44-460f-a279-f80181847be0.png)\n\nFinally, after a few hours and a bunch of tweaks, the GPU was using and Mistral 7B worked on my machine!",
    "version": "1.0"
  },
  "description": "This might only work for a few months (or even days), but after spending a few hours trying to get an open source LLMs to work on AMDGPUs inside Docker, I thought I'd share my findings. My GPU is an AMD 7900 XTX, and I was only able to make it work with the llama-cpp Python bi...",
  "path": "/llms-with-amdgpu",
  "publishedAt": "2023-11-02T00:00:00.000Z",
  "site": "at://did:plc:4z5i7njrld66ew36htufcwry/site.standard.publication/3mo43d2tmt2ov",
  "textContent": "This might only work for a few months (or even days), but after spending a few hours trying to get an open source LLMs to work on AMDGPUs inside Docker, I thought I'd share my findings. My GPU is an AMD 7900 XTX, and I was only able to make it work with the llama-cpp Python bindings. This should work for any ROCm supported AMDGPUs.\n\nThe first thing is to build and setup our Docker image. This is what I ended up with:\n\nYou might need to change gfx1100 to your GPU's family/target.\n\nNext, we need to build the image:\n\nNow we can run the image with this ~complex~ precise command:\n\nThis will mount the current directory to /models inside the container and get you into a bash shell. Now is time to check if the Pytorch installation is working and able to detect the GPU. These commands should work:\n\nIf everything is working, you should see something like this:\n\nNow, let's do some LLMing and put those graphical processing units to work with one of the latest models, Mistral!\n\nDownload the model:\n\nAnd with that, we should be ready to run the model with llama-cpp-python:\n\nFor me, it printed the following:\n\n🎉 🎉 🎉\n\nIf you, like me, are wondering if the GPU was actually being used, you can install nvtop and execute it.\n\nFinally, after a few hours and a bunch of tweaks, the GPU was using and Mistral 7B worked on my machine!",
  "title": "Working with LLMs on AMDGPUs"
}