Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiairocfm5bqlumxnvy5c77xagbkdc3t4x6vnz7pjojcun4fvaosje",
    "commit": {
      "cid": "bafyreibqacfxbadd6x45jgmoxnvwoymlbp47phc7rnhj2l5veeerrlhe4i",
      "rev": "3mnqeuegi2p23"
    },
    "uri": "at://did:plc:5g2hkj2od4zr3tlpb2su364e/app.bsky.feed.post/3mnqeuegb7x23",
    "validationStatus": "valid"
  },
  "content": {
    "$type": "pub.lemma.blog.entry",
    "content": "# NVIDIA CUDA on Ubuntu 26.04\r\n\r\n*based on [An Even Easier Introduction to CUDA](https://developer.nvidia.com/blog/even-easier-introduction-cuda/)*\r\n\r\nIf you have an NVIDIA GPU you can use CUDA, a parallel computing platform for writing high-performance applications using thousands of parallel threads on GPUs. Everything on this article also works on Ubuntu flavors like Kubuntu.\r\n\r\n## Prerequisites\r\n\r\nThe first step is to install the proprietary **NVIDIA driver**. See which one is the recommended one with:\r\n```sh\r\nubuntu-drivers devices\r\n```\r\nYou can automatically install it with:\r\n```sh\r\nsudo ubuntu-drivers install\r\n```\r\nOptionally, you can verify which driver packages are installed and see that the recommended one was installed:\r\n```sh\r\ndpkg -l | grep nvidia\r\n```\r\n\r\nYou'll also need the **CUDA Toolkit**. On its website you'd find versions for older Ubuntu releases but now on 26.04 you can install it directly with:\r\n```sh\r\nsudo apt install nvidia-cuda-toolkit\r\n```\r\n\r\nFor profiling our application, we'll need [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems/get-started). Download and install the package for your system which is probably the Linux x86_64 `.deb` installer.\r\n\r\nNow download [this wrapper script](https://github.com/harrism/nsys_easy/tree/main) for the `nsys` command that makes profiling your CUDA app easier. Put it somewhere on your `$PATH` or in the directory where you'll be working.\r\n\r\n## Key Concepts\r\n- A *kernel* in CUDA is a function that the GPU can run. You specify it with `__global__`.\r\n- _Unified Memory_ in CUDA provides a single memory space accessible by all CPUs and GPUs in your system. This memory allocation returns a pointer that can be accessed by **_host_** (CPU) code or **_device_** (GPU) code. If we know what memory is needed by the kernel we can prefetch it to make sure the data is on the GPU before the kernel needs it. This avoids a situation where you have multiple page faults and the hardware migrates the pages to the GPU memory when the faults occur.\r\n- CUDA GPUs have many parallel processors grouped into Streaming Multiprocessors (SMs). Each SM can run multiple concurrent thread blocks, but each block runs on a single SM. Together, the **_blocks_** of parallel **_threads_** make up what is known as the **_grid_**.\r\n- When launching a kernel you specify the number of thread blocks and block size with `<<<A, B >>>` notation meaning: \r\n  - $$ A $$ thread blocks\r\n  - $$ B $$ threads in a block (in multiples of $$ 32 $$)\r\n- CUDA C++ provides keywords that let kernels get the indices of the running threads.\r\n  - `gridDim.x` is the number of blocks in the grid.\r\n  - `blockIdx.x` is the index of the current thread block in the grid.\r\n  - `blockDim.x` is the number of threads in the block.\r\n  - `threadIdx.x` is the index of the current thread within its block.\r\n- So in order to process an array of $$ 2^{20} $$ elements with a block size of $$ 256 $$ threads we have $$ 2^{20}/256 = 4096 $$ blocks.\r\n\r\n![img](https://developer-blogs.nvidia.com/wp-content/uploads/2017/01/Even-easier-intro-to-CUDA-image.png)\r\n_Figure: Grid, Block and Thread indexing in CUDA kernels (one-dimensional) from the NVIDIA blog_ \r\n\r\n## Example\r\nLet's use parallel processing on the GPU to add the elements of two arrays. Save this program in a file  called **`add_grid.cu`**\r\n```cpp\r\n#include <iostream>\r\n#include <math.h>\r\n\r\n// CUDA kernel function to add the elements of two arrays. \r\n__global__\r\nvoid add(int n, float *x, float *y)\r\n{\r\n  int index = blockIdx.x * blockDim.x + threadIdx.x;\r\n  int stride = blockDim.x * gridDim.x; // total number of threads in the grid\r\n  for (int i = index; i < n; i += stride)\r\n    y[i] = x[i] + y[i];\r\n}\r\n \r\nint main(void)\r\n{\r\n int N = 1<<20; // ~1M elements (2^20 or 1,048,576 to be exact)\r\n float *x, *y;\r\n \r\n // Allocate Unified Memory – accessible from CPU or GPU\r\n cudaMallocManaged(&x, N*sizeof(float));\r\n cudaMallocManaged(&y, N*sizeof(float));\r\n \r\n // initialize x and y arrays on the host\r\n for (int i = 0; i < N; i++) {\r\n   x[i] = 1.0f;\r\n   y[i] = 2.0f;\r\n }\r\n  \r\n // Prefetch the x and y arrays to the GPU\r\n cudaMemPrefetchAsync(x, N*sizeof(float), 0, 0);\r\n cudaMemPrefetchAsync(y, N*sizeof(float), 0, 0);\r\n\r\n // Run kernel on the ~1M elements on the GPU. \r\n int blockSize = 256;\r\n int numBlocks = (N + blockSize - 1) / blockSize; // number of blocks to get at least N threads. Divide N by the block size (being careful to round up in case N is not a multiple of blockSize)\r\n add<<<numBlocks, blockSize>>>(N, x, y);\r\n \r\n // Wait for GPU to finish before accessing on host\r\n cudaDeviceSynchronize();\r\n \r\n // Check for errors (all values should be 3.0f)\r\n float maxError = 0.0f;\r\n for (int i = 0; i < N; i++) {\r\n   maxError = fmax(maxError, fabs(y[i]-3.0f));\r\n }\r\n std::cout << \"Max error: \" << maxError << std::endl;\r\n \r\n // Free memory\r\n cudaFree(x);\r\n cudaFree(y);\r\n return 0;\r\n}\r\n```\r\n\r\nCompile it:\r\n```sh\r\nnvcc add_grid.cu -o add_cuda\r\n```\r\nProfile it:\r\n```sh\r\n./nsys_easy add_cuda\r\n```\r\n\r\nThat's it! On my NVIDIA RTX 5070 Ti the `add` operation for ~1M elements takes only 21 microseconds (µs) thanks to the parallelism on the GPU and the memory prefetching. This is compared to ~85,000 microseconds (µs) in a version of this code that uses only 1 thread on the GPU (without parallelism) and no memory prefetching. A huge performance boost!\r\n\r\nWhat results do you get on your machine?",
    "license": "CC BY"
  },
  "path": "/3mkimumjuc22o",
  "publishedAt": "2026-04-27T17:47:25.984Z",
  "site": "at://did:plc:5g2hkj2od4zr3tlpb2su364e/site.standard.publication/3mmhg6ulmlszy",
  "textContent": "NVIDIA CUDA on Ubuntu 26.04 \n \nbased on An Even Easier Introduction to CUDA \n \nIf you have an NVIDIA GPU you can use CUDA, a parallel computing platform for writing high-performance applications using thousands of parallel threads on GPUs. Everything on this article also works on Ubuntu flavors like Kubuntu. \n \nPrerequisites \n \nThe first step is to install the proprietary NVIDIA driver. See which one is the recommended one with: \n \nYou can automatically install it with: \n \nOptionally, you can verify which driver packages are installed and see that the recommended one was installed: \n \n \nYou'll also need the CUDA Toolkit. On its website you'd find versions for older Ubuntu releases but now on 26.04 you can install it directly with: \n \n \nFor profiling our application, we'll need NVIDIA Nsight Systems. Download and install the package for your system which is probably the Linux x8664 .deb installer. \n \nNow download this wrapper script for the nsys command that makes profiling your CUDA app easier. Put it somewhere on your $PATH or in the directory where you'll be working. \n \nKey Concepts A kernel in CUDA is a function that the GPU can run. You specify it with global. Unified Memory in CUDA provides a single memory space accessible by all CPUs and GPUs in your system. This memory allocation returns a pointer that can be accessed by host (CPU) code or device (GPU) code. If we know what memory is needed by the kernel we can prefetch it to make sure the data is on the GPU before the kernel needs it. This avoids a situation where you have multiple page faults and the hardware migrates the pages to the GPU memory when the faults occur. CUDA GPUs have many parallel processors grouped into Streaming Multiprocessors (SMs). Each SM can run multiple concurrent thread blocks, but each block runs on a single SM. Together, the blocks of parallel threads make up what is known as the grid. When launching a kernel you specify the number of thread blocks and block size with >> notation meaning: $$ A $$ thread blocks $$ B $$ threads in a block (in multiples of $$ 32 $$) CUDA C++ provides keywords that let kernels get the indices of the running threads. gridDim.x is the number of blocks in the grid. blockIdx.x is the index of the current thread block in the grid. blockDim.x is the number of threads in the block. threadIdx.x is the index of the current thread within its block. So in order to process an array of $$ 2^{20} $$ elements with a block size of $$ 256 $$ threads we have $$ 2^{20}/256 = 4096 $$ blocks. \n \n \nFigure: Grid, Block and Thread indexing in CUDA kernels (one-dimensional) from the NVIDIA blog \n \nExample \nLet's use parallel processing on the GPU to add the elements of two arrays. Save this program in a file called add_grid.cu \n \n \nCompile it: \n \nProfile it: \n \n \nThat's it! On my NVIDIA RTX 5070 Ti the add operation for ~1M elements takes only 21 microseconds (µs) thanks to the parallelism on the GPU and the memory prefetching. This is compared to ~85,000 microseconds (µs) in a version of this code that uses only 1 thread on the GPU (without parallelism) and no memory prefetching. A huge performance boost! \n \nWhat results do you get on your machine?",
  "title": "NVIDIA CUDA on Ubuntu 26.04",
  "updatedAt": "2026-06-08T00:01:00.480Z"
}