{
"$type": "site.standard.document",
"content": {
"$type": "pub.lemma.blog.entry",
"content": "# NVIDIA CUDA on Ubuntu 26.04\r\n\r\n*based on [An Even Easier Introduction to CUDA](https://developer.nvidia.com/blog/even-easier-introduction-cuda/)*\r\n\r\nIf you have an NVIDIA GPU you can use CUDA, a parallel computing platform for writing high-performance applications using thousands of parallel threads on GPUs. Everything on this article also works on Ubuntu flavors like Kubuntu.\r\n\r\n## Prerequisites\r\n\r\nThe first step is to install the proprietary **NVIDIA driver**. See which one is the recommended one with:\r\n```sh\r\nubuntu-drivers devices\r\n```\r\nYou can automatically install it with:\r\n```sh\r\nsudo ubuntu-drivers install\r\n```\r\nOptionally, you can verify which driver packages are installed and see that the recommended one was installed:\r\n```sh\r\ndpkg -l | grep nvidia\r\n```\r\n\r\nYou'll also need the **CUDA Toolkit**. On its website you'd find versions for older Ubuntu releases but now on 26.04 you can install it directly with:\r\n```sh\r\nsudo apt install nvidia-cuda-toolkit\r\n```\r\n\r\nFor profiling our application, we'll need [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems/get-started). Download and install the package for your system which is probably the Linux x86_64 `.deb` installer.\r\n\r\nNow download [this wrapper script](https://github.com/harrism/nsys_easy/tree/main) for the `nsys` command that makes profiling your CUDA app easier. Put it somewhere on your `$PATH` or in the directory where you'll be working.\r\n\r\n## Key Concepts\r\n- A *kernel* in CUDA is a function that the GPU can run. You specify it with `__global__`.\r\n- _Unified Memory_ in CUDA provides a single memory space accessible by all CPUs and GPUs in your system. This memory allocation returns a pointer that can be accessed by **_host_** (CPU) code or **_device_** (GPU) code. If we know what memory is needed by the kernel we can prefetch it to make sure the data is on the GPU before the kernel needs it. This avoids a situation where you have multiple page faults and the hardware migrates the pages to the GPU memory when the faults occur.\r\n- CUDA GPUs have many parallel processors grouped into Streaming Multiprocessors (SMs). Each SM can run multiple concurrent thread blocks, but each block runs on a single SM. Together, the **_blocks_** of parallel **_threads_** make up what is known as the **_grid_**.\r\n- When launching a kernel you specify the number of thread blocks and block size with `<<<A, B >>>` notation meaning: \r\n - $$ A $$ thread blocks\r\n - $$ B $$ threads in a block (in multiples of $$ 32 $$)\r\n- CUDA C++ provides keywords that let kernels get the indices of the running threads.\r\n - `gridDim.x` is the number of blocks in the grid.\r\n - `blockIdx.x` is the index of the current thread block in the grid.\r\n - `blockDim.x` is the number of threads in the block.\r\n - `threadIdx.x` is the index of the current thread within its block.\r\n- So in order to process an array of $$ 2^{20} $$ elements with a block size of $$ 256 $$ threads we have $$ 2^{20}/256 = 4096 $$ blocks.\r\n\r\n\r\n_Figure: Grid, Block and Thread indexing in CUDA kernels (one-dimensional) from the NVIDIA blog_ \r\n\r\n## Example\r\nLet's use parallel processing on the GPU to add the elements of two arrays. Save this program in a file called **`add_grid.cu`**\r\n```cpp\r\n#include <iostream>\r\n#include <math.h>\r\n\r\n// CUDA kernel function to add the elements of two arrays. \r\n__global__\r\nvoid add(int n, float *x, float *y)\r\n{\r\n int index = blockIdx.x * blockDim.x + threadIdx.x;\r\n int stride = blockDim.x * gridDim.x; // total number of threads in the grid\r\n for (int i = index; i < n; i += stride)\r\n y[i] = x[i] + y[i];\r\n}\r\n \r\nint main(void)\r\n{\r\n int N = 1<<20; // ~1M elements (2^20 or 1,048,576 to be exact)\r\n float *x, *y;\r\n \r\n // Allocate Unified Memory – accessible from CPU or GPU\r\n cudaMallocManaged(&x, N*sizeof(float));\r\n cudaMallocManaged(&y, N*sizeof(float));\r\n \r\n // initialize x and y arrays on the host\r\n for (int i = 0; i < N; i++) {\r\n x[i] = 1.0f;\r\n y[i] = 2.0f;\r\n }\r\n \r\n // Prefetch the x and y arrays to the GPU\r\n cudaMemPrefetchAsync(x, N*sizeof(float), 0, 0);\r\n cudaMemPrefetchAsync(y, N*sizeof(float), 0, 0);\r\n\r\n // Run kernel on the ~1M elements on the GPU. \r\n int blockSize = 256;\r\n int numBlocks = (N + blockSize - 1) / blockSize; // number of blocks to get at least N threads. Divide N by the block size (being careful to round up in case N is not a multiple of blockSize)\r\n add<<<numBlocks, blockSize>>>(N, x, y);\r\n \r\n // Wait for GPU to finish before accessing on host\r\n cudaDeviceSynchronize();\r\n \r\n // Check for errors (all values should be 3.0f)\r\n float maxError = 0.0f;\r\n for (int i = 0; i < N; i++) {\r\n maxError = fmax(maxError, fabs(y[i]-3.0f));\r\n }\r\n std::cout << \"Max error: \" << maxError << std::endl;\r\n \r\n // Free memory\r\n cudaFree(x);\r\n cudaFree(y);\r\n return 0;\r\n}\r\n```\r\n\r\nCompile it:\r\n```sh\r\nnvcc add_grid.cu -o add_cuda\r\n```\r\nProfile it:\r\n```sh\r\n./nsys_easy add_cuda\r\n```\r\n\r\nThat's it! On my NVIDIA RTX 5070 Ti the `add` operation for ~1M elements takes only 21 microseconds (µs) thanks to the parallelism on the GPU and the memory prefetching. This is compared to ~85,000 microseconds (µs) in a version of this code that uses only 1 thread on the GPU (without parallelism) and no memory prefetching. A huge performance boost!\r\n\r\nWhat results do you get on your machine?"
},
"path": "/3mkimumjuc22o",
"publishedAt": "2026-04-27T17:47:25.984Z",
"site": "at://did:plc:5g2hkj2od4zr3tlpb2su364e/site.standard.publication/3mmhg6ulmlszy",
"title": "NVIDIA CUDA on Ubuntu 26.04",
"updatedAt": "2026-04-27T23:26:25.349Z"
}