4.3 Mapping Threads to Data Elements¶
In the last section we mentioned that the key new idea in CUDA programming is that the programmer is responsible for:
setting up the grid of blocks of threads and
determining a mapping of those threads to elements in 1D, 2D, or 3D arrays.
We briefly saw task 1 (setting up grids with blocks) in the previous section, through the use of the dim3 data structure. Now we will examine more examples using dim3, then combine that with task 2, which is to map the threads within the blocks within the grid to data elements in arrays.
1D grid of 1D blocks of threads¶
Filename: 1-basics/1.2-dim3/dim3Demo1D1D.cu
The following example creates a 1 dimensional grid of 2 blocks that are also 1 dimensional, containing 8 threads:
If we simply change main to create a 1D grid with 2 blocks of 8 threads, we still maintain the same thread number values that can be used as indexes into an array of 16 data values. Here is the code- look for the change in main:
Filename: 1-basics/1.3-1DBlockPrint/print2Blocks.cu
Note
Some other new ideas from this code are the following:
CUDA kernel functions that run on the device can have parameters that get passed from the host code calling it.
A kernel function called from host code, which we learned was designated by the keyword __global__, can call other functions that will immediately run on the device. These functions are designated with the keyword __device__, such as the function find1DThreadNumber() given above.
The situation from the above code is depicted in Figure 4-6, where the thread numbers computed by the function find1DThreadNumber and printed in the output above as t0, t1, t2, etc. are mapped to indices of an array containing 16 elements. Compare the function, repeated here, to Figure 4-6.
// Given a 1 dimensional grid of blocks of threads,
// determine my thread number.
// This is run on the GPU on each thread.
__device__ int find1DThreadNumber() {
// illustrative variable names
int threadsPerBlock_horizontal = blockDim.x;
int gridBlockNumber = blockIdx.x;
int threadNumber;
threadnumber = (gridBlockNumber * threadsPerBlock_horizontal) + threadIdx.x;
return threadNumber;
}
Warning
The function called find1DThreadNumber is sufficient to calculate an index into any length 1-dimensional array when using a 1D grid of 1D blocks. As a programmer, you must determine the grid and block sizes from the length of the array and ensure that you don’t go out of the bounds of the array. You will see how this is done next when we look at an example of vector addition from linear algebra.
Exercises¶
- 4.3-1: Try a few more blocks
Try changing the code for print2Blocks.cu to use more than 2 blocks, such as 4 (don’t try too large because of all the printing that will happen, some of which may not get returned). What do you observe about the numbering for each thread?
- True.
- Yes! The qualifier signifies code that gets called on a running section of device code, which can be a starting kernel function or another device function.
- False.
- The qualifier signifies code that gets called from a running section of device code.
4.3-2: Functions annotated with the __device__ qualifier must be called from a kernel function or another device function.
2D grid of 2D blocks of threads¶
2D grids of 2D blocks of threads are useful for applications that use 2-dimensional arrays, or matrices. We will look at that in the next chapter containing applications.