5.4 Vector Addition Example: exploring thread block mappings¶
Here is a full example that tries running the code under several conditions:
Case 3: using one block of 256 threads (or you can change this default block size)
Case 4: using 16 blocks of 256 threads (or you can change this default block size)
Case 5: using as many blocks of 256 threads that we need for the entire array (or you change the block size)
The default first argument is the blockSize of 256, and the second argument is the array size, N, whose default size is 33554432 (a multiple of 256 and also 1024, which is sometimes the largest recommended number of threads per block). Note at the end of the code that the getArguments() function now handles two optional arguments for blockSize and N.
Each case takes a bit of time, so be patient while it executes each one before sending the results back.
The 1 block case is scheduled on multiple SMs, making it slower.
-
1 block maps to 1 SM, so it does not use all the SMs available on the GPU.
The multiple blocks cases use multiple SMs, making better use of the hardware.
-
Using more SMs often increases the speed of your code.
The 1 block case is scheduled on 1 SM, so it does not use all the SMs available on the GPU.
-
Using more SMs often increases the speed of your code.
5.4-1: What explains the different run times?
Choose all that are correct. After observing this code running, using 1 block of threads instead of multiple blocks of threads is slower because:
An Exercise¶
Observe the times that each case in the code takes with the default array size, which is 33554432, and the default blockSize of 256. What time do you observe when using half of the default array size, or 16777216?
['256', '16777216']
Try doubling the size of the original array in the command line arguments, like this:
['256', '67108864']
What time does it take to run each array size on each of case 4 (stride) and case 5 (no stride)?
Go even higher:
['256', '134217728']
On a sequential version of this algorithm, the timings should be twice as long as we double the array size N for this type of algorithm, which we call an O(N) algorithm.
The reason you see better results than an O(N) single core solution is that as the size of the array increases, we are able to use more cores in parallel across all the SMs on this particular GPU card and more importantly it is able to schedule the computations on those cores effectively.
Note
Though there is a difference between case 4 and 5 times, it is fairly small (the time is reported in milliseconds, or thousandths of a second), and may be different for each GPU card. This means either method works well for this particular problem.
Note
An important point about the design of the NVIDIA cards is that the block size should be a multiple of 32 and that for today’s cards, experiments seem to show that block sizes of 128, 256, or 512 are preferred choices for the design of the hardware.
Build and run on your machine¶
File: 4-UMVectorAdd-timing/vectorAdd.cu
Just as for previous examples, you can use the make command on your own machine or compile the code like this:
nvcc -arch=native -o vectorAdd vectorAdd.cu
Remember that you will need to use a different -arch flag if native does not work for you. (See note at end of section 4.1.)
You can execute this code like this:
./vectorAdd
You can also experiment with trying a smaller or larger block size, by running like this:
./vectorAdd 128
./vectorAdd 512
Also try changing the array sizes as we did earlier.
Test your understanding¶
- True.
- Look carefully at the case of one block.
- False.
- Yes! The single block case ran much slower than when using multiple blocks of threads.
5.4-2: Using one block of threads is a reasonable choice for this example.
- True.
- Yes! In this case, the GPU cores are being scheduled very efficiently.
- False.
- Try running each case a few more times to determine if it really is true.
5.4-3: When we use multiple blocks and double the size of our problem, the code takes slightly less than twice the time to run on the GPU for each case.
Summary¶
This example shows that a host CPU is faster than a single core on a GPU by quite some margin. So to use GPUs effectively, you need to use as many cores as possible in parallel to complete the computation. From this example, you can see that this is possible when the mapping of cores to data elements is straightforward and the computation on each data element is simple (though this example still works well with more sophisticated mathematical calculations involving single elements of each array).
In Araujo(2023), the authors performed an extensive study to determine the affect of the block size on a variety of different benchmark code examples. They concluded that under some circumstances the block size had very small effects on the runtime, but for other cases, keeping it small or large made a considerable difference on how fast the codes ran. Here you likely will see minor effects for vector addition, but in other cases you may not, so it is best to design your code so that you can change it and run experiments. It still holds from their results that block sizes of 128, 256, or 512 are preferred choices.
In this same study, the authors ran their experiments 10 times for the same conditions, getting an average time. This is also a practice you will want to get into the habit of when testing out your code. You should have seen variation in your timing results as you ran the same condition multiple times.
Exercises for Further Exploration¶
There is overhead creating the Unified Memory array and copying it to the GPU. The vector addition computation as we use more blocks of threads on the GPU does not increase by exactly twice as we double the array size, but a complete solution with timing should include that memory overhead. You could try creating a version of just case 5 that timed all parts of the code.
Given this example of how the code can be timed on the host, go back and add timing to the code that did not use unified memory. The results will enable you to determine whether our assertion that using unified memory is the preferred method is true for this example.
Another exercise is to consider when the 5th case, using one thread id per array index and calculate the number of blocks, could fail. Though likely a rare case, it is worth thinking about. To do it, go back to the information about your device and determine the maximum number of blocks allowed in a grid.
For case 4, experiment with changing the fixed number of blocks. Is there any case where the time is consistently better or worse than the case where we calculate the number of blocks in the grid based on N and the block size?
There is an example provided in our GitHub repository where CUDA library functions are used for timing the code instead of C timing functions on the host. If you want to explore this example, you can see how CUDA has also provided mechanisms for timing code that can sometimes be useful for adding timing to sophisticated kernel or device functions.
References¶
Araujo, G., Griebler, D., Rockenbach, D. A., Danelutto, M., & Fernandes, L. G. (2023). NAS Parallel Benchmarks with CUDA and beyond. Software: Practice and Experience, 53(1), 53-80.