5.1 Vector Addition Example: exploring CUDA by timing and experimenting¶

In this example we will look at these new aspects of CUDA coding, using the same Vector Addition example as the previous chapter.

Using code timing to know how fast the code runs.
Experiment to find out whether changes to the code affect the time and to compare running on the host CPU to running on the GPU device.
Add a check for whether a user enters a value for threads per block that is larger than allowed.
Demonstrate that you can run kernel code without using dim3 variables if you are only concerned with the x dimension and using 1D grids and thread blocks.
Show a standard practice of using an additional command line argument for the size of our array.
Show that we can run several device kernel functions, one after the other, from one main program (after re-initializing the data arrays in between each kernel function).

The code examples contain experimental timings for the following conditions in the following sections of this chapter:

Section 5.2. Case 1. Running on a single thread on the host CPU.

Section 5.2. Case 2. Running on a single thread on the GPU device (not something we would normally do, but given to show the difference between CPU and GPU cores).

Section 5.3, 5.4. Case 3. Running on a single block of threads (grid size 1).

Section 5.3, 5.4. Case 4. Running on a somewhat small number of blocks, using a slightly different version of the loop to perform the addition.

Section 5.3, 5.4. Case 5. Running on a large number of blocks, as shown in the previous example.

Note

Taking timings and running experiments as shown in the next few sections is central to the work process during PDC computing. Just as we need to make sure our code is still correct, we also want to determine the best way to run the code to get the best performance and what factors do not really affect the performance.