5.2 Vector Addition Example: see hardware in action¶
This example is designed to show you how you can run an experiment to find out the difference in speed between a CPU core and a single GPU core. To do this we will run the vector addition on the host, where it will use the CPU, and then run it on a kernel function that only uses one core of the GPU (never a good idea for real code, but fine for this type of experiment).
Parts that remain the same from the previous chapter¶
We use the same macro for detecting and reporting errors in code run on the device.
We use the same function for initializing the arrays with values and the same function for verifying that the result of the vector addition is correct.
We use managed memory as in the final example from the previous chapter.
Differences for this example¶
The command line argument is now the array size so that we can change it each time we run it. The getArguments() function now looks like this:
// simple argument gather for this simple 1D example program
//
// Design is the arguments will be optional in this order:
// number of data elements in 1D vector arrays
void getArguments(int argc, char **argv, int *numElements) {
if (argc == 2) {
*numElements = atoi(argv[1]);
}
}
We use it in main by setting a default value and then overwriting it if there is one given on the command line, like this:
int N = 32*1048576;
// get optional argument: change array size
getArguments(argc, argv, &N);
printf("size (N) of 1D array is: %d\n\n", N);
Case 1: We have a host function to add the two arrays on the host CPU:
// To run code on host for comparison
void HostAdd(int n, float *x, float *y)
{
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
Note that this function does not use the __global__ keyword and therefore acts as a regular function that runs on the host CPU.
Case 2: We have a kernel function that runs on only one thread on one core of the GPU:
// Kernel function to add the elements of two arrays
// This one is sequential on one GPU core.
__global__ void add(int n, float *x, float *y)
{
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
Note how we are not making any use of multiple threads by finding a thread number and computing an array index. Note below how we call this function in main.
We introduce how we can time our code. In this case we will use standard C functions for this. There are functions for this from CUDA libraries for this, but this method works just as well in many cases. In main, we time the host function like this:
// case 1: run on the host on one core
t_start = clock();
// sequentially on the host
HostAdd(N, x, y);
t_end = clock();
tot_time_secs = ((double)(t_end-t_start)) / CLOCKS_PER_SEC;
tot_time_milliseconds = tot_time_secs*1000;
printf("\nSequential time on host: %f seconds (%f milliseconds)\n",
tot_time_secs, tot_time_milliseconds);
We do similar timing around the kernel function call in main:
// case 2:
// Purely illustration of something you do not ordinarilly do:
// Run kernel on all elements on the GPU sequentially on one thread
// re-initialize
initialize(x, y, N);
t_start = clock();
add<<<1, 1>>>(N, x, y); // the kernel call
cudaCheckErrors("add kernel call");
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
cudaCheckErrors("Failure to synchronize device");
t_end = clock();
tot_time_secs = ((double)(t_end-t_start)) / CLOCKS_PER_SEC;
tot_time_milliseconds = tot_time_secs*1000;
printf("\nSequential time on one device thread: %f seconds (%f milliseconds)\n",
tot_time_secs, tot_time_milliseconds);
Note
The simple trick to get the code to run on one GPU device thread is the kernal call itself above:
add<<<1, 1>>>(N, x, y); // the kernel call
This also illustrates that instead of using dim3 variables between the <<< and >>> symbols, we can use integers for just the x values for number of blocks in a grid and number of threads in a block. In this case, <<<1, 1>>> is indicating that there will be 1 block with one thread.
We use this function solely to compare the time of a CPU core running the hostAdd to the time to run the same code on one GPU thread.
The complete code for CPU to GPU comparison¶
Try using ‘16777216’ to use half the default array size.
Let’s consider what you see from this by answering the following questions.
- GPU cores are faster than CPU cores.
- Look carefully at the case of one device thread.
- CPU cores are faster than GPU cores.
- Yes! The single GPU thread case ran much slower than the single CPU thread case.
- The speed of CPU and GPU cores are comparable.
- Look carefully at the case of one device thread.
5.2-1: What can we infer about the speed of the CPU and GPU cores?
This case brings up an interesting observation that you can make for this particular example code. Figure it out by answering this question:
- True.
- Yes! In this case, the simple algorithm is O(N), so this makes sense for the sequential version on the CPU or one GPU core.
- False.
- Try running each case a few more times to determine if it really is true.
5.2-2: When we double the size of our problem, the code takes roughly twice the time to run for each case.
Build and run on your machine¶
File: 4-UMVectorAdd-timing/vectorAdd-1.cu
Just as for previous examples, you can use the make command on your own machine or compile the code like this:
nvcc -arch=native -o vectorAdd-1 vectorAdd-1.cu
Remember that you will need to use a different -arch flag if native does not work for you. (See note at end of section 4.1.)
You can execute this code like this:
./vectorAdd-1
./vectorAdd-1 16777216