5.2 Vector Addition Example: see hardware in action¶

This example is designed to show you how you can run an experiment to find out the difference in speed between a CPU core and a single GPU core. To do this we will run the vector addition on the host, where it will use the CPU, and then run it on a kernel function that only uses one core of the GPU (never a good idea for real code, but fine for this type of experiment).

Parts that remain the same from the previous chapter¶

We use the same macro for detecting and reporting errors in code run on the device.
We use the same function for initializing the arrays with values and the same function for verifying that the result of the vector addition is correct.
We use managed memory as in the final example from the previous chapter.

Differences for this example¶

The command line argument is now the array size so that we can change it each time we run it. The getArguments() function now looks like this:

getArguments function¶

// simple argument gather for this simple 1D example program
//
// Design is the arguments will be optional in this order:
//  number of  data elements in 1D vector arrays
void getArguments(int argc, char **argv, int *numElements) {

  if (argc == 2) {  
    *numElements = atoi(argv[1]);
  }
}

We use it in main by setting a default value and then overwriting it if there is one given on the command line, like this:

use of getArguments function¶

  int N = 32*1048576;
  
  // get optional argument: change array size
  getArguments(argc, argv, &N); 

  printf("size (N) of 1D array is: %d\n\n", N);

Case 1: We have a host function to add the two arrays on the host CPU:

hostAdd function for CPU¶

// To run code on host for comparison
void HostAdd(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}

Note that this function does not use the __global__ keyword and therefore acts as a regular function that runs on the host CPU.

Case 2: We have a kernel function that runs on only one thread on one core of the GPU:

the add kernel function for 1 thread¶

// Kernel function to add the elements of two arrays
// This one is sequential on one GPU core.
__global__ void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}

Note how we are not making any use of multiple threads by finding a thread number and computing an array index. Note below how we call this function in main.

We introduce how we can time our code. In this case we will use standard C functions for this. There are functions for this from CUDA libraries for this, but this method works just as well in many cases. In main, we time the host function like this:

C language timing of hostAdd function¶

  // case 1: run on the host on one core
  t_start = clock();
  // sequentially on the host
  HostAdd(N, x, y);
  t_end = clock();
  tot_time_secs = ((double)(t_end-t_start)) / CLOCKS_PER_SEC;
  tot_time_milliseconds = tot_time_secs*1000;
  printf("\nSequential time on host: %f seconds (%f milliseconds)\n", 
          tot_time_secs, tot_time_milliseconds);

We do similar timing around the kernel function call in main:

C language timing of add kernel function¶

  // case 2:
  // Purely illustration of something you do not ordinarilly do:
  // Run kernel on all elements on the GPU sequentially on one thread
  
  // re-initialize
  initialize(x, y, N);

  t_start = clock();

  add<<<1, 1>>>(N, x, y);   // the kernel call
  cudaCheckErrors("add kernel call");

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();
  cudaCheckErrors("Failure to synchronize device");
  
  t_end = clock();
  tot_time_secs = ((double)(t_end-t_start)) / CLOCKS_PER_SEC;
  tot_time_milliseconds = tot_time_secs*1000;
  printf("\nSequential time on one device thread: %f seconds (%f milliseconds)\n", 
         tot_time_secs, tot_time_milliseconds);

Note

The simple trick to get the code to run on one GPU device thread is the kernal call itself above:

add<<<1, 1>>>(N, x, y); // the kernel call

This also illustrates that instead of using dim3 variables between the <<< and >>> symbols, we can use integers for just the x values for number of blocks in a grid and number of threads in a block. In this case, <<<1, 1>>> is indicating that there will be 1 block with one thread.

We use this function solely to compare the time of a CPU core running the hostAdd to the time to run the same code on one GPU thread.

The complete code for CPU to GPU comparison¶

/*
 * Example of vector addition :
 * Array of floats x is added to array of floats y and the
 * result is placed back in y
 *
 * Timng added for analysis of CPU and GPU differences.
 *
 * This is a simple example that is for demonstration only:
 * THIS IS NOT HOW WE NORMALLY WRITE AND RUN VECTOR ADDITION CODE.
 */

#include <math.h>
 #include <iostream> // alternative cout print for illustration
 #include <time.h>
 #include <cuda.h>

void initialize(float *x, float *y, int N);
 void verifyCorrect(float *y, int N);
 void getArguments(int argc, char **argv, int *numElements);

///////
 // error checking macro taken from Oakridge Nat'l lab training code:
 // https://github.com/olcf/cuda-training-series
 ////////
 #define cudaCheckErrors(msg) \
     do { \
         cudaError_t __err = cudaGetLastError(); \
         if (__err != cudaSuccess) { \
             fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                 msg, cudaGetErrorString(__err), \
                 __FILE__, __LINE__); \
             fprintf(stderr, "*** FAILED - ABORTING\n"); \
             exit(1); \
         } \
     } while (0)

// To run code on host for comparison
 void HostAdd(int n, float *x, float *y)
 {
   for (int i = 0; i < n; i++)
     y[i] = x[i] + y[i];
 }

// Kernel function to add the elements of two arrays
 // This one is sequential on one GPU core.
 __global__
 void add(int n, float *x, float *y)
 {
   for (int i = 0; i < n; i++)
     y[i] = x[i] + y[i];
 }

/////////////////////////////////////  main  //////////////////////
 int main(int argc, char **argv)
 {
     printf("This program lets us experimentally see the difference in running\n");
     printf("time between a single host CPU core and a single device GPU core.\n");

// Set up size of arrays
     // multiple of 1024 to match largest threads per block
     // allowed in many NVIDIA GPUs
     //
     int N = 32*1048576;
     //int blockSize = 256;     // threads per block
     float *x, *y;

// get optional argument: change array size
     getArguments(argc, argv, &N);

printf("size (N) of 1D array is: %d\n\n", N);
     // Size, in bytes, of each vector; use just below
     size_t bytes = N*sizeof(float);

// Allocate Unified Memory - accessible from CPU or GPU
     cudaMallocManaged(&x, bytes);
     cudaMallocManaged(&y, bytes);
     cudaCheckErrors("allocate managed memory");

// initialize x and y arrays on the host
     initialize(x, y, N);

clock_t t_start, t_end;              // for timing
     double tot_time_secs;
     double tot_time_milliseconds;

///////////////////////////////////////////////////////////////////
     // case 1: run on the host on one core
     t_start = clock();
     // sequentially on the host
     HostAdd(N, x, y);
     t_end = clock();
     tot_time_secs = ((double)(t_end-t_start)) / CLOCKS_PER_SEC;
     tot_time_milliseconds = tot_time_secs*1000;
     printf("\nSequential time on host: %f seconds (%f milliseconds)\n", tot_time_secs, tot_time_milliseconds);

verifyCorrect(y, N);

///////////////////////////////////////////////////////////////////
     // case 2:
     // Purely illustration of something you do not ordinarilly do:
     // Run kernel on all elements on the GPU sequentially on one thread

// re-initialize
     initialize(x, y, N);

t_start = clock();

add<<<1, 1>>>(N, x, y);   // the kernel call
     cudaCheckErrors("add kernel call");

// Wait for GPU to finish before accessing on host
     cudaDeviceSynchronize();
     cudaCheckErrors("Failure to synchronize device");

t_end = clock();
     tot_time_secs = ((double)(t_end-t_start)) / CLOCKS_PER_SEC;
     tot_time_milliseconds = tot_time_secs*1000;
     printf("\nSequential time on one device thread: %f seconds (%f milliseconds)\n", tot_time_secs, tot_time_milliseconds);

verifyCorrect(y, N);

// Free memory
     cudaFree(x);
     cudaFree(y);
     cudaCheckErrors("free cuda memory");

return 0;
 }

// To reset the arrays for each trial
 void initialize(float *x, float *y, int N) {
 // initialize x and y arrays on the host
     for (int i = 0; i < N; i++) {
         x[i] = 1.0f;
         y[i] = 2.0f;
     }
 }

// check whether the kernel functions worked as expected
 void verifyCorrect(float *y, int N) {
     // Check for errors (all values should be 3.0f)
     float maxError = 0.0f;
     for (int i = 0; i < N; i++)
         maxError = fmax(maxError, fabs(y[i]- 3.0f));
     std::cout << "Max error: " << maxError << std::endl;
 }

// simple argument gather for this simple 1D example program
 //
 // Design is the arguments will be optional in this order:
 //  number of  data elements in 1D vector arrays
 void getArguments(int argc, char **argv, int *numElements) {

if (argc == 2) {
         *numElements = atoi(argv[1]);
     }
 }

Try using ‘16777216’ to use half the default array size.

Let’s consider what you see from this by answering the following questions.

5.2-1: What can we infer about the speed of the CPU and GPU cores?

GPU cores are faster than CPU cores.
Look carefully at the case of one device thread.
CPU cores are faster than GPU cores.
Yes! The single GPU thread case ran much slower than the single CPU thread case.
The speed of CPU and GPU cores are comparable.
Look carefully at the case of one device thread.

This case brings up an interesting observation that you can make for this particular example code. Figure it out by answering this question:

5.2-2: When we double the size of our problem, the code takes roughly twice the time to run for each case.

True.
Yes! In this case, the simple algorithm is O(N), so this makes sense for the sequential version on the CPU or one GPU core.
False.
Try running each case a few more times to determine if it really is true.

Build and run on your machine¶

File: 4-UMVectorAdd-timing/vectorAdd-1.cu

Just as for previous examples, you can use the make command on your own machine or compile the code like this:

nvcc -arch=native  -o vectorAdd-1 vectorAdd-1.cu

Remember that you will need to use a different -arch flag if native does not work for you. (See note at end of section 4.1.)

You can execute this code like this:

./vectorAdd-1
./vectorAdd-1 16777216

You have attempted of activities on this page