To know how long a kernel takes to execute will help us during the performance tuning of kernels. There are several ways to measure kernel performance. The simplest methods to measure kernel executions from the host side are to use either a CPU timer or a GPU timer using the Nvidia nvprof profiler.

Timing with CPU timer

A CPU timer can be created by using the gettimeofday() system call to get the system's wall-clock time, which returns the number of seconds since the epoch. You need to include the sys/time.h header file.

double cpuSecond() {
   struct timeval tp;
   gettimeofday(&tp,NULL);
   return ((double)tp.tv_sec + (double)tp.tv_usec*1.e-6);
}

You can measure your kernel with the function cpuSecond() in the following way:

double iStart = cpuSecond();
kernel_name<<>>(argument list);
cudaDeviceSynchronize();
double iElaps = cpuSecond() - iStart;

Because a kernel call is asynchronous with respect to the host, you need to use cudaDeviceSynchronize() to wait for all GPU threads to complete.

The variable iElaps reports the time spent. Because a kernel call is asynchronous with respect to the host, you need to use cudaDeviceSynchronize to wait for all GPU threads to complete.

Timing with NVPROF

Since CUDA 5.0, a command-line profiling tool, called nvprof, is available to help you to collect timeline information from your application's CPU and GPU activity, including kernel execution, memory transfers, and CUDA API calls. Its usage is shown here.

$ nvprof [nvprof_args] ./application [application_args]

More information about nvprof options can be found by using the following command:

$ nvprof --help

You can use nvprof to measure your kernel as follows:

$ nvprof ./mycode-timer

The output reported by nvprof varies based on the type of GPU you are using. This a sample report collected on a Tesla GPU

$ ./mycode-timer Starting...
Using Device 0: Tesla M2070
==17770== NVPROF is profiling process 17770, command: ./mycode-timer
Vector size 16777216
mycodeOnGPU <<<16384, 1024>>>  Time elapsed 0.003266 sec
Arrays match.
==17770== Profiling application: ./mycode-timer
==17770== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 70.35%  52.667ms         3  17.556ms  17.415ms  17.800ms  [CUDA memcpy HtoD]
 25.77%  19.291ms         1  19.291ms  19.291ms  19.291ms  [CUDA memcpy DtoH]
  3.88%  2.9024ms         1  2.9024ms  2.9024ms  2.9024ms  mycodeOnGPU(float*, float*, int)

The first half of the message contains output from the program, and the second half contains output from nvprof. Note that the CPU timer reported the elapsed kernel time as 3.26 milliseconds, and nvprof reported the elapsed kernel time as 2.90 milliseconds. For this case, the nvprof result is more accurate than the host-side timing result because the time measured with the CPU timer included overhead from nvprof.

Note that in this example, data transfer between the host and device takes more time than the kernel execution.

For HPC workloads, it is essential to understand the compute to communication ratio in a program. If your application spends more time calculating than transferring data, then it may be possible to overlap these operations and completely hide the latency associated with transferring data. If your application spends less time computing than transferring data, it is crucial to minimize the transfer between the host and device.