Timing your Kernel - CPU Timer & nvprof
To know how long a kernel takes to execute will help us during the performance tuning of kernels. There are several ways to measure kernel performance. The simplest methods to measure kernel executions from the host side are to use either a CPU timer or a GPU timer using the Nvidia nvprof profiler.
Timing with CPU timer
A CPU timer can be created by using the gettimeofday()
system call to get the system's wall-clock time, which returns the number of seconds since the epoch. You need to include the sys/time.h
header file.
double cpuSecond() { struct timeval tp; gettimeofday(&tp,NULL); return ((double)tp.tv_sec + (double)tp.tv_usec*1.e-6); }
You can measure your kernel with the function cpuSecond()
in the following way:
double iStart = cpuSecond(); kernel_name<<>>(argument list); cudaDeviceSynchronize(); double iElaps = cpuSecond() - iStart;
Because a kernel call is asynchronous with respect to the host, you need to use cudaDeviceSynchronize()
to wait for all GPU threads to complete.
The variable iElaps
reports the time spent. Because a kernel call is asynchronous with respect to the host, you need to use cudaDeviceSynchronize to wait for all GPU threads to complete.
Timing with NVPROF
Since CUDA 5.0, a command-line profiling tool, called nvprof, is available to help you to collect timeline information from your application's CPU and GPU activity, including kernel execution, memory transfers, and CUDA API calls. Its usage is shown here.
$ nvprof [nvprof_args] ./application [application_args]
More information about nvprof options can be found by using the following command:
$ nvprof --help
You can use nvprof to measure your kernel as follows:
$ nvprof ./mycode-timer
The output reported by nvprof varies based on the type of GPU you are using. This a sample report collected on a Tesla GPU
$ ./mycode-timer Starting... Using Device 0: Tesla M2070 ==17770== NVPROF is profiling process 17770, command: ./mycode-timer Vector size 16777216 mycodeOnGPU <<<16384, 1024>>> Time elapsed 0.003266 sec Arrays match. ==17770== Profiling application: ./mycode-timer ==17770== Profiling result: Time(%) Time Calls Avg Min Max Name 70.35% 52.667ms 3 17.556ms 17.415ms 17.800ms [CUDA memcpy HtoD] 25.77% 19.291ms 1 19.291ms 19.291ms 19.291ms [CUDA memcpy DtoH] 3.88% 2.9024ms 1 2.9024ms 2.9024ms 2.9024ms mycodeOnGPU(float*, float*, int)
The first half of the message contains output from the program, and the second half contains output from nvprof. Note that the CPU timer reported the elapsed kernel time as 3.26 milliseconds, and nvprof
reported the elapsed kernel time as 2.90 milliseconds. For this case, the nvprof result is more accurate than the host-side timing result because the time measured with the CPU timer included overhead from nvprof.
Note that in this example, data transfer between the host and device takes more time than the kernel execution.
For HPC workloads, it is essential to understand the compute to communication ratio in a program. If your application spends more time calculating than transferring data, then it may be possible to overlap these operations and completely hide the latency associated with transferring data. If your application spends less time computing than transferring data, it is crucial to minimize the transfer between the host and device.