User`s guide
Measure and Improve GPU Performance
9-39
repeating the timed operation to get better resolution, executing the function before
measurement to avoid initialization overhead, and subtracting out the overhead of
the timing function. Also, gputimeit ensures that all operations on the GPU have
completed before the final timing.
For example, consider measuring the time taken to compute the lu factorization of a
random matrix A of size N-by-N. You can do this by defining a function that does the lu
factorization and passing the function handle to gputimeit:
A = rand(N,'gpuArray');
fh = @() lu(A);
gputimeit(fh,2); % 2nd arg indicates number of outputs
You can also measure performance with tic and toc. However, to get accurate timing
on the GPU, you must wait for operations to complete before calling toc. There are two
ways to do this. You can call gather on the final GPU output before calling toc: this
forces all computations to complete before the time measurement is taken. Alternately,
you can use the wait function with a GPUDevice object as its input. For example, if you
wanted to measure the time taken to compute the lu factorization of matrix A using tic,
toc, and wait, you can do it as follows:
gd = gpuDevice();
tic();
[l,u] = lu(A);
wait(gd);
tLU = toc();
You can also use the MATLAB profiler to show how computation time is distributed in
your GPU code. Note, that to accomplish timing measurements, the profiler runs each
line of code independently, so it cannot account for overlapping (asynchronous) execution
such as might occur during normal operation. For timing whole algorithms, you should
use tic and toc, or gputimeit, as described above. Also, the profile might not yield
correct results for user-defined MEX functions if they run asynchronously.
Vectorize for Improved GPU Performance
This example shows you how to improve performance by running a function on the GPU
instead of the CPU, and by vectorizing the calculations.
Consider a function that performs fast convolution on the columns of a matrix. Fast
convolution, which is a common operation in signal processing applications, transforms
each column of data from the time domain to the frequency domain, multiplies it by the