CUDA Libraries

To augment the abilities of CUDA developers, NVIDIA and other institutions provide domain-specific CUDA libraries that can be used as building blocks for more complex applications. These libraries have been optimized by CUDA experts and designed to have high-level, highly-usable APIs with standardized data formats to facilitate their ability to plug into existing applications.

List of CUDA Libraries

LIBRARY NAME	DOMAIN
NVIDIA cuFFT	Fast Fourier Transforms
NVIDIA cuBLAS	Linear Algebra (BLAS Library)
CULA Tools	Linear Algebra
MAGMA	Next generation Linear Algebra
IMSL Fortran Numerical Library	Mathematics and Statistics
NVIDIA cuSPARSE	Sparse Linear Algebra
NVIDIA CUSP	Sparse Linear Algebra and Graph Computations
AccelerEyes ArrayFire	Mathematics, Signal and Image Processing, and Statistics
NVIDIA cuRAND	Random Number Generation
NVIDIA NPP	Image and Signal Processing
NVIDIA CUDA Math Library	Mathematics
Thrust	Parallel Algorithms and Data Structures
HiPLAR	Linear Algebra in R
Geometry Performance Primitives	Computational Geometry
Paralution	Sparse Iterative Methods
AmgX	Core Solvers

LIBRARY NAME	DOMAIN
AccelerEyes ArrayFire	Mathematics, Signal and Image Processing, and Statistics
NVIDIA cuRAND	Random Number Generation
NVIDIA NPP	Image and Signal Processing
NVIDIA CUDA Math Library	Mathematics
Thrust	Parallel Algorithms and Data Structures
HiPLAR	Linear Algebra in R
Geometry Performance Primitives	Computational Geometry
Paralution	Sparse Iterative Methods
AmgX	Core Solvers

CUDA Library Workflow

CUDA libraries share concepts, features, and a common workflow when being called from a host application. A common workflow while using NVIDIA libraries is the following

Stage 1: Creating a Library Handle

CUDA libraries share the concept of a handle, which contains contextual library information such as the format of data structures used, the devices used for computation, and other environmental data. For those library methods that accept a handle, you must allocate and initialize the handle before making any library calls. In general, we can think of the handle as an opaque object stored on the host that contains information which each library function may access. For example, you might want all library operations to run in a particular CUDA stream. Although different libraries use different function names, many offer a function that forces all library operations to occur in a certain stream (for example, cuSPARSE uses cusparseSetStream, cuBLAS uses cublasSetStream, and cuFFT uses cufftSetStream). This stream information would be stored in the library handle.

Stage 2: Allocating Device Memory

device memory is either allocated by the programmer using cudaMalloc or the library uses cudaMalloc internally.

Stage 3: Converting Inputs to a Library-Supported Format

If the format of your application’s data is not supported by the CUDA library you are using, it must be re-formatted. For instance, if an application is storing 2D arrays in row-major order, but the CUDA library only accepts arrays in column-major order, you will need to perform some transformations.

Stage 4: Populating Device Memory with Inputs

Stage 4 simply makes that data available to the library function on the CUDA device
by transferring it to the device memory. You can think of this as analogous to cudaMemcpy, though in many cases a library-specific function is used. For instance, when transferring a vector from the host to the device in a cuBLAS-based application, cublasSetVector should be used. Internally, it uses properly ordered and strided calls to cudaMemcpy (or some equivalent function) to transfer the input data to the device memory.

Stage 5: Configuring the Library

Often, the library function being called must be aware of data formatting, dimensionality, or other configurable parameters. In Stage 5, you manage this configuration process.

Stage 6: Executing

We call the desired library function with the properly configured objects from Stages 1 through 5.

Stage 7: Retrieving Results from Device Memory

We retrieve the output from device memory into host memory in a pre-determined format

Stage 8: Converting Back to Native Format

In the case that an application’s native data format is different from the formats supported by the CUDA library in use, you will need to convert back to the application-specific format.

Stage 9: Releasing CUDA Resources

If the resources allocated by this workflow are no longer necessary, you can release them for use in future computation. Note, there is some overhead in allocating and releasing resources, so it is better to reuse resources across multiple invocations of CUDA library calls when possible.

THE cuBLAS LIBRARY

cuBLAS is a collection of linear algebra routines. Unlike cuSPARSE, cuBLAS is a port of a legacy linear algebra library, the Basic Linear Algebra Subprograms (BLAS) library.

Like BLAS, cuBLAS subroutines are split into multiple classes based on the data types on which they operate:

cuBLAS Level 1 contains vector-only operations like vector addition
cuBLAS Level 2 contains matrix-vector operations like matrix-vector multiplication
cuBLAS Level 3 contains matrix-matrix operations like matrix-multiplication

cuBLAS does not support multiple sparse data formats; it only supports and is optimized for dense vector and dense matrix manipulation.

Because the original BLAS library was written in FORTRAN, it historically uses column-major array storage and one-based indexing. Column-major refers to how a multi-dimensional matrix is stored in a one-dimensional address space.

In column-major flattening, the elements of a column are iterated through and stored in consecutive memory locations before processing the next column. As a result, elements in the same column are located adjacent in memory, whereas elements in the same row are strided. This contrasts with the semantics of C/C++ from which cuBLAS is called, which is row-major, meaning that elements in the same row are stored adjacent to each other.

For compatibility reasons, the cuBLAS library also chooses to use column-major storage. This can be confusing if you are used to the row-major array layout in C/C++.

On the other hand, one-based indexing simply means that the very first element in an array is referenced using the value one rather than the value zero, as it is in C and many other programming languages. As a result, the final element in an N-element array is referenced with the index N rather than N-1, as in zero-based indexing.

However, the cuBLAS library has no control over the semantics of the C/C++ programming language in which it is built, so it must use zero-based indexing. This leads to an odd hybrid situation where the column-major rule of the original FORTRAN BLAS library applies, but not the one-based indexing.

All operations are done on dense cuBLAS vectors or matrices. These vectors and matrices are allocated as contiguous chunks of device memory through cudaMalloc, but use custom cuBLAS routines such as cublasSetVector/cublasGetVector and cublasSetMatrix/cublasGetMatrix to transfer data between the host and device. Although you can think of these specialized functions as wrappers around cudaMemcpy, they are well-optimized to transfer both strided and unstrided data. For example, a call to cublasSetMatrix takes the following arguments:

cublasStatus_t cublasSetMatrix(int rows, int cols, int elementSize, const void *A, int lda, void *B, int ldb);

The first four arguments are self-explanatory: They define the dimensions of the matrix to transfer, the size of each element in the matrix, and the memory location of the column-major source matrix A in host memory. The sixth argument B defines the location of the destination matrix in device memory. The use of the fifth and seventh arguments might be less clear. lda and ldb specify the leading dimension of the source matrix A and destination matrix B. The leading dimension is the total number of rows in the respective matrix. This is useful if only a submatrix of a matrix in host memory is being transferred to the GPU. In other words, if the full matrices stored at A and B are being transferred, lda and ldb should both equal M. If only a submatrix in those matrices is being transferred, the values of lda and ldb should be the row length of the full matrix. lda and ldb should also always be greater than or equal to rows.

If we give a dense two-dimensional column-major matrix A of single-precision floating-point values on the host with M rows and N columns, you could use cublasSetMatrix to transfer the entire matrix using:

cublasSetMatrix(M, N, sizeof(float), A, M, dA, M);

We could also use cublasSetVector to transfer a single column of matrix A to a vector dV on the device. cublasSetVector takes the following arguments:

cublasStatus_t cublasSetVector(int n, int elemSize, const void *x, int incx, void *y, int incy);

where x is the source location on the host, y is the destination location on the device, n is the number of elements to transfer, elemSize is the size of each element in bytes, and incx/incy is a stride between elements to be transferred. Transferring a single column with length M of a column-major matrix A to a vector dV could be done using:

cublasSetVector(M, sizeof(float), A, 1, dV, 1);

You could also use cublasSetVector to transfer a single row of that matrix A to a vector dV on the device:

cublasSetVector(N, sizeof(float), A, M, dV, 1);

This function copies N elements from A to dV, skipping M elements in A at a time. Because A is column-major, this command would copy the first row of A to the device. Copying row i would be implemented as:

cublasSetVector(N, sizeof(float), A + i, M, dV, 1);

A Simple cuBLAS Example

This example will perform matrix-vector multiplication on the GPU, a cuBLAS Level 2 operation. You can follow along by downloading the sample code in cublas.cu from Wrox.com or using the following code snippet:

// Create the cuBLAS handle
cublasCreate(&handle);
// Allocate device memory
cudaMalloc((void **)&dA, sizeof(float) * M * N);
cudaMalloc((void **)&dX, sizeof(float) * N);
cudaMalloc((void **)&dY, sizeof(float) * M);
// Transfer inputs to the device
cublasSetVector(N, sizeof(float), X, 1, dX, 1);
cublasSetVector(M, sizeof(float), Y, 1, dY, 1);
cublasSetMatrix(M, N, sizeof(float), A, M, dA, M);
// Execute the matrix-vector multiplication
cublasSgemv(handle, CUBLAS_OP_N, M, N, &alpha, dA, M, dX, 1, &beta, dY, 1);
// Retrieve the output vector from the device
cublasGetVector(M, sizeof(float), dY, 1, Y, 1);