2D Shared Memory

We can use shared memory to cache global data with square dimensions in a straightforward fashion.

Rectangular shared memory is a more general case of 2D shared memory, where the number of rows and columns in an array are not equal.

__shared__ int tile[Row][Col];

Without loss of generality, you are going to examine a rectangular shared memory array with 32 elements per row and 16 elements per column. The dimensions are defined in the following macros:

#define BDIMX 32 #define BDIMY 16

The rectangular shared memory tile is allocated as follows:

__shared__ int tile[BDIMY][BDIMX];

For simplicity, the kernel will be launched with only one grid, and one 2D block using the same size as the rectangular shared memory array, as follows:

dim3 block (BDIMX,BDIMY);
dim3 grid (1,1);

In this example, we are going to create kernels with two simple operations:

Write global thread indices to a 2D shared memory array.

Read those values from shared memory and store them to global memory.

Accessing Row-Major versus Accessing Column-Major

The first two kernels we investigate are:

__global__ void setRowReadRow(int *out);
__global__ void setColReadCol(int *out);

In the kernel setRowReadRow, the length of the innermost dimension of the shared memory array tile is set to the same dimension as the innermost dimension of the 2D thread block: __shared__ int tile[BDIMY][BDIMX];

__global__ void setRowReadRow(int *out)
{
    // static shared memory
    __shared__ int tile[BDIMY][BDIMX];

    // mapping from thread index to global memory index
    unsigned int idx = threadIdx.y * blockDim.x + threadIdx.x;

    // shared memory store operation
    tile[threadIdx.y][threadIdx.x] = idx;

    // wait for all threads to complete
    __syncthreads();

    // shared memory load operation
    out[idx] = tile[threadIdx.y][threadIdx.x] ;
}

In the kernel setColReadCol, the length of the innermost dimension of the shared memory array tile is set to the same dimension as the outermost dimension of the 2D thread block: __shared__ int tile[BDIMX][BDIMY];

__global__ void setColReadCol(int *out)
{
    // static shared memory
    __shared__ int tile[BDIMX][BDIMY];

    // mapping from thread index to global memory index
    unsigned int idx = threadIdx.y * blockDim.x + threadIdx.x;

    // shared memory store operation
    tile[threadIdx.x][threadIdx.y] = idx;

    // wait for all threads to complete
    __syncthreads();

    // shared memory load operation
    out[idx] = tile[threadIdx.x][threadIdx.y];
}

Writing Row-Major and Reading Column-Major

In this section, we implement a kernel that writes to shared memory in row-major order and reads from shared memory in column-major order using shared memory array. This kernel is applicable in real-world applications; it performs a matrix transpose using shared memory to improve performance by maximizing low-latency loads and stores.

The 2D shared memory tile is declared as follows: __shared__ int tile[BDIMY][BDIMX];

The procedure for calculating the proper shared and global memory accesses is as follows. First, the 2D thread index of the current thread is converted to a 1D global thread ID:

unsigned int idx = threadIdx.y * blockDim.x + threadIdx.x;

Because data elements in the output global memory are transposed, you then need to calculate the new coordinate in the transpose matrix, as follows:

unsigned int irow = idx / blockDim.y;
unsigned int icol = idx % blockDim.y;

You initialize the shared memory tile by storing the global thread IDs to the 2D shared memory tile as follows:

tile[threadIdx.y][threadIdx.x] = idx;

At this point, the data in shared memory is stored linearly from 0 to BDIMX×BDIMY-1.

Now, you can access the shared memory data to be transposed with the coordinates calculated earlier. By accessing shared memory using swapped irow and icol, you can write the transposed data to global memory using the 1D thread IDs. As shown in the following snippet, a warp reads data elements from one column in shared memory. out[idx] = tile[icol][irow];

The complete kernel code is as follows:

__global__ void setRowReadCol(int *out)
{
    // static shared memory
    __shared__ int tile[BDIMY][BDIMX];

    // mapping from 2D thread index to linear memory
    unsigned int idx = threadIdx.y * blockDim.x + threadIdx.x;

    // convert idx to transposed coordinate (row, col)
    unsigned int irow = idx / blockDim.y;
    unsigned int icol = idx % blockDim.y;

    // shared memory store operation 
    tile[threadIdx.y][threadIdx.x] = idx;

    // wait for all threads to complete
    __syncthreads();

    // shared memory load operation
    out[idx] = tile[icol][irow];
}

Dynamically Declared Shared Memory

Because dynamic shared memory can only be declared a 1D array, a new index is required to con- vert from 2D thread coordinates to 1D shared memory indices when writing by rows and reading by columns:

unsigned int col_idx = icol * blockDim.x + irow;

Because icol corresponds to the innermost dimension of the thread block, this conversion yields column-major access to shared memory, which results in bank conflicts. The kernel code is as follows:

__global__ void setRowReadColDyn(int *out)
{
    // dynamic shared memory
    extern __shared__ int tile[];
    // mapping from thread index to global memory index
    unsigned int idx = threadIdx.y * blockDim.x + threadIdx.x;

    // convert idx to transposed (row, col)
    unsigned int irow = idx / blockDim.y;
    unsigned int icol = idx % blockDim.y;

    // convert back to smem idx to access the transposed element
    unsigned int col_idx = icol * blockDim.x + irow;

    // shared memory store operation
    tile[idx] = idx;

    // wait for all threads to complete
    __syncthreads();

    // shared memory load operation
    out[idx] = tile[col_idx]; 
}

Shared memory size must be specified as part of the kernel launch:

setRowReadColDyn<<<grid, block, BDIMX * BDIMY * sizeof(int)>>>(d_C);