Assignment III - Advanced CUDA

Inlämningsdatum 25 nov 2019 av 23.59
Poäng 1
Lämnar in en filuppladdning
Filtyper pdf
Tillgänglig 11 nov 2019 kl 9:00–14 jan 2020 kl 23.59

Den här uppgiften låstes 14 jan 2020 kl 23.59.

In this assignment, we will go through the concepts of the GPU memory hierarchy, optimization techniques and CUDA libraries provided by NVIDIA.

To submit your assignment, prepare a small report that answers the questions in the exercises. Submit the report as a PDF with the following filename:

appgpu19_HW3_GroupNumberFromCanvas.pdf

Submit your code in a Git repository, and make it public so we can access it. Use the following folder structure and include the link in your report:

Assignment_3/ex_ExerciseNumber/your_source_code_files

The assignment is solved and submitted in a group of two according to Canvas signup.

Exercise 1 - CUDA Edge Detector using shared memory

In this exercise, we will implement an edge detector for bitmap images. We will also provide a reference image and reference output for you to check the answer.

Comment out the function call to CPU versions during development, otherwise you may still get a correct output file even if the kernel does not execute.
Run the program from the root of the exercise folder, do not modify the folder names or folder structure!

Obtain the skeleton code, compile and run like the following, assuming that you are on Tegner with all modules loaded and you are already allocated with a K420 GPU. Modify the architecture flag if you are using other GPUs (i.e. -arch=sm_50 for workstations in lab rooms and -arch=sm_37 for K80).

$ module load git
$ git clone https://github.com/steven-chien/DD2360-HT19.git
$ cd DD2360-HT19/Assignment_3/ex_1
$ nvcc -O3 -arch=sm_30 hw3_ex1.cu -o hw3_ex1.out
$ srun -n 1 ./hw3_ex1.out images/hw3.bmp

If you are running the exercise on Tegner, we also recommend that you add the -Y flag when you ssh to enable X11 forward so that you can display the images directly through the terminal.

Do not change the folder structure or rename anything.

Mapping of input data

The Bitmap (BMP) image format is an uncompressed format. Each BMP file contains an encoded header that specifies the {width, height} of the image, the number of bits per color plane, and more. After the header, a subsequent string of interleaved color values follows (e.g., in BGR). Here is a simplified example of how a 3x3 image looks like inside the file:

BGR

Each BGR, from Blue / Green / Red, represents an 8-bit pixel value in the image that encodes the intensity of each channel. The values span from 0 to 255 in the case of BMP 24bpp2, being 0 the absence of representation by this color and 255 the full representation. The decoding of the image is already done for you. The resulting data follows the exact mapping as above, flattened as a 1D array.

Grayscale conversion

The first step of our edge detector is to discard the color information and work directly in black & white. Since a Bitmap image uses a BGR color space, where the combination of the individual intensities of each color value represent the final intensity of the specific pixels, we combine these pixels to generate a BMP 8bpp image in grayscale. In other words, we want only 8 bits per pixel.

For the conversion to grayscale, we are going to use the Colorimetric Links to an external site. (luminance-preserving) method by applying the following conversion using the weighted sum of the three color values:

Note that since the color information is discarded, each pixel can now be represented in one value instead of three, meaning the output array will have size width*height. If the program is executed successfully, an image will be created:

images/hw3_result_1.bmp

The image can be viewed with

$ display -resize 1280x720 images/hw3_result_1.bmp

You will get a new window that displays the converted image in black & white, such as this:

Do not display the image without resizing since it is very large.

TODO: Find the declaration of gpu_greyscale() in hw3_ex1.cu and implement the GPU version of the black & white color conversion filter. The source code is already set-up to call the kernel and generate the output, but you will need to uncomment the code inside main().

Hint #1: The kernel is launched with a 2D grid of 2D blocks. Consider calculating the ID of the thread in the Y direction to select the specific row, and the ID of the thread in the X direction to select the specific column.
Hint #2: The boundaries of the image cannot be exceeded. You must include an if-statement to prevent any issues, based on the width and the height parameters.

Convolution Filtering

The second step is to implement a Gaussian filter to smooth the grayscale image that was generated through the kernel. We implement this filter to reduce the noise and increase the quality of input for the next step, Sobel Filter, which is very sensitive to noise. For this exercise, we are going to apply a Gaussian filter using a 3×3 convolution matrix on all the pixels of the image. The term convolution is the result of adding each pixel to its local neighbors, weighted by the matrix values:

The * operator represents the convolution, not a matrix multiplication. Here, what you have to consider is to map each pixel as the center of the 3×3 convolution matrix and apply the weights with the surrounding pixels. As we use symmetric filters, the order can be top-bottom as well.

Once the kernel is implemented and executed, a file images/hw3_result_2.bmp will be generated. The differences are very fine, to see them, do:

$ montage -tile 2x1 -crop 320x180+512+512 -geometry 640x360   \
        images/hw3_result_1.bmp images/hw3_result_2.bmp \
        images/hw3_result_2_comp.jpg
$ display images/hw3_result_2_comp.jpg

The new window will display a cropped area of the original black & white image (left), and a cropped area of the new blurred image (right). The differences are very subtle, but you should be able to notice some differences:

TODO: Find the implementation of cpu_applyFilter() inside the hw3_ex1.cu file and try to understand how a given convolution matrix is applied to a certain pixel.

Hint #1: The input block of the image is given by the top-left corner, not the center of the block (the target pixel).
Hint #2: This is not a matrix-matrix multiplication, keep this in mind while reviewing the source code.

Detecting Edges in the Image

Finally, we complete the edge detector by applying the Sobel filter. With this filter, we are going to compute an approximation of the gradient of the image intensity function. This allows us to create a new image where the edges are emphasized, which constitutes the base for full edge detection algorithms such as Canny Links to an external site..

The filter uses two 3×3 kernels which are convolved with the original image to calculate approximations of the derivatives on the horizontal and vertical directions. In other words, if we define A as the source image, and Gx and Gy as two convolution matrices that generate the horizontal and vertical derivative approximations, the computations are as follow:

The resultant gradient magnitude of the pixel is obtained by calculating the square root of these:

For the last exercise, we want you to implement the GPU version of cpu_sobel(), which is already declared in hw3_ex1.cu under the name gpu_sobel(). The implementation of this function is very similar to gpu_gaussian(), except for the fact that we apply two different convolution filters to the same pixel and combine the result.

Once the implementation is complete, run the program and open the result with the following two commands:

$ montage -border 0 -geometry 640x360 -tile 3x1      \
        images/hw3.bmp images/hw3_result_1.bmp \
        images/hw3_result_3.bmp images/hw3_result_3_comp.jpg
$ display images/hw3_result_3_comp.jpg

A new window will open that displays the original image (left), the black & white image (center), and finally the result of applying the Gaussian and Sobel filters (right):

You can also observe how the resulting image looks like in a larger resolution. Use the display command in combination with the resize flag:

$ display -resize 1280x720 images/hw3_result_3.bmp

Alternatively, you can remove the "-resize 1280x720" option to visualize a full resolution of the image. This might take some time to load, but it might be worth it to consider all the small details. Whether you resize the image or not, you should observe something like the following:

Optimizing Memory Accesses

In this section, we are going to try to optimize the GPU versions of the Gaussian and Sobel filter by using the Shared Memory instead. The idea is to bring the content of the image from Global Memory to Shared Memory in blocks of size BLOCK_SIZE_SH. This constant is also the dimension of each block inside the grid, plus some additional values in X and Y.

We ask you first to declare the BLOCK_SIZE_SH constant on top of the file, which defines the dimension of the Shared Memory block. Use the following:

#define BLOCK_SIZE_SH 18

We will provide more details of why we use 18 here and not 16, as in the number of threads per block.

We will use this constant for the declaration of the memory space inside gpu_gaussian() and gpu_sobel(). The declaration is defined in the first or one of the first lines of each kernel:

__shared__ float sh_block[BLOCK_SIZE_SH * BLOCK_SIZE_SH];

This will declare a 2D shared block in Shared Memory, using the 1D array representation that we have already discussed in the previous exercises. The __shared__ attribute is given in the declaration to suggest the compiler that we want this variable to be located in Shared Memory and not in Local or Global Memory.

Hence, the first exercise would be to declare the shared block inside gpu_gaussian() and gpu_sobel(). Then, we ask you to make each thread copy a pixel from the input image into the shared memory block. You have to call __syncthreads() to guarantee that each thread has finished retrieving its part of the block before using the data. Thereafter, change the input of the applyFilter() function to use the shared block instead.

TODO: In hw3_ex1.cu, declare a Shared Memory block within gpu_gaussian() and another one within gpu_sobel(). Thereafter, introduce the necessary changes to make each thread bring one pixel value to the shared block. Change the input parameter of applyFilter() to use the shared block (i.e., instead of a reference to the input image directly).

Hint #1: Use __syncthreads() to guarantee that all the threads have copied their pixels to the Shared Memory.

If you have implemented it "correctly", you will observe that the output result is not exactly what you expected it to be. You should see by now something like this, in the case of the Gaussian filter and the side-by-side comparison with the original image:

The reason is that the exercise is a little bit more complex than initially, one might expect. With the change that you just introduced, we are not considering that we also have to bring extra columns and rows on one of the sides of the block. Without this change, some of the threads are accessing uninitialized data.

This is the main reason why we declared the constant BLOCK_SIZE_SH with two additional elements per dimension. This will make sure that all the threads within the block access data that is available inside the Shared Memory space. As such, the final exercise for you would be to consider the boundaries of each thread block. We already gave you a hint in the declaration of the constant BLOCK_SIZE_SH (i.e., two extra columns and rows are needed).

TODO: Extend the Shared Memory version of gpu_gaussian() and gpu_sobel() to transfer part of the surrounding pixels of the thread block to Shared Memory. Make sure that you do not exceed the boundaries of the image.

Hint #1: Once again, use __syncthreads() to guarantee that all the threads have copied their pixels to the Shared Memory. You will need more than one call to this function.

Questions to answer in report

Explain how the mapping of GPU thread and thread blocks (which is already implemented for you in the code) is working.
Explain why shared memory can (theoretically) improvement performance.
Explain why the resulting image looks like a "grid" when the kernel is simply copying in pixels to the shared block. Explain how this is solved and what are the cases.
There are several images with different sizes in the image folder. Try running the program on them and report how their execution time relate to file sizes.

Exercise 2 - Pinned and Managed Memory

In this exercise, we study the use of pinned memory and managed memory. We will reuse your solution from Assignment II - Exercise 3, particle mover for this exercise.

Programming exercise on Pinned Memory

Put the file of this sub exercise in:

DD2360-HT19/Assignment_3/ex_2

and call it exercise_2a.cu.

In our particle simulator, once the initial particle data is copied to the GPU after initialization, the simulation is completely offloaded and time stepped on the GPU. In reality, an application often includes steps that require processing on the hosts. The dependency implies that data has to be transferred back and forth every timestep. Implement the following:

Modify the program, such that
1. All particles are copied to the GPU at the beginning of a time step.
2. All the particles are copied back to the host after the kernel completes, before proceeding to the next time step.
Use nvprof to study the time spent on data movement and actual computation, with a large number of particles that can the GPU.
Change the appropriate memory allocator to use cudaMallocHost().
Use nvprof to study the time spent on data movement and actual computation, with a large number of particles that can fill the GPU memory. Also, note for the time spent on allocation.

Programming exercise on Managed Memory

Put the file of this sub exercise in:

Assignment_3/ex_2

and call it exercise_2b.cu.

Make necessary changes to the program so it uses managed memory.

Change the GPU memory allocators to use cudaMallocManaged().
Eliminate explicit data copy and device pointers.
Study the breakdown of timing using nvprof.

Questions to answer in the report

What are the differences between pageable memory and pinned memory, what are the tradeoffs?
Do you see any difference in terms of break down of execution time after changing to pinned memory from pageable memory?
What is a managed memory? What are the implications of using managed memory?
If you are using Tegner or lab computers, the use of managed memory will result in an implicit memory copy before CUDA kernel launch. Why is that?

Exercise 3 - cuBLAS

In this exercise, we study one of the most well used CUDA library called cuBLAS. The library is used to accelerate BLAS (Basic Linear Algebra Subroutine) operations. The library has a host interface and is highly optimized. We use SGEMM (Single precision floating General Matrix Multiply) of square matrices as a case in the exercise.

Put the file of this sub exercise in:

Assignment_3/ex_3

We provide you with a skeleton code in:

DD2360-HT19/Assignment_3/ex_3/exercise_3.cu

For simplicity, we only consider matrices with the width being a multiple 16. The number can be changed by defining TILE_SIZE. We have also implemented timing for you. Select the appropriate architecture and compile the code like the following:

$ nvcc -O3 -arch=sm_30 exercise_3.cu -o exercise_3.out -lcurand -lcublas

While developing the sub-exercises, you can comment some parts of the code in main for testing. The code can be executed with -s [matrix size], and optionally -v, to perform CPU verification. For example, to test for 1024x1024 matrix with CPU verification:

$ srun -n 1 ./exercise_3.out -s 1024 -v
Matrix size: 1024x1024
Matrix size: 1024x1024
Grid size: 64x64
Tile size: 16x16
Run CPU sgemm: 1

CPU matmul:			3358.208000 ms
GPU cuBLAS matmul:		1.233000 ms
GPU matmul (global memory):	78.396000 ms
GPU matmul (shared memory):	7.610000 ms

GEMM

GEMM is defined as $LaTeX: C\:=\:\alpha AB\:+\:\beta C$ $C\:=\:\alpha AB\:+\:\beta C$ , where LaTeX: A $A$ and LaTeX: B $B$ are matrices to be multiplied and $C$ is where the result will be accumulated. $LaTeX: \alpha$ $\alpha$ is a scalar that controls the scaling of multiplication results and $LaTeX: \beta$ $\beta$ is another scalar that controls how the result is accumulated to LaTeX: C $C$ . In this exercise, we only consider matrix multiplication. Therefore we set $LaTeX: \alpha\:=\:1.0$ $\alpha\:=\:1.0$ and $LaTeX: \beta=0.0$ $\beta=0.0$ .

Naive matrix multiplication

Study the function and cpu_matmul() and the GPU kernel naive_sgemm_kernel(). Understand how the CPU version is translated to the GPU version. Also, understand how the threads and thread blocks are organized.

Just for fun: for those who took the course DD2356, do you remember why the two inner loops in cpu_matmul() are reordered?

Matrix multiplication with shared memory

Study the code in the GPU kernel shared_sgemm_kernel(). The code performs tiled matrix multiplication. The algorithm of a tiled matrix multiplication is the same as the matrix multiplication algorithm, except that the lowest unit of multiplication sub-matrices instead of scalars. More information can be found here Links to an external site..

Fill in the parts of the function that is marked with a TODO comment. Ensure CPU verification passes.

Matrix multiplication with cuBLAS

Study the code in the function cublas_sgemm(). The function call to cuBLAS SGEMM is already coded for you. Fill in the blanks. Hint: since all the matrices are squared, all the width and stride are the same. Ensure CPU verification passes.

Questions to answer in the report

Explain why is the matrix size has to be a multiple of 16?
Refer to shared_sgemm_kernel(). There are two __syncthreads() in the loop. What are they used for, in the context of this code?
1. What is the directive that can potentially improve performance in the actual multiplication? What does it do?
2. There is a large speedup after switching from using global memory to shared memory, compared to the Edge Detector in Exercise 1. What might be the reason?
Refer to cublas_sgemm(). We asked that you compute $LaTeX: C\:=BA$ $C\:=BA$ instead of $C=AB$ . It has to do with an important property of cuBLAS. What is that, and why do we do $LaTeX: C\:=BA$ $C\:=BA$ ?
Run the program with different input sizes, for example from 64, 128, ... , to 4096. Make a grouped bar plot of the execution times of the different versions (CPU, GPU Global, GPU Shared, GPU cuBLAS). You can plot CPU results in a seperate figure if the execution time goes out of the scale comparing to the rest.
The way the execution time benchmark that is implemented in the code is good enough for this exercise, but in general it is not a good way to do a benchmark. Why?

Matris

Titel:

Hitta en matris

Titel

Titel
Kriterier	Bedömningar	Poäng
Beskrivning av kriterium tröskel: 5 poäng Redigera beskrivning av kriterium Ta bort kriterium rad	5 till >0 poäng Full poäng blank 0 till >0 poäng Inga poäng blank_2 Det här området kommer användas av utvärderaren för kommentarer relaterade till det här kriteriet.	poäng / 5 poäng --
Beskrivning av kriterium tröskel: 5 poäng Redigera beskrivning av kriterium Ta bort kriterium rad	5 till >0 poäng Full poäng blank 0 till >0 poäng Inga poäng blank_2 Det här området kommer användas av utvärderaren för kommentarer relaterade till det här kriteriet.	poäng / 5 poäng --