Assignment I - GPU Architecture

Inlämningsdatum 3 nov 2019 av 23.59
Poäng 1
Lämnar in en filuppladdning
Filtyper pdf
Tillgänglig 29 okt 2019 kl 0:00–14 jan 2020 kl 23.59

Den här uppgiften låstes 14 jan 2020 kl 23.59.

In the first assignment of the course, we would like to challenge your knowledge about GPU architecture. The goal is to make sure that you understand some of the main concepts that motivate the use of graphics cards today, including why these devices are massively parallel or the differences in the memory hierarchy.

To submit your assignment, prepare, and upload a PDF document that answers all the questions asked below. You must name the file following this format:

appgpu20_HW1_GroupNumberFromCanvas.pdf

There is no code submission for this assignment.

Exercise 1 - Questions about GPU Architecture

Write a report by answering the following questions.

Why GPUs emerged as suitable hardware for computer graphics (e.g. games)?
Why do we talk about throughput-oriented architecture when we talk about GPUs?
List the main differences between GPUs and CPUs in terms of architecture.
Use the Internet to find out and list the number of SMs, the number of cores per SM, the clock frequency and the size of the memory of the NVIDIA GPU that you plan to use during the course. It might be the GPU of your laptop/workstation, or the GPUs on Tegner (Links to an external site.) (NVIDIA Quadro K420 or NVIDIA Tesla K80). Please, make sure that you mention the specific model as well.
Use Google Scholar to find a scientific paper reporting about a work using GPUs in your main domain area (HPC, image processing, machine learning, ...). Report the title, authors, conference name/journal, the GPU type that has been used, and which programming approach has been employed.

Exercise 2 - Bandwidth Test

We have learned that one of the current weaknesses of GPU programming is the link between the host and device (GPU) memories. Measure the bandwidth between host-to-device, device-to-host and device-to-device on Tegner using the bandwidthTest (Links to an external site.) utility. Follow the instructions to run the bandwidth test and answer questions that are at the end of the report.

You need to decide if you want to measure bandwidth on lab computers (2a) or on Tegner GPUs (2b)

Locate your own CUDA SDK folder if you are using your own laptop, and modify the include path.

2.a - Instructions for measuring bandwidth on lab computers (5V4 Magenta/4V6 Brun)

Setting up the environment

To use the nvcc compiler on lab computers, you will need to add the "bin" folder of CUDA to your PATH environment variable. From that point, you can directly use it like any other command:

$ export PATH=/usr/local/cuda/bin:$PATH

The -arch flag is necessary to specify the CUDA architecture that we would like to use. GTX 745 in lab computers use the GM107 processor that belongs to the first generation of Maxwell architecture. It supports compute capability 5.0 and for this reason, we specify -arch=sm_50.

Detailed information can be found at Tutorial - Using CUDA in the laboratory workstations.

Getting and Running Stream

Copy the bandwidthTest (Links to an external site.) utility from the CUDA SDK examples and compile it:

$ cp -rf /usr/local/cuda/samples/1_Utilities/bandwidthTest ./bandwidthTest
$ cd bandwidthTest
$ nvcc -arch=sm_50 -I/usr/local/cuda/samples/common/inc bandwidthTest.cu -o bandwidthTest

The bandwithTest tool can be executed directly.

$ ./bandwidthTest

Studying the bandwidth and answer questions

When you execute the program without any arguments you should be able to see something like this:

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 745
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			13.1

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			12.7

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			25.3

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

The bandwidth test gives you three results: Host to Device, Device to Host and Device to Device. The memory transfer is called Pinned Memory Transfer. We will simply use that and discuss more about pinned memory in the latter part of the course. To test for other transfer sizes, you can run the tool in "shmoo" mode:

$ ./bandwidthTest --mode=shmoo

And you will get something like this for the three kinds of transfer:

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 745
 Shmoo Mode

.................................................................................
 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   1000				0.7
   2000				1.3
   3000				2.0
...

Looking at the results, explain in the report your observations, and why the bandwidth is behaving like that. You can optionally provide a line plot to help your explanation.

2.b - Instructions for measuring bandwidth on Tegner

Setting up the environment

To connect to Tegner, use SSH with the username that has been assigned to you by the PDC Supercomputing Center. You must ask first for a Kerberos ticket as well:

$ kinit --forwardable your_username@NADA.KTH.SE
$ ssh your_username@tegner.pdc.kth.se

Important note: If you are using one of the computers in the laboratory room, do not forget to use pdc-kinit and pdc-ssh instead. More information can be found here.

More information can be found at Tutorial - Connecting to PDC Supercomputer and Tutorial - Using GPUs on Tegner.

Getting and Running Stream

Change the current directory to your Klemming folder and copy the bandwidthTest (Links to an external site.) utility from the CUDA SDK examples:

$ cd /cfs/klemming/nobackup/u/your_username
$ cp -rf /pdc/vol/cuda/cuda-10.0/samples/1_Utilities/bandwidthTest ./bandwidthTest

To compile the bandwidth test, you have to load the GNU Compiler and CUDA modules, and compile the "bandwidthTest.cu" file using nvcc (do not use the Makefile that is provided inside the folder!):

$ module load gcc/7.2.0 cuda/10.0
$ cd bandwidthTest
$ nvcc -arch=sm_30 -I/pdc/vol/cuda/cuda-10.0/samples/common/inc bandwidthTest.cu -o bandwidthTest

The last step is to allocate a node with GPU on Tegner and use the srun command to execute the bandwidth test:

$ salloc --nodes=1 --gres=gpu:K420:1 -C Haswell -t 00:05:00 -A edu17.dd2360
$ srun -n 1 ./bandwidthTest

If there is an active reservation, you can add it to the salloc command as well. Note that we are asking for 5 minutes of computation time on one single node and that we are specifying that we want to get access to the GPU resource of the node with the --gres=gpu:K420:1 option. Check the Canvas pages "Introduction to PDC environment" and "Reserved allocation time on Tegner" for additional details on how to connect and run jobs on Tegner.

Please, always ask for a node with salloc when your code compiles without errors and you would like to run your program on Tegner. After you finish executing and if you are not going to execute anything for some time, type exit to relinquish your allocation and allow other students to get quick access to the cluster. This way, we will efficiently share the resources and everyone will be able to run immediately.

Studying the bandwidth and answering questions

When you execute the program without any arguments you should be able to see something like this:

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro K420
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			6026.6

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			6549.0

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			21285.8

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

$ srun -n 1 ./bandwidthTest --mode=shmoo

And you will get something like this for the three kinds of transfer:

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro K420
 Shmoo Mode

.................................................................................
 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   1024				380.2
   2048				750.8
   3072				1144.0
....

Looking at the results, explain in the report your observations, and why the bandwidth is behaving like that. You can optionally provide a line plot to help your explanation.

Matris

Titel:

Hitta en matris

Titel

Titel
Kriterier	Bedömningar	Poäng
Beskrivning av kriterium tröskel: 5 poäng Redigera beskrivning av kriterium Ta bort kriterium rad	5 till >0 poäng Full poäng blank 0 till >0 poäng Inga poäng blank_2 Det här området kommer användas av utvärderaren för kommentarer relaterade till det här kriteriet.	poäng / 5 poäng --
Beskrivning av kriterium tröskel: 5 poäng Redigera beskrivning av kriterium Ta bort kriterium rad	5 till >0 poäng Full poäng blank 0 till >0 poäng Inga poäng blank_2 Det här området kommer användas av utvärderaren för kommentarer relaterade till det här kriteriet.	poäng / 5 poäng --