Assignment II - CUDA Basics

Inlämningsdatum 10 nov 2019 av 23.59
Poäng 1
Lämnar in en filuppladdning
Filtyper pdf
Tillgänglig 15 okt 2019 kl 0:00–14 jan 2020 kl 23.59

Den här uppgiften låstes 14 jan 2020 kl 23.59.

For the second assignment of the course, you will work on having a deeper understanding of the main CUDA concepts. This includes learning how to compile a CUDA program, how to transfer data to/from the GPU, and how to handle data structures inside the CUDA kernels. We will ask you to implement a mock-up simulation that updates the positions of thousands of particles, based on their individual velocity.

To submit your assignment, prepare a small report that answers the questions in the exercises. Submit the report as a PDF with the following filename:

appgpu19_HW2_GroupNumberFromCanvas.pdf

Submit your code in a Git repository, and make it public so we can access it. Use the following folder structure and include the link in your report:

Assignment_2/ex_ExerciseNumber/your_source_code_files

A simple Git tutorial can be found here Links to an external site..

The assignment is solved and submitted in a group of two according to Canvas signup.

Important note: Check Tutorial - Using CUDA in the laboratory workstations for instructions on how to use workstations available in the laboratory room.

Exercise 1 - Hello World!

For the first exercise, we ask you to create a very basic “Hello World!” program in CUDA. This will make you understand what are the dependencies that you need for running CUDA-based applications, such as the CUDA SDK in your laptop/desktop.

In addition, you will consolidate some base CUDA skills required throughout the course (e.g., how to compile a CUDA program).

Programming Exercise

Write a CUDA program according to the following:

Create an exercise_1.cu file that contains a CUDA kernel that prints a “Hello World!” message and the thread identifier.
- Use printf to output the message.
- Do not forget to specify the target architecture when compiling (e.g., "sm_30" on Tegner or "sm_50" on lab machines), otherwise the support for IO would be limited inside the CUDA kernel.
Define the main() function inside exercise_1.cu to call the kernel and generate all the messages from the GPU.
- Set the kernel to run with 256 threads in 1 single thread block, following a 1D distribution.

After you implement your exercise and run it, your application is expected to output something similar to:

Hello World! My threadId is 0

Hello World! My threadId is 1

Hello World! My threadId is 2

···

Hello World! My threadId is 253

Hello World! My threadId is 254

Hello World! My threadId is 255

If you are struggling to see the messages, do not worry. Just think about how the interaction between the CPU and the GPU really works. Review the video lectures for more information.

Questions to answer in report

Explain how the program is compiled and the environment that you used. Explain what are CUDA threads and thread blocks, and how they are related to GPU execution.

Exercise 2 - Performing SAXPY on the GPU

Now that you have a clear understanding of how to compile and run CUDA-based programs, in this exercise you will practice the distribution of the threads and the memory management between the host memory (i.e., CPU space) and the device (i.e., GPU space).

For this purpose, we ask you to implement a simple SAXPY program. SAXPY is very suitable to make you understand how to index 1D arrays inside a GPU kernel. The term stands for “Single-Precision A*X Plus Y”, where A is a constant, and X and Y are arrays.

Programming Exercise

Write a program according to the following:

Create an exercise_2.cu file that contains a CUDA kernel that computes SAXPY.
- The kernel must receive as input two arrays “x” and “y”, and a constant “a”. The base type is float.
- The output is expected to be stored in the array “y”.
Inside exercise_2.cu, create an equivalent CPU implementation of SAXPY to compare the result with the GPU version afterward.
Define the main() function inside exercise_2.cu to call the kernel and compute SAXPY on the CPU and the GPU.
- Set an ARRAY_SIZE constant to define the size per array. The number of elements per array is irrelevant, but use something reasonable (e.g., 10000).
- Set the kernel to run with a thread block size of 256 threads. The grid size (i.e., number of thread blocks) depends on the constant ARRAY_SIZE.
- Make sure that you create the correspondent GPU memory for each array, including transferring the data to/from the GPU.
- Hint: The ARRAY_SIZE might not be multiple of the block size!
Compare the output of each implementation inside the main() function, to guarantee that both the CPU and GPU implementations are equivalent.
- The precision of the floating-point operations might differ between each version, which can translate into rounding error differences. Hence, use a margin error range when comparing both versions.

When everything is ready, your program is expected to output something similar to the following:

Computing SAXPY on the CPU… Done!

Computing SAXPY on the GPU… Done!

Comparing the output for each implementation… Correct!

You might optionally consider introducing time measurements in your code, just so that you can have a feeling on how fast or slow each version is. You can use the gettimeofday() Links to an external site. function to get timestamps in order to calculate execution time. Take care of CUDA kernel synchronization.

Questions to answer in Report

Explain how you solved the issue when the ARRAY_SIZE is not a multiple of the block size. If you implemented timing in the code, vary ARRAY_SIZE from small to large, and explain how the execution time changes between GPU and CPU.

(Just for fun and optional: for those who took the course DD2356, how parallelize this with OpenMP? Does it resemble similarities with the GPU approach?)

Exercise 3 - CUDA simulation and GPU Profiling

The implementation of SAXPY has allowed you to get a deeper understanding of how to index a thread inside a CUDA kernel, as well as how to manage the memory between the CPU and the GPU. For this exercise, we create a mock-up simulation that uses CUDA to update the position of thousands of particles and evaluate its performance.

Programming Exercise

Write a program according to the following:

Create an exercise_3.cu file that contains the declaration of a Particle data structure. We will use an Array of Structure (AoS) for the rest of the exercise.
- Each particle instance contains a position in (X, Y, Z), as well as velocity on each axis.
- You can define your own vector data structure to represent the position and velocity, or directly use CUDA vector types (e.g., float3).
Define a CUDA kernel inside exercise_3.cu that performs one timestep.
- The kernel receives an array of particles as input, with base type Particle.
- The updates on each particle are to be stored on the same input array (i.e. in-place update).
Implement particle velocity update and particle position update in the kernel
- The velocity update can be random, or based on some input parameter (e.g., reference time).
- After updating velocity, update particle position with the following update rule: $p . x = p . x + v . x \cdot d t$ $p.x\:=\:p.x\:+v.x\cdot dt$ , where $p . x$ $p.x$ is the position of a particle and $v . x$ $v.x$ is the updated velocity of a particle.
- $d t$ $dt$ represents the time derivative factor. For simplicity, use $d t = 1$ $dt=1$ .
Inside “exercise_3.cu”, create an equivalent CPU implementation to compare the result with the GPU version.
- If you are using random values, make sure a seed is established to obtain the same output on both versions.
- Hint: The random value implementation might differ between the CPU and GPU. You can provide the values as input to the kernel or avoid the issue with an alternative (e.g., use of a time parameter).
Define the main() function inside “exercise_3.cu” to conduct our simulation on the CPU and the GPU.
- Set a NUM_PARTICLES constant to define the number of particles. Once again, use reasonable values (e.g., 10000).
- Set a NUM_ITERATIONS constant to define the number of iterations in your simulation. Use something reasonable to get a feeling of the performance.
- Set the kernel to run with a variable thread BLOCK_SIZE. Define the grid size based on this value and the number of particles.
- Hint: The number of particles, iterations and other settings can be easily set as input on the command line! This will allow you get the results quickly with an external script.
Compare the output of each implementation inside the main() function, to guarantee that both the CPU and GPU implementations are equivalent.
Implement a timing for each implementation. You can use gettimeofday() (Länkar till en externa sida.) and use nvprof to measure the execution time. When using gettimeofday(), pay attention to issues related to asyn kernel launching.

Make sure that everything is correct and working as you would expect!

Questions to answer in report

Measure the execution time of the CPU version, varying the number of particles.
Measure the execution time of the GPU version, varying the number of particles, like in 1).
1. Include data copying time to and from in the measurement.
2. For each GPU particle configuration, vary the block size in the GPU version from 16, 32, …, up to 256 threads per block.
Generate one or more performance figures based on your measurements in 1 and 2. Include it in the report with a description of the experimental setup (e.g., GPU used) and the observations obtained from the results. Explain how the execution time of the two version changes when the number of particles increases. Which block size configuration is optimal?
Currently, the particle mover is completely offloaded to the GPU, only with data transfer at the beginning and end of the simulation. If the simulation involves CPU dependent functions (i.e. particles need to be copied back and forth every time step), would your observation in 3) still holds? How would you expect the performance will change in terms of GPU execution? Make an educated guess using the GPU activities timing provided by nvprof.

Important note: If the execution times between the CPU and the GPU are very similar, try to increase the number of particles and the number of iterations per simulation even further.

Matris

Titel:

Hitta en matris

Titel

Titel
Kriterier	Bedömningar	Poäng
Beskrivning av kriterium tröskel: 5 poäng Redigera beskrivning av kriterium Ta bort kriterium rad	5 till >0 poäng Full poäng blank 0 till >0 poäng Inga poäng blank_2 Det här området kommer användas av utvärderaren för kommentarer relaterade till det här kriteriet.	poäng / 5 poäng --
Beskrivning av kriterium tröskel: 5 poäng Redigera beskrivning av kriterium Ta bort kriterium rad	5 till >0 poäng Full poäng blank 0 till >0 poäng Inga poäng blank_2 Det här området kommer användas av utvärderaren för kommentarer relaterade till det här kriteriet.	poäng / 5 poäng --