Assignment IV - OpenACC
- Inlämningsdatum 2 dec 2019 av 23.59
- Poäng 1
- Lämnar in en filuppladdning
- Filtyper pdf
- Tillgänglig 18 nov 2019 kl 9:00–14 jan 2020 kl 23.59
In this assignment, we are going to accelerate a mini application that resembles a weather simulation application, using OpenACC directives.
To submit your assignment, prepare a small report that answers the questions in the exercises. Submit the report as a PDF with the following filename:
appgpu19_HW4_GroupNumberFromCanvas.pdf
Submit your code in a Git repository, and make it public, so we can access it. Use the following folder structure and include the link in your report:
Assignment_4/ex_ExerciseNumber/your_source_code_files
The assignment is solved and submitted in a group of two according to Canvas signup.
Exercise 1 - Accelerating a mini weather application
This assignment uses the PGI compiler which is only available on Tegner. The exercise focuses on parallelization and data optimization. No knowledge of Meteorology is needed for this exercise.
Credit to Matthew R. Norman from ORNL who developed the code.
Getting and running the code
Login to Tegner and prepare all the modules:
$ module load git gcc/6.2.0 openmpi/3.0-gcc-6.2 cuda/10.0 pgi
Obtain the code and compile like the following:
$ git clone https://github.com/steven-chien/DD2360-HT19.git $ cd DD2360-HT19/Assignment_4/ex_1 $ make
If you already have the repository, do a git pull instead of clone.
The code can be executed with:
$ mpirun -n 1 ./bin/miniWeather
The code can be profiled with nvprof using:
$ mpirun -n 1 nvprof ./bin/miniWeather
Do not change the folder structure or rename anything.
The program creates an output file in NetCDF called output.nc. The file can be visualized using the ncview tool. The program depends on Parallel-NetCDF, which is already provided by us as a static library that is compiled on Tegner. Another dependency is MPI and the linking with PGI compiler is taken care of by our Makefile. Parallelization using MPI on X direction is already implemented in the code but this is out of the scope of this exercise. We only require that you use one process.
Study the code
Study the application code in miniWeather_mpi_openacc.cpp. The code has several parameters that you can play with:
- nx_glob and nz_glob control the size of the simulation.
- sim_time controls the maximum time steps.
-
data_spec_int selects the experiment to perform.
- DATA_SPEC_INJECTION
- A narrow jet of fast and slightly cold wind is injected into a balanced, neutral atmosphere at rest from the left domain near the model top. This has nothing to do with atmospheric flows.
- DATA_SPEC_DENSITY_CURRENT
- creates a neutrally stratified atmosphere with a strong cold bubble in the middle of the domain that crashes into the ground to give the feel of a weather front
- DATA_SPEC_MOUNTAIN
- Passes a horizontal wind over a faked mountain at the model bottom in a stable atmosphere to generate a train of stationary gravity waves across the model domain.
- DATA_SPEC_INJECTION
The simulation has a default size of (400, 200) with 1500 steps and simulates DATA_SPEC_INJECTION. Look for candidates for parallelization in the code and find data dependencies when performing parallelization.
Visualization results
We have prepared the ncview tool for you on Tegner. To view files directly from Tegner, add -Y when you SSH to the login node to enable X11 forwarding, like in the previous assignment. The tool is installed in:
/cfs/klemming/scratch/w/wdchien/DD2360-HT19/bin/ncview
To use the tool, add the folder to your path:
$ export PATH=$PATH:/cfs/klemming/scratch/w/wdchien/DD2360-HT19/bin
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/cfs/klemming/scratch/w/wdchien/DD2360-HT19/lib $ ncview output.nc
It is possible to compare two NetCDF files using the nccmp utility, which is installed in the same directory with ncview.
$ nccmp --tolerance=1e-8 --data --force --metadata --statistics cpu_output.nc gpu_output.nc
where tolerance is a small number. Since the program always writes to output.nc, rename the file if you want to preserve it.
Naive offloading
Implement offloading of loops using the loop and collapse directives in semi_discrete_step()
, compute_tendencies_x()
, compute_tendencies_z()
, set_halo_values_x()
and set_halo_values_z()
. Take care of the copying of data using the copy directive. Take care of variables that should be private, using the private directive. For example:
#pragma acc parallel loop collapse(...) private(...) copy(...)
Execute the program and ensure it runs correctly (hint: also use cuda-memcheck to ensure data copy is done correctly). Use nvprof to study the time spent on data movement and computation.
(Optional) Optimized data movement
Reduce unnecessary data transfer by changing some copy directives from copy to copyin or copyout if possible. Check if the program still runs correctly. Use nvprof to study the time spent on data movement.
(Optional) Establishing lifetime
Use the data directive to establish data lifetime to avoid unnecessary data movement during the application. Take care of updating data copies on host and device when necessary, such as before MPI halo exchange. Implement changes in main()
and set_halo_exchange_x()
, using #pragma acc update device(...)
and #pragma acc update host(...)
, depending on the direction of update. Again, check if the program works correctly and use nvprof to study the time spent on different kernels and data movement.
Questions to answer in the report
- Explain what you have offloaded to the GPU.
- The last version of the code should be the most efficient one. Why is that? In certain places, the arrays need to be updated on the host or on the device. Why?
- Elaborate on the result of nvprof on breakdown of runtime and how the optimization techniques improve execution. Analysis all the three stages using profiling result.