Assignment II: Programming with OpenMP
- Due Apr 24, 2021 by 11:59pm
- Points 4
- Submitting a file upload
- File Types pdf
- Available until Jun 30, 2021 at 11:59pm
For each exercise:
- Compile a short report according to the instructions, and answer questions that are marked Question(s) .
- Submit your code to a GitHub repo (have the link either in the report or as a comment on the submission page).
- Presentation booking will open around the deadline of the assignment. More info here.
- All the code used in this assignment can be found here: https://github.com/KTH-HPC/DD2356/tree/master/Assignment-II Links to an external site.
- The assignments should be done using Beskow (most encouraged), but your laptop is also acceptable.
Exercise 1 - OpenMP Hello World, get familiar with OpenMP Environment
Here we’re going to implement the first OpenMP program. Expected knowledge includes a basic understanding of the OpenMP environment, how to compile an OpenMP program, how to set the number of OpenMP threads, and retrieve the thread ID number at runtime.
Instructions. Write a C code to make each OpenMP thread print Hello World from Thread X! with X = thread ID. Learn how to launch OpenMP code on Beskow or your laptop. Your code using 4 threads should behave similarly to:
Hello World from Thread 3!
Hello World from Thread 0!
Hello World from Thread 2!
Hello World from Thread 1!
Questions:
- Write an OpenMP C code with each thread printing Hello World from Thread X! where X is the thread ID.
- How do you compile the code in question 1? which compiler and flags have you used?
- How do you run the OpenMP code on Beskow, what flags did you set?
- How many different ways are there to change the number of threads in OpenMP? Which are they?
Exercise 2 - STREAM benchmark and the importance of threads
Here we’re going to measure the sustained memory bandwidth on a node of the Beskow supercomputer and study the impact of using different numbers of threads and schedules. Expected knowledge includes an understanding of parallel for constructs and different OpenMP schedules.
Instructions. Download the STREAM benchmark Download STREAM benchmark, measure the bandwidth varying the number of threads, and varying the schedules for the four STREAM kernels. The program should give an o utput similar to the following:
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
…
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 11839.5 0.013682 0.013514 0.013894
…
Questions:
- Run the STREAM benchmark five times and record the average values of bandwidth and its standard deviation for the copy kernel. Prepare a plot (with error bars) comparing the bandwidth using 1-2-4-8-12-16-20-24-28-32 threads.
- How does the measured bandwidth with the copy kernel depend on the number of threads?
- Prepare another plot comparing the bandwidth measured with copy kernel with static, dynamic, and guided schedules using 32 threads.
- How do you set the schedule in the STREAM code? What is the fastest schedule and why do you think it is so?
Exercise 3 - Parallel Sum
Here, you are going to implement and parallelize a simple C code to find the sum of an array with at least 107 elements. Expected knowledge includes an understanding of parallel for constructs, race conditions, and false sharing. For this exercise, it is useful to check Lecture: OpenMP and maxloc.
Instructions. Implement a serial and a number of OpenMP parallel versions of serial_sum
. You can use the function omp_get_wtime()
to get the time for measurement.
The simple skeleton version of the code is the following:
double serial_sum(double *x, size_t size)
{
double sum_val = 0.0;
for (size_t i = 0; i < size; i++) {
sum_val += x[i];
}
return sum_val;
}
We can initialize the input array as follows:
#include <stdlib.h>
void generate_random(double *input, size_t size)
{
for (size_t i = 0; i < size; i++) {
input[i] = rand() / (double)(RAND_MAX);
}
}
Questions:
- Measure the performance of the serial code (average + standard deviation).
- Implement a parallel version of the
serial_sum
calledomp_sum
and use theomp parallel for
construct to parallelize the program. Run the code with 32 threads and measure the execution time (average + standard deviation). Is and should the code be working correctly? If not, why not? - Implement a new version called
omp_critical_sum
and use theomp critical
to protect the code region that might be updated by multiple threads concurrently. Measure the execution time for the code in questions 2 and 3 by varying the number of threads: 1, 2, 4, 8, 16, 20, 24, 28, and 32. How does the performance compare to the program in questions 1 and 2? What is the reason for the performance gain/loss? - Try to avoid the use of a critical section. Implement a new version called
omp_local_sum
. Let each thread find the local sum in its own data then combine their local result to get the final result. For instance, we can use temporary arrays indexed by their thread number to hold the values found by each thread like the code below.double local_sum[MAX_THREADS];
- Measure the performance of the new implementation varying the number of threads to 1, 2, 4, 8, 16, 20, 24, 28, and 32. Does the performance increase as expected? If not why not?
- Write a new version of the code in question 4 called
opt_local_sum
using a technique to remove false sharing with padding. Measure the performance of code varying the number of threads to 1, 2, 4, 8, 16, 20, 24, 28, and 32.
Exercise 4 - DFTW, The Fastest DFT in the West
The FFTW Links to an external site. is the most famous library to solve a Fourier Transform (FT) using the Fast Fourier Transform algorithm. The FFTW takes its name from the fact that it is the fastest FFT in the West if not in the world. The goal of this exercise is to develop the fastest DFT (Discrete Fourier Transform, a different algorithm to calculate FT) in the west by parallelizing the serial C code that is available here.
Your starting point is a serial code DFTW_1.c
Download DFTW_1.c we have developed to compute a DFT in serial. The code calculates direct and inverse DFTs, timing the total time taken for the computation of the two DFTs. The computational core of the program is the DFT subroutine that takes the argument idft
as 1 for direct DFT and -1 for inverse DFT.
The code also prints the value of the first element of the DFT computation. This is equal to input size (N). You can use this number to quickly check the correctness of your implementation.
You will focus on parallelizing the DFT function:
int DFT(int idft, double* xr, double* xi, double* Xr_o, double* Xi_o, int N) {
for (int k = 0 ; k < N ; k++) {
for (int n = 0 ; n < N ; n++) {
// Real part of X[k]
Xr_o[k] += xr[n] * cos(n * k * PI2 / N) + idft*xi[n]*sin(n * k * PI2 / N);
// Imaginary part of X[k]
Xi_o[k] += -idft*xr[n] * sin(n * k * PI2 / N) + xi[n] * cos(n * k * PI2 / N);
}
}
…
Example output:
DFTW calculation with N = 10000
DFTW computation in 2.107117 seconds
Xre[0] = 8000.000000
Instructions and Questions: Here the steps you need to follow for the exercise submission:
- Parallelize the DFTW code with OpenMP. You can develop different strategies; the important point is that the code produces the correct results and it is fast!
- Measure the performance on Beskow 32 cores reporting the average values and standard deviation for DFTW using an input size equal to 10000 (N=10000).
- Prepare a speed-up plot varying the number of threads: 1, 2, 4, 8, 12, 16, 20, 24, 28, and 32.
- Which performance optimizations would be suitable for DFT, other than parallelization with OpenMP? Explain, no need for implementing the optimizations.
Bonus Exercise - Shallow Water Equation Solver
The aim of this exercise is to give hands-on experience in parallelizing a larger program, measure parallel performance, and gain experience in what to expect from modern multi-core architectures. In the exercise, you should use a dual hexadeca-core shared memory Intel Xeon E5-2698v3 Haswell node (use -C Haswell
when allocating a node). Running the program should give you realistic timings and speedup characteristics. Your task is to parallelize a finite-volume solver for the two-dimensional shallow water equations. Measure speedup and if you have time, tune the code. You don’t need to understand the numerics in order to solve this exercise (a short description of the numerics is given at the end of this section).
Algorithm
For this exercise, we solve the shallow water equations on a square domain using a simple dimensional splitting approach. Updating volumes Q with numerical fluxes F and
G, first in the
x, and then in the
y direction, more easily expressed with the following pseudo-code:
for each time step do
Apply boundary conditions
for each Q do
Calculate fluxes F in the x-direction
Update volume Q with fluxes F
end
for each Q do
Calculate fluxes G in the y-direction
Update volumes Q with fluxes G
end
end
In order to obtain good parallel speedup with OpenMP, each sub-task assigned to a thread needs to be rather large. Since the nested loops contain a lot of numerical calculations the solver is a perfect candidate for OpenMP parallelization. But as you will see in this exercise, it is fairly difficult to obtain optimal speedup on today’s multi-core computers. However, it should be easy to obtain some speedup without too much effort. The difficult task is to make good use of all the available cores. Choose to work with either the given serial C/Fortran 90 code or, if you think you have time, write your own implementation (but don’t waste time and energy). Compile the code (one needs to use the -lm
flag to link to the math library when compiling) and execute the program shwater2d
with srun
.
1. Parallelize the code.
Question: Start with the file shwater2d.c Download shwater2d.c or shwater2d.f90 Download shwater2d.f90 , add OpenMP statements to make it run in parallel, and make sure the computed solution is correct. Some advice is given below
- How should the work be distributed among threads
- Don’t parallelize everything
- What’s the difference between (Note we are using OpenMP in Fortran in this example).
!$omp parallel do
do i=1,n
...
!$omp end parallel do
!$omp parallel do
do j=1,m
...
!$omp end parallel do
and
!$omp parallel
!$omp do
do i=1,n
...
!$omp end do
!$omp do
do j=1,m
...
!$omp end do
!$omp end parallel
Hint: How are threads created/destroyed by OpenMP? How can it impact performance? Check Best Practices: Optimize OpenMP Codes.
2. Measure parallel performance.
Question: In this exercise, parallel performance refers to the computational speedup Sn = T1/Tn, using n threads. Measure run time T for 1, 2, ..., 16 threads and calculate speedup. Is it linear? If not, why? Finally, is the obtained speedup acceptable? Also, increase the space discretization (M, N) and see if it affects the speedup. Recall from the OpenMP exercises that the number of threads is determined by an environment variable OMP_NUM_THREADS
.
Debugging
For debugging purposes, you might want to visualize the computed solution. Uncomment the line save_vtk
. The result will be stored in result.vtk
, which can be opened in ParaView. Transfer the VTK file to your local machine and open it using the ParaView tool. More information on ParaView can be found here. Beware that the resulting file could be rather large unless the space discretization (M, N) is decreased.
About the Finite-Volume solver
In this exercise, we solve the shallow water equations in two dimensions given by
where h is the depth and (u,v) are the velocity vectors. To solve the equations we use a dimensional splitting approach, i.e. reducing the two-dimensional problem to a sequence of one-dimensional problems
For this exercise, we use Lax-Friedrich’s scheme, with numerical fluxes F, G defined as
where f and g are the flux functions, derived from (1). For simplicity, we use reflective boundary conditions, thus at the boundary
Run script for changing OMP_NUM_THREADS
#!/bin/csh
foreach n (`seq 1 1 16`)
env OMP_NUM_THREADS=$n srun -n 1 ./a.out
end
Rubric
Criteria | Ratings | Pts |
---|---|---|
Exercise I
threshold:
pts
|
pts
--
|
|
Exercise II
threshold:
pts
|
pts
--
|
|
Exercise III
threshold:
pts
|
pts
--
|
|
Exercise IV
threshold:
pts
|
pts
--
|
|
Bonus Exercise
threshold:
pts
|
pts
--
|