Assignment II: Programming with OpenMP

Due Apr 24, 2021 by 11:59pm
Points 4
Submitting a file upload
File Types pdf
Available until Jun 30, 2021 at 11:59pm

This assignment was locked Jun 30, 2021 at 11:59pm.

For each exercise:

Compile a short report according to the instructions, and answer questions that are marked Question(s) .
Submit your code to a GitHub repo (have the link either in the report or as a comment on the submission page).
Presentation booking will open around the deadline of the assignment. More info here.
All the code used in this assignment can be found here: https://github.com/KTH-HPC/DD2356/tree/master/Assignment-II Links to an external site.
The assignments should be done using Beskow (most encouraged), but your laptop is also acceptable.

Exercise 1 - OpenMP Hello World, get familiar with OpenMP Environment

Here we’re going to implement the first OpenMP program. Expected knowledge includes a basic understanding of the OpenMP environment, how to compile an OpenMP program, how to set the number of OpenMP threads, and retrieve the thread ID number at runtime.

Instructions. Write a C code to make each OpenMP thread print Hello World from Thread X! with X = thread ID. Learn how to launch OpenMP code on Beskow or your laptop. Your code using 4 threads should behave similarly to:

Hello World from Thread 3!
Hello World from Thread 0!
Hello World from Thread 2!
Hello World from Thread 1!

Questions:

Write an OpenMP C code with each thread printing Hello World from Thread X! where X is the thread ID.
How do you compile the code in question 1? which compiler and flags have you used?
How do you run the OpenMP code on Beskow, what flags did you set?
How many different ways are there to change the number of threads in OpenMP? Which are they?

Exercise 2 - STREAM benchmark and the importance of threads

Here we’re going to measure the sustained memory bandwidth on a node of the Beskow supercomputer and study the impact of using different numbers of threads and schedules. Expected knowledge includes an understanding of parallel for constructs and different OpenMP schedules.

Instructions. Download the STREAM benchmark Download STREAM benchmark

, measure the bandwidth varying the number of threads, and varying the schedules for the four STREAM kernels. The program should give an o utput similar to the following:

-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
…
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           11839.5     0.013682     0.013514     0.013894
…

Questions:

Run the STREAM benchmark five times and record the average values of bandwidth and its standard deviation for the copy kernel. Prepare a plot (with error bars) comparing the bandwidth using 1-2-4-8-12-16-20-24-28-32 threads.
How does the measured bandwidth with the copy kernel depend on the number of threads?
Prepare another plot comparing the bandwidth measured with copy kernel with static, dynamic, and guided schedules using 32 threads.
How do you set the schedule in the STREAM code? What is the fastest schedule and why do you think it is so?

Exercise 3 - Parallel Sum

Here, you are going to implement and parallelize a simple C code to find the sum of an array with at least 10⁷ elements. Expected knowledge includes an understanding of parallel for constructs, race conditions, and false sharing. For this exercise, it is useful to check Lecture: OpenMP and maxloc.

Instructions. Implement a serial and a number of OpenMP parallel versions of serial_sum. You can use the function omp_get_wtime() to get the time for measurement.

The simple skeleton version of the code is the following:

double serial_sum(double *x, size_t size)
{
  double sum_val = 0.0;

  for (size_t i = 0; i < size; i++) {
    sum_val += x[i];
  }

  return sum_val;
}

We can initialize the input array as follows:

#include <stdlib.h>
void generate_random(double *input, size_t size)
{
  for (size_t i = 0; i < size; i++) {
    input[i] = rand() / (double)(RAND_MAX);
  }
}

Questions:

Measure the performance of the serial code (average + standard deviation).
Implement a parallel version of the serial_sum called omp_sum and use the omp parallel for construct to parallelize the program. Run the code with 32 threads and measure the execution time (average + standard deviation). Is and should the code be working correctly? If not, why not?
Implement a new version called omp_critical_sum and use the omp critical to protect the code region that might be updated by multiple threads concurrently. Measure the execution time for the code in questions 2 and 3 by varying the number of threads: 1, 2, 4, 8, 16, 20, 24, 28, and 32. How does the performance compare to the program in questions 1 and 2? What is the reason for the performance gain/loss?
Try to avoid the use of a critical section. Implement a new version called omp_local_sum. Let each thread find the local sum in its own data then combine their local result to get the final result. For instance, we can use temporary arrays indexed by their thread number to hold the values found by each thread like the code below.
- double local_sum[MAX_THREADS];
- Measure the performance of the new implementation varying the number of threads to 1, 2, 4, 8, 16, 20, 24, 28, and 32. Does the performance increase as expected? If not why not?
Write a new version of the code in question 4 called opt_local_sum using a technique to remove false sharing with padding. Measure the performance of code varying the number of threads to 1, 2, 4, 8, 16, 20, 24, 28, and 32.

Exercise 4 - DFTW, The Fastest DFT in the West

The FFTW Links to an external site. is the most famous library to solve a Fourier Transform (FT) using the Fast Fourier Transform algorithm. The FFTW takes its name from the fact that it is the fastest FFT in the West if not in the world. The goal of this exercise is to develop the fastest DFT (Discrete Fourier Transform, a different algorithm to calculate FT) in the west by parallelizing the serial C code that is available here.

Your starting point is a serial code DFTW_1.c Download DFTW_1.c we have developed to compute a DFT in serial. The code calculates direct and inverse DFTs, timing the total time taken for the computation of the two DFTs. The computational core of the program is the DFT subroutine that takes the argument idft as 1 for direct DFT and -1 for inverse DFT.

The code also prints the value of the first element of the DFT computation. This is equal to input size (N). You can use this number to quickly check the correctness of your implementation.

You will focus on parallelizing the DFT function:

int DFT(int idft, double* xr, double* xi, double* Xr_o, double* Xi_o, int N) {
  for (int k = 0 ; k < N ; k++) {
    for (int n = 0 ; n < N ; n++) {
      // Real part of X[k]
      Xr_o[k] += xr[n] * cos(n * k * PI2 / N) + idft*xi[n]*sin(n * k * PI2 / N);
      // Imaginary part of X[k]
      Xi_o[k] += -idft*xr[n] * sin(n * k * PI2 / N) + xi[n] * cos(n * k * PI2 / N);
    }
  }
…

Example output:

DFTW calculation with N = 10000
DFTW computation in 2.107117 seconds
Xre[0] = 8000.000000

Instructions and Questions: Here the steps you need to follow for the exercise submission:

Parallelize the DFTW code with OpenMP. You can develop different strategies; the important point is that the code produces the correct results and it is fast!
Measure the performance on Beskow 32 cores reporting the average values and standard deviation for DFTW using an input size equal to 10000 (N=10000).
Prepare a speed-up plot varying the number of threads: 1, 2, 4, 8, 12, 16, 20, 24, 28, and 32.
Which performance optimizations would be suitable for DFT, other than parallelization with OpenMP? Explain, no need for implementing the optimizations.

Bonus Exercise - Shallow Water Equation Solver

The aim of this exercise is to give hands-on experience in parallelizing a larger program, measure parallel performance, and gain experience in what to expect from modern multi-core architectures. In the exercise, you should use a dual hexadeca-core shared memory Intel Xeon E5-2698v3 Haswell node (use -C Haswell when allocating a node). Running the program should give you realistic timings and speedup characteristics. Your task is to parallelize a finite-volume solver for the two-dimensional shallow water equations. Measure speedup and if you have time, tune the code. You don’t need to understand the numerics in order to solve this exercise (a short description of the numerics is given at the end of this section).

Algorithm

For this exercise, we solve the shallow water equations on a square domain using a simple dimensional splitting approach. Updating volumes Q with numerical fluxes LaTeX: F $F$ and LaTeX: G $G$ , first in the LaTeX: x $x$ , and then in the LaTeX: y $y$ direction, more easily expressed with the following pseudo-code:

for each time step do
    Apply boundary conditions
    for each Q do
        Calculate fluxes F in the x-direction
        Update volume Q with fluxes F
    end
    for each Q do
        Calculate fluxes G in the y-direction
        Update volumes Q with fluxes G
    end
end

In order to obtain good parallel speedup with OpenMP, each sub-task assigned to a thread needs to be rather large. Since the nested loops contain a lot of numerical calculations the solver is a perfect candidate for OpenMP parallelization. But as you will see in this exercise, it is fairly difficult to obtain optimal speedup on today’s multi-core computers. However, it should be easy to obtain some speedup without too much effort. The difficult task is to make good use of all the available cores. Choose to work with either the given serial C/Fortran 90 code or, if you think you have time, write your own implementation (but don’t waste time and energy). Compile the code (one needs to use the -lm flag to link to the math library when compiling) and execute the program shwater2d with srun.

1. Parallelize the code.

Question: Start with the file shwater2d.c Download shwater2d.c or shwater2d.f90 Download shwater2d.f90 , add OpenMP statements to make it run in parallel, and make sure the computed solution is correct. Some advice is given below

How should the work be distributed among threads
Don’t parallelize everything
What’s the difference between (Note we are using OpenMP in Fortran in this example).

!$omp parallel do
    do i=1,n
    ...
!$omp end parallel do
!$omp parallel do
    do j=1,m
    ...
!$omp end parallel do

and

!$omp parallel
    !$omp do
        do i=1,n
        ...
    !$omp end do
    !$omp do
        do j=1,m
        ...
    !$omp end do
!$omp end parallel

Hint: How are threads created/destroyed by OpenMP? How can it impact performance? Check Best Practices: Optimize OpenMP Codes.

2. Measure parallel performance.

Question: In this exercise, parallel performance refers to the computational speedup S_n = T₁/T_n, using n threads. Measure run time T for 1, 2, ..., 16 threads and calculate speedup. Is it linear? If not, why? Finally, is the obtained speedup acceptable? Also, increase the space discretization (M, N) and see if it affects the speedup. Recall from the OpenMP exercises that the number of threads is determined by an environment variable OMP_NUM_THREADS.

Debugging

For debugging purposes, you might want to visualize the computed solution. Uncomment the line save_vtk. The result will be stored in result.vtk, which can be opened in ParaView. Transfer the VTK file to your local machine and open it using the ParaView tool. More information on ParaView can be found here. Beware that the resulting file could be rather large unless the space discretization (M, N) is decreased.

About the Finite-Volume solver

In this exercise, we solve the shallow water equations in two dimensions given by

where h is the depth and (u,v) are the velocity vectors. To solve the equations we use a dimensional splitting approach, i.e. reducing the two-dimensional problem to a sequence of one-dimensional problems

For this exercise, we use Lax-Friedrich’s scheme, with numerical fluxes F, G defined as

where f and g are the flux functions, derived from (1). For simplicity, we use reflective boundary conditions, thus at the boundary

Run script for changing OMP_NUM_THREADS

#!/bin/csh
foreach n (`seq 1 1 16`)
    env OMP_NUM_THREADS=$n srun -n 1 ./a.out
end

Rubric

Title:

Keep in mind that 32 students have already been assessed using this rubric. Changing it will affect their evaluations.

Assignment II: Programming with OpenMP

Assignment II: Programming with OpenMP
Criteria	Ratings	Pts
Exercise I _9499 threshold: pts Edit criterion description	1 to >0.0 pts Full Marks blank 0 to >0 pts No Marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 1 pts --
Exercise II _7072 threshold: pts Edit criterion description	1 to >0.0 pts Full Marks _1748 0 to >0 pts No Marks _437 This area will be used by the assessor to leave comments related to this criterion.	pts / 1 pts --
Exercise III _5640 threshold: pts Edit criterion description	1 to >0.0 pts Full Marks _5407 0 to >0 pts No Marks _3497 This area will be used by the assessor to leave comments related to this criterion.	pts / 1 pts --
Exercise IV _790 threshold: pts Edit criterion description	1 to >0.0 pts Full Marks _8935 0 to >0 pts No Marks _5577 This area will be used by the assessor to leave comments related to this criterion.	pts / 1 pts --
Bonus Exercise _5469 threshold: pts Edit criterion description	1 to >0.0 pts Full Marks _1628 0 to >0 pts No Marks _7042 This area will be used by the assessor to leave comments related to this criterion.	pts / 1 pts --
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 pts Full Marks blank 0 to >0 pts No Marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --

Rubric

Title:

Find a Rubric

Title

Title
Criteria	Ratings	Pts
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 pts Full Marks blank 0 to >0 pts No Marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 pts Full Marks blank 0 to >0 pts No Marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --