Assignment II: Data Structures for HPC

Due 15 Feb 2022 by 17:00
Points 7
Submitting a file upload
File types pdf

For this assignment, you need to submit a report in pdf format and GitHub / GitLab repository. State the link to the repository in the report or in Canvas as a comment.

Exercise 1 - List, Tuples, array, and NumPy

Answer the following questions:

- What is the difference between Lists and Tuples

- What is the advantage of using Lists vs Tuples

- What is the advantage of using the array module vs Python lists?

- What is the main difference between the array module arrays and NumPy arrays?

- What are the memory fragmentation problem and Von-Neumann bottleneck? How do they affect the performance of a code? How can we try to address it?

- What is a page fault? What is the difference between a minor and a major page fault?

- What is the impact of a cache miss on the performance?

- Which HPC libraries, does your NumPy installation use?

Exercise 2 - STREAM Benchmark in Python to Measure the Memory Bandwidth

The STREAM benchmark Links to an external site. is a very famous benchmark to measure the bandwidth of a memory system (we studied this in the first module). The benchmark is originally developed by "Dr. Bandwidth". Together with Linpack is the most famous benchmark in HPC.

Screenshot 2022-02-05 at 12.39.03.png

To goal of this exercise is to measure the bandwidth of your memory system using three different implementations you will develop: 1) Python lists 2) arrays and 3) NumPy.

As you know from the lectures, the memory organization of different Python arrays will impact the performance because of memory fragmentation, support for vectorization.

How does the STREAM benchmark work? We have three arrays and we time four different "kernels": copy, scale, sum, and triad.

In our benchmark, we first initialize the arrays (for simplicity here just using Python Lists)

for j in range(STREAM_ARRAY_SIZE):
    a[j] = 1.0
    b[j] = 2.0
    c[j] = 0.0
scalar = 2.0

We then time the four operations:

     # copy
     times[0] = timer()
     for j in range(STREAM_ARRAY_SIZE):
           c[j] = a[j]
     times[0] = timer() - times[0]

     # scale
     times[1] = timer()
     for j in range(STREAM_ARRAY_SIZE):
          b[j] = scalar*c[j]
     times[1] = timer() - times[1]
     #sum
     times[2] = timer()
     for j in range(STREAM_ARRAY_SIZE):
          c[j] = a[j]+b[j]
     times[2] = timer() - times[2]

     # triad
     times[3] = timer()
     for j in range(STREAM_ARRAY_SIZE):
         a[j] = b[j]+scalar*c[j]
     times[3] = timer() - times[3]

Finally, knowing how much data is moved for each kernel and spent time, we calculate the memory bandwidth. The total amount of data that is moved for the different kernels are:

 copy -> 2 * sizeof(STREAM_ARRAY_TYPE) * STREAM_ARRAY_SIZE,
 add -> 2 * sizeof(STREAM_ARRAY_TYPE) * STREAM_ARRAY_SIZE,
 scale ->   3 * sizeof(STREAM_ARRAY_TYPE) * STREAM_ARRAY_SIZE,
 triad ->   3 * sizeof(STREAM_ARRAY_TYPE) * STREAM_ARRAY_SIZE

For our experiments, you can use either array in single (float) or double precision (this will affect the performance when vectorization is used in NumPy).

Task 2.1 Implement in Python the STREAM benchmark using Python lists, arrays from array module, and NumPy arrays.

Task 2.2 Measure the bandwidth for the three kernels varying the STREAM_ARRAY_SIZE and make a plot with the results. Answer the questions: how the bandwidth varies when increasing the STREAM_ARRAY_SIZE and why so? How do the different implementation bandwidths compare to each other?

Exercise 3 - PyTest with the Julia Set Code

As part of this exercise, we ask you to experiment with the pytest framework, used for unit testing, and the Julia set code.

We provide a tutorial of pytest at B.1 Tutorial: Writing Unit Tests with Pytest.

In the Julia Set Code, we have a simple assertion to check the correctness of the code.

# This sum is expected for a 1000^2 grid with 300 iterations
# It ensures that our code evolves exactly as we'd intended
assert sum(output) == 33219980

In this exercise, we ask you to develop a test unit to check this assertion using the pytest framework.

Task 3.1 Implement a separate code to test the assertion above using the pytest framework.

Task 3.2 How would you implement the unit test with the possibility of having a different number of iterations and grid points? Implementation is optional.

Exercise 4 - Python DGEMM Benchmark Operation

DGEMM is an important computational kernel, part of the BLAS library (see the lectures on HPC Library), solving the problem C = C + A*B, where A, B, and C are matrices of size NxN. In this exercise, we will use matrices with values in double precision (that is why we have D in DGEMM). The three matrices can be initialized as you think it is convenient.

The pseudo-code in C-style (no Python )for the DGEMM is the following:

// Multiplying first and second matrices and storing it in result
   for (int i = 0; i < N; ++i) {
      for (int j = 0; j < N; ++j) {
         for (int k = 0; k < N; ++k) {
            C[i][j] = C[i][j] + A[i][k] * B[k][j];
         }
      }
   }

You will need to implement this operation in Python using different approaches as we did for Exercise 2. We will use this DGEMM benchmark to calculate the computational performance of the system you are using.

The goal of this exercise is to implement, profile, analyze and compare the performance of different Python DGEMM implementations with lists, arrays, and NumPy

Task 4.1 Implement the DGEMM with:

1. DGEMM with matrices as lists

2. DGEMM with matrices as arrays

3. DGEMM with matrices as NumPy array

Task 4.2 Using pytest develop a unit test for checking the correctness of your implementations. You can find the tutorial of pytest here.

Task 4.2 Measure the execution time for each approach varying the size of the matrix. Report the average and error (std. deviation or min/max or interval of confidence). Answer the question: how the computational performance varies with increasing the size of the matrices and why so?

Task 4.3 What is the size of the L1 cache. Can you observe the effect of caches in the execution times, e.g. when the matrices do not fit into the cache anymore, you should observe a performance loss?

Task 4.4 Using the timing information and the number of operations for the DGEMM, calculate the FLOPS/s. How many operations are carried out in DGEMM with N as the matrices dimension? Hint: Think about the number of iterations completed in the loops and the number of flops per iteration. How do the FLOPS/s you measured compares to the theoretical peak of your processor (if we assume that we do one operation per cycle then the peak is the clock frequency value)

Task 4.5 For this task, you will need a Linux machine. This is compulsory only if you have a Linux own system available in your group. Perform profiling with perf and make a table reporting the different performance counters. Highlight which approach is best for each performance counter.

Task 4.6 For this task, you will need a Linux machine. This is compulsory only if you have a Linux own system available in your group. Compare and discuss the difference of profiling information and how this impacts the performance.

Task 4.7 To further optimize the NumPy version, use numexpr. Measure and plot the execution times, varying the size of the matrix. Comment on the performance.

Exercise 5 - A Python Discrete Fourier Transform

In signal processing, the discrete Fourier transform (DFT) converts a finite sequence of function samples (typically real numbers) into samples of the discrete-time Fourier transform (DTF), which is a complex-valued function of frequency.

An N-point DFT is simply expressed as the matrix-vector multiplication $X=Wx$ , where $x$ is the original input signal, $W$ is the N-by-N square DFT matrix, and $X$ is the DFT of the signal.

The transformation matrix $W$ can be defined as $W=\left({\frac {\omega ^{jk}}{\sqrt {N}}}\right)_{j,k=0,\ldots ,N-1}$ , or equivalently:

W={\frac {1}{\sqrt {N}}}{\begin{bmatrix}1&1&1&1&\cdots &1\\1&\omega &\omega ^{2}&\omega ^{3}&\cdots &\omega ^{N-1}\\1&\omega ^{2}&\omega ^{4}&\omega ^{6}&\cdots &\omega ^{2(N-1)}\\1&\omega ^{3}&\omega ^{6}&\omega ^{9}&\cdots &\omega ^{3(N-1)}\\\vdots &\vdots &\vdots &\vdots &\ddots &\vdots \\1&\omega ^{N-1}&\omega ^{2(N-1)}&\omega ^{3(N-1)}&\cdots &\omega ^{(N-1)(N-1)}\end{bmatrix}}

where $\omega =e^{-2\pi i/N}$ . Note that the normalization factor in front of the sum ( $1/{\sqrt {N}}$ ) and the sign of the exponent in ω are merely conventions, and differ in some treatments. In this exercise, we will keep it.

The DFT results in performing the following calculation:

{\begin{aligned}X_{k}&=\sum _{n=0}^{N-1}x_{n}\cdot e^{-{\frac {i2\pi }{N}}kn}\\&=\sum _{n=0}^{N-1}x_{n}\cdot \left[\cos \left({\frac {2\pi }{N}}kn\right)-i\cdot \sin \left({\frac {2\pi }{N}}kn\right)\right],\end{aligned}}

x is the input real vector, X is the output vector (complex vector) and the W matrix has complex values.

As part of your implementation, you can create the W matrix first, and then calculate X = W x as matrix-vector multiplication.

You can find a detailed explanation of DFT at the Wikipedia page: https://en.wikipedia.org/wiki/Discrete_Fourier_transform Links to an external site.

Note the Fast Fourier transform (FFT) is a fast way to calculate the DFT (without recurring to the matrix-vector multiplication).

In C pseudo-code, the DFT is the following:

DFT( double* xr,  double* xi, double* Xr_o, double* Xi_o,  int N){ 
    for (int k=0 ; k<N ; k++){ 
        for (int n=0 ; n<N ; n++{ 
             // Real part of X[k] 
             Xr_o[k] += xr[n] * cos(n * k * PI2 / N) + xi[n]*sin(n * k * PI2 / N); 
            // Imaginary part of X[k] 
            Xi_o[k] += -xr[n] * sin(n * k * PI2 / N) + xi[n] * cos(n * k * PI2 / N); 
       } 
 }

Tasks to Complete

Task 5.1 Develop a DFT in Python and a unit test with pytest for checking the correctness of the calculation. The data structures (lists, array, or NumPy) are of your choice.

Task 5.2. Measure the execution time, varying the input size from 8 to 1024 elements and plot it.

Task 5.3 Profile the code with all the profiling tools that can be useful for performance analysis (from coarse-grained to fine-grained) fixing the input size, e.g. 1024. Motivate the choice of the profiling tools and report the profiling results

Task 5.4 Discuss the performance and profiler results. Which part can be improved. Identify 1-2 optimization possibilities. Make a hypothesis about the potential improvement.

Task 5.5 Implement the optimization techniques (checking that they don't break the unit test), carry out the profiling, and discuss them.

Task 5.5 Use DFT implementation from numpy.fft Links to an external site.. Make a performance measurement varying the size and compare the results with your optimized version. Why NumPy is faster / slower than your optimized DFT version? Is what you expect? Motivate by comparing the profiling results.

Exercise 6 - Experiment with the Python Debugger

As part of this exercise, we ask you to complete an online tutorial on the Python pdb debugger. Follow the instructions at https://github.com/spiside/pdb-tutorial Links to an external site.

Task 6.1 Reflection: answer the questions: What are the advantages of using a debugger? What are the challenges you found in using the pdb debugger, if any?

Bonus Exercise - Performance Analysis and Optimization of the Game of Life Code

The goal of this exercise is to measure and profile the performance of the Game of Life code and perform at least one optimization (with proven performance improvement).

The Game of Life, also known as Life, is a cellular automaton (CA) invented by John Conway (recently deceased for Covid :(( ) in 1970. It is a zero-player game: evolution is determined by its initial state, requiring no further input. One interacts with the Game of Life by creating an initial configuration and observing how it evolves.

The universe of the Game of Life is an infinite, two-dimensional grid of square cells, each of which is in one of two possible states, live or dead, (or populated and unpopulated, respectively). Every cell interacts with its eight neighbors which are the cells that are horizontally, vertically, or diagonally adjacent.

In the standard formulation of the game, at each step in time, the following transitions occur:

Any live cell with fewer than two live neighbors dies, as if by underpopulation.
Any live cell with two or three live neighbors lives on to the next generation.
Any live cell with more than three live neighbors dies, as if by overpopulation.
Any dead cell with exactly three live neighbors becomes a live cell, as if by reproduction.

These rules emphasize the role of both live and dead cells and lead to the traditional implementation of the game of life where the state of the system is represented by a matrix of boolean values where each (i, j) entry represents a cell’s state: true for live, false for dead. You can find a more detailed description at https://www.geeksforgeeks.org/conways-game-life-python-implementation/ Links to an external site.

Code

You can find the code you are going to analyze and optimize at https://github.com/electronut/pp/blob/master/conway/conway.py Links to an external site.

The online code on GitHub will need reformatting, you can use the code formatted correctly at: https://www.geeksforgeeks.org/conways-game-life-python-implementation/ Links to an external site.

For performance measurements and analysis, you will need to turn the visualization and animation off changing the main function.

Tasks to Complete

Task B.1 Measure the execution time, varying the size of the grid (and fixed number of iterations). Make a plot with this information.

Task B.2 Use different profilers (from coarse- to fine-grained profilers) to identify performance bottlenecks and potential performance improvement. Report the results of the profilers.

Task B.3 Identify 1 or 2 potential optimizations. Motivate your chosen optimization. Hint: https://www.labri.fr/perso/nrougier/from-python-to-numpy/ Links to an external site. presents an excellent optimization of the game of life using vectorized operations.

Task B.4 Implement the optimization, report the new profiling results and show the performance improvement.

Task B.5 Calculate the sparsity of your grid (dead cells / total number of cells) during different iterations. How would you implement a game of life using only information about alive cells? What the advantages would be? Describe the data structures and potential implementation. Implementation is not needed.

Rubric

Title:

Find a rubric

Title

Title
Criteria	Ratings	Pts
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 Pts Full marks blank 0 to >0 Pts No marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 Pts Full marks blank 0 to >0 Pts No marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --