Assignment I: Performance Modeling & Benchmarking
- Due Apr 7, 2021 by 11:59pm
- Points 6
- Submitting a file upload
- File Types pdf
- Available until Jun 8, 2021 at 11:59pm
Go through the exercise steps and answer questions that are marked Question:
- The submission is group-based. Join an empty Assignment Group in the People section even if you are working on your own. You must join a group.
- Submit a PDF. No code submission is required for this assignment, but you must be ready to show and explain them during the presentation.
- The assignments can be completed either on Beskow or on your local computer. If you decide to complete the exercise on Beskow, read the tutorial on how to compile and run jobs. For example how to allocate and run jobs with
salloc
, or how to specify which kinds of computing node to use with-C Haswell
. - You must be prepared to answer and justify the questions during the lab presentation to pass the assignment.
- All the codes required for this assignment are also available at https://github.com/KTH-HPC/DD2356/tree/master/Assignment-I Links to an external site.
Exercise 1. Modeling Sparse Matrix-Vector Multiply
In the lecture, we have seen how to develop a simple performance model for sparse-matrix multiplication. The goal of this exercise is to estimate the performance of sparse matrix-vector multiply for different sparse matrix sizes and compare it to actual performance measurements.
The code we use for this exercise is available here Download here. Compile it using the following:
$ cc -O2 -o spmv.out spmv.c
This code uses a sparse penta-diagonal (=5 non-zero diagonals and rest all zeros) matrix. A penta-diagonal matrix arises from the 2D finite-difference discretization of the Poisson equation and it is quite common to use these matrices in scientific computing.
Remember that the simple performance model of sparse matrix-vector multiply is the following:
Time=nnz(2c+2.5r)+nrows(0.5r+w)
In this exercise, we will set the nrows value and calculate
nnz=5∗nrows since a penta-diagonal matrix has 5 non-zero values per row.
We will consider matrices with nrows = 102, 104, 106, and 108.
Questions:
- What is the performance in total execution time - do not consider data movement - according to your performance model on Beskow or your local computer for different sparse matrices
nrows = 102, 104, 106, and 108?
- Hint: Use
Time=nnz∗2c and calculate
c from the given clock speed of the processor in use.
- Hint: Use
- What is the measured performance in total execution time and floating-point operations per second running
spmv.c
for different sizesnrows = 102, 104, 106, and 108? Compare the results from the performance model and experimental results. Discuss the comparison in the report.
- Note: in
spmv.c
, we set upnrows by setting
n=√nrows .
- Hint: use
nnz∗2 and the total execution time to calculate the floating-point operations per second in SpMV.
- Note: in
- What is the main reason for the observed difference between the modeled value and the measured value?
- What are the read bandwidth values you measure running
spmv.c
for different sizesnrows = 102, 104, 106, and 108?
- Hint: use
nnz(sizeof(double)+sizeof(int))+nrows(sizeof(double)+sizeof(int)) from page 11 of Lecture: Modeling Sparse Matrix Vector Multiply and the total execution time to calculate the bandwidth of SpMV
- Hint: use
- What is the bandwidth you obtain by running the STREAM benchmark
Download STREAM benchmark on your system? How does it compare to the bandwidth you measured in SpMV? Discuss the comparison. Compile the STREAM benchmark with:
$ cc -O2 -o stream.out stream.c
Exercise 2. The Memory Mountain
This exercise can be done on a Beskow computing node or a local computer.
In this exercise, we measure the memory bandwidth of various levels of the memory hierarchy, and the impact of the (spatial and temporal) locality of accesses. You can use the code to prepare the "memory mountain" for a Beskow computing node processor.
By plotting the memory bandwidth depending on the size and stride, you will create the so-called memory mountain that is the cover of the book Computer Systems: A Programmer's Perspective Links to an external site. by Randal E. Bryant and David R. O'Hallaron.
This program allocates a flat data buffer, walks through it, then computes the actual read bandwidth. The program runs the following loop and measures its execution time.
Note: This program can be quite slow and will take 5+ minutes to run on a modern laptop.
To install the code and run it, you will need to:
1. Download the code by loading the git repository
$ git clone https://github.com/KTH-HPC/DD2356.git
$ cd DD2356/Assignment-I/memory-mountain-example
Note: If you run this on Beskow you need to change the value in #DEFINE MAXBYTES
to (1 << 29)
in mountain.c
before compiling. This might increase runtime.
2. Build the code. Ignore the module swap command if you are running on your local computer.
$ module swap PrgEnv-cray PrgEnv-gnu
$ cd memory-mountain-example
$ make
3. You can try to run the benchmark, for example on Beskow:
$ salloc ...
$ srun -n 1 ./mountain.out
4. The program outputs a matrix that represents the bandwidth with respect to stride and data size. Store the result with the following command, ignore salloc and srun if you are running on your laptop:
$ srun -n 1 ./mountain.out > results.txt
5. Transfer results back to your local computer, if you are running on Beskow, and visualize it using plot.gplt
with Gnuplot. The script reads results.txt
and generates a 3D figure. The resulting plot will be in memory_mountain.png
. You can also run the commands in plot.gplt
in the interactive mode of Gnuplot so you can rotate the 3D plot interactively. (if the script does not work with your system, most likely it is because of the shell scripting used to process the data file (e.g. tail
), try to play around with it to make it work.)
$ gnuplot plot.gplt
Questions:
- Report the name of the processor and the size of the L1, L2, and L3 of the processor you are benchmarking. You can check the specs of your processor online.
- Create the memory mountain following the steps above.
- Save the image of the displayed "memory mountain", and place the resulting image in your pdf file.
- What region (array size, stride) gets the most consistently high performance (ignoring spikes in the graph that are noisy results...)? What is the read bandwidth reported?
- What region (array size, stride) gets the most consistently low performance (Ignoring spikes in the graph that are noisy results...)? What is the read bandwidth reported?
- When you look at the graph for stride=1, you (should) see relatively high performance compared to stride=32. This is true even for large array sizes that are much larger than the L3 cache size. How is this possible, when the array cannot possibly all fit into the cache? Your explanation should include a brief overview of hardware prefetching as it applies to caches.
- What is temporal locality? What is spatial locality?
- Adjusting the total array size impacts temporal locality, why? Will an increased array size increase or decrease temporal locality?
- Adjusting the read stride impacts spatial locality, why? Will an increased read stride increase or decrease spatial locality?
- Repeat the analysis on your own laptop to check the differences. (Optional, if you did your analysis on Beskow)
Exercise 3. Write a Benchmark to Measure Performance
Modify the following provided code Download code and use it for completing the assignment. If you use Beskow, switch to the GNU environment (PrgEnv-gnu) instead of the default one (PrgEnv-cray) on Beskow. If you use your local computer, use the GNU compiler environment.
- Question: How do we switch the compiler environment on Beskow?
Using the gnu compiler environment will allow us to use different optimization flags.
The code is the following:
#define N 5000
#include <stdio.h>
#include <sys/time.h>
double mysecond();
int main(){
int i, j;
double t1, t2; // timers
double a[N], b[N], c[N]; // arrays
// init arrays
for (i = 0; i < N; i++){
a[i] = 47.0;
b[i] = 3.1415;
}
// measure performance
t1 = mysecond();
for(i = 0; i < N; i++)
c[i] = a[i]*b[i];
t2 = mysecond();
printf("Execution time: %11.8f s\n", (t2 - t1));
return 0;
}
// function with timer
double mysecond(){
struct timeval tp;
struct timezone tzp;
int i;
i = gettimeofday(&tp,&tzp);
return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}
- Compiler the program in GNU environment with optimization flag
-O2
and execute it.- Question: what is the average running time?
- Question: Increase N and compile the code, what is the average running time now?
- Get the assembly instructions with
More information about getting assembly language through GCC in this tutorial.$ cc -S -fverbose-asm -O2 benchmark.c
- Question: why is the execution time like that in the previous question when the flag
-O2
is used? Answer this question using the information you find in the assembly code. - Question: What is the average execution time without the
-O2
compilation flag?
- Question: why is the execution time like that in the previous question when the flag
- Check the tick (clock granularity) on Beskow or your local computer.
- Download and compile this code on Beskow or your local computer.
- Run from the command line giving it "100" (without quotation marks) as input (number of times the time period between the two execution time measurements are performed).
- Question: What is the clock granularity on Beskow or your local computer.?
- Replace the timer of the code using the RDTSC instruction as in this code Links to an external site.. Read this tutorial to learn more about RDTSC instruction.
- Question: What is the clock granularity when using the RDTSC timer?
4. Modify the program that you used in question 1.1 and do the following such that the code runs properly with -O2
optimization:
-
- Avoid cold start issues by running the loop once before timing
- Avoid clock granularity by timing multiple iterations of the same loop, then dividing by the number of outer iterations.
- Trick a smart compiler by computing something with the result.
- Question (optional): Check if the code is running as expected by checking the assembly code.
- Question: What are the minimum and average run times? Run the tests multiple times to avoid interference.
Exercise 4. Measure the Cache Usage in Matrix-Matrix Multiply
In this exercise, we use the PERF tool to measure the performance of a code for matrix-matrix multiply. You can use Beskow or your local computer for this exercise. In the case of your local computer, PERF must be supported.
A brief overview of PERF is given here.
Before starting the exercise, please take a look at matrix_multiply.c
Download matrix_multiply.c. The program consists of three matrices A (matrix_a
), B (matrix_b
), and C (matrix_c
). The three matrices are initialized and a warm-up run is executed. During the experiment, the program sets C to zero and calls multiply_matrices()
to perform matrix multiplication. The test is repeated 10 times. After the tests, a checksum-like value is computed and the average running time per matrix multiplication is reported.
Switch to the GNU environment and compile the program with the following (use the exact environment and flags):
$ cc -g -O2 matrix_multiply.c -o matrix_multiply.out
The main focus of the program is:
void multiply_matrices()
{
int i, j, k ;
// Textbook algorithm
for (i = 0 ; i < MSIZE ; i++) {
for (j = 0 ; j < MSIZE ; j++) {
for (k = 0 ; k < MSIZE ; k++) {
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
}
}
}
The innermost subscript k
changes the fastest and the program steps sequentially through the memory elements of matrix_a
. However, the program steps across the columns of matrix_b
, resulting in a large physical stride through memory. (Recall that C arrays are arranged in row-major order in primary memory.) Access to matrix_b
makes inefficient use of the L1 data cache.
The optimized code for the loop nest interchange program is
void multiply_matrices()
{
int i, j, k ;
// Loop nest optimized algorithm
for (i = 0 ; i < MSIZE ; i++) {
for (k = 0 ; k < MSIZE ; k++) {
for (j = 0 ; j < MSIZE ; j++) {
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
}
}
}
The nesting of the innermost loops has been changed. The index variables j
and k
change the most frequently and the access pattern through the two operand matrices is sequential using a small physical stride (8 bytes.) These changes improve access to memory data via the data cache.
Questions:
- Using perf stat -e ... and create a table reporting these quantities:
- OBS! On Beskow, where do you put srun in this case, e.g. srun -n 1 perf or perf srun -n 1?
EVENT NAME | MSIZE = 64 Naive | MSIZE = 64 Optimized | MSIZE = 1000 Naive | MSIZE = 1000 Optimized |
---|---|---|---|---|
Elapsed time (seconds) | _____________ | _____________ | _____________ | _____________ |
Instructions per cycle | _____________ | _____________ | _____________ | _____________ |
L1 cache miss ratio | _____________ | _____________ | _____________ | _____________ |
L1 cache miss rate PTI | _____________ | _____________ | _____________ | _____________ |
LLC cache miss ratio | _____________ | _____________ | _____________ | _____________ |
LLC cache miss rate PTI | _____________ | _____________ | _____________ | _____________ |
- What are the factors that impact the most the performance of the matrix multiply operation for different matrix sizes and implementations (naive vs optimized)?
Exercise 5. Measure the Cache Usage in Transpose
In this exercise, we use the PERF tool to measure the performance of a code for matrix transpose Download code for matrix transpose operation for different matrix sizes. The code can be compiled with:
$ cc -O2 -o transpose.out transpose.c
Questions:
- Using perf stat -e ... and create a table reporting these quantities:
EVENT NAME | N=64 | N=128 | N=2048 | N=2049 |
---|---|---|---|---|
Elapsed time (seconds) | _____________ | _____________ | _____________ | _____________ |
Bandwidth/Rate (from the code and not PERF) | _____________ | _____________ | _____________ | _____________ |
Instructions per cycle | _____________ | _____________ | _____________ | _____________ |
L1 cache miss ratio | _____________ | _____________ | _____________ | _____________ |
L1 cache miss rate PTI | _____________ | _____________ | _____________ | _____________ |
LLC cache miss ratio | _____________ | _____________ | _____________ | _____________ |
LLC cache miss rate PTI | _____________ | _____________ | _____________ | _____________ |
- What are the factors that impact the performance of the transpose operation most for different matrix sizes and implementations?
- Which code transformations can be used in the code for the matrix transpose to improve the re-usage of cache?
Exercise 6. Vectorization
Questions:
- Find out how to request your compiler, e.g. GCC, Cray compiler … apply vectorization by checking on-line resources. For some systems and compilers, vectorization is the default with certain optimization flags.
- Find out which optimization flag for your compiler includes vectorization.
- Find out how you can get a report from the compiler about its success at vectorization.
- In particular, find out which compiler flag enables a vectorization report for your compiler.
- Read your compiler’s documentation to find out what special directives or command-line options can affect vectorization
- Obtain the vectorization report for the matrix-matrix multiply code
Download matrix-matrix multiply code for MSIZE = 1000.
- Which lines of the code are not vectorized if any, and in case why the compiler is not vectorizing them?
Bonus exercise: Performance Modeling of N-body Simulation
In an N-body problem, we need to find the positions and velocities of a collection of interacting particles over a period of time. For example, an astrophysicist might want to know the positions and velocities of a collection of stars, while a chemist might want to know the positions and velocities of a collection of molecules or atoms.
An N-body solver is a program that finds the solution to an n-body problem by simulating the behavior of the particles. The input to the problem is the mass, position, and velocity of each particle at the start of the simulation, and the output is the position and velocity of each particle at a sequence of user-specified times.
The N-body solver can be used to solve the three-body problem, like in the book we briefly discuss in the introductory lecture!
Problem
We will write and parallelize a two-dimensional N-body solver that simulates the motions of planets or stars. We’ll use Newton’s second law of motion and his law of universal gravitation to determine the positions and velocities. Thus, if a particle q has a position
sq(t) at time
t, and particle
k has a position
sk(t), then the force on particle q exerted by particle k is given by
Here, G is the gravitational constant (6.673 × 10−11m3 /(kg ·s 2 )), mq and
mk are the masses of particles
q and
k, respectively. Also, the notation
sq(t)−sk(t) represents the distance from particle
k to particle
q. Note that in general the positions, the velocities, the accelerations, and the forces are vectors.
We can use the equation above to find the total force on any particle by adding the forces due to all the particles. If our n particles are numbered 0,1,2,...,n−1, then the total force on the particle
q is given by
The acceleration of an object is given by the second derivative of its position and that Newton’s second law of motion states that the force on an object is given by its mass multiplied by its acceleration, so if the acceleration of particle q is
aq(t), then
Fq(t)=mqaq(t)=mqs″q(t), where
s″q(t) is the second derivative of the position
sq(t). Thus, we can use the equation above to find the acceleration of particle
q :
Therefore, Newton’s laws give us a system of differential equations—equations involving derivatives—and our job is to find at each time t of interest the position sq(t) and velocity
vq(t)=s′q(t). We’ll suppose that we either want to find the positions and velocities at the times
or, more often, simply the positions and velocities at the final time TΔt. Here,
Δt and
T are specified by the user, so the input to the program will be
n , the number of particles,
Δt,
T, and, for each particle, its mass, its initial position, and its initial velocity.
Two Serial Programs
In outline, a serial n-body solver can be based on the following pseudocode:
1 Get input data; 2 for each timestep { 3 if (timestep output) Print positions and velocities of particles; 4 for each particle q 5 Compute total force on q; 6 for each particle q 7 Compute position and velocity of q; 8 } 9 Print positions and velocities of particles;
Simple Algorithm
We can use our formula for the total force on a particle to refine our pseudocode for the computation of the forces in Lines 4-5:
for each particle q {
for each particle k != q {
x_diff = pos[q][X] − pos[k][X];
y_diff = pos[q][Y] − pos[k][Y];
dist = sqrt(x_diff*x_diff + y_diff*y_diff);
dist_cubed = dist*dist*dist;
forces[q][X] −= G*masses[q]*masses[k]/dist_cubed * x_diff;
forces[q][Y] −= G*masses[q]*masses[k]/dist_cubed * y_diff;
}
}
Here, we are assuming that the forces and the positions of the particles are stored as two-dimensional arrays, forces, and pos, respectively. We’re also assuming we’ve defined constants X=0 and
Y=1. So the x-component of the force on particle
q is forces[q][X] and the y-component is forces[q][Y]. Similarly, the components of the position are pos[q][X] and pos[q][Y].
Reduced Algorithm
We can use Newton’s third law of motion, that is, for every action there is an equal and opposite reaction, to halve the total number of calculations required for the forces. If the force on particle q due to particle k is fqk, then the force on
k due to
q is
−fqk. Using this simplification we can modify our code to compute forces, as follows:
for each particle q
forces[q] = 0;
for each particle q {
for each particle k > q {
x_diff = pos[q][X] − pos[k][X];
y_diff = pos[q][Y] − pos[k][Y];
dist = sqrt(x_diff*x diff + y_diff*y diff);
dist_cubed = dist*dist*dist;
force_qk[X] = G*masses[q]*masses[k]/dist_cubed * x_diff;
force_qk[Y] = G*masses[q]*masses[k]/dist_cubed * y_diff
forces[q][X] += force_qk[X];
forces[q][Y] += force_qk[Y];
forces[k][X] −= force qk[X];
forces[k][Y] −= force qk[Y];
}
}
To better understand this pseudocode, imagine the individual forces as a two-dimensional array:
Our original solver simply adds all of the entries in row q to get forces[q]. In our modified solver, when q = 0, the body of the loop for each particle q will add the entries in row 0 into forces[0]. It will also add the kth entry in column 0 into forces[k] for k=1,2,...,n−1. In general, the qth iteration will add the entries to the right of the diagonal (that is, to the right of the 0) in row q into forces[q], and the entries below the diagonal in column q will be added into their respective forces, that is, the kth entry will be added into forces[k]. Note that in using this modified solver, it’s necessary to initialize the forces array in a separate loop, since the qth iteration of the loop that calculates the forces will, in general, add the values it computes into forces[k] for
k=q+1,q+2,...,n−1, not just forces[q].
Particle Mover
The position and velocity remain to be found. We will use the following pseudocode
pos[q][X] += delta_t∗vel[q][X];
pos[q][Y] += delta_t∗vel[q][Y];
vel[q][X] += delta_t/masses[q]∗forces[q][X];
vel[q][Y] += delta_t/masses[q]∗forces[q][Y];
Here, we’re using pos[q], vel[q], and forces[q] to store the position, the velocity, and the force, respectively, of particle q.
Data structures
We are going to use an array type to store our vectors.
#define DIM 2
typedef double vect_t[DIM];
If forces are stored in an array using contiguous memory, we can use a fast function such as memset
to quickly assign zeroes to all of the elements at the beginning of each iteration:
#include <string.h> /* For memset */
...
vect_t *forces = malloc(n*sizeof(vect_t));
...
for (step = 1; step <= n_steps; step++) {
...
/* Assign 0 to each element of the forces array */
forces = memset(forces, 0, n*sizeof(vect_t);
for (part = 0; part < n−1; part++)
compute_force(part, forces, ...);
}
Initialization
We provide you a simple code on how to initialize the bodies.
for each particle q {
pos[q][X] = (rand() / (double)(RAND_MAX)) * 2 - 1;
pos[q][Y] = (rand() / (double)(RAND_MAX)) * 2 - 1;
vel[q][X] = (rand() / (double)(RAND_MAX)) * 2 - 1;
vel[q][Y] = (rand() / (double)(RAND_MAX)) * 2 - 1;
mass[q] = fabs((rand() / (double)(RAND_MAX)) * 2 - 1);
}
- Question: Implement the simple and reduced algorithm of the N-Body simulator and answer the following questions:
- What is the performance of the serial simple and reduced algorithms in execution time for 500, 1000, 2000 particles using 100 cycles and
Δt=0.05 with no output? Plot the performance results (including error bars) and discuss the results.
- Which algorithm has better cache reuse and why is that? Discuss the results. Hint: you can use Perf for monitoring cache counters or any tool you might find useful.
- What is the performance of the serial simple and reduced algorithms in execution time for 500, 1000, 2000 particles using 100 cycles and
- Question: Develop a performance model for the simple and reduced N-body codes:
- Develop a performance model that takes as input the clock of the processor you are using, the number of particles and cycles and provides the execution time of the code as output assuming memory infinitely fast (no cost for data movement). Make a plot with the prediction values and compare them with the experimental results. Discuss the results in the report.
- Develop a performance model that takes as input the clock of the processor and bandwidth of your system, the number of particles and cycles, and provides the execution time of the code as output considering the cost of data movement. Make a plot with the prediction values and compare them with the experimental results. Discuss the results in the report.
Rubric
Criteria | Ratings | Pts |
---|---|---|
Exercise I
threshold:
pts
|
pts
--
|
|
Exercise II
threshold:
pts
|
pts
--
|
|
Exercise III
threshold:
pts
|
pts
--
|
|
Exercise IV
threshold:
pts
|
pts
--
|
|
Exercise V
threshold:
pts
|
pts
--
|
|
Exercise VI
threshold:
pts
|
pts
--
|
|
Bonus Exercise
threshold:
pts
|
pts
--
|