Grading Criteria

To achieve a particular grade, all the criteria for that particular grade and grades below must be met. For example, to get B, all criteria for B, C, D, E must be met, and so on.

Grade	Code	Report
A	Completely offload both the particle mover and interpolation to CUDA, with both steps performed in a single kernel. OR Implement a simple particle mover that uses OpenACC.	The report is clear, readable and of high quality. Explain how you overcome the challenges that you discussed in grade C for interpolation. AND Illustrate the performance implication using nvprof. AND Discussion the overall performance change, after completing you have completed all the grade levels. AND If the performance is not as expected, discuss and explain, using results from nvprof. OR Compare the performance of the versions using OpenACC and CUDA.
B	Use pinned memory on the host. Use Stream and async copy	Illustrate the performance when using pinned memory and asynchronous data movement, comparing the version in grade C, using nvprof. Explain your strategy of the overlapping in computation. Illustrate the performance when overlapping computation, using nvprof. Discuss general performance changes from the previous version. If the performance is not as expected, explain why.
C	Implement mini-batches of particles to the particle mover so that the application can process more particles than it fits on the GPU.	Explain the strategy of your mini batching. Illustrate the performance implication using nvprof. Repeat the experiments you did for grade E. Discuss performance changes from the previous version. Explain the challenges in implementing a completely offloaded interpolation.
D	Port part of the interpolation (interpP2G) to CUDA. All the CUDA memory management is correctly taken care of.	Explain how the interpolation is ported. Illustrate the performance implication using nvprof. Repeat the experiments you did for grade E. Discuss the performance bottleneck in this implementation.
E	Port the particle mover (mover_PC) of the code to use GPU, without further optimization. All the CUDA memory management is correctly taken care of. All the input files that are used to perform the experiments are provided.	The final report has all five sections and readable. Design specification provided Describe clearly your experiments and the testing environment. Explain how to reproduce your results so we can run the code. Plots are readable, have labels and are clearly explained.