Mar 08

Print this Post

One Week of GPU Hacking as a Mentor

As it was covered on this blog extensively, I’d like to share my personal experiences as a mentor at the 2016 GPU Eurohack in Dresden, Germany. I was affiliated to the Palm4GPU project together with 2 other mentors: Alexander Grund (HZDR) and Dave Norton (PGI). Three developers came over from Hannover to advance their already existing OpenACC enabled code base and to see if more performance could be achieved: Helge Knoop, Tobias Gronemeier and Christoph Knigge. Feel free to drop a line in the discussion section below in case you have a question or would like to comment on something. But before we continue, here is the executive summary of this post (continue to read if you are interested in the details):

* bring (unit) tests before you port/accelerate your code
* Score-P/Vampir are good for exploring the performance landscape of HPC code
* intrinsic profiling tools can provide quick stats on GPU kernel metrics

Palm itself is a CFD simulation exploiting the large eddy approach to solve the Navier-Stokes equation (think large scalar fields with reference to space and velocity and a lot of stencil plus FFT operations). Palm was already running on multiple nodes using a MPI+OpenMP back-end that distributed the data across the cluster and multiple cores to do the heavy lifting.

We started the week with first doing Score-P/Vampir profiles and trying to understand the code. The latter was a real challenge as the code base historically started as a one-man-show and has evolved to a team project ever since. The problematic part about this was, that the code itself did not scale its architecture to the growing number of contributors and so unit tests and integration tests etc. were missing entirely. While this is no novelty in academia, unfortunately, this made examining the code and working with it very challenging as the 3 mentors were of course not literate in large eddy simulations.

On top, the OpenACC enabled version produced different results than the CPU-only version with the same input configuration. This lead to quite some bug hunting in the first 2 days. The three palm guys tried to adapt as best as they could and provided us with bash based tooling to compare the output of a simulation run with the reference run (simple diffs basically). But this does not replace the fine grained capabilities of a solid unit test base when it comes to bug hunting. The team decided at some point to start over and comment out all prior OpenACC code. One lesson learnt from this is, that code that comes with unit tests of its components is easier to port and thus easier to accelerate (think M. Feathers, Working Effectively with Legacy Code)!

On a related note, the OpenACC version was maintained in parallel to the CPU version. Although this might have been a necessity for palm, this is not needed with pragma based parallelisation paradigms. As code that works well on a GPU does so in a CPU, having two separate code basis made hunting bugs even more complicated. So half a lesson learnt from this, if you use OpenACC or similar, try to keep the code base unified as much as possible.

After a couple of CPU+MPI runs on the test data, the top 5 functions that consumed the most runtime during simulation could be identified. Score-P instrumentation and Vampir visualization were instrumental here and helped us pinpointing these hot spots within minutes (see this post on how to profile with Score-P). Going further, we started to throw some `$!acc kernels` sections into the code base and kept profiling to see how the performance of the code evolved. One has to say however that this cycle is quite time consuming as some of traces took 5-10 minutes to load in Vampir. This was mostly due to the fact that the hot spot subroutines were called at a very high frequency on the nodes and thus, the traces grew to an unhandy size (at least for Vampir).

One of the mentors from PGI then hinted us at the fact that the PGI compiler does compile the OpenACC code to CUDA, which allows us to use the CUDA profiler as well. To do that, you simply compile your code as is for OpenACC, e.g. as

$ pgf90 -fastsse -acc -ta=tesla -Minfo=acc ...

Before you run the application though, equip your environment with the following variables:

$ export COMPUTE_PROFILE=1 # 1 is on, 0 is off
$ export PGI_ACC_TIME=0 # 1 is on, 0 is off
$ export CUDA_PROFILE_LOG=./cuda_profile_out
$ export CUDA_PROFILE_CONFIG=${HOME}/cuda_prof.config

Once the application runs, it will dump all relevant profiling output in


inside the current working directory. Last but not least,


points to a configuration file that controls what metric is to be included in the profiling output (for more options see the CUDA profiler documentation). In this case, the file looked like:

$ cat ${HOME}/cuda_prof.config

For our medium test case that ran for 60s on one node with one rank and one GPU, this produced a 210 MB ASCII text file. Inside, you can grep for lines like

timestamp=[ 69514056.000 ] method=[ advec_u_ws_acc_2234_gpu ] gputime=[ 1557.472 ] 
  cputime=[ 1586.327 ] gridsize=[ 128, 1 ] threadblocksize=[ 128, 1, 1 ] 
  dynsmemperblock=[ 0 ] stasmemperblock=[ 0 ] regperthread=[ 160 ] occupancy=[ 0.375 ] 
  streamid=[ 1 ]local_load=[ 0 ] local_store=[ 0 ] active_warps=[ 15267214 ] active_cycles=[ 743715 ]

The most interesting metrics in the above are

method=[ advec_u_ws_acc_2234_gpu ] gputime=[ 1557.472 ]

which report the name of kernel that was profiled and the time spent on the device during execution. Then,

regperthread=[ 160 ] occupancy=[ 0.375 ]

tells you, how good your code was able to exploit parallelism on the device based on the number of registers a kernel required. All of this information helped us, to have a rapid turn around frequency while doing GPU dedicated optimisations. So rather than firing up a full blown profiler (Score-P/Vampir), you have very quick feedback to the changes you just made to your code. Of course, this is only helpful if you know what you are looking for. If the latter is not true, using NVVP is a must, as it contains quite some helpful visualizations (e.g. on occupancy calculation etc).

We finished the week by replacing

$!acc kernels

with proper

$!acc loop

constructs and added more and more OpenACC data regions. It was interesting to see that with every check-in, the app was getting faster. In the end, the code ran and produced correct results. As we didn’t finish completely porting the code and all required data transfers, it was nice to see that performance was getting on par with the CPU version (at least with one rank only). But by that time, it was already Friday afternoon and time to leave. Too bad, it was so much fun.

I’d like to thank ORNL, Nvidia and PGI as well as the TU Dresden for supporting this event. I’d also like to stress that I had a great time working with the palm team and with my 2 co-mentors. I learned a lot! Thank you to everyone.

Permanent link to this article: https://gcoe-dresden.de/one-week-of-gpu-hacking-as-a-mentor/