Apr 26

Reproducible Performance | GPU boost

Developing highly performant code for GPUs is a task with many pitfalls. And to run that code efficiently with a reproducible outcome is another. This part of our blog covers the issues with runtime analysis under the GPU boost functionality on Kepler GPUs.
Read the rest of this entry »

Permanent link to this article: https://gcoe-dresden.de/reproducible-performance-gpu-boost/

Apr 26

GCOE Dresden at GTC 2017

the GCOE will be present at the GPU Technology Conference (GTC) 2017, May 8-11, in Silicon Valley, organized by Nvidia. There will be more than 500 sessions during the four days, where the GCOE will participate with the following events:
Read the rest of this entry »

Permanent link to this article: https://gcoe-dresden.de/gcoe-dresden-at-gtc-2017/

Apr 25

Deep Learning Workshop in August 2017

Deep Learning

The GCOE Dresden is happy to announce our first Deep Learning workshop organised in collaboration with the Max Planck Institute of Molecular Cell Biology and Genetics, Scionics and Helmholz-Zentrum Dresden-Rossendorf.

This hands-on course will take you from 0 to 100 in Deep Learning with Keras (a high-level neural networks API, written in Python and capable of running on top of either TensorFlow or Theano). Our aim is to teach the fundamentals of deep learning with Convolutional Neural Networks (CNN) based on modern techniques using the Keras API and the Tensorflow backend. By the end participants will know how to build deep learning models, how to train them, what to avoid during training, what to check during training and how to perform model inference, especially for image based problems. We hope participants will then go out and apply these methods to their own problems and use cases.

The core curriculum is planned from Tuesday (August 22) to Thursday night (August 24) to take place on the MPI CBG campus, Dresden, Germany. There is the latent plan to have voluntary session before (introduction to python) and a hackday after the workshop (to work on the challenges participants have).

The registration is open until May 31, 2017. If you are interested, please consult the workshop page and register.

Permanent link to this article: https://gcoe-dresden.de/deep-learning-workshop-in-august-2017/

Apr 07

Reaching the shore between Scylla and Charybdis, my Eurohack day 4 afternoon session


Odysseus and Scylla by PinkParasol, Copyright 2012-2017

Don't be too shocked by the image above. I chose it as I deem it quite fitting to the situation I encountered at the Eurohack on my fourth day. For those of you that don't quite recall the background to this image from Homer's Odyssey, the sea monster Charybdis was believed to live under a small rock on one side of a narrow channel. Opposite her was Scylla, another sea monster, that lived inside a much larger rock. Odyssey, Book XII The sides of the strait were within an arrow-shot of each other, and sailors attempting to avoid one of them would come in reach of the other.

So what happened. As discussed in a previous post, I sat down to code a 3x3 stencil operation using thrust. In order to have a fair comparison, I did the same with plain CUDA. So I coded it in a unit test and ran it. As expected, I got the following results:

[==========] Running 2 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 1 test from stencil_2d_fixture
[ RUN      ] stencil_2d_fixture.cuda_if_else_stencil
[       OK ] stencil_2d_fixture.cuda_if_else_stencil (2696 ms)
[----------] 1 test from stencil_2d_fixture (2696 ms total)
[----------] 1 test from stencil_3d_fixture
[ RUN      ] stencil_3d_fixture.cuda_if_else_stencil
[       OK ] stencil_3d_fixture.cuda_if_else_stencil (87 ms)
[----------] 1 test from stencil_3d_fixture (87 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 2 test cases ran. (2785 ms total)
[  PASSED  ] 2 tests.

Life was great at this point. I was happy, so I built the code in release mode and ran the tests again:

[==========] Running 2 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 1 test from stencil_2d_fixture
[ RUN      ] stencil_2d_fixture.cuda_if_else_stencil
/home/steinbac/development/asynchronator/test/SandboxCUDAStencil.cu:134: Failure
Expected: data[i]*2
Which is: 1028
To be equal to: observed[i]
Which is: 2.95916e+09
[  FAILED  ] stencil_2d_fixture.cuda_if_else_stencil (2719 ms)
[----------] 1 test from stencil_2d_fixture (2719 ms total)
[----------] 1 test from stencil_3d_fixture
[ RUN      ] stencil_3d_fixture.cuda_if_else_stencil
/home/steinbac/development/asynchronator/test/SandboxCUDAStencil.cu:219: Failure
Expected: data[i]*3
Which is: 197382
To be equal to: observed[i]
Which is: 1.65302e+12
[  FAILED  ] stencil_3d_fixture.cuda_if_else_stencil (26 ms)
[----------] 1 test from stencil_3d_fixture (27 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 2 test cases ran. (2746 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] stencil_2d_fixture.cuda_if_else_stencil
[  FAILED  ] stencil_3d_fixture.cuda_if_else_stencil

2 FAILED TESTS

Ok, that looks bad. As a matter of fact, I'd consider this one of the nightmare situations for every seasoned C/C++ developers:

  • the debug version of the code passes the tests
  • the release version with optimization turned-on fails the tests

From the looks of it, we can't use the debugger to dig into the code as obviously in release mode, debug symbols are absent. So what else could it be?

I sat there staring at my code and telling my team mates, that something is wrong. I recreated the Cmake build files multiple times, ran make with VERBOSE=1 to check if the NVCC configuration in cmake was handed to make correctly, I did all kind of sanity checks that in retrospective potentially were irrelevant.

Taking a step back, I thought, maybe the kernel I used uses memory from regions that were not allocated by me or maybe the memory access pattern, i.e. the index computation, had a bug. So I added -lineinfo to the compiler flags for release mode and build the tests again.

$ cuda-memcheck ./myapp | c++filt
#...
========= CUDA-MEMCHECK
========= Invalid   global   read of size 8
========= at 0x00000458 in /home/steinbac/development/asynchronator/test/SandboxCUDAStencil.cu:72:stencil_2D(double*, double*, double*, int2, int2)
========= by thread (31,3,0) in block (3,2,0)
========= Address 0x1030b600048 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x204235]
========= Host Frame:./test/sandbox_cuda_stencil [0x668e1]
========= Host Frame:./test/sandbox_cuda_stencil [0x84483]
========= Host Frame:./test/sandbox_cuda_stencil (stencil_2d_fixture_cuda_if_else_stencil_Test::TestBody() + 0x63a) [0x4fafa]
#...

Right on the spot! Ok, so I am running into an out-of-bounds error. My kernel is accessing memory that does not belong to my application on the device. So I went back to my source code and continued staring at every line. The bulk of the work in the kernel is done inside a double for-loop that performs the stencil computation:

double new_value = 0.;
int stencil_offset = 0;
int y = -_stencil_shape.y/2;

for(;y<=_stencil_shape.y/2;++y){

  int x = -1*_stencil_shape.x/2;
  for(;x<=_stencil_shape.x/2;++x){

    int device_offset = (idx.y + y)*_shape.x + (idx.x + x);
    new_value += _stencil[stencil_offset++]*_old[device_offset];
  }
}

_new[linear_idx] = new_value;

I condensed the code a bit for the sake of clarity. _stencil_shape and _shape are two simple structs that describe the extent of the memory to work on. In this example, their values are _shape = {512,512} and stencil_shape={3,3}. idx is the 2D Cartesian coordinate of the current thread obtained from threadIdx, blockIdx and friends.

You should imagine me at this point holding my head and sweating like hell. When I observe myself doing this, an alarm bell rings that tells me to get some help. I asked Rene Widera and Jeffrey Kelling (both HZDR Rossendorf) to lend me a hand.

For a first shot, Rene suggested to make the nvcc compiler keep the PTX code to have a look at it - that's what we performance addicts do typically ... look at the assembly! You can achieve this by providing --keep --source-in-ptx to nvcc. It will then compile your code and leave all temporary files in the directory where all your binaries are built.

In my case with 2 compute capabilities defined, this resulted in 48 files being generated by the compiler:

./SandboxCUDAStencil.compute_60.cpp1.ii
./SandboxCUDAStencil.compute_60.cpp2.i
./SandboxCUDAStencil.compute_60.cpp3.i
./SandboxCUDAStencil.compute_60.cudafe1.c
./SandboxCUDAStencil.compute_60.cudafe1.gpu
./SandboxCUDAStencil.compute_60.cudafe1.stub.c
./SandboxCUDAStencil.compute_60.cudafe2.c
./SandboxCUDAStencil.compute_60.cudafe2.gpu
./SandboxCUDAStencil.compute_60.cudafe2.stub.c
./SandboxCUDAStencil.compute_60.ptx
./SandboxCUDAStencil.compute_60.sm_60.cubin
./SandboxCUDAStencil.compute_61.cpp1.ii
./SandboxCUDAStencil.compute_61.cpp2.i
./SandboxCUDAStencil.compute_61.cpp3.i
./SandboxCUDAStencil.compute_61.cudafe1.c
./SandboxCUDAStencil.compute_61.cudafe1.cpp
./SandboxCUDAStencil.compute_61.cudafe1.gpu
./SandboxCUDAStencil.compute_61.cudafe1.stub.c
./SandboxCUDAStencil.compute_61.cudafe2.c
./SandboxCUDAStencil.compute_61.cudafe2.gpu
./SandboxCUDAStencil.compute_61.cudafe2.stub.c
./SandboxCUDAStencil.compute_61.ptx
./SandboxCUDAStencil.compute_61.sm_61.cubin
./SandboxCUDAStencil.cpp4.ii
./SandboxCUDAStencil.cu.cpp.ii
./SandboxCUDAStencil.fatbin
./SandboxCUDAStencil.fatbin.c
./SandboxCUDAStencil.module_id

What you want to look at is ./SandboxCUDAStencil.compute_60.ptx in our cases as PTX which is not yet the machine code that will be run on the device, but something very close to that.

One nice property of the PTX file generated is, that it will be interleaved with the C/C++ code equivalent. For example, the stencil computation looks like this:

//SandboxCUDAStencil.cu:80 new_value += _stencil[stencil_offset++]*_old[device_offset];
.loc 1 80 1
add.s32 %r62, %r62, 1;
ld.global.f64 %fd8, [%rd26];
ld.global.f64 %fd9, [%rd25];
fma.rn.f64 %fd10, %fd9, %fd8, %fd10;

I can make out the increment by 1, two 64-bit load instructions and a fused-multiply-add to fill new_value. So far, not good! Because we still didn't find the root cause for our memory violation. So both of my Rossendorf colleagues suggested to insert printf statements. As many of you know, the CUDA run-time supports printf statements in device code. That said, the size of the printf output buffer is limited, so we won't get all the printf calls in every kernel instance on the card's SMs, but we'll get some. So I inserted the printf:


int y = -_stencil_shape.y/2;
for(;y<=_stencil_shape.y/2;++y){

  int x = -1*_stencil_shape.x/2;
  for(;x<=_stencil_shape.x/2;++x){

    device_offset = (idx.y + y)*std::size_t(_shape.x) + (idx.x + x);
    if (!(device_offset < len))
      printf("Arrrg out of bounds!\n");
    if (!(stencil_offset < stencil_len))
      printf("Arrrg out of stencil bounds!\n");
    new_value += _stencil[stencil_offset++]*_old[device_offset];
  }

}

And boom it fired (multiple times as expected) reporting an out of bounds memory access! But now we were getting closer in knowing where. After a sequence of binary-search-style insertions of printf, we observed that the initialization of y and x was corrupted.

int y = -_stencil_shape.y/2;

Contrary to expectations, given _stencil_shape.y=3 in our case, y yielded -2! Same goes for x. According to common integer arithmetic, 3/2 should give 1 and that multiplied by -1 gives -1 as you'd expect on the CPU. But with CUDA 8.0.44 (on x86_64) and 8.0.51 on Power8, 3/2 gave 2! At this point, we asked the Nvidia dev techs that were still present for help: Kate Clark and Matthias Wagner. After further inspection of the C and PTX code with them, Kate asked us to come up with a tiny reproducer and submit a bug report. While doing so, we were able to show that this bug prevails not only on compute capability 6.0 and 6.1 (Pascal generation), but also on compute capability 5.2 (Maxwell generation). What a day! What an odyssey.

Addendum: I submitted the bug report on March 13th, 2017, after the Eurohack. The Nvidia compiler team was able to reproduce the error described here. On March 30th, 2017, Nvidia updated the bug report by stating, that the bug has been fixed in the just-released CUDA 8.0.61. I uploaded a minimal reproducing example to this post.

Permanent link to this article: https://gcoe-dresden.de/reaching-the-shore-between-scylla-and-charybdis-my-eurohack-day-4-afternoon-session/

Apr 04

GCOE Talk on CFD with CPU-GPGPU | Institute of Fluid Dynamics

On the 10th April, 3:30pm, there will be a GCoE Talk by Immo Huismann from the Chair of Fluid Dynamics, TU Dresden.

CG with the agonizing pain – CPU-GPGPU coupling for CFD

This talk investigates static load balancing models for CPU-GPU coupling from a Computational Fluid Dynamics perspective. While able to generate a benefit, traditional load balancing models are found to be too inaccurate to predict the runtime of a preconditioned conjugate gradient solver. Hence, an expanded model is derived that accounts for the multistep nature of the solver. It is able to predict the runtime to a margin of 5 %, rendering CPU-GPGPU coupling predictable and worthwhile. Lastly, a computation of a gas-phase chemical reaction problem is made using the new load-balancing model, that is able to harness all the resources available on the heterogeneous computation nodes.

The talk will take place as usual in TRE-101 VC, Zellescher Weg 16, TU Dresden.

By the way, we always look for GPU related talks, so if you are interested to present your current work to GPU developers and to get feedback from others, just get in contact with us: gcoe@mailbox.tu-dresden.de

Permanent link to this article: https://gcoe-dresden.de/gcoe-talk-cfd/

Mar 22

Reaching the shore with a fog warning, my Eurohack day 4 morning session

Winslow Homer - The Fog Warning (1885, oil on canvas, wikicommons)
Winslow Homer, The Fog Warning (1885, oil on canvas, wikicommons)

My team of four bright young PhD students set out to conquer the HPC world with an open-source MPI-aware header-only library implementing asynchronous sparse matrix algebraic operations. The library called asynchronator is C++11 based and uses C++11 threads in conjunction with MPI calls to dispatch its operations on HPC hardware. Especially the use of C++11 threads gave us some trouble using score-p as it is not so common in HPC to do so even as the C++11 standard is 6 years old by now.

The library itself is based on a flat text-book style class hierarchy where most classes abstract the underlying type (assume double or std::complex<double> for now) away by means of a templated type. Apart from that, asynchronator does neither use template meta programming or any other state of the art C++ techniques like [expression templates]() or [CRTD]() that may give any compiler in the HPC arena trouble (one hopes). The code values simplicity over complexity which will certainly play out if the library yields a limited set of operations only. Depending on the evolution of feature requirements, this may be revised as usual.

So my team was very intrigued by the features that the thrust library had to offer. First of, thrust by now comes with every CUDA release and thus is a Nvidia supported framework that you can rely on as a developer. thrust itself is a C++03 template library that provides a STL like interface for interacting with Nvidia GPUs and multi-core hardware in general. Thrust has a CUDA, OpenMP and Intel TBB backend. Take for example a simple vector addition:

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>

#include <thrust/copy.h>
#include <thrust/transform.h>
#include <thrust/fill.h>
#include <thrust/sequence.h>

#include <thrust/functional.h>

#include <iostream>
#include <cassert>

int main(void)
{
    // initialize all ten integers of a device_vector to 1
    thrust::device_vector<int> d_a(10, 1);

    // set the first seven elements of a vector to 9
    thrust::fill(d_a.begin(), d_a.begin() + 7, 9);

    // initialize a host_vector with the first five elements of d_a
    thrust::host_vector<int> H(d_a.begin(), d_a.begin() + 5);

    // set the elements of H to 0, 1, 2, 3, ...
    thrust::sequence(H.begin(), H.end());

    //transfer the contents of H to d_b
    thrust::device_vector<int> d_b = H;

    //perform b = a + b
    thrust::transform(d_a.begin(),d_a.end(),
        d_b.begin(),
        d_b.begin(),
        thrust::plus<int>());

    //get the results from device vector d_b into host vector H
    thrust::copy(d_b.begin(), d_b.end(), H.begin());

    //check that the computation gave a correct result
    assert(H[1] == 10 && "uuups, thrust implementation of vector addition failed\n");
    return 0;
}

So the API of thrust::device_vector and thrust::host_vector has a strong resemblance of std::vector and thus should make every intermediate to professional C++ developer feel at home immediately. The explicit thrust::copy function interface for host device communication as well as the assignment operator overload of thrust::device_vector and thrust::host_vector that has the same effect, can make the code very readable and more easier to maintain. But Thrust has much more to offer. In case you are curious, jump over to the thrust documentation.

But why on earth, did I name this blog in such a weird fashion? And what is the relation to the fog warning? When we started to work with thrust during the Eurohack 2017 we stumbled upon a couple of facts that I'd like to address here. I hope this helps for them to being fixed and other developers don't run into them as well.

Something that irritated us from the beginning was that the official thrust homepage listed the 1.8.1 release in 2015 as the most recent stable one at the time of writing this blog. If you check the CUDA 8.0 root directory, you'll find a file at include/thrust/version.h that contains the following source code:

*! \def THRUST_VERSION
 *  \brief The preprocessor macro \p THRUST_VERSION encodes the version
 *         number of the Thrust library.
 *
 *         <tt>THRUST_VERSION % 100</tt> is the sub-minor version.
 *         <tt>THRUST_VERSION / 100 % 1000</tt> is the minor version.
 *         <tt>THRUST_VERSION / 100000</tt> is the major version.
 */
#define THRUST_VERSION 100803

In other words, CUDA 8.0 ships thrust 1.8.3! The GitHub repo itself contains the 1.8.3 tag as well. So it looks like the thrust documentation has not been updated for a while. That was the first confusion we encountered.

Another intriguing observation was that the pinned memory allocator that one can supply to the thrust::host_vector type definition is located in ./include/thrust/system/cuda/experimental/pinned_allocator.h. As asynchronator is all about asynchronous methods, I hoped to use asynchronous copies based on pinned memory (think cudaMallocHost and cudaMemcpyAsync) and eventually use this allocator for the host side handling of asynchronator's data. The fact that the allocator's header file is located in a directory called experimental makes the warning bells in my head ring very loud. We brought these issues up to the present Nvidia engineers and they told us, that the development of thrust has moved to Nvidia internal servers. The GitHub repo and GitHub page was simply not updated recently.

Shoving all of the above and related concerns aside, we continued to implement a 2D stencil operation on a 2D block of data with thrust. We used thrust::for_each (see its [documentation](https://thrust.github.io/doc/group modifying.html#ga263741e1287daa9edbac8d56c95070ba)) and thrust::counting_iterator (see its [documentation](https://thrust.github.io/doc/classthrust_1_1counting iterator.html)) for that. At some point, our unit tests pointed us at a bug that we introduced in the device side function that performed the stencil operation. So we switched on device side debug symbols during compilation (with the -G flag to nvcc). But, the compiler throws an error when doing so:

ptxas fatal   : Unresolved extern function 'cudaDeviceSynchronize'

If you want to reproduce this problem, use the samples that are shipped with CUDA 8 and go into the samples/6_Advanced/radixSortThrust sub-folder. Compile the radixSortThrust example with debug symbols (under *nix by calling make dbg=1) and you'll get the same error. I submitted this as a bug to Nvidia and was told that this is a known issue.

By this point, our frustration had stacked up significantly. On top, we thought that compiling thrust with debug symbols was impossible. And that yielded show killer as one thing is quite certain when developing software: there will be bugs! So our team decided to move away from using thrust to adopting plain CUDA. Feel free to call this decision premature and post comments below, but keep in mind the irritations using thrust (which might be limited to the specific point in time and the related CUDA release) had stacked up significantly until this point and our first steps with it yielded mostly problems rather than help solve our computational problems. Which is unfortunate as I still think that thrust is a great library especially when it comes to bringing a parallel STL like interface to GPUs. It's even more unfortunate that most of the trouble that we had, could be resolved by up-to-date documentation and e.g. an open bug tracker for CUDA and CUDA library bugs.

Coming back to the title of this blog and the painting by Winslow Homer: Getting to the shore of the hackathon, i.e. having a GPU implementation is harder, than you think. I spent half a day due to the issues reported above. So this post is meant as a fog warning if you want to use the thrust boat.

PS. Thrust uses dynamic parallelism internally. When using optimization flags with nvcc, the thrust calls to cudaDeviceSynchronize inside device code (which is in itself an instantiation of dynamic parallelism) apparently were optimized out I suspect. In debug mode, this was not the case, so the compiler fell to its feet as it was missing the cudaDeviceSynchronize implementation on device. This can be fixed by adding -rdc true to the nvcc command line. But this opened another can of worms with the cmake build system setup that asynchronator was using.

PPS. Digging the stale GitHub repo wiki of thrust about debugging, I found this statement:

As of version 4.2, nvcc does not support device debugging Thrust code. Thrust functions compiled with (e.g., nvcc -G, nvcc –device-debug 0, etc.) will likely crash.

Even though this statement points at CUDA 4.2, given that the above mentioned CUDA 8 SDK example works when adding -rdc true I am even more confused.

Permanent link to this article: https://gcoe-dresden.de/reaching-the-shore-with-a-fog-warning-my-eurohack-day-4-morning-session/

Mar 17

Seeing the shore, Eurohack day 4

The 4th day started off with a quick presentation about good practices on power8 compiler flags. One suggestion to use with the gcc frontend to xl was, to use the following:

gcc -Ofast -flto -mcpu=power8 -mtune=power8 -mveclibabi=mass

Let's quickly go through them: -Ofast use optimizations that produce the fastest binaries; -Olto relates to link-time optimisations where an optimizer parses the binaries for optimisation opportunities after all object files have been linked; -mcpu=power8 -mtune=power8 forces the compiler to apply Power8 platform specific optimisations and if needed, use platform specific instructions; -mveclibabi=mass use the IBM vector library MASS where possible. Andreas wrote that up in the Juron related documentation for the hackathon. The eurhack slack channel was quite active.

First off, the question was raised, how to produce a benchmark dump with nvprof that can be viewed with nvvp. The recipe shared by HZDR's Alexander Matthes was:

#on the remote machine or in the cluster as part of a submit script
$ nvprof -o metrics.nvprof --analysis-metrics ./yourapp

in case you are interested in the metrics analysis, if not, then do

#on the remote machine or in the cluster as part of a submit script
$ nvprof -o timeline.nvprof ./yourapp

By the way, this trick also works with mpi applications, so no need to use the heavy artillery like allinea, score-p and friends. After this is done, the produced .nvprof file needs to be imported in to nvvp either on your laptop or in an exported X session with nvvp.

In order to write CPU code that respects the available number of cores, someone asked how to set the number of pthreads in a similar manner to OMP_NUM_THREADS ? The community replied to call pthread_create only n threads times. But, the Linux operating system sets a maximum of threads to spawn per process as documented in /proc/sys/kernel/threads-max. My laptop apparently is capable to spawn 125811 threads:

$ cat /proc/sys/kernel/threads-max
125811

which is an interesting number for a 2 core mobile Ivy Bridge box.

A more detailed usage example of the -Minfo flag was given to help you understand compiler actions on OpenACC code for a simple saxpy kernel. For some simple code like this:

void saxpy(int n, float a, float * restrict x, float * restrict y)
{
#pragma acc kernels
    for(int i = 0; i < n; ++i)
        y[i] = a*x[i] + y[i];
}

Here is the output of the pgi compiler:

$ pgcc -fast -ta=tesla,multicore -Minfo -c saxpy.c
saxpy:
     4, Generating implicit copy(y[:n])
        Generating implicit copyin(x[:n])
     5, Loop is parallelizable
        Accelerator kernel generated
        Generating Tesla code
         5, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     5, Generating Multicore code
         5, #pragma acc loop gang
     5, Generated an alternate version of the loop
        Generated vector simd code for the loop
        Generated 2 prefetch instructions for the loop
        Generated vector simd code for the loop
        Generated 2 prefetch instructions for the loop
        FMA (fused multiply-add) instruction(s) generated

Very impressive.

My team and a neighboring one ran into a clash while using cuda-gdb. Apparently, cuda-gdb creates a lock file upon invocation. While that might be ok on a machine with 1 user and possibly 1 GPU, that approach falls on its feet on a shared node with 4 GPUs. Two users may want to use the debugger at the same time, but the cuda-gdb lock will deny at least one user to do that.

Some teams were struggling with uncoalesced memory access, so someone on the chat suggested the trove library by Baidu's Brian Catanzaro.

Last but not least, a short bash snippet was posted to pin a job near the GPU on a node. The goal of this trick is to minimize CPU socket Bridge traffic before getting your data onto the device.

#this is bash
CPUCORE=2
GPUDEVICE=$(nvidia-smi topo -c $CPUCORE | tail -n +2 | cut -d "," -f 1)
export CUDA_VISIBLE_DEVICES=$GPUDEVICE
taskset -c $CPUCORE $1

Again, if you have any questions or concerns, feel free to use the comment section below. Also don't miss the news coverage of Eurohack.

Permanent link to this article: https://gcoe-dresden.de/seeing-the-shore-eurohack-day-4/

Mar 16

Treading in Water, Eurohack day 3

As the title suggests, this day was the middle of no man's land. The teams started to have a grip on handling the GPU or exploiting it's hardware. At the same time, teams typically start to get a feeling how they need to change their project's source code to flexibly accommodate GPU hardware. The daily scrum contained again a lot of interesting discussions.

It started of with a report on the use and misuse of atomicAdd operations on data from GPU global memory. A handful of mentors immediately made suggestion on how to do atomicAdds more efficiently. Here is their bag of tricks:

  • play with block/grid sizes and check the device utilization in the presence of atomicAdd operations (lower block sizes might bring you better performance by higher occupancy)
  • this GTC talk on replacing atomic operations was recommended
  • also warp aggregated atomics may be of help as well
  • documentation on double precision atomic add emulated with atomicCAS

There was some discussion about comparisons of CPU performance numbers compared to GPU speeds especially on the Power8 systems we were using. One of the mentors suggested to take care of pinning the application to the socket of the Power8 system if you want maximal performance (using e.g. numactl -m0 -N0 <yourapp> to bind compute threads and memory allocations to socket 0). Using GCC 5.4.0 might also not a very good choice on Power8, although it is the highest GCC version supported by CUDA 8 to the best of our knowledge. The discussion boiled down to the fact you'd expect that the best performant binaries are only produced by the hardware vendor's compiler infrastructure. But often, these compilers lack support for C++11 and beyond, or choke on the slightest sign of C++ templates.

And the last tip of the day, always consider using -Minfo with the PGI compiler to infer more information on what the PGI compiler does with your code.

Permanent link to this article: https://gcoe-dresden.de/treading-in-water-eurohack-day-3/

Mar 15

Our feet are wet, Eurohack day 2

Day 2 of the hackathon went by fast and filled with progress. This blog post tries to summarize the events and findings that were presented in the daily scrum.

One team resolved a thread divergence on the device by reducing this conditional bifurcation:

if((s > edge[0]) ^ (s > edge[2])
    e = -(a + f - M[s]) * sgn / 2;
else
    e =  (a + f - M[s]) * sgn / 2;

by replacing it with:

e =  (a + f - M[s]) * sgn / 2;
if((s > edge[0]) ^ (s > edge[2])
    e = -e;

This modification removed a 99% thread divergence reported by nvvp, which resulted in a 12% performance improvement according to the team.

Another team observed a very low occupancy of their GPU device code as it required to access an array of random numbers from global memory. They switched to generating the random numbers on device instead of accessing an array of precomputed random numbers. This increased their occupancy from 14% to 83%. Related to this, there was an interesting discussion on our Eurohack slack channel about good random number generators for GPUs. Here are libraries that were suggested besides cuRAND:

So people were benchmarking those to find the ideal balance between performance and quality of random numbers, which is project dependent. Feel free to share your experiences with these in the comments section below.

Other teams reported that they spent the day improving their data structures to become GPU ready, think Array-of-Structures versus Structure-of-Arrays.

My team started to port parts of this blog post related to an CUDA aware MPI implementation of a Jacobi solver that is quite relevant for them. We so far replaced one function that runs the Jacobi stencil on the device by a thrust based implementation using thrust::for_each. If people are interested in the code, please post a comment below and I can share it.

Permanent link to this article: https://gcoe-dresden.de/our-feet-are-wet-eurohack-day-2/

Mar 14

Let’s hack a ton! Eurohack Day 1

On March 6, ten teams met at JSC for the Eurohack 2017 to boost their applications on GPUs or port them to use GPUs at all. We had on average of 4 team members and 2 mentors per team. So we were topping around 70 hackers in one room. This year’s hackathon (see last years starting post) was again supported by the industry as Nvidia and IBM staff were around to help as mentors.

The Eurohack 2017 took place in the rotunda, second floor of the round building you see there.

After getting acquainted with the team, work started immediately for all Dresden mentors (7 in total).
In my team, we separated from the rest to dive into a code review and had the main developer explain to us the method of the asynchronator library and it’s advantages and disadvantages. As my team didn’t have any code running on the GPU yet, a lot of constructive discussion started immediately. As usual, a lot of arguments filled the room on each individual’s understanding of the computer hardware. At this point, this can be dangerous as it leads to premature optimization very rapidly. We agreed on a iterative approach guided by profiling and supported by unit tests (the latter being inplace already and even a benchmark suite running on a dedicated CI machine, very impressive).

The first day of the hackathon started off very smooth thanks to the wonderful organization by JSC. Dirk Pleiter, Andreas Herten, Michael Knobloch and others put the local infrastructure in place. Getting a training account on the GPU-equipped clusters, Jureca and Juron, was a matter of minutes. So everyone was ready to start! Interesting enough, I discovered that the entire environment modules infrastructure on Jureca is based on lua scripts (which I never saw before) that are auto-generated by a tool called easy_build. On Juron, all module files are written in tcl/tk as usual.

Just before lunch, Julian Miller invited all of us to participate in collecting data in a study to explore how much development effort is needed for porting applications to GPUs and other parallel architectures. After that Michael Knobloch gave us an introduction to profiling with score-p e.g. for use on the local clusters.

In the afternoon, Fernanda Foertner convened everyone to have the daily scrum. Every team was allowed to have 5 minutes to explain what they are after during the hackathon. The range of topics was very diverse (see the slides collection) this year. There were three Lattice QCD projects, three projects from brain science, fluid dynamics, life sciences, asynchronous mathematical methods and more. It’s interesting to note that 7 out of 10 projects already had accelerated code paths in production and some of them were already running on supercomputers. For these, the goal of the workshop was to squeeze more performance out of their application or to reach better scalability to run on large HPC machines. As simple arithmetic tells us, the remaining 3 projects were CPU-only code bases and needed to make the jump onto the accelerator this week. As I know from my personal experience, that’s a tough goal to reach.

In the evening, we spent a wonderful dinner at “Am Hexenturm” and finished the day with good food and drinks.

The evening dinner took place in a restaurant next this wonderful 14th century gate guarded by 2 defense towers called “Hexenturm” (english: Witchtower).

This blog marks one of the first in a small series that will cover our observations from this week-long hackathon. The blogs will cover suggestions straight from the lips of Nvidia engineers and best practises of seasoned CUDA developers that came up due to the issues the project teams had. If you think, I missed something or didn’t convey anything in the correct manner, please use the comment section below to indicate this.

Permanent link to this article: https://gcoe-dresden.de/lets-hack-a-ton-eurohack-day-1/

Older posts «