Mar 22

# Reaching the shore with a fog warning, my Eurohack day 4 morning session

My team of four bright young PhD students set out to conquer the HPC world with an open-source MPI-aware header-only library implementing asynchronous sparse matrix algebraic operations. The library called asynchronator is C++11 based and uses C++11 threads in conjunction with MPI calls to dispatch its operations on HPC hardware. Especially the use of C++11 threads gave us some trouble using score-p as it is not so common in HPC to do so even as the C++11 standard is 6 years old by now.

The library itself is based on a flat text-book style class hierarchy where most classes abstract the underlying type (assume double or std::complex<double> for now) away by means of a templated type. Apart from that, asynchronator does neither use template meta programming or any other state of the art C++ techniques like [expression templates]() or [CRTD]() that may give any compiler in the HPC arena trouble (one hopes). The code values simplicity over complexity which will certainly play out if the library yields a limited set of operations only. Depending on the evolution of feature requirements, this may be revised as usual.

So my team was very intrigued by the features that the thrust library had to offer. First of, thrust by now comes with every CUDA release and thus is a Nvidia supported framework that you can rely on as a developer. thrust itself is a C++03 template library that provides a STL like interface for interacting with Nvidia GPUs and multi-core hardware in general. Thrust has a CUDA, OpenMP and Intel TBB backend. Take for example a simple vector addition:

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>

#include <thrust/copy.h>
#include <thrust/transform.h>
#include <thrust/fill.h>
#include <thrust/sequence.h>

#include <thrust/functional.h>

#include <iostream>
#include <cassert>

int main(void)
{
// initialize all ten integers of a device_vector to 1
thrust::device_vector<int> d_a(10, 1);

// set the first seven elements of a vector to 9
thrust::fill(d_a.begin(), d_a.begin() + 7, 9);

// initialize a host_vector with the first five elements of d_a
thrust::host_vector<int> H(d_a.begin(), d_a.begin() + 5);

// set the elements of H to 0, 1, 2, 3, ...
thrust::sequence(H.begin(), H.end());

//transfer the contents of H to d_b
thrust::device_vector<int> d_b = H;

//perform b = a + b
thrust::transform(d_a.begin(),d_a.end(),
d_b.begin(),
d_b.begin(),
thrust::plus<int>());

//get the results from device vector d_b into host vector H
thrust::copy(d_b.begin(), d_b.end(), H.begin());

//check that the computation gave a correct result
assert(H[1] == 10 && "uuups, thrust implementation of vector addition failed\n");
return 0;
}

So the API of thrust::device_vector and thrust::host_vector has a strong resemblance of std::vector and thus should make every intermediate to professional C++ developer feel at home immediately. The explicit thrust::copy function interface for host device communication as well as the assignment operator overload of thrust::device_vector and thrust::host_vector that has the same effect, can make the code very readable and more easier to maintain. But Thrust has much more to offer. In case you are curious, jump over to the thrust documentation.

But why on earth, did I name this blog in such a weird fashion? And what is the relation to the fog warning? When we started to work with thrust during the Eurohack 2017 we stumbled upon a couple of facts that I'd like to address here. I hope this helps for them to being fixed and other developers don't run into them as well.

Something that irritated us from the beginning was that the official thrust homepage listed the 1.8.1 release in 2015 as the most recent stable one at the time of writing this blog. If you check the CUDA 8.0 root directory, you'll find a file at include/thrust/version.h that contains the following source code:

*! \def THRUST_VERSION
*  \brief The preprocessor macro \p THRUST_VERSION encodes the version
*         number of the Thrust library.
*
*         <tt>THRUST_VERSION % 100</tt> is the sub-minor version.
*         <tt>THRUST_VERSION / 100 % 1000</tt> is the minor version.
*         <tt>THRUST_VERSION / 100000</tt> is the major version.
*/
#define THRUST_VERSION 100803

In other words, CUDA 8.0 ships thrust 1.8.3! The GitHub repo itself contains the 1.8.3 tag as well. So it looks like the thrust documentation has not been updated for a while. That was the first confusion we encountered.

Another intriguing observation was that the pinned memory allocator that one can supply to the thrust::host_vector type definition is located in ./include/thrust/system/cuda/experimental/pinned_allocator.h. As asynchronator is all about asynchronous methods, I hoped to use asynchronous copies based on pinned memory (think cudaMallocHost and cudaMemcpyAsync) and eventually use this allocator for the host side handling of asynchronator's data. The fact that the allocator's header file is located in a directory called experimental makes the warning bells in my head ring very loud. We brought these issues up to the present Nvidia engineers and they told us, that the development of thrust has moved to Nvidia internal servers. The GitHub repo and GitHub page was simply not updated recently.

Shoving all of the above and related concerns aside, we continued to implement a 2D stencil operation on a 2D block of data with thrust. We used thrust::for_each (see its [documentation](https://thrust.github.io/doc/group modifying.html#ga263741e1287daa9edbac8d56c95070ba)) and thrust::counting_iterator (see its [documentation](https://thrust.github.io/doc/classthrust_1_1counting iterator.html)) for that. At some point, our unit tests pointed us at a bug that we introduced in the device side function that performed the stencil operation. So we switched on device side debug symbols during compilation (with the -G flag to nvcc). But, the compiler throws an error when doing so:

ptxas fatal   : Unresolved extern function 'cudaDeviceSynchronize'

If you want to reproduce this problem, use the samples that are shipped with CUDA 8 and go into the samples/6_Advanced/radixSortThrust sub-folder. Compile the radixSortThrust example with debug symbols (under *nix by calling make dbg=1) and you'll get the same error. I submitted this as a bug to Nvidia and was told that this is a known issue.

By this point, our frustration had stacked up significantly. On top, we thought that compiling thrust with debug symbols was impossible. And that yielded show killer as one thing is quite certain when developing software: there will be bugs! So our team decided to move away from using thrust to adopting plain CUDA. Feel free to call this decision premature and post comments below, but keep in mind the irritations using thrust (which might be limited to the specific point in time and the related CUDA release) had stacked up significantly until this point and our first steps with it yielded mostly problems rather than help solve our computational problems. Which is unfortunate as I still think that thrust is a great library especially when it comes to bringing a parallel STL like interface to GPUs. It's even more unfortunate that most of the trouble that we had, could be resolved by up-to-date documentation and e.g. an open bug tracker for CUDA and CUDA library bugs.

Coming back to the title of this blog and the painting by Winslow Homer: Getting to the shore of the hackathon, i.e. having a GPU implementation is harder, than you think. I spent half a day due to the issues reported above. So this post is meant as a fog warning if you want to use the thrust boat.

PS. Thrust uses dynamic parallelism internally. When using optimization flags with nvcc, the thrust calls to cudaDeviceSynchronize inside device code (which is in itself an instantiation of dynamic parallelism) apparently were optimized out I suspect. In debug mode, this was not the case, so the compiler fell to its feet as it was missing the cudaDeviceSynchronize implementation on device. This can be fixed by adding -rdc true to the nvcc command line. But this opened another can of worms with the cmake build system setup that asynchronator was using.

PPS. Digging the stale GitHub repo wiki of thrust about debugging, I found this statement:

As of version 4.2, nvcc does not support device debugging Thrust code. Thrust functions compiled with (e.g., nvcc -G, nvcc –device-debug 0, etc.) will likely crash.

Even though this statement points at CUDA 4.2, given that the above mentioned CUDA 8 SDK example works when adding -rdc true I am even more confused.