«

»

Apr 07

Print this Post

Reaching the shore between Scylla and Charybdis, my Eurohack day 4 afternoon session


Odysseus and Scylla by PinkParasol, Copyright 2012-2017

Don't be too shocked by the image above. I chose it as I deem it quite fitting to the situation I encountered at the Eurohack on my fourth day. For those of you that don't quite recall the background to this image from Homer's Odyssey, the sea monster Charybdis was believed to live under a small rock on one side of a narrow channel. Opposite her was Scylla, another sea monster, that lived inside a much larger rock. Odyssey, Book XII The sides of the strait were within an arrow-shot of each other, and sailors attempting to avoid one of them would come in reach of the other.

So what happened. As discussed in a previous post, I sat down to code a 3x3 stencil operation using thrust. In order to have a fair comparison, I did the same with plain CUDA. So I coded it in a unit test and ran it. As expected, I got the following results:

[==========] Running 2 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 1 test from stencil_2d_fixture
[ RUN      ] stencil_2d_fixture.cuda_if_else_stencil
[       OK ] stencil_2d_fixture.cuda_if_else_stencil (2696 ms)
[----------] 1 test from stencil_2d_fixture (2696 ms total)
[----------] 1 test from stencil_3d_fixture
[ RUN      ] stencil_3d_fixture.cuda_if_else_stencil
[       OK ] stencil_3d_fixture.cuda_if_else_stencil (87 ms)
[----------] 1 test from stencil_3d_fixture (87 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 2 test cases ran. (2785 ms total)
[  PASSED  ] 2 tests.

Life was great at this point. I was happy, so I built the code in release mode and ran the tests again:

[==========] Running 2 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 1 test from stencil_2d_fixture
[ RUN      ] stencil_2d_fixture.cuda_if_else_stencil
/home/steinbac/development/asynchronator/test/SandboxCUDAStencil.cu:134: Failure
Expected: data[i]*2
Which is: 1028
To be equal to: observed[i]
Which is: 2.95916e+09
[  FAILED  ] stencil_2d_fixture.cuda_if_else_stencil (2719 ms)
[----------] 1 test from stencil_2d_fixture (2719 ms total)
[----------] 1 test from stencil_3d_fixture
[ RUN      ] stencil_3d_fixture.cuda_if_else_stencil
/home/steinbac/development/asynchronator/test/SandboxCUDAStencil.cu:219: Failure
Expected: data[i]*3
Which is: 197382
To be equal to: observed[i]
Which is: 1.65302e+12
[  FAILED  ] stencil_3d_fixture.cuda_if_else_stencil (26 ms)
[----------] 1 test from stencil_3d_fixture (27 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 2 test cases ran. (2746 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] stencil_2d_fixture.cuda_if_else_stencil
[  FAILED  ] stencil_3d_fixture.cuda_if_else_stencil

2 FAILED TESTS

Ok, that looks bad. As a matter of fact, I'd consider this one of the nightmare situations for every seasoned C/C++ developers:

  • the debug version of the code passes the tests
  • the release version with optimization turned-on fails the tests

From the looks of it, we can't use the debugger to dig into the code as obviously in release mode, debug symbols are absent. So what else could it be?

I sat there staring at my code and telling my team mates, that something is wrong. I recreated the Cmake build files multiple times, ran make with VERBOSE=1 to check if the NVCC configuration in cmake was handed to make correctly, I did all kind of sanity checks that in retrospective potentially were irrelevant.

Taking a step back, I thought, maybe the kernel I used uses memory from regions that were not allocated by me or maybe the memory access pattern, i.e. the index computation, had a bug. So I added -lineinfo to the compiler flags for release mode and build the tests again.

$ cuda-memcheck ./myapp | c++filt
#...
========= CUDA-MEMCHECK
========= Invalid   global   read of size 8
========= at 0x00000458 in /home/steinbac/development/asynchronator/test/SandboxCUDAStencil.cu:72:stencil_2D(double*, double*, double*, int2, int2)
========= by thread (31,3,0) in block (3,2,0)
========= Address 0x1030b600048 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x204235]
========= Host Frame:./test/sandbox_cuda_stencil [0x668e1]
========= Host Frame:./test/sandbox_cuda_stencil [0x84483]
========= Host Frame:./test/sandbox_cuda_stencil (stencil_2d_fixture_cuda_if_else_stencil_Test::TestBody() + 0x63a) [0x4fafa]
#...

Right on the spot! Ok, so I am running into an out-of-bounds error. My kernel is accessing memory that does not belong to my application on the device. So I went back to my source code and continued staring at every line. The bulk of the work in the kernel is done inside a double for-loop that performs the stencil computation:

double new_value = 0.;
int stencil_offset = 0;
int y = -_stencil_shape.y/2;

for(;y<=_stencil_shape.y/2;++y){

  int x = -1*_stencil_shape.x/2;
  for(;x<=_stencil_shape.x/2;++x){

    int device_offset = (idx.y + y)*_shape.x + (idx.x + x);
    new_value += _stencil[stencil_offset++]*_old[device_offset];
  }
}

_new[linear_idx] = new_value;

I condensed the code a bit for the sake of clarity. _stencil_shape and _shape are two simple structs that describe the extent of the memory to work on. In this example, their values are _shape = {512,512} and stencil_shape={3,3}. idx is the 2D Cartesian coordinate of the current thread obtained from threadIdx, blockIdx and friends.

You should imagine me at this point holding my head and sweating like hell. When I observe myself doing this, an alarm bell rings that tells me to get some help. I asked Rene Widera and Jeffrey Kelling (both HZDR Rossendorf) to lend me a hand.

For a first shot, Rene suggested to make the nvcc compiler keep the PTX code to have a look at it - that's what we performance addicts do typically ... look at the assembly! You can achieve this by providing --keep --source-in-ptx to nvcc. It will then compile your code and leave all temporary files in the directory where all your binaries are built.

In my case with 2 compute capabilities defined, this resulted in 48 files being generated by the compiler:

./SandboxCUDAStencil.compute_60.cpp1.ii
./SandboxCUDAStencil.compute_60.cpp2.i
./SandboxCUDAStencil.compute_60.cpp3.i
./SandboxCUDAStencil.compute_60.cudafe1.c
./SandboxCUDAStencil.compute_60.cudafe1.gpu
./SandboxCUDAStencil.compute_60.cudafe1.stub.c
./SandboxCUDAStencil.compute_60.cudafe2.c
./SandboxCUDAStencil.compute_60.cudafe2.gpu
./SandboxCUDAStencil.compute_60.cudafe2.stub.c
./SandboxCUDAStencil.compute_60.ptx
./SandboxCUDAStencil.compute_60.sm_60.cubin
./SandboxCUDAStencil.compute_61.cpp1.ii
./SandboxCUDAStencil.compute_61.cpp2.i
./SandboxCUDAStencil.compute_61.cpp3.i
./SandboxCUDAStencil.compute_61.cudafe1.c
./SandboxCUDAStencil.compute_61.cudafe1.cpp
./SandboxCUDAStencil.compute_61.cudafe1.gpu
./SandboxCUDAStencil.compute_61.cudafe1.stub.c
./SandboxCUDAStencil.compute_61.cudafe2.c
./SandboxCUDAStencil.compute_61.cudafe2.gpu
./SandboxCUDAStencil.compute_61.cudafe2.stub.c
./SandboxCUDAStencil.compute_61.ptx
./SandboxCUDAStencil.compute_61.sm_61.cubin
./SandboxCUDAStencil.cpp4.ii
./SandboxCUDAStencil.cu.cpp.ii
./SandboxCUDAStencil.fatbin
./SandboxCUDAStencil.fatbin.c
./SandboxCUDAStencil.module_id

What you want to look at is ./SandboxCUDAStencil.compute_60.ptx in our cases as PTX which is not yet the machine code that will be run on the device, but something very close to that.

One nice property of the PTX file generated is, that it will be interleaved with the C/C++ code equivalent. For example, the stencil computation looks like this:

//SandboxCUDAStencil.cu:80 new_value += _stencil[stencil_offset++]*_old[device_offset];
.loc 1 80 1
add.s32 %r62, %r62, 1;
ld.global.f64 %fd8, [%rd26];
ld.global.f64 %fd9, [%rd25];
fma.rn.f64 %fd10, %fd9, %fd8, %fd10;

I can make out the increment by 1, two 64-bit load instructions and a fused-multiply-add to fill new_value. So far, not good! Because we still didn't find the root cause for our memory violation. So both of my Rossendorf colleagues suggested to insert printf statements. As many of you know, the CUDA run-time supports printf statements in device code. That said, the size of the printf output buffer is limited, so we won't get all the printf calls in every kernel instance on the card's SMs, but we'll get some. So I inserted the printf:


int y = -_stencil_shape.y/2;
for(;y<=_stencil_shape.y/2;++y){

  int x = -1*_stencil_shape.x/2;
  for(;x<=_stencil_shape.x/2;++x){

    device_offset = (idx.y + y)*std::size_t(_shape.x) + (idx.x + x);
    if (!(device_offset < len))
      printf("Arrrg out of bounds!\n");
    if (!(stencil_offset < stencil_len))
      printf("Arrrg out of stencil bounds!\n");
    new_value += _stencil[stencil_offset++]*_old[device_offset];
  }

}

And boom it fired (multiple times as expected) reporting an out of bounds memory access! But now we were getting closer in knowing where. After a sequence of binary-search-style insertions of printf, we observed that the initialization of y and x was corrupted.

int y = -_stencil_shape.y/2;

Contrary to expectations, given _stencil_shape.y=3 in our case, y yielded -2! Same goes for x. According to common integer arithmetic, 3/2 should give 1 and that multiplied by -1 gives -1 as you'd expect on the CPU. But with CUDA 8.0.44 (on x86_64) and 8.0.51 on Power8, 3/2 gave 2! At this point, we asked the Nvidia dev techs that were still present for help: Kate Clark and Matthias Wagner. After further inspection of the C and PTX code with them, Kate asked us to come up with a tiny reproducer and submit a bug report. While doing so, we were able to show that this bug prevails not only on compute capability 6.0 and 6.1 (Pascal generation), but also on compute capability 5.2 (Maxwell generation). What a day! What an odyssey.

Addendum: I submitted the bug report on March 13th, 2017, after the Eurohack. The Nvidia compiler team was able to reproduce the error described here. On March 30th, 2017, Nvidia updated the bug report by stating, that the bug has been fixed in the just-released CUDA 8.0.61. I uploaded a minimal reproducing example to this post.

Permanent link to this article: https://gcoe-dresden.de/reaching-the-shore-between-scylla-and-charybdis-my-eurohack-day-4-afternoon-session/

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>