Mar 17

Print this Post

Seeing the shore, Eurohack day 4

The 4th day started off with a quick presentation about good practices on power8 compiler flags. One suggestion to use with the gcc frontend to xl was, to use the following:

gcc -Ofast -flto -mcpu=power8 -mtune=power8 -mveclibabi=mass

Let's quickly go through them: -Ofast use optimizations that produce the fastest binaries; -Olto relates to link-time optimisations where an optimizer parses the binaries for optimisation opportunities after all object files have been linked; -mcpu=power8 -mtune=power8 forces the compiler to apply Power8 platform specific optimisations and if needed, use platform specific instructions; -mveclibabi=mass use the IBM vector library MASS where possible. Andreas wrote that up in the Juron related documentation for the hackathon. The eurhack slack channel was quite active.

First off, the question was raised, how to produce a benchmark dump with nvprof that can be viewed with nvvp. The recipe shared by HZDR's Alexander Matthes was:

#on the remote machine or in the cluster as part of a submit script
$ nvprof -o metrics.nvprof --analysis-metrics ./yourapp

in case you are interested in the metrics analysis, if not, then do

#on the remote machine or in the cluster as part of a submit script
$ nvprof -o timeline.nvprof ./yourapp

By the way, this trick also works with mpi applications, so no need to use the heavy artillery like allinea, score-p and friends. After this is done, the produced .nvprof file needs to be imported in to nvvp either on your laptop or in an exported X session with nvvp.

In order to write CPU code that respects the available number of cores, someone asked how to set the number of pthreads in a similar manner to OMP_NUM_THREADS ? The community replied to call pthread_create only n threads times. But, the Linux operating system sets a maximum of threads to spawn per process as documented in /proc/sys/kernel/threads-max. My laptop apparently is capable to spawn 125811 threads:

$ cat /proc/sys/kernel/threads-max

which is an interesting number for a 2 core mobile Ivy Bridge box.

A more detailed usage example of the -Minfo flag was given to help you understand compiler actions on OpenACC code for a simple saxpy kernel. For some simple code like this:

void saxpy(int n, float a, float * restrict x, float * restrict y)
#pragma acc kernels
    for(int i = 0; i < n; ++i)
        y[i] = a*x[i] + y[i];

Here is the output of the pgi compiler:

$ pgcc -fast -ta=tesla,multicore -Minfo -c saxpy.c
     4, Generating implicit copy(y[:n])
        Generating implicit copyin(x[:n])
     5, Loop is parallelizable
        Accelerator kernel generated
        Generating Tesla code
         5, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     5, Generating Multicore code
         5, #pragma acc loop gang
     5, Generated an alternate version of the loop
        Generated vector simd code for the loop
        Generated 2 prefetch instructions for the loop
        Generated vector simd code for the loop
        Generated 2 prefetch instructions for the loop
        FMA (fused multiply-add) instruction(s) generated

Very impressive.

My team and a neighboring one ran into a clash while using cuda-gdb. Apparently, cuda-gdb creates a lock file upon invocation. While that might be ok on a machine with 1 user and possibly 1 GPU, that approach falls on its feet on a shared node with 4 GPUs. Two users may want to use the debugger at the same time, but the cuda-gdb lock will deny at least one user to do that.

Some teams were struggling with uncoalesced memory access, so someone on the chat suggested the trove library by Baidu's Brian Catanzaro.

Last but not least, a short bash snippet was posted to pin a job near the GPU on a node. The goal of this trick is to minimize CPU socket Bridge traffic before getting your data onto the device.

#this is bash
GPUDEVICE=$(nvidia-smi topo -c $CPUCORE | tail -n +2 | cut -d "," -f 1)
taskset -c $CPUCORE $1

Again, if you have any questions or concerns, feel free to use the comment section below. Also don't miss the news coverage of Eurohack.

Permanent link to this article: https://gcoe-dresden.de/seeing-the-shore-eurohack-day-4/