«

»

Aug 23

Print this Post

Reproducible Performance on HPC | Affinity

Developing highly performant code for GPUs is a task with many pitfalls. And to run that code efficiently with a reproducible outcome is another. This part of our blog covers the pinning of tasks to cores within a multi-socket system.

Affinity

Processor affinity, or CPU pinning, allows to pin a process or a thread to CPU cores, e.g., for efficient communication and caching. To explain things more practically, we take a non-uniform architectures (NUMA) such as the dual-socket system E5-2680-v3 as an example. By lstopo --no-io --no-legend haswell.png the topology of one of our haswell nodes becomes visualized:

Haswell lstopo

On the top layer you can see the two NUMA nodes represented by the two CPU sockets, each with 32GB of DRAM. On such a socket there are 12 CPU cores, which have their own L1 and L2 caches. The L3 cache with 30 MB is shared by all the 12 CPU cores.

What we cannot see in the picture above is the QuickPath Interconnect (QPI) between the memory controllers in the two sockets (AMD's counterpart is HyperTransport). The QPI bus takes care of the inter-socket traffic as illustrated by the following picture (source: Frank Denneman).

NUMA Access with QPI

Depending on the clockrate QPI delivers 20-25 GB/s of bandwidth (bidirectional). Well, what does it have to do with GPUs? GPUs are typically connected with PCIe to PCIe switches on the CPU die. Sharing data between non-adjacent GPUs incurs a QPI-hop. This forwarding of PCIe packets takes the performance down.

In current generations (IvyBridge) of Intel Xeon processor E5-2600 series CPUs, the generally really useful QPI link introduces significant latency when forwarding PCIe packets between processors due to the way the packets are buffered and forwarded. (source: The Cirrascale Blog)

This also has consequences how you allocate the data and which devices you can use to avoid such a detour. And you have to know how your job manager enumerates the cores. If you start a job with 4 CPUs and 4 GPUs (K80) on a dual socket system, you probably will receive this topology:

$ nvidia-smi topo -m

GPU0    GPU1    GPU2    GPU3    mlx4_0  CPU Affinity
GPU0     X      PIX     SOC     SOC     SOC     0-3
GPU1    PIX      X      SOC     SOC     SOC     0-3
GPU2    SOC     SOC      X      PIX     PHB     0-3
GPU3    SOC     SOC     PIX      X      PHB     0-3
mlx4_0  SOC     SOC     PHB     PHB      X

Legend:

X   = Self
SOC  = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX  = Connection traversing a single PCIe switch
NV#  = Connection traversing a bonded set of # NVLinks
  • CPUs 0-3 are located on node 0 = first socket
  • CPUs 0-3 connects to GPU0-1 directly via PCIe
  • CPUs 0-3 connects to GPU2-3 traversing SMP link / QPI

The allocated CPU cores 0-3 are located on the first socket, while GPU2-3 are connected to the cores of the other socket. numactl helps to show and to control the NUMA policy and in our example numactl confirms the topology:

$ numactl -s
policy: default
preferred node: current
physcpubind: 0 1 2 3
cpubind: 0
nodebind: 0
membind: 0 1

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
node 0 size: 32663 MB
node 0 free: 30981 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23
node 1 size: 32768 MB
node 1 free: 31819 MB
node distances:
node   0   1
0:  10  21
1:  21  10

Control affinity with numactl

We now allocate exclusively a node whereby all GPUs and CPUs are visible. numactl controls the process and memory placement by node and memory IDs.

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
node 0 size: 32663 MB
node 0 free: 20561 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23
node 1 size: 32768 MB
node 1 free: 26344 MB
node distances:
node   0   1
0:  10  21
1:  21  10

$ numactl -s
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
cpubind: 0 1
nodebind: 0 1
membind: 0 1

This is our slurm jobfile for the K80 to test the process pinning and the corresponding bandwidth (bandwidthTest from CUDA SDK).

#!/bin/bash
#SBATCH -J BandwidthK80Pinning
#SBATCH --ntasks=1
#SBATCH --gres=gpu:4
#SBATCH --time=0:01:00
#SBATCH --mem=2000M # gpu2
#SBATCH --exclusive
#SBATCH --partition=gpu2

srun --gpufreq=2505:823 numactl -m0 -N0 ./bandwidthTest # socket0, memory0
srun --gpufreq=2505:823 numactl -m1 -N1 ./bandwidthTest # socket1, memory1

Results are given in the picture below.

cudaMemcpy Throughput on a Dual-Socket

The diagrams show which impact an ill-placed GPU job can have. The throughput on the SandyBridge/K20Xm is mostly limited by PCIe v2 and downloading might be affected by the QPI. While the throughput on the Haswell/K80 suffers 15% of performance, the Haswell/P100 looses over 80% of the throughput when the pinned memory is transferred from the other CPU memory over the QPI. It looks like there is another factor which kills the PCIe performance. Coming to a more recent architecture, the throughput on the Broadwell/P100 system seems to be independent from the node pinning and reaches almost 13 GB/s. All these transfers excluded the memory used for ECC (20% memory footprint).

Take care of process pinning and node topology, especially when using older systems. To control the pinning you can use numactl or taskset. Some HPC job managers like Slurm also offer parameters for controlling the task affinity. Keep in mind, that job managers can have different policies to distribute your tasks on the cores. The GPU-CPU binding can be found via nvidia-smi topo -m.

Permanent link to this article: https://gcoe-dresden.de/reproducible-performance-on-hpc-affinity/

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>