«

Apr 26

Print this Post

Reproducible Performance | GPU boost

Developing highly performant code for GPUs is a task with many pitfalls. And to run that code efficiently with a reproducible outcome is another. This part of our blog covers the issues with runtime analysis under the GPU boost functionality on Kepler GPUs.

GPU clock rate and GPU boost

GPU boost improves overall performance: Modern Nvidia GPUs offer dynamic clock speeds depending on power & temperature targets and application requirements. The GPU boost feature and the Kepler architecture with GPU boost 2.0 introduces fine-grained voltage points with respective clock speeds. Pascal GPUs and GPU boost 3.0 are even more flexible and expose these controls to the end user by making individual voltage points programmable via the NVAPI.

Performance vs. clock modes

(Source: techpowerup.com)

GPU boost is not meant for reproducible performance: For performance benchmarks the GPU memory and core clock rate setting is important. Generally, there are base clocks, boost clocks and maximum clock speeds. On the K80 Kepler GPU the autoboost behaviour can be controlled via application clocks. Nevertheless, the power target based clock throttling still can lower the clocks.

The Nvidia blog on GPU boost gives more details on how to control the GPU boost functionality, even in a cluster environment. So the key point is: Applications running in boost clock mode might show higher, but unstable performance due to clock throttling. This becomes an issue, when your application is a benchmark or when you want to compare implemented optimizations. It also affects load balancing.

Fix the GPU clock to a non-boost level: So for consistent measurements the clock rate should be fixed to reasonable application clocks, and/or to disable auto boost clock mode. Keep in mind, that the clock rate arguments are just a recommendation. The GPU is still free to throttle down (you don't want to burn your GPU, do you?). For SLURM a SPANK plugin (GPLv3) has been developed at our HPC facility to specify the applications clocks via srun. nvidia-smi provides the same functionality by the -ac flag. On the K80 I normally start with 2505 MHz effective DDR clock for the GDDR5 memory and 823 MHz for the GPU core clock. To find the clock rates of your card, just check by nvidia-smi:

$ nvidia-smi -q -d CLOCK -i 0
[...]
    Applications Clocks
        Graphics                    : 823 MHz
        Memory                      : 2505 MHz
    Default Applications Clocks
        Graphics                    : 562 MHz
        Memory                      : 2505 MHz
    Max Clocks
        Graphics                    : 875 MHz
        SM                          : 875 MHz
        Memory                      : 2505 MHz
        Video                       : 540 MHz
[...]
    Clock Policy
        Auto Boost                  : On
        Auto Boost Default          : On

After your application finished you can check clock throttle reasons:

$ nvidia-smi -q -i 0 -d PERFORMANCE
[...]
GPU 0000:04:00.0
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active

Idle is set to active when there is no GPU application. Applications Clocks Setting is active when application is running with defined clock speeds. The others indicate possible unexpected deviations in the clock speeds. The GPU clock speed can vary from 562 MHz to 875 MHz on the K80. A small memory bandwidth test, with ECC enabled, gives an impression of the performance impact (average of 50 benchmark runs):

Throughput vs. GPU clock frequency

For 732 MHz and higher the memory bus becomes saturated, because enough memory instructions could be issued per time and 875 MHz is just waste of energy in this example. A higher GPU clock rate does not improve memory throughput (DDR effective memory clock rate is always 2505 MHz here). If the kernel would be compute-bound, then a change of the GPU clock rate directly affects the kernel performance. Note, that there is a throughput difference of 20% when running the benchmark with lowest and highest GPU clock speed. Furthermore, a throughput decrease of 3% can be observed by the read-write kernel, where the DRAM suffers from read/write turn-around overhead (see this forum entry).

The clock rate also affects the bandwidth of transfers between host and device (average of 5 runs of the CUDA SDK bandwidthTest):

cudaMemcpyAsync Throughput vs. GPU clock frequency

The default GPU clock rate of 562 MHz is increased to 875 MHz and in the same amount the throughput increases.

The SLURM job scripts consist of the following header:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1      # 1x GPU
#SBATCH --cpus-per-task=1 # default
#SBATCH --time=0:05:00    # 5min
#SBATCH --mem=2000M       # 2GB per node
#SBATCH --exclusive       # to avoid interferences
#SBATCH --partition=gpu2  # partition with K80s

# ... some work
srun ./bandwidthTest
# ... request fixed clock rates
srun --cpu-freq=medium --gpufreq=2505:823 ./bandwidthTest
# ...

Fix & watch the GPU clock for consistency: For benchmarks as well as analysis of optimization and load-balancing you want consistent clock speed conditions on the GPU. GPU boost is supposed to improve the overall performance, but hinders runtime analysis with clock throttling when power targets have been reached. Therefore, the applications clocks should be set and configured to a reasonable value depending on application and system environment. If possible, keep track of the clock speeds and power target statistics, as the GPU is always able to throttle down to keep the GPU safe.

CPU Boost

CPUs like Intel Core i5/i7 support dynamic clock rates as well. For example, the E5-2680-v3 has 2.5 GHz as base and 3.3 GHz as maximum turbo frequency, while the lowest frequency state would be at 1.2 GHz. Intel calls it Turbo Boost or "dynamic overclocking". Note, that the clock frequency can be lowered even further when running in AVX mode. Again, for benchmarks or load balancing analysis "dynamic overclocking" is an unwanted feature. Control the CPU performance states by either ACPI or on an HPC system with, e.g., Slurm. The Slurm documentation of cpu-freq provides several ways to request a CPU clock and power state respectively (requesting a clock range is also possible):

  • srun with command-line argument: srun --cpu-freq=<...>
  • srun without frequency option:
    • frequency depends on default setting of SLURM_CPU_FREQ_REQ:
    • ... can be set via #SBATCH --cpu-freq=<...> (within an SBATCH job script)
    • ... or can be set via export SLURM_CPU_FREQ_REQ=<...>
  • frequency values: low, medium, highm1 (highest non-turbo) or high or the frequency in kilohertz
  • launching application without srun: not recommended, because:
    • clock speed is unknown
    • tasks might run on wrong cores (affinity/pinning)
    • launching with mpirun can be an exception, when (commercial) binaries come with their own MPI, where srun does not work

Fix your CPU clock for reproducible performance: If you do not run your application with srun the frequency scaling governor might use an OnDemand policy. The result would yield volatile runtime measurements due to varying clock speeds. The following diagram shows an example of a single-CPU application, where the runtimes vary at the same amount as the CPU clock rate setting. At high clock speed the application seems to become memory bound.

Runtime vs. CPU clock rate

Here is a job script to that shows you the frequencies:

#!/bin/bash
#SBATCH -J batch_test
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:01:00

bash -c 'echo $SLURM_CPU_FREQ_REQ'
srun bash -c 'echo $SLURM_CPU_FREQ_REQ'
srun --cpu-freq=medium bash -c 'echo $SLURM_CPU_FREQ_REQ'

which returned on one of our Haswell nodes:

highm1
Highm1
Medium

Conclusion

So, there are several avoidable factors which could disturb the benchmarks, optimizations, runtime- or load balancing analysis. In this post we recommended to:

  • use exclusive allocation of compute nodes on an HPC system to prevent interferences with other users
    • we speak of spontaneous slowdowns up to 4x or even more here
  • watch the performance states and clock frequency respectively, for CPUs as well as for GPUs
    • to keep track of the environment dump the most relevant settings right into your results file: CPU, GPU, memory, OS, date, program version, libraries versions, compiler version, driver version, ...

Further recommendations:

  • use proper compute capabilities of your compiled kernels
  • when warmup is required, measurements of warmups are nice-to-have (good to know the whole timeline and helps tuning warmup length)
  • for more reliable results gather statistics from more than one process launch (where one process already repeats a function to be benchmarked n times)
    • single-core vs. multi-core CPU performance also depends on shared L3 cache usage (e.g., run 8 instances of single-core application to compare against 8-core performance)
  • if possible keep track of runtime dynamic metrics like clock rates and memory usage (this has impact on runtime and latency which must be measured itself), e.g. via CUDA API, NVML, nvidia-smi, ...

Permanent link to this article: https://gcoe-dresden.de/reproducible-performance-gpu-boost/

2 comments

  1. Peter Steinbach

    I was just informed by Nvidia that setting the frequency of the attached GPU through nvidia-smi is restricted to the Titan, Quadro and Tesla line of Nvidia GPUs only.

  2. Matthias Werner

    Thanks for the valuable information. AfterBurner 4.3.0 seems to allow fixed voltage/frequency settings in realtime for Pascal cards (GPU Boost 3.0), see here.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>