May 01

1st Machine Learning Community Workshop Dresden

The “Machine Learning Community” (MLC) Dresden is a joint activity of numerous research institutes to enhance the networking and the knowledge transfer in the fields of Machine and Deep Learning in and around Dresden.

The 1st MLC workshop will take place on May 15th 2018 with over 50 researchers from Dresden and Leipzig, who will present their current work, upcoming projects and ideas regarding Machine Learning applications. It will also be the kick-off event to announce MLC as a supportive platform, e.g., for regular meetings on general and subject-specific Machine Learning topics. Everybody, from beginners to experts, and also researchers just planning to use ML methods in the future are welcome to join the MLC.

The MLC founding members come from several institutes of the Technische Universität Dresden, but also from the Helmholtz-Zentrum Dresden – Rossendorf, the University Hospital Dresden and the Max Planck Institute of Molecular Cell Biology and Genetics. As Machine Learning might become a highly data-intensive and time-consuming task, members from the Centre for Information Services and High Performance Computing (ZIH) and from the GPU Center of Excellence (GCOE) joined the MLC to provide HPC support for ML applications.

1st MLC Workshop Dresden

MLC is always looking for interesting talks and research collaboration, so do not hesitate to get in touch: mlc (at)

(The registration is closed, no more slots available.)

Permanent link to this article:

Mar 07

Meet the #GPUhackDD 2018 mentors

The GPU Hackathons would not be as successful, were it not for the mentors. Every one of our nine teams has been assigned two mentors who work closely with the teams to help them achieve their goals. And now – meet the GPU experts of the Dresden GPU Hackathon (in alphabetical order):

Read the rest of this entry »

Permanent link to this article:

Mar 02

The Hack is on in 2018!

It’s that time of the year again! Developers of scientific software projects of any discipline come together with GPU experts from across the globe in Dresden to bring their science to the GPU! Stay tuned on this blog or by following the twitter hashtag #GPUHackDD.

So without further ado, here are this year's teams:

Team ZerialLab

We are after microscopy-based image analysis by solving a system of non-linear equations in multidimensional space (~100 dimensions) using an iterative convergence method. This system is solved for each pixel/voxel of a 2D/3D image. The home-made implementation gives similar calculation time on CPU (Intel I7, 12 threads/6 core) and high- end GPU (Kepler K20). We are doing software development for quantitative microscopy within biological laboratory for more than 15 years. The basic programming language is C++. Basic platforms are Windows for user interface and Linux for high-throughput computation on HPC systems. We are implementing some parts of our algorithms on GPU (OpenCL) from 2011.

Team Remeisen

Tessim is part of the Sixte software package developed at the Erlangen Centre for Astroparticle Physics (ECAP) (Sixte Homepage). Using the software package astronomers all around the world are able to perform simulations of X-ray observations as they would be carried out by a variety of different satellites in space. This makes it possible to determine the feasibility of observations, create analysis software in advance of satellite missions and to carry out performance studies of instruments in development. The Tessim tool itself consists of about 8000 LOC and it is currently implemented in C and uses standard libraries in addition to the GSL. The main user community is the consortium for the XIFU instrument. The satellite will be launched in 2028, our software package is used in all areas of the performance studies for the mission. A (to be developed) version using the GPU will be used mainly by the XIFU systems team.

Team Uni Graz

Our application is an Eikonal equation solver written in CUDA to solve the eikonal equation and simulate the wave propagation on the heart mesh. It is also used for the inverse problem in order to determining the patient specific cardiac conductivity parameters, where the solver is called by the optimization method many times until the optimization method finally converges. For this reason it is very important to have a very fast solver. We have implemented it in CUDA and currently it runs only in one GPU. We want to port it to cluster computing and the benefits are discussed above. LOC = 4248. It has the LGPL v2.1 license. We are part of a large community who are using the Eikonal equation for different research purposes including here the Medical and Technical Uni of Graz.

Team LeMonADErs

Calculating the interactions of macromolecules/soft matter is computationally demanding and recent simulations on CPU are limited in physical time and length scales. In our approach, we use the Bond Fluctuation Model (BFM)[1,2] as well-established Monte Carlo method for simulating coarse- grained polymeric materials. Our team has already developed and published a single-CPU-based open-source project "LeMonADE" on gitHub for the usage in the scientific community. We already ported the sequential BFM-algorithm onto the GPU with CUDA for parallel execution of the monomer trial and error procedure. Within the hackathon, we want to address further optimization of this approach and novel algorithms in the problem domain.

Team McStas-McXtrace

The McStas neutron Monte Carlo ray-tracing simulation package is a versatile tool for producing accurate simulations of neutron scattering experiments at reactors, short- and long-pulsed spallation sources. McXtrace is the X-ray counterpart, for X-ray scattering experiments at lab-, synchrotron- and free-electron laser sources. McStas and McXtrace are extensively used for design and optimization of instruments, virtual experiments, data analysis and user training. McStas was founded as an scientific, open-source collaborative code in 1997, McXtrace was founded in 2009. Technology wise, a user-written input file defined in our own DSL (lex+yacc grammar) provides input to a code-generator, and an ISO C code is generated, compiled and run. To (always) be able to simulate the (virtual) experiments faster than they are performed in reality, allow for in-experiment "data analysis", shorten execution time for "complicated" simulations, shorten development time for the instruments being constructed.

Team TU Dresden HF

Our application is a FDTD (finite difference time domain) solver for Maxwell's equation. This can be used to solve electromagnetic propagation problems. The current code base is about 8000 lines of code. The implementation is done in Python and has MPI support. However, the software currently runs on CPU only. Our goal is to port the datastructures and FDTD kernel to the GPU to increase the computational speed of the finite difference algorithms. The software is open-source with a BSD license.

Team Neuro Nerds

We have two applications in total: The first is about speeding up analysis of neurothermic images in neurosurgery. The second is about speeding up recurrent neural networks learning for deep learning in sensor applications. We think that there would be a performance gain of some hours or even days in the RNN case so our networks could become larger to learn more features from our data.


We have developed an in-house software for multi-resolution particle and mesh simulations. The novelty of our works comes from the new method : Adaptive particle representation (APR) we devised for details estimation and tracking for adaptive simulations. The APR was original developed for efficient and adaptive representation and storage of light sheet microscopy images, the method has now been extended for adaptive simulations of PDE's which reduces the number of DOF and the computational cost associated with it. We think, the GPU capabilities would massively decrease our compute times. More interestingly, we would like to quantify the performance characteristics of our novel algorithm. Potential applications include real- time adaptive simulations for system biology and computer graphics. We also envision an efficient and adaptive pipeline for image based system biology where fast and accurate simulations can be run in surfaces/geometries reconstructed from pixel data.

Team Ptycho_imaging

Phase reconstruction from intensity measurements obtained by ptychographical imaging. Ptychography is used to reconstruct 3D xray far field images from 2D scattering images. The algoritm is not a specific solution but a general purpose technique with a lot of applications. We use MATLAB with the Image Processing Toolbox running on laptop. We would like to run the code on the GPU.

Permanent link to this article:

Dec 21

Apply for the GPU Hackathon in Dresden [Submission Deadline: 19.1.2018]

The GPU Hackathon Series is announcing another year of hackathons. In 2018 the first one will take place in Dresden. If you have science code you would like to see runnning on GPUs, then this is for you!

Read the rest of this entry »

Permanent link to this article:

Nov 02

GCOE Dresden at the SC17

The International Conference for High Performance Computing, Networking, Storage and Analysis (SC17) in Denver (Colorado, USA) is the most important conference in the field of high-performance computing (HPC) and will take place from 12 to 17 November 2017. The GCOE partners ZIH and the HZDR will present their know-how on the topics of HPC, Data Intensive Computing as well as Grid and Cloud Computing in talks, tutorials and discussions.

In addition to current research work and software developments, activities in the field of Big Data and data analysis will be presented. The two main topics are online performance analysis and analysis of applications’ I/O behavior. At the ZIH booth, visitors can try out the interactive online analysis prototype Vampir Live and experience new I/O analysis capabilities of Vampir. In particular, the I/O behavior of applications is becoming a bottleneck on the way to exascale high-performance computers. Novel I/O systems based on non-volatile memory technologies address this bottleneck. Together with the EU Horizon2020 project NEXTGenIO, the I/O analysis capabilities of the tools Score-P and Vampir have been significantly enhanced. The online analysis approach presents strategies for dealing with limited I/O capacities on the tool side as well.

Highlighted events

If you want to know more about GPU Hackathons to benefit from the experiences we and other organizers have had over the past hackathons, visit the Workshop on Education for High Performance Computing. The agenda  will probably be released on the workshop website.

Get into accelerator programming with the Fourth Workshop on Accelerator Programming Using Directives WACCPD, which is co-organized by Guido Juckeland from the HZDR.

Guido also will be part of the OpenACC BoF and the OpenACC User Group Meeting.

Ronny Tschüter et. al from our ZIH will present an LLVM Instrumentation Plug-In for Score-P.

The Hands-On Practical Hybrid Parallel Application Performance Engineering will show you how to work with the different performance tools from the Virtual Institute - High Productivity Supercomputing (VI-HPS) by hands-on exercises with Scalasca, Vampir, and TAU.

To see other contributions, check the TU Dresden page. To get in contact you also can visit our ZIH booth @ booth 1881. Here, you can check out our performance tools or ask us about Machine Learning, BigData (ScaDs), NextGenIO, Energy Efficiency and performance related issues.

Permanent link to this article:

Sep 01

Interactive Deep Learning on a HPC cluster

One of the key ingredients for the success of our Deep Learning Bootcamp last week was the availability of jupyter notebooks on the GPU nodes of our HPC installation. This avoided many problems we saw in previous workshops when software is run on the hardware that participants bring along. At the same time, it allows the learners to take home their notebooks.

The key ingredient besides the GPU hardware and the availability of theano, tensorflow and keras on these nodes, was using jupyterhub. I started from a blog post by a fellow software carpentry instructor, Andrea Zonca at the San Diego Supercomputer Center. If people are interested what is possible beyond our minimal setup (think docker swarm).

For my use case, I wanted to run jupyterhub on the head node of our cluster. So I installed it from pip into our software installation:

$ pip3 install --prefix=/sw/apps/jupyterhub/<version> jupyterhub

In addition, I exploited two plugins to jupyterhub: batchspawner and profilespawner from the wrapspawner package. Both can be installed doing:

$ git clone<spawnername>
$ cd spawnername
$ pip3 install --prefix=/sw/apps/<spawnername>/<version> .

And after adjusting the environment, aka PYTHONPATH and PATH, everything is good to go.The last ingredient to get things rolling was to install the nodejs package configurable-http-proxy.

$ npm install -g --prefix /sw/apps/nodejs-modules/ configurable-http-proxy

To configure jupyterhub, I first generated a configfile from:

$ jupyterhub --generate-config

After adjusting the core bits related to authentication, logfiles, port numbers, location of sqlite database and such, the most important parts of the configuration are the following parts:

c = get_config()
c.JupyterHub.spawner_class = 'batchspawner.SlurmSpawner'
c.Spawner.http_timeout = 120
c.SlurmSpawner.req_nprocs = '1'
c.SlurmSpawner.req_runtime = '8:00:00'
c.SlurmSpawner.req_partition = 'gpu'

c.SlurmSpawner.start_timeout = 240

c.SlurmSpawner.batch_script = '''#!/bin/bash
#SBATCH --partition={partition}
#SBATCH --time={runtime}
#SBATCH --output={homedir}/jupyterhub_slurmspawner_%j.log
#SBATCH --job-name=jupyterhub-spawner
#SBATCH --cpus-per-task={nprocs}
#SBATCH --workdir=/home/{username}
#SBATCH --uid={username}
#SBATCH --mem={memory}
#SBATCH {options}

source /sw/env/
source /usr/share/Modules/init/bash
module load courses/env
module load cuda/8.0.61
module load hdf5/1.8.16
which jupyterhub-singleuser

And for convenience I created 3 profiles (plus one test profile) from which the user can choose from:

c.JupyterHub.spawner_class = 'wrapspawner.ProfilesSpawner'
c.Spawner.http_timeout = 120

c.ProfilesSpawner.profiles = [
 ('Furiosa GPU node - 40 cores, 32 GB, 8 hours',
   dict(req_nprocs='40', req_partition='gpu', req_runtime='8:00:00', req_memory='32000')
 ('Furiosa CPU only node - 1 core, 16 GB, 8 hours',
   dict(req_nprocs='1', req_partition='batch', req_runtime='8:00:00', req_memory='16000')
 ('Furiosa CPU node - 12 cores, 32 GB, 8 hours',
   dict(req_nprocs='12', req_partition='batch', req_runtime='8:00:00', req_memory='32000')
 ( "Test server", 'local', 'jupyterhub.spawner.LocalProcessSpawner', {'ip':''} )

With this, you can run jupyterhub as easy as:

# jupyterhub -f /path/to/ --log-file=/path/to/jhub.log

And when I then open my browser and dial over to the IP which I configured, I see this after the account/password prompt:

A user would only see the “Start My Server” button here. When “Start My Server” is clicked, you land on this nice selection page, where you can choose the profile from which you want to spawn a job.

With this, a jupyter notebook opens after some small waiting time and you are good to go.

With this, we enabled interactive use of our GPU nodes without the participants having to learn the job scheduler (SLURM in our case) and allowed extensive experimentation during the workshop which aided learning tremendously.

Further, this allows more people to use the cluster resources and hence provide a higher return-of-investment with regards to scientific results. Mission accomplished!

Permanent link to this article:

Aug 29

A Deep Learning Infusion

Last week, the Deep Learning Bootcamp 2017 took place at the Center for Systems Biology in Dresden. It was a mind blowing experience – both from the intellectual and from the social point of view. If you want to get more direct impressions, see our twitter timeline.

We had two highly motivated instructors: Nico Hoffman (Uniklinikum Dresden) and Kashif Rasul (Zalando/Freie Universität Berlin) that were a perfect match for the course as they came with experience, curiosity and the talent to adapt their teaching ad-hoc or even over night if they felt it was needed. Our instructors encountered 33 curious learners from industry (e.g. automobile, network security, media) and academia from many disciplines (plasma physics, particle physics, economy, air and spacecraft engineering as well as a lot of life science domains). Our participants traveled from as far as Paris, London, Italy, Zurich and many other cities in Germany to Dresden.

We were fortunate enough to have wonderful sponsors that allowed us to put up the course on a reduced budget and yet provide all participants with food and drinks, supplemental material and (most importantly) t-shirts. That’s why we deeply thank all of our sponsors at this point: GCOE Dresden, Scionics Computer Innovation GmbH, Google CloudNvidiaZalando Research, Deep Learning Box, Cray and the German Network for Bioinformatics Infrastructure.

Here is an impression of our wonderful participants, instructors and speakers that made the event interactive and thriving as much as possible:

Last but not least, here is a screenshot of the anonymous feedback I received so far from 20 participants:

I hope I have provided enough evidence with the above, that deep learning is here to stay at Dresden. I will add one or two more blog posts in the coming days on some of the GPU related findings, we made during the workshop. So stay tuned!

Permanent link to this article:

Aug 23

Reproducible Performance on HPC | Affinity

Developing highly performant code for GPUs is a task with many pitfalls. And to run that code efficiently with a reproducible outcome is another. This part of our blog covers the pinning of tasks to cores within a multi-socket system.


Processor affinity, or CPU pinning, allows to pin a process or a thread to CPU cores, e.g., for efficient communication and caching. To explain things more practically, we take a non-uniform architectures (NUMA) such as the dual-socket system E5-2680-v3 as an example. By lstopo --no-io --no-legend haswell.png the topology of one of our haswell nodes becomes visualized:

Haswell lstopo

On the top layer you can see the two NUMA nodes represented by the two CPU sockets, each with 32GB of DRAM. On such a socket there are 12 CPU cores, which have their own L1 and L2 caches. The L3 cache with 30 MB is shared by all the 12 CPU cores.

What we cannot see in the picture above is the QuickPath Interconnect (QPI) between the memory controllers in the two sockets (AMD's counterpart is HyperTransport). The QPI bus takes care of the inter-socket traffic as illustrated by the following picture (source: Frank Denneman).

NUMA Access with QPI

Depending on the clockrate QPI delivers 20-25 GB/s of bandwidth (bidirectional). Well, what does it have to do with GPUs? GPUs are typically connected with PCIe to PCIe switches on the CPU die. Sharing data between non-adjacent GPUs incurs a QPI-hop. This forwarding of PCIe packets takes the performance down.

In current generations (IvyBridge) of Intel Xeon processor E5-2600 series CPUs, the generally really useful QPI link introduces significant latency when forwarding PCIe packets between processors due to the way the packets are buffered and forwarded. (source: The Cirrascale Blog)

This also has consequences how you allocate the data and which devices you can use to avoid such a detour. And you have to know how your job manager enumerates the cores. If you start a job with 4 CPUs and 4 GPUs (K80) on a dual socket system, you probably will receive this topology:

$ nvidia-smi topo -m

GPU0    GPU1    GPU2    GPU3    mlx4_0  CPU Affinity
GPU0     X      PIX     SOC     SOC     SOC     0-3
GPU1    PIX      X      SOC     SOC     SOC     0-3
GPU2    SOC     SOC      X      PIX     PHB     0-3
GPU3    SOC     SOC     PIX      X      PHB     0-3
mlx4_0  SOC     SOC     PHB     PHB      X


X   = Self
SOC  = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX  = Connection traversing a single PCIe switch
NV#  = Connection traversing a bonded set of # NVLinks
  • CPUs 0-3 are located on node 0 = first socket
  • CPUs 0-3 connects to GPU0-1 directly via PCIe
  • CPUs 0-3 connects to GPU2-3 traversing SMP link / QPI

The allocated CPU cores 0-3 are located on the first socket, while GPU2-3 are connected to the cores of the other socket. numactl helps to show and to control the NUMA policy and in our example numactl confirms the topology:

$ numactl -s
policy: default
preferred node: current
physcpubind: 0 1 2 3
cpubind: 0
nodebind: 0
membind: 0 1

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
node 0 size: 32663 MB
node 0 free: 30981 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23
node 1 size: 32768 MB
node 1 free: 31819 MB
node distances:
node   0   1
0:  10  21
1:  21  10

Control affinity with numactl

We now allocate exclusively a node whereby all GPUs and CPUs are visible. numactl controls the process and memory placement by node and memory IDs.

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
node 0 size: 32663 MB
node 0 free: 20561 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23
node 1 size: 32768 MB
node 1 free: 26344 MB
node distances:
node   0   1
0:  10  21
1:  21  10

$ numactl -s
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
cpubind: 0 1
nodebind: 0 1
membind: 0 1

This is our slurm jobfile for the K80 to test the process pinning and the corresponding bandwidth (bandwidthTest from CUDA SDK).

#SBATCH -J BandwidthK80Pinning
#SBATCH --ntasks=1
#SBATCH --gres=gpu:4
#SBATCH --time=0:01:00
#SBATCH --mem=2000M # gpu2
#SBATCH --exclusive
#SBATCH --partition=gpu2

srun --gpufreq=2505:823 numactl -m0 -N0 ./bandwidthTest # socket0, memory0
srun --gpufreq=2505:823 numactl -m1 -N1 ./bandwidthTest # socket1, memory1

Results are given in the picture below.

cudaMemcpy Throughput on a Dual-Socket

The diagrams show which impact an ill-placed GPU job can have. The throughput on the SandyBridge/K20Xm is mostly limited by PCIe v2 and downloading might be affected by the QPI. While the throughput on the Haswell/K80 suffers 15% of performance, the Haswell/P100 looses over 80% of the throughput when the pinned memory is transferred from the other CPU memory over the QPI. It looks like there is another factor which kills the PCIe performance. Coming to a more recent architecture, the throughput on the Broadwell/P100 system seems to be independent from the node pinning and reaches almost 13 GB/s. All these transfers excluded the memory used for ECC (20% memory footprint).

Take care of process pinning and node topology, especially when using older systems. To control the pinning you can use numactl or taskset. Some HPC job managers like Slurm also offer parameters for controlling the task affinity. Keep in mind, that job managers can have different policies to distribute your tasks on the cores. The GPU-CPU binding can be found via nvidia-smi topo -m.

Permanent link to this article:

Aug 08

Google coming to Dresden

As part of our “Deep Learning with Keras” workshop, we are happy to announce that google will send two data engineers to teach us how to use GPUs in the google cloud for deep learning.

On August 24, 2017, there will be a three hour long workshop on using tensorflow in the google cloud. this is followed by a general Q&A session on how scientists can use the google cloud for research.

There are 50 seats available that will be distributed on a first-come, first-serve basis. Only registered participants are admitted.

More details will become available here. To register, please use the event page.


Permanent link to this article:

Apr 26

Reproducible Performance | GPU boost

Developing highly performant code for GPUs is a task with many pitfalls. And to run that code efficiently with a reproducible outcome is another. This part of our blog covers the issues with runtime analysis under the GPU boost functionality on Kepler GPUs.
Read the rest of this entry »

Permanent link to this article:

Older posts «