The Dresden GCoE and FZ Juelich were calling for interested groups and now six teams and 12 mentors came to spend the week in Dresden to port their applications to GPUs or improve their existing GPU applications. The teams are:

Complex Brain Networks from Charite, Berlin |
The connectivity of the brain can be described by defining brain networks, which comprise regions (“nodes”) and interregional structural or functional connections (“edges”). Graph theory, the mathematical study of networks, provides a powerful and comprehensive formalism of global and local topological network properties of complex structural or functional brain connectivity. These applications of graph theory in neuroscience have led to significant advances in basic and clinical research over the past years. Our application, “GraphVar”, is a user-friendly software for comprehensive graph-theoretical analyses of brain connectivity, including network construction and characterization, statistical analysis on network topological measures, network based statistics, and interactive exploration of results. GraphVar consists of several thousand lines of code. It is currently built in MATLAB to run on the CPU, and on clusters in the MATLAB distributed computing environment. While the use of a scripting language allowed us to quickly add common graph theoretical analysis pipelines, it also limits analyses to a few hundred graph nodes. In 2016, we thus plan to extend and optimize GraphVar to allow the fast analysis of tens of thousands of nodes. Through the hackathon, we would like to explore the possibility of parallelizing our graph theoretical routines via OpenACC. We believe the ease of use of OpenACC may allow us, neuroscientists without a formal background in computer science, to achieve significant speedups and allow the parallelization of our (currently often recursive) codes. OpenACC may allow us to run our routines in supercomputer and GPU environments. GraphVar was developed under the GNU General Public License v3.0 and can be downloaded at www.rfmri.org/graphvar or www.nitrc.org/projects/graphvar. Since its publication in early 2015, the application has been downloaded about two thousand times and applied in five peer-reviewed publications. We believe that the improvements outlined above would allow researchers to make much more detailed computations by reducing the need for aggregating data, and thus increase the reliability and generalizability of future graph theoretical studies. |

JuRor from FZ Juelich |
Real-time simulations of smoke propagation during fires in complex geometries are challenging. For the sake of computing time, accuracy suffers where precision could impact rescue decisions. To support fire fighters, we build a real-time simulation and prognosis software of smoke propagation in underground stations using GPUs. To provide for necessary data processing, sufficient computational capacity in CFD simulations, and instantaneous access, the computer system needs a minimum amount of main memory available any time. Also limitations in costs, power, cooling and space are reasons for remote usage on clusters, cloud computing or installing a small GPU appliance on-site (e.g. in the fire station or truck). If needed at different sites, OpenACC stands out with its performance portability to reduce latency by running on local resources such as clusters or clouds (even on multiple CPU cores, when GPUs are not available). Thereby, we do not even have any efficiency costs due to the interoperability with low level programming models such as CUDA C/C++. The code is currently in an early state, where we evaluate numerical methods to fit the target computer architecture (GPU) best. The weakly compressible Navier-Stokes equations are numerically solved using finite differences in a fractional step method using an orthogonal projection. The time marching scheme is split into the three major steps: advection, diffusion and pressure calculation. The advection equation is calculated using a semi-Lagrangian scheme, the diffusion equation solver is based on an implicit Jacobi scheme, and the pressure equation is handled by a (geometric) multi-grid method. This stage of porting the code’s individual solution steps to GPUs is crucial for the further development, as it will ensure the predictive potential with respect to the computational power. The current code size is ~ 2500 LOCs, is written in C++ 11 and does not depend on any external libraries. It will be open source once a beta version is available. The code development is funded by the German Ministry for Education and Science (BMBF) and is part of the OPRHEUS project. The focused and high-level support during this workshop will give this project a great boost. By porting the code to GPU/OpenACC, we envision the code to run in real-time and even faster than real-time to ensure the predictive potential. (04/16: Result of GPU Hackathon can be read in their blog entry [german]) |

PALM4GPU from Institute of Meteorology and Climatology, Leibniz Universität Hannover |
PALM is a finite-difference large-eddy simulation (LES) model to study turbulent atmospheric and oceanic flows (palm.muk.uni-hannover.de). It is used both for basic research (e.g. turbulent transport above heterogeneous terrain, effect of turbulence on the growth of cloud droplets) and applied research questions (e.g. flow within and behind wind farms, effects of turbulence on airplanes during takeoff and landing, turbulent wind fields in cities). The flow model is coupled with a Lagrangian particle model (LPM). Our largest applications use up to 4000^3 grid points and run on about 10.000 CPU cores. For the LPM we also have used several billions of particles. PALM currently consists of about 100.000 lines of Fortran 95-based code with some Fortran 2003 extensions. External libraries such as netCDF and FFTW are used. Parallelization is based on two-dimensional domain decomposition using MPI. Hybrid parallelization with MPI and openMP is realized. PALM shows excellent strong scaling, so far tested on up to 50.000 CPUs. The main parts have been ported to GPU using openACC with data completely residing on the GPU and as such recently included in the SPEC ACCEL benchmark. However, the general performance gain we got so far is disappointing (one K20 board performs about like a quad-core CPU) and we hope to get it much better during the workshop. Furthermore, as a next step we would like to use multiple GPUs with CUDA-aware MPI. We also like to discuss chances of using GPU and CPU in parallel. PALM is under the GNU GPL (v3). It is currently used by more than 10 international research groups worldwide, and the number of registered users is growing strongly (currently we have several hundred single users). Beside a boost in performance on very large machines with thousands of GPUs (like the tsubame-system at TokyoTech), we expect that an efficient OpenACC porting of PALM will also make the code interesting for research groups with more limited computational resources, as well as small and medium-sized businesses. |

Planet Hunters from the European Southern Observatory, Garching |
Our application is called PyVFit: it is a Python-based application designed to fit observational data collected from state-of-the-art radio telescopes like ALMA (in Chile) and JVLA (in New Mexico, USA). These radio telescopes are made of dozens of big antennas (30 meters diameter) that work sinergically as interferometers and, as a result, measure the Fourier Transform of the sky. Due to the nature of such observations, *any* kind of data modeling has to be made in the Fourier space: this constitutes the major reason for going to GPUs to accelerate the computation. This application is the first/only one to date of this kind because it provides an architecture to fit *simultaneously* interferometric observations at several wavelengths and proved fundamental to constrain the growth of dust particles in protoplanetary disks, witnessing the formation of planets (Tazzari et al. 2015, http://www.eso.org/~mtazzari/preprints/Tazzari_et_al_2015.pdf). The main architecture of the code is written in Python (makes use of numpy, scipy packages, plus other packages to perform the Bayesian analysis, e.g. emcee). We use C- and Fortran-compiled functions (wrapped respectively with Cython and f2py) to carry out the most intensive computations (e.g. interpolation and sampling of Fourier Transforms). The application runs a Monte Carlo that needs to compute a likelihood function approx. 2 millions times. For a typical case, every likelihood call takes >= 50seconds: (1) it computes a synthetic image of the protoplanetary disk (of size 4096×4096 pixels), (2) it takes several Fourier Transforms of the image (e.g. 4x), (3) it samples the resulting Fourier Transforms interpolating in 4 million points. Steps (2) and (3) are the bottleneck of the computation (70% of the toal computing time). We think that moving the steps (2) and (3) to the GPUs we can achieve an enormous speed-up in the computation. We need very fast algorithms for matrix interpolation and Fourier Transform: we know that GPUs are very efficient at these works but we do not have the expertise to use them. In the next years the image sizes will become even larger due to the improving observations: this makes it crucial for our research field to find a solution to accelerate the analysis for the years to come. Our current code takes 8000 CPU hours to give a result for the fit of a 4-wavelength datasets, but this is a lower limit as the incoming observations are growing in size. The code distributes the likelihood computations on multi-CPU thanks to MPI protocol (implemented in the emcee Python package for bayesian analysis), with extremely good scalability (tested up to 200 cores). Each MPI process computes 1 likelihood, so in principle it should be easy to make it exploit GPUs (GPUs do not need to be used concurrently). The whole radio astronomy field is going into the direction of multi-wavelength analysis of observations. With PyVFit we have developed an architecture that is general and versatile and can be used not only in protoplanetary disk studies, but for *all* radio-astronomy studies that want to compare a model with observations. This analysis method has potentially revoluzionizing applications, but most of it success will depend on how fast is the code. Accelerating our code with OpenACC and GPUs usage has the potential of delivering to a whole community of astronomer a transformational analysis toolkit. The final aim is certainly to release the code to the scientific community with an Open Source license. The code is not yet public mostly because it has been developed as core part of the PhD of Marco Tazzari and needs some polishing before publication. The fitting architecture we are proposing is potentially open to an enormous amount of applications in different fields of astronomy that make use of radio observations. Potential users of this architecture are thousands of researchers worldwide. This number will grow rapidly in the next years since the ALMA telescope is reaching the Full Science operability. |

Soft Matters! from the Institute for Theoretical Physics of the Georg-August-University Göttingen |
Our research group uses simulations of coarse grained models to investigate problems in bio- and polymerphysics. We investigate properties on a coarse level like physical properties of cell membranes, polymer brushes and copolymer self-assembly. One of the techniques we apply is the single-chain-in-mean-field (SCMF) method, which is a Monte- Carlo method that gets around the problem of finding neighbours in a particle based simulation via density fields and thus it is very fast, it scales roughly linear with the number of particles. The non- bonded interactions are calculated via density fields, which are kept constant during one or more MC-steps and updated afterwards. This particular model is perfectly suited to be ported on GPUs as the Monte-Carlo moves for all the chains do not depend on each other. Our group developed an MPI implementation of this model in C that runs on up to 32 cores with a reasonable speed-up. It uses the zlib and an MPI-library as the only dependencies. It runs on any system with a C-compiler (tested on : IBM Power6, Intel x86, Intel x86-64, ARMv7, … ) with OpenMPI, MPICH, MVAPICH2, Intel MPI. The main version of the program has ~30.000 lines of code, we intend to use a version at the workshop that is stripped down to the core simulation and has ~800 lines of code. The MonteCarlo- step routine we intend to port with OpenACC uses less than 50 lines of code. Our intention is to publish the code that will be written during the hackathlon under the new BSD (BSD-3) license. From profiling our program we know that by far the largest fraction of the time is spend calculating MC moves (>95%). Our estimation is that an OpenACC port will result in a huge speed up, as we can split the calculations of a MC step into hundreds of thousands separate compute tasks (the number of polymer chains of a typical simulation), that each require the generation of ~100 random numbers and the evaluation of ~30 exponential functions. Our user community is our own research group and the associated groups. The SCMF simulations are also used in an FP7 project of the EU on self-assembly assisted computational lithography (colisa.mmp) and integrated into a computational lithography package developed by the Fraunhofer Institute for Integrated Circuits (IISB) in Erlangen (Dr. Litho). If the speed-up is in the orders of our expectations it is possible that this OpenACC port will be integrated into that package and thus be used by several companies and research institutes that try to develop the next generation techniques for building microprocessors. As a research group we expect that an OpenACC port of our simulation software will enable us to investigate more complex questions of biomembranes and other polymer based soft matter systems. |

Twitching Bugs from Max Planck Institute for the Physics of Complex Systems (MPI-PKS), Dresden |
For most bacteria, individual cells are not able to survive on their own. Instead they increase their chances of survival by living in large colonies known as biofilms. During evolution a lot of processes developed that faciliate the formation of these agglomerates. One of the great examples are pili: long and thin filaments that grow out of the membranes of individual cells. Bacteria use pili to attach and move over surfaces and find other cells. Pili work similar to a grappling hook in which pili grow out, attach to a surface or a pilus from another cell and retract. Retracting pili can produce an extremely large pulling force (in fact the responsible molecular motor pilT is the strongest known in nature) that makes the cells move. In the proposed simulation we look at Neisseria gonorrhoeae, the bacterium that causes the second most common sexually transmitted disease, gonorrhea. The formation of microcolonies plays a fundamental role during the infection process of not only N. gonorrhoeae, but also many other pathogenic bacteria, such as those responsible for meningitis and cholera. Our goal is to explain the growth of N. gonorrhoeae colonies observed by our experimental collaborators (lab of Prof. Nicolas Biais, NY). Our quantitative understanding of bacterial colony formation will help to develop new approaches for treating dangerous infections. We want to simulate up to 10000 cells, each having 5-20 pili. Such a system can be modeled by a group of spheres (the cells) being connected to each other via springs (the pili) of dynamic length. By iteratively summing up all forces and computing the displacements of the cells we are able to create an in silico version of a bacterial microcolony. Right now the simulation of such a system (10^9 steps for the experimental relevant time of 2-3h) would take 100-1000 days for a serial computation without any parallelization. We hope that the GPU parallelization will allow us to perform these simulations in less than 10 days. The code consists of roughly 5000 lines of code and is written in C++. For solving polynomials of third order and random number generation GSL (GNU Scientific Library) is used. We will use the outcome of the simulations for scientific publications. After publishing all of our data we will take care that the code is freely available to the public. This will be the first published simulation model of bacteria interacting via pili. Thus we offer a new important tool to the scientific community that will especially useful in the fields of microbiology and theoretical biophysics. |

Stay tuned to hear more about the progress of the teams!