«

»

Sep 01

Print this Post

Interactive Deep Learning on a HPC cluster

One of the key ingredients for the success of our Deep Learning Bootcamp last week was the availability of jupyter notebooks on the GPU nodes of our HPC installation. This avoided many problems we saw in previous workshops when software is run on the hardware that participants bring along. At the same time, it allows the learners to take home their notebooks.

The key ingredient besides the GPU hardware and the availability of theano, tensorflow and keras on these nodes, was using jupyterhub. I started from a blog post by a fellow software carpentry instructor, Andrea Zonca at the San Diego Supercomputer Center. If people are interested what is possible beyond our minimal setup (think docker swarm).

For my use case, I wanted to run jupyterhub on the head node of our cluster. So I installed it from pip into our software installation:

$ pip3 install --prefix=/sw/apps/jupyterhub/<version> jupyterhub

In addition, I exploited two plugins to jupyterhub: batchspawner and profilespawner from the wrapspawner package. Both can be installed doing:

$ git clone https://github.com/jupyterhub/<spawnername>
$ cd spawnername
$ pip3 install --prefix=/sw/apps/<spawnername>/<version> .

And after adjusting the environment, aka PYTHONPATH and PATH, everything is good to go.The last ingredient to get things rolling was to install the nodejs package configurable-http-proxy.

$ npm install -g --prefix /sw/apps/nodejs-modules/ configurable-http-proxy

To configure jupyterhub, I first generated a configfile from:

$ jupyterhub --generate-config

After adjusting the core bits related to authentication, logfiles, port numbers, location of sqlite database and such, the most important parts of the configuration are the following parts:

c = get_config()
c.JupyterHub.spawner_class = 'batchspawner.SlurmSpawner'
c.Spawner.http_timeout = 120
c.SlurmSpawner.req_nprocs = '1'
c.SlurmSpawner.req_runtime = '8:00:00'
c.SlurmSpawner.req_partition = 'gpu'

c.SlurmSpawner.start_timeout = 240

c.SlurmSpawner.batch_script = '''#!/bin/bash
#SBATCH --partition={partition}
#SBATCH --time={runtime}
#SBATCH --output={homedir}/jupyterhub_slurmspawner_%j.log
#SBATCH --job-name=jupyterhub-spawner
#SBATCH --cpus-per-task={nprocs}
#SBATCH --workdir=/home/{username}
#SBATCH --uid={username}
#SBATCH --mem={memory}
#SBATCH {options}

source /sw/env/mpi.sh
source /usr/share/Modules/init/bash
module load courses/env
module load cuda/8.0.61
module load hdf5/1.8.16
env
which jupyterhub-singleuser
{cmd}
'''

And for convenience I created 3 profiles (plus one test profile) from which the user can choose from:

c.JupyterHub.spawner_class = 'wrapspawner.ProfilesSpawner'
c.Spawner.http_timeout = 120

c.ProfilesSpawner.profiles = [
 ('Furiosa GPU node - 40 cores, 32 GB, 8 hours',
  'furiosa-gpu',
  'batchspawner.SlurmSpawner',
   dict(req_nprocs='40', req_partition='gpu', req_runtime='8:00:00', req_memory='32000')
 ),
 ('Furiosa CPU only node - 1 core, 16 GB, 8 hours',
  'furiosa-1cpu',
  'batchspawner.SlurmSpawner',
   dict(req_nprocs='1', req_partition='batch', req_runtime='8:00:00', req_memory='16000')
 ),
 ('Furiosa CPU node - 12 cores, 32 GB, 8 hours',
  'furiosa-12cpu',
  'batchspawner.SlurmSpawner',
   dict(req_nprocs='12', req_partition='batch', req_runtime='8:00:00', req_memory='32000')
 ),
 ( "Test server", 'local', 'jupyterhub.spawner.LocalProcessSpawner', {'ip':'0.0.0.0'} )
 ]

With this, you can run jupyterhub as easy as:

# jupyterhub -f /path/to/jupyterhub_config.py --log-file=/path/to/jhub.log


And when I then open my browser and dial over to the IP which I configured, I see this after the account/password prompt:

A user would only see the “Start My Server” button here. When “Start My Server” is clicked, you land on this nice selection page, where you can choose the profile from which you want to spawn a job.

With this, a jupyter notebook opens after some small waiting time and you are good to go.

With this, we enabled interactive use of our GPU nodes without the participants having to learn the job scheduler (SLURM in our case) and allowed extensive experimentation during the workshop which aided learning tremendously.

Further, this allows more people to use the cluster resources and hence provide a higher return-of-investment with regards to scientific results. Mission accomplished!

Permanent link to this article: https://gcoe-dresden.de/interactive-deep-learning-on-a-hpc-cluster/