Setting up singularity
For HPC facilities supporting singularity
dockers, we have provided an ubuntu environment with tensorflow installed.
Singularity can be obtained here and the pre-configured image can be downloaded here. Download the file ubuntu.simg
which is your ubuntu docker image containing CUDA 9.0
and Tensorflow
.
Setting up singularity on Cobweb
The following instructions are very UCL specific but are likely helpful in the general case.
Cobweb has two GPU nodes available with the following specifications:
- Node 1: NVIDIA - K40, 20 CPUs, 256 Gb RAM (tensorflow only supported using singularity)
- Node 2: NVIDIA - V100, 32 CPUs, 256 Gb RAM (both singularity and tensorflow module support)
To start, move the singularity image to your personal folder, e.g. /share/data/[USER]/singularity
and navigate to the folder containing your image.
Before proceeding, make sure that you have no local copies of tensorflow installed in your ~/.local
folder. To do this, load an anaconda version
module load anaconda/3-4.4.0
if you have been using python 3.x previously or load one of the python 2 equivalent modules. Make sure you uninstall all tensorflow instances in .local
with
pip uninstall tensorflow
Next, log into one of the GPU nodes with an interactive shell. For the V100:
qsub -I -q gpu -l nodes=1:v100
and for the K40 node:
qsub -I -q gpu -l nodes=1:k40
Now you can get on and load the following singularity module:
module load singularity/2.5.2
Navigate to the folder in which you have stored the ubuntu.simg
image.
You can then mount the singularity image with the following command
singularity shell -B /share/data:/mnt --nv ubuntu.simg
where -B
mounts the /share/data
directory into the /mnt
directory inside singularity. The --nv
flag binds the external GPU hardware with the singularity internal CUDA libraries.
Once inside the image, you will find that your home
diretory has been mounted and your data is mounted in /mnt
. The image is essentially a stripped down version of an Ubuntu operating system containing both anaconda
and tensorflow
distributions pre-installed.
Now enter the anaconda tensorflow environment with:
source activate tensorflow
In principle you can now run ExoGAN by navigating to the folder containig ExoGAN and run the code in the interactive shell. Once this works, we can now set up a launch script to run singularity using the Torque scheduler.
Troubleshooting
You may find that pylab
or other python modules are missing when running ExoGAN or similar codes. These can be easily installed inside the image using pip:
pip install --user matplotlib scipy numpy
Running singularity using the scheduler
Whilst running ExoGAN in an interactive shell is great, the code requires significant training time. It is hence useful to use the torque
scheduler instead. For this we need to set up two bash scripts: the launch script that gets executed inside singularity and the scheduler script that instructs the scheduler to start singularity.
Launcher script
The following script summarises exactly the same steps as above. Place the script in the ExoGAN directory, obviously replacing the path in the script. It will get called together with singularity later. For reference we name this script singularity_launch_gan.sh
here (but you can call it anything of course…).
#! /bin/bash
#load the anaconda tensorflow environment
source activate tensorflow
#navigate to GAN directory
cd /mnt/[PATH TO WHERE ExoGAN is put]
#run the code
python gan.py
Make sure that it is executable with
chmod +x singularity_launch_gan.sh
Scheduler script
This is a more standard torque
submit script. To run your code on the V100
node:
#! /bin/bash
#PBS -S /bin/bash
#################################
## Script to run ExoGAN on Cobweb using singularity
##################################
## Name of job displayed in the queue
#PBS -N [JOBNAME]
## Queue to submit to: gpu/compute/test
#PBS -q gpu
##run parameters
##Maximum runtime: HH:MM:SS
#PBS -l walltime=30:00:00
##NODES required
#PBS -l nodes=1:v100
##Memory requirement
#PBS -l mem=100gb
##error output and path variables
#PBS -j oe
#PBS -V
##setting up path
DATADIR=[PATH TO SINGULARITY IMAGE]
cd $DATADIR
##run singularity
singularity shell -B /share/data:/mnt --nv ubuntu.simg /mnt/[PATH TO ExoGAN]/singularity_launch_gan.sh
To change to the NVIDIA K40 node, change the script to
##NODES required
#PBS -l nodes=1:k40
Once you have your submit script. You’re ready to submit it to Cobweb. You can do this with
qsub [SCRIPTNAME.sh]
This will send it to the scheduler and you can see your job listed in the queue using
qstat
Once the required resources become available, your job will be executed. If you need to cancel a job:
qdel [JOBID]
If you need to cancel all your jobs
qdel all