TauREx Cluster:Cobweb

To run TauREx on cobweb, you need to set a few environment variables and set up your relevant folders

Folders

All your data needs to be hosted in \share\data\ as your home directory is relatively small. Best create a folder there with your name where all data is hosted.

mkdir \share\data\[USERNAME]

Inside this folder, copy TauREx.

Required Modules

As we are currently upgrading some of the software, this may be subject of change. Currently these modules need to be imported

anaconda/3-4.4.0
lapack/3.8.0
multinest/3.11
mpi/openmpi/3.1.1-gcc7.2
compiler/gcc/7.2.0

To see which modules are available

module avail

to see which ones are currently loaded

module list

to load a new module

module load [MODULENAME]

similarly to unload a module

module unload [MODULENAME]

It is best if you add the module loading into your .bashrc file. Alternatively you can load them in your submit script.

To add them to your .bashrc

echo 'module load [MODULENAME]' > ~/.bashrc

Submit Script

To run the code, you need to submit to a scheduler, who will run the program on your behalf. On cobweb we use the torque scheduler. An example of a submit script to the scheduler is below. The script should have an .sh file name.

#! /bin/bash
#PBS -S /bin/bash

#################################
## Script to run Tau-REx on Cobweb
##
## DATADIR    : data directory (e.g. /share/data/taurex)
## SCRATCHDIR : temporary scratch dir. gets deleted at the end
## RUNDIR     : same as SCRATCHDIR but can include subfolders
## OUTDIR     : final results folder (DATADIR=OUTDIR usually)
##
## NP         : number of total cpus, i.e. nodes x ppn
##################################

## Name of job displayed in the queue
#PBS -N [JOBNAME]

## Queue to submit to: gpu/compute/test
#PBS -q compute

##run parameters
##Maximum runtime: HH:MM:SS
#PBS -l walltime=30:00:00       
##NODES/CPUs required      
#PBS -l nodes=1:ppn=24         
##Memory requirement      
#PBS -l mem=20gb                      

##error output and path variables
#PBS -j oe
#PBS -V

##Number of total cores for mpirun command
NP=24

##setting up path
DATADIR=/share/data/ingo/repos/TauREx
cd $DATADIR

##run job
mpirun -np $NP -x PATH -x LD_LIBRARY_PATH --machinefile $PBS_NODEFILE python taurex.py -p [PARAMETER FILE] --plot

You need to set several parameters here:

#PBS -q compute

There are currently three queues on cobweb compute\gpu\test, compute should be your standard queue. The compute queue has currently 176 cores available distributed across 6 nodes á 24 cores and 1 node á 32 cores.

#PBS -l walltime=05:00:00

This is the maximum runtime allowed. You can leave this as long as you want but please don’t run too many nodes with jobs that go for days. Use Legion/Grace/DiRAC for that.

#PBS -l nodes=1:ppn=24

This sets the number of nodes required and the number of processors per node. TauREx runs best if you keep this at nodes=1:ppn=24 as inter-node latency affects MultiNest.

#PBS -l mem=20gb

RAM memory requirement per node. Each node has a maximum of 256 Gb of RAM memory at the moment.

QSUB/QDEL/QSTAT

Once you have your submit script. You’re ready to submit it to Cobweb. You can do this with

qsub [SCRIPTNAME.sh]

This will send it to the scheduler and you can see your job listed in the queue using

qstat

Once the required resources become available, your job will be executed. If you need to cancel a job:

qdel [JOBID]

If you need to cancel all your jobs

qdel all

Trouble shooting

Missing python libraries

You may find that you are missing python libraries but cannot install them using conda or pip. You can easily install all python packages locally to your home directory using

pip install --user [PACKAGE NAME]

Running on less than 24 cores per node

If you run a job that requires less than the full node, you may have to change your submit script slightly. Say, you want to run 3 jobs with 8 cores each. You may need to limit your number of PSM contexts. There are 16 contexts per node. If you run 8 jobs, you could run 5 contexts per job (i.e. 15 contexts). Generally speaking, it’s easiest if you divide your number of cores by two to get the number of contexts.

Contexts can be set with the enviroment variable:

export PSM_RANKS_PER_CONTEXT=4

and export the variable to openmpi with

mpirun -x PSM_RANKS_PER_CONTEXT ...

An example submit script is given below:

#! /bin/bash
#PBS -S /bin/bash

#################################
## Script to run Tau-REx on Cobweb
##
## DATADIR    : data directory (e.g. /share/data/taurex)
## SCRATCHDIR : temporary scratch dir. gets deleted at the end
## RUNDIR     : same as SCRATCHDIR but can include subfolders
## OUTDIR     : final results folder (DATADIR=OUTDIR usually)
##
## NP         : number of total cpus, i.e. nodes x ppn
##################################

## Name of job displayed in the queue
#PBS -N [JOBNAME]

## Queue to submit to: gpu/compute/test
#PBS -q compute

##run parameters
#PBS -l walltime=30:00:00             #Maximum runtime: HH:MM:SS
#PBS -l nodes=1:ppn=8                #NODES/CPUs required
#PBS -l mem=20gb                      #Memory requirement

##error output and path variables
#PBS -j oe
#PBS -V

##Number of total cores for mpirun command
NP=24

##Limit number of inifinband contexts per node. Max. of 16 contexts per node
export PSM_RANKS_PER_CONTEXT=4

##setting up path
DATADIR=/share/data/ingo/repos/TauREx
cd $DATADIR

##run job
mpirun -np $NP -x PATH -x LD_LIBRARY_PATH -x PSM_RANKS_PER_CONTEXT --machinefile $PBS_NODEFILE python taurex.py -p [PARAMETER FILE] --plot