User Tools

Site Tools


institute_lorentz:institutelorentz_maris_slurm

This is an old revision of the document!


Slurm on the Maris Cluster

All maris nodes have been configured to use slurm as a workload manager. Its use is enforced on all nodes. Direct access to any node other than the headnode `novamaris' is not allowed.

Maris' slurm has been configured to manage consumable resources, such as CPUs and RAM, and generic resources (GPUs) using cgroups.

A snapshot of the cluster usage can be found at http://slurm.lorentz.leidenuniv.nl/ (only accessible within the IL workstations network).

Maris runs SLURM v17.02.

Suggested readings:

Accounting

Maris accounting scheme has been set up such that each principal investigator (PI) at IL has its own slurm account. Collaborators, postdocs, PhD and master students associated with a PI will use their slurm account to submit jobs to the cluster. This is achieved using the option -A or –account to srun, sbatch, etc….

Accounting allows system managers to track cluster usages to improve services and enables the assignment of different powers/priorities to different users. At this moment all accounts, but `guests' have been granted the same privileges, shares and priorities. This could however change in the future if deemed necessary by the cluster owners.

Account information for a given <username> can be displayed using

sacctmgr list associations cluster=maris user=<username> format=Account,Cluster,User,Fairshare

if no results are returned, then please contact support.

Similarly, if you encounter the following error message upon submission of your batch job

error: Unable to allocate resources: Invalid account or account/partition combination specified

please make sure that you have specified the account associated to your user name.

Available partitions and nodes

Available partitions and their configurations are available by typing `sinfo' with the appropriate options (use man sinfo).

A similar type of information plus all jobs currently in a queue can be seen using the GUI program sview. Besides displaying cluster information, sview lets users manage their jobs via a GUI.

Maris partitions

`playground' is the default execution partition if a different partition name is not specified.

Name CPUs Memory Deafult Job Memory,Time GRES Nodes # Nodes MaxCPUPu MaxJobsPu Max JobTime QOS Access
playground288851832Mall maris0[04-22,29-33,35-46]36 playground all
notebook 48193044Mall maris0[23-28]64 1 notebook all
computation 15526578050M 400M, 3Days maris0[47-74] 28 normal all
compintel 192 1030000M400M, 1 Day maris0[76-77] 2 3 days normal beenakker
ibintel 192 1030000M400M, 1 Day maris078 1 10 days normal beenakker
emergency 384 2773706Mall maris0[69-74] 6 normal NOBODY
gpu 56 256000M400M, 3 Days2 gpumaris075 1 normal all

The `playground' partition should be used for test runs.

The `notebook' partition has been set specifically for our `jupyterhub' users. Nonetheless, you are free to use it at your own convenience.

The `emergency' partition is a special partition to be used in very rare and urgent circumstances. Access to it must be granted by maris' administrators. Jobs scheduled to run on this partition will pause any running jobs should it be necessary.

The `gpu' partition should be used only for jobs requiring GPUs. Note that these latter must be requested to slurm explicitly using –gres=gpu:1 for instance.

The `computation' and `compintel' partitions should be used for production runs.

Maris QOS

A quality of service (QOS) can be associated with a job, user, partition, etc… and can modify

  • priorities
  • limits

Maris uses the concept of QOS to impose usage limits on the `notebook', `playground' partitions and users belonging to the `guests' account.

To display all defined QOS use sacctmgr

#sacctmgr show qos format=Name,MaxCpusPerUser,MaxJobsPerUser,GrpNodes,GrpCpus,MaxWallDurationPerJob,Flags
      Name MaxCPUsPU MaxJobsPU GrpNodes  GrpCPUs     MaxWall                Flags 
---------- --------- --------- -------- -------- ----------- -------------------- 
    normal                                                                        
playground                                                             
  notebook         4         1                                        DenyOnLimit 
    guests       128                                                  DenyOnLimit 

 

Any users can submit jobs specifying a QOS via the option –qos, however they should know that any partition QOS will override the job's QOS.

Slurm job priority on Maris

Maris' slurm uses the multifactor-priority plugin. All jobs submitted to the slurm queue and waiting to be scheduled will be ordered according to the following `multi' factors

  • Age
  • Fairshare
  • Job size and TRES
  • Partition
  • QOS

Furthermore, any of the factors above will influence more or less a job's the execution order according to a pre-defined number called weight.

In summary, Maris' slurm has been set up such that:

  • small jobs are given high priority.
  • jobs submitted to the playground partition have higher priority.
  • QOS have no influence on job priority.
  • fairshare is an important factor when ordering the queue.
  • after a wait of 7 days, a job will be given the maximum Age factor weight.
  • fairshare is only based on the past 14 days. That is, usage decays to 0 within 14 days.

The relevant configuration options can be displayed via the command scontrol show config | grep -i prior.

Using GPUs with slurm on Maris

Maris075 is the only GPU node in the maris cluster. GPUs are configured as generic resources or GRES. In order to use a GPU in your calculations, you must explicitly request it as a generic resource using the –gres option supported by the salloc, sbatch and srun commands. For instance, if you are submitting a batch script to slurm, then use the format #SBATCH –gres=gpu:tesla:1 to request one GPU.

:!: Please note that on maris GPUs are configured as not-consumable generic resources (i.e. multiple jobs can use the same GPU).

To compile your cuda application on maris using slurm, note that in your submission script you might have to export the libdevice library path and include the path in which the cuda headers can be found, for instance

#!/bin/env bash
....
NVVMIR_LIBRARY_DIR=/usr/share/cuda /usr/bin/nvcc -I/usr/include/cuda my_code.cu

slurm and MPI

OpenMPI on the maris cluster supports launching parallel jobs in all three methods that SLURM supports:

  • with salloc
  • with sbatch
  • with srun

Please read https://www.open-mpi.org/faq/?category=slurm and https://slurm.schedmd.com/mpi_guide.html

In principle to run an MPI application you could just execute it using mpirun as shown in the session below

novamaris$ cat slurm_script.sh
#!/bin/env bash
mpirun mpi_app.exe
novamaris$ sbatch -N 4 slurm_script.sh
srun: jobid 1234 submitted
novamaris$

However, it is highly advised you use slurm's srun to submit a parallel job in any circumstances.

novamaris$ cat slurm_script.sh
#!/bin/env bash
srun mpi_app.exe
novamaris$ sbatch -N 4 slurm_script.sh
srun: jobid 1234 submitted
novamaris$

At the moment maris supports only OpenMPI with slurm so you are required to load a particular openmpi/slurm module to get things to work, for instance

# load openMPI
module load openmpi-slurm/2.0.2
 
# run on 1 node usind 3 CPUs
srun -n 3 <your_MPI_program>
 
# run on 4 nodes node usind 4 CPUs
srun -N 4 -n 4 mpi_example
 
# if job is multithreaded and requires more than one CPU per task
srun -c 4 mpi_example

:!: module load openmpi-slurm/2.0.2 will be successful only if no other modules that implement MPI are loaded. In other words, unload any other MPI module and then load openmpi-slurm.

:!: Any application that uses MPI with slurm must be compiled against the mpi in the module openmpi-slurm otherwise it will behave erratically.

module: openmpi-slurm

It includes a version of openMPI built with slurm support. It also includes mpi-enabled fttw.

Example batch script

Whenever writing a batch script, users are HIGHLY advised to explicitly specify resources (mem, cpus, etc…).

maris offers a helper program to get user started with their first batch script, just type `swizard' and follow the instructions on the screen.

Batch scripts come handy when you have several options you would like to pass slurm. Instead of having a very long cmd line, you could create a batch script and you could submit it using `sbatch'. An example of batch script is given below:

#!/bin/env bash
##comment out lines by adding at least two `#' at the beginning
#SBATCH --job-name=lel-rAUtV
#SBATCH --account=wyxxl
#SBATCH --partition=computation
#SBATCH --output=/home/lxxx/%x.out
#SBATCH --error=/home/lxxx/%x.err
#SBATCH --time=1-00:00:00
#SBATCH --mem=400
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=1
 
module load openmpi-slurm/2.0.2
 
srun a.out

Example: how to use a node's scratch disks

How can I tranfer my files locally to a node's scratch space so that they can be used in my slurm jobs?

slurm provides the command sbcast to make single files available to allocated jobs. But how do achieve the same for directories or multiple files?

Consider the batch script below

$ cat slurmcp.sh
#!/bin/env bash
 
DEFAULT_SOURCE=${SLURM_SUBMIT_DIR}
SOURCE=${1:-DEFAULT_SOURCE}
 
#SBATCH -N 1
#SBATCH --nodelist=maris066
 
SCRATCH=/data1/$USER/$SLURM_JOB_ID
 
srun mkdir -p ${SCRATCH} || exit $?
# note that srun cp is equivalent to loop over each node and copy the files
srun cp -r ${DEFAULT_SOURCE}/*  ${SCRATCH} || exit $?
 
# now do whatever you need to do with the local data
 
# do NOT forget to remove data that are no longer needed
srun rm -rf ${SCRATCH} || exit $?

and its invocation sbatch slurmcp.sh /tmp/mydir.

Example: instruct slurm to send emails upon job state changes

Slurm can be instructed to email any job state changes to a chosen email address. This is accomplished by using the –mail-type option in sbatch for instance

...
#SBATCH --mail-user=myemail@address.org
#SBATCH --mail-type=ALL
...

If the –mail-user option is not specified, emails will be sent to the submitting user.

In the event a job failure (exit status different than zero), maris will include in the notification email a few lines from the job's stderr. Please note that this feature will only work if a job's stdout and stderr were not specified using filename patterns, that is names including % characters as described in man sbatch.

Python Notebooks on maris

We have set up a jupyterhub environment that uses the slurm facilities to launch users' notebooks. Please refer to JupyterHub with Slurm on Maris.

Notes

:!: ssh-ing from novamaris to a maris compute node produces a top-like output.

institute_lorentz/institutelorentz_maris_slurm.1538984691.txt.gz · Last modified: 2018/10/08 07:44 by lenocil