This is an old revision of the document!
All maris nodes have been configured to use slurm as a workload manager. Its use is enforced on all nodes. Direct access to any node other than the headnode `novamaris' is not allowed.
Maris' slurm has been configured to manage consumable resources, such as CPUs and RAM, and generic resources (GPUs) using cgroups.
A snapshot of the cluster usage can be found at http://slurm.lorentz.leidenuniv.nl/ (only accessible within the IL workstations network).
Maris runs SLURM v17.02.
Suggested readings:
Maris accounting scheme has been set up such that each principal investigator (PI) at IL has its own slurm account. Collaborators, postdocs, PhD and master students associated with a PI will use their slurm account to submit jobs to the cluster. This is achieved using the option -A
or –account
to srun
, sbatch
, etc….
Accounting allows system managers to track cluster usages to improve services and enables the assignment of different powers/priorities to different users. At this moment all accounts, but `guests' have been granted the same privileges, shares and priorities. This could however change in the future if deemed necessary by the cluster owners.
Account information for a given <username> can be displayed using
sacctmgr list associations cluster=maris user=<username> format=Account,Cluster,User,Fairshare
if no results are returned, then please contact support
.
Similarly, if you encounter the following error message upon submission of your batch job
error: Unable to allocate resources: Invalid account or account/partition combination specified
please make sure that you have specified the account associated to your user name.
Available partitions and their configurations are available by typing `sinfo' with the appropriate options (use man sinfo).
A similar type of information plus all jobs currently in a queue can be seen using the GUI program sview
. Besides displaying cluster information, sview lets users manage their jobs via a GUI.
`playground' is the default execution partition if a different partition name is not specified.
Name | CPUs | Memory | Deafult Job Memory,Time | GRES | Nodes | # Nodes | MaxCPUPu | MaxJobsPu | Max JobTime | QOS | Access |
---|---|---|---|---|---|---|---|---|---|---|---|
playground | 288 | 851832M | all | maris0[04-22,29-33,35-46] | 36 | playground | all | ||||
notebook | 48 | 193044M | all | maris0[23-28] | 6 | 4 | 1 | notebook | all | ||
computation | 1552 | 6578050M | 400M, 3Days | maris0[47-74] | 28 | normal | all | ||||
compintel | 192 | 1030000M | 400M, 1 Day | maris0[76,77] | 6 | 3 days | normal | beenakker | |||
emergency | 384 | 2773706M | all | maris0[69-74] | 6 | normal | NOBODY | ||||
gpu | 56 | 256000M | 400M, 3 Days | 2 gpu | maris075 | 1 | normal | all |
The `playground' partition should be used for test runs.
The `notebook' partition has been set specifically for our `jupyterhub' users. Nonetheless, you are free to use it at your own convenience.
The `emergency' partition is a special partition to be used in very rare and urgent circumstances. Access to it must be granted by maris' administrators. Jobs scheduled to run on this partition will pause any running jobs should it be necessary.
The `gpu' partition should be used only for jobs requiring GPUs. Note that these latter must be requested to slurm explicitly using –gres=gpu:1
for instance.
The `computation' and `compintel' partitions should be used for production runs.
A quality of service (QOS) can be associated with a job, user, partition, etc… and can modify
Maris uses the concept of QOS to impose usage limits on the `notebook', `playground' partitions and users belonging to the `guests' account.
To display all defined QOS use sacctmgr
#sacctmgr show qos format=Name,MaxCpusPerUser,MaxJobsPerUser,GrpNodes,GrpCpus,MaxWallDurationPerJob,Flags Name MaxCPUsPU MaxJobsPU GrpNodes GrpCPUs MaxWall Flags ---------- --------- --------- -------- -------- ----------- -------------------- normal playground notebook 4 1 DenyOnLimit guests 128 DenyOnLimit
Any users can submit jobs specifying a QOS via the option –qos
, however they should know that any partition QOS will override the job's QOS.
Maris' slurm uses the multifactor-priority plugin. All jobs submitted to the slurm queue and waiting to be scheduled will be ordered according to the following `multi' factors
Furthermore, any of the factors above will influence more or less a job's the execution order according to a pre-defined number called weight
.
In summary, Maris' slurm has been set up such that:
playground
partition have higher priority.
The relevant configuration options can be displayed via the command scontrol show config | grep -i prior
.
Maris075 is the only GPU node in the maris cluster. GPUs are configured as generic resources or GRES. In order to use a GPU in your calculations, you must explicitly request it as a generic resource using the –gres
option supported by the salloc, sbatch and srun commands. For instance, if you are submitting a batch script to slurm, then use the format #SBATCH –gres=gpu:tesla:1
to request one GPU.
Please note that on maris GPUs are configured as not-consumable generic resources (i.e. multiple jobs can use the same GPU).
To compile your cuda application on maris using slurm, note that in your submission script you might have to export the libdevice library path and include the path in which the cuda headers can be found, for instance
#!/bin/env bash .... NVVMIR_LIBRARY_DIR=/usr/share/cuda /usr/bin/nvcc -I/usr/include/cuda my_code.cu
OpenMPI on the maris cluster supports launching parallel jobs in all three methods that SLURM supports:
Please read https://www.open-mpi.org/faq/?category=slurm and https://slurm.schedmd.com/mpi_guide.html
In principle to run an MPI application you could just execute it using mpirun as shown in the session below
novamaris$ cat slurm_script.sh #!/bin/env bash mpirun mpi_app.exe novamaris$ sbatch -N 4 slurm_script.sh srun: jobid 1234 submitted novamaris$
However, it is highly advised you use slurm's srun
to submit a parallel job in any circumstances.
novamaris$ cat slurm_script.sh #!/bin/env bash srun mpi_app.exe novamaris$ sbatch -N 4 slurm_script.sh srun: jobid 1234 submitted novamaris$
At the moment maris supports only OpenMPI with slurm so you are required to load a particular openmpi/slurm module to get things to work, for instance
# load openMPI module load openmpi-slurm/2.0.2 # run on 1 node usind 3 CPUs srun -n 3 <your_MPI_program> # run on 4 nodes node usind 4 CPUs srun -N 4 -n 4 mpi_example # if job is multithreaded and requires more than one CPU per task srun -c 4 mpi_example
module load openmpi-slurm/2.0.2
will be successful only if no other modules that implement MPI are loaded. In other words, unload any other MPI module and then load openmpi-slurm
.
Any application that uses MPI with slurm must be compiled against the mpi in the module openmpi-slurm otherwise it will behave erratically.
It includes a version of openMPI built with slurm support. It also includes mpi-enabled fttw
.
Whenever writing a batch script, users are HIGHLY advised to explicitly specify resources (mem, cpus, etc…).
maris offers a helper program to get user started with their first batch script, just type `swizard' and follow the instructions on the screen.
Batch scripts come handy when you have several options you would like to pass slurm. Instead of having a very long cmd line, you could create a batch script and you could submit it using `sbatch'. An example of batch script is given below:
#!/bin/env bash ##comment out lines by adding at least two `#' at the beginning #SBATCH --job-name=lel-rAUtV #SBATCH --account=wyxxl #SBATCH --partition=computation #SBATCH --output=/home/lxxx/%x.out #SBATCH --error=/home/lxxx/%x.err #SBATCH --time=1-00:00:00 #SBATCH --mem=400 #SBATCH --ntasks=16 #SBATCH --ntasks-per-node=1 module load openmpi-slurm/2.0.2 srun a.out
How can I tranfer my files locally to a node's scratch space so that they can be used in my slurm jobs?
slurm provides the command sbcast
to make single files available to allocated jobs. But how do achieve the same for directories or multiple files?
Consider the batch script below
$ cat slurmcp.sh #!/bin/env bash DEFAULT_SOURCE=${SLURM_SUBMIT_DIR} SOURCE=${1:-DEFAULT_SOURCE} #SBATCH -N 1 #SBATCH --nodelist=maris066 SCRATCH=/data1/$USER/$SLURM_JOB_ID srun mkdir -p ${SCRATCH} || exit $? # note that srun cp is equivalent to loop over each node and copy the files srun cp -r ${DEFAULT_SOURCE}/* ${SCRATCH} || exit $? # now do whatever you need to do with the local data # do NOT forget to remove data that are no longer needed srun rm -rf ${SCRATCH} || exit $?
and its invocation sbatch slurmcp.sh /tmp/mydir
.
Slurm can be instructed to email any job state changes to a chosen email address. This is accomplished by using the –mail-type
option in sbatch for instance
... #SBATCH --mail-user=myemail@address.org #SBATCH --mail-type=ALL ...
If the –mail-user
option is not specified, emails will be sent to the submitting user.
In the event a job failure (exit status different than zero), maris will include in the notification email a few lines from the job's stderr. Please note that this feature will only work if a job's stdout and stderr were not
specified using filename patterns
, that is names including %
characters as described in man sbatch
.
We have set up a jupyterhub environment that uses the slurm facilities to launch users' notebooks. Please refer to JupyterHub with Slurm on Maris.
ssh-ing from novamaris to a maris compute node produces a top-like output.