Differences

This shows you the differences between two versions of the page.

--- institute_lorentz:institutelorentz_maris_slurm [2018/10/08 07:44] – [Maris partitions] lenocil
+++ institute_lorentz:institutelorentz_maris_slurm [2020/01/15 13:34] (current) – removed lenocil
@@ Line 1: / Line 1: @@
-====== Slurm on the Maris Cluster ======
-All maris nodes have been configured to use [[http://slurm.schedmd.com/|slurm]] as a workload manager. Its use is enforced on all nodes. Direct access to any node other than the headnode `novamaris' is not allowed.
-Maris' slurm has been configured to manage consumable resources, such as CPUs and RAM, and generic resources (GPUs) using cgroups.
-A snapshot of the cluster usage can be found at http://slurm.lorentz.leidenuniv.nl/ (only accessible within the IL workstations network).
-Maris runs SLURM v17.02.
-Suggested readings:
-  *  [[https://slurm.schedmd.com/quickstart.html|A short slurm guide]].
-  *  [[:slurm_tutorial|A concise slurm guide]].
-===== Accounting =====
-Maris accounting scheme has been set up such that each principal investigator (PI) at IL has its own slurm account. Collaborators, postdocs, PhD and master students associated with a PI will use their slurm account to submit jobs to the cluster. This is achieved using the option ''-A'' or ''--account'' to ''srun'', ''sbatch'', etc....
-Accounting allows system managers to track cluster usages to improve services and enables the assignment of different powers/priorities to different users. At this moment all accounts, but `__guests__' have been granted the __same privileges, shares and priorities__. This could however change in the future if deemed necessary by the  cluster owners.
-Account information for a given <username> can be displayed using
-<code bash>
-sacctmgr list associations cluster=maris user=<username> format=Account,Cluster,User,Fairshare
-</code>
-if no results are returned, then please contact ''support''.
-Similarly, if you encounter the following error message upon submission of your batch job
-<code>
-error: Unable to allocate resources: Invalid account or account/partition combination specified
-</code>
-please make sure that you have specified the account associated to your user name.
-===== Available partitions and nodes =====
-Available partitions and their configurations are available by typing `sinfo' with the appropriate options (use man sinfo).
-A similar type of information plus all jobs currently in a queue can be seen using the GUI program ''sview''. Besides displaying cluster information, sview lets users  manage their jobs via a GUI.
-==== Maris partitions ====
-`playground' is the default execution partition if a different partition name is __not__ specified.
-^ Name ^ CPUs ^ Memory ^ Deafult Job Memory,Time^ GRES ^ Nodes ^# Nodes ^ MaxCPUPu ^ MaxJobsPu^ Max JobTime ^ QOS ^ Access ^
-|playground|288|851832M|all| |maris0[04-22,29-33,35-46]|36|  |  |  | playground | all |
-|notebook| 48|193044M|all| |maris0[23-28]|6|4| 1|  | notebook | all |
-|computation| 1552|6578050M |400M, 3Days |  |maris0[47-74] | 28 |  |  |  | normal | all |
-|compintel| 192 |1030000M|400M, 1 Day| |maris0[76-77] |2|  |  | 3 days | normal | beenakker |
-|ibintel| 192 |1030000M|400M, 1 Day| |maris078 |1|  |  | 10 days | normal | beenakker |
-|emergency| 384 |2773706M|all| |maris0[69-74] |6|  |  |  | normal | NOBODY |
-|gpu| 56 |256000M|400M, 3 Days|2 gpu|maris075 |1|  |  |  | normal | all |
-The `playground' partition should be used for test runs.
-The `notebook' partition has been set specifically for our `[[institute_lorentz:institutelorentz_maris_slurm_jupyterhub|jupyterhub]]' users. Nonetheless, you are free to use it at your own convenience.
-The `emergency' partition is a special partition to be used in very rare and urgent circumstances. Access to it must be granted by maris' administrators. Jobs scheduled to run on this partition will pause any running jobs should it be necessary.
-The `gpu' partition should be used only for jobs requiring GPUs. Note that these latter must be requested to slurm explicitly using ''--gres=gpu:1'' for instance.
-The `computation' and `compintel' partitions should be used for production runs.
-===== Maris QOS =====
-A quality of service (QOS) can be associated with a job, user, partition, etc... and can modify
-  * priorities
-  * limits
-Maris uses the concept of QOS to impose usage limits on the `notebook', `playground' partitions and users belonging to the `guests' account.
-To display all defined QOS use  ''sacctmgr''
-<code>
-#sacctmgr show qos format=Name,MaxCpusPerUser,MaxJobsPerUser,GrpNodes,GrpCpus,MaxWallDurationPerJob,Flags
-      Name MaxCPUsPU MaxJobsPU GrpNodes  GrpCPUs     MaxWall                Flags
----------- --------- --------- -------- -------- ----------- --------------------
-    normal
-playground
-  notebook         4         1                                        DenyOnLimit
-    guests       128                                                  DenyOnLimit
-</code>
-Any users can submit jobs specifying a QOS via the option ''--qos'', however they should know that any partition QOS will override the job's QOS.
-===== Slurm job priority on Maris =====
-Maris' slurm uses the [[https://slurm.schedmd.com/priority_multifactor.html|multifactor-priority plugin]]. All jobs submitted to the slurm queue and waiting to be scheduled will be  ordered according to the following `multi' factors
-  * Age
-  * Fairshare
-  * Job size and TRES
-  * Partition
-  * QOS
-Furthermore, any of the factors above will influence more or less a job's the execution order according to a pre-defined number called ''weight''.
-In summary, Maris' slurm has been set up such that:
-  * small jobs are given high priority.
-  * jobs submitted to the ''playground'' partition have higher priority.
-  * QOS have no influence on job priority.
-  * fairshare is an important factor when ordering the queue.
-  * after a wait of 7 days, a job will be given the maximum Age factor weight.
-  * fairshare is only based on the past 14 days. That is, usage decays to 0 within 14 days.
-The relevant configuration options can be displayed via the command ''scontrol show config | grep -i prior''.
-===== Using GPUs with slurm on Maris =====
-Maris075 is the only GPU node in the maris cluster. GPUs are configured as generic resources or GRES.  In order to use a GPU in your calculations, you must explicitly request it as a generic resource using the ''--gres'' option supported by the salloc, sbatch and srun commands. For instance, if you are submitting a batch script to slurm, then use the format ''#SBATCH --gres=gpu:tesla:1'' to request one GPU.
-:!: Please note that on maris GPUs are configured as __not-consumable__ generic resources (i.e. multiple jobs can use the same GPU).
-To compile your cuda application on maris using slurm, note that in your submission script you might have to export the libdevice library path and include the path in which the cuda headers can be found, for instance
-<code bash>
-#!/bin/env bash
-....
-NVVMIR_LIBRARY_DIR=/usr/share/cuda /usr/bin/nvcc -I/usr/include/cuda my_code.cu
-</code>
-===== slurm and MPI =====
-OpenMPI on the maris cluster supports launching parallel jobs in all three methods that SLURM supports:
-  * with //salloc//
-  * with //sbatch//
-  * with //srun//
-Please read https://www.open-mpi.org/faq/?category=slurm and https://slurm.schedmd.com/mpi_guide.html
-In principle to run an MPI application you could just execute it using mpirun as shown in the session below
-<code bash>
-novamaris$ cat slurm_script.sh
-#!/bin/env bash
-mpirun mpi_app.exe
-novamaris$ sbatch -N 4 slurm_script.sh
-srun: jobid 1234 submitted
-novamaris$
-</code>
-However, __**it is highly advised you use slurm's ''srun'' to submit a parallel job in any circumstances**__.
-<code bash>
-novamaris$ cat slurm_script.sh
-#!/bin/env bash
-srun mpi_app.exe
-novamaris$ sbatch -N 4 slurm_script.sh
-srun: jobid 1234 submitted
-novamaris$
-</code>
-At the moment maris supports only OpenMPI with slurm so you are required to load a particular openmpi/slurm module to get things to work, for instance
-<code bash>
-# load openMPI
-module load openmpi-slurm/2.0.2
-# run on 1 node usind 3 CPUs
-srun -n 3 <your_MPI_program>
-# run on 4 nodes node usind 4 CPUs
-srun -N 4 -n 4 mpi_example
-# if job is multithreaded and requires more than one CPU per task
-srun -c 4 mpi_example
-</code>
-:!: ''module load openmpi-slurm/2.0.2'' will be successful only if no other modules that implement MPI are loaded. In other words, unload any other MPI module and then load ''openmpi-slurm''.
-:!: Any application that uses MPI with slurm must be compiled against the mpi in the module openmpi-slurm otherwise it will behave erratically.
-==== module: openmpi-slurm ====
-It includes a version of openMPI built with slurm support. It also includes mpi-enabled ''fttw''.
-===== Example batch script =====
-Whenever writing a batch script, users are HIGHLY advised to explicitly specify resources (mem, cpus, etc...).
-maris offers a helper program to get user started with their first batch script, just type  `swizard' and follow the instructions on the screen.
-Batch scripts come handy when you have several options you would like to pass slurm. Instead of having a very long cmd line, you could create a batch script and you could submit it using `sbatch'.
-An example of batch script is given below:
-<code bash>
-#!/bin/env bash
-##comment out lines by adding at least two `#' at the beginning
-#SBATCH --job-name=lel-rAUtV
-#SBATCH --account=wyxxl
-#SBATCH --partition=computation
-#SBATCH --output=/home/lxxx/%x.out
-#SBATCH --error=/home/lxxx/%x.err
-#SBATCH --time=1-00:00:00
-#SBATCH --mem=400
-#SBATCH --ntasks=16
-#SBATCH --ntasks-per-node=1
-module load openmpi-slurm/2.0.2
-srun a.out
-</code>
-===== Example: how to use a node's scratch disks =====
- How can I tranfer  my files locally to a node's scratch space so that they can be used in my slurm jobs?
-slurm provides the command ''sbcast'' to make [[:slurm_tutorial#tips|single files]] available to allocated jobs. But how do achieve the same for directories or multiple files?
-Consider the batch script below
-<code bash>
-$ cat slurmcp.sh
-#!/bin/env bash
-DEFAULT_SOURCE=${SLURM_SUBMIT_DIR}
-SOURCE=${1:-DEFAULT_SOURCE}
-#SBATCH -N 1
-#SBATCH --nodelist=maris066
-SCRATCH=/data1/$USER/$SLURM_JOB_ID
-srun mkdir -p ${SCRATCH} || exit $?
-# note that srun cp is equivalent to loop over each node and copy the files
-srun cp -r ${DEFAULT_SOURCE}/*  ${SCRATCH} || exit $?
-# now do whatever you need to do with the local data
-# do NOT forget to remove data that are no longer needed
-srun rm -rf ${SCRATCH} || exit $?
-</code>
-and its invocation ''sbatch slurmcp.sh /tmp/mydir''.
-===== Example: instruct slurm to send emails upon job state changes =====
-Slurm can be instructed to email any job state changes to a chosen email address. This is accomplished by using the ''--mail-type'' option in sbatch for instance
-<code bash>
-...
-#SBATCH --mail-user=myemail@address.org
-#SBATCH --mail-type=ALL
-...
-</code>
-If the ''--mail-user'' option is not specified, emails will be sent to the submitting user.
-In the event a job failure (exit status different than zero), maris will include in the notification email a few lines from the job's stderr. Please note that this feature will **only** work if a job's stdout and stderr were not
-specified using ''filename patterns'', that is names including ''%'' characters as described in ''man sbatch''.
-===== Python Notebooks on maris =====
-We have set up a jupyterhub environment that uses the slurm facilities to launch users' notebooks. Please
-refer to [[institute_lorentz:institutelorentz_maris_slurm_jupyterhub|JupyterHub with Slurm on Maris]].
-===== Notes =====
-:!: ssh-ing from novamaris to a maris compute node produces a top-like output.

Computer Documentation Wiki

User Tools

Site Tools

Differences

Page Tools