This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
slurm_tutorial [2017/03/31 14:40] – [Tips] lenocil | slurm_tutorial [2019/01/16 08:35] (current) – [Less-common user commands] lenocil | ||
---|---|---|---|
Line 3: | Line 3: | ||
Slurm is a resource manager and job scheduler. | Slurm is a resource manager and job scheduler. | ||
Users can submit jobs (i.e. scripts containing execution instructions) to slurm so that it can schedule their execution and allocate the appropriate resources (CPU, RAM, etc..) on the basis of a user's preferences or the limits imposed by the system administrators. | Users can submit jobs (i.e. scripts containing execution instructions) to slurm so that it can schedule their execution and allocate the appropriate resources (CPU, RAM, etc..) on the basis of a user's preferences or the limits imposed by the system administrators. | ||
- | Clearly, the advantages of using slurm on a computational cluster are multiple. For an overview of them please read [[https:// | + | The advantages of using slurm on a computational cluster are multiple. For an overview of them please read [[https:// |
Slurm is **free** software distributed under the [[http:// | Slurm is **free** software distributed under the [[http:// | ||
- | ==== What is a parallel | + | ==== What is parallel |
- | //A parallel job consists of tasks that run simultaneously.// | + | //A parallel job consists of tasks that run simultaneously.// |
- | * by running a multi-process program, for example using [[https:// | + | |
- | * by running a multi-threaded program, for example see [[http:// | + | |
- | + | ||
- | A multi-process program consists of multiple tasks orchestrated by MPI and possibly executed by different nodes. On the other hand, a multi-threaded program consists of multiple task using several CPUs on the same node. | + | |
- | + | ||
- | Slurm' | + | |
==== Slurm' | ==== Slurm' | ||
Line 55: | Line 49: | ||
< | < | ||
$ sinfo | $ sinfo | ||
- | PARTITION | ||
- | playground* | ||
- | playground* | ||
- | lowmem | ||
- | lowmem | ||
- | lowmem-inf | ||
- | lowmem-inf | ||
- | highmem | ||
- | highmem-inf | ||
- | notebook | ||
- | notebook | ||
- | |||
- | |||
</ | </ | ||
A * near a partition name indicates the default partition. See '' | A * near a partition name indicates the default partition. See '' | ||
- | **What jobs exist on the system?** | + | **Display all active |
< | < | ||
- | $squeue | + | $squeue -u < |
- | JOBID PARTITION | + | |
- | | + | |
- | 12276 playgroun pkequal_ maxxxxel | + | |
- | 8439 notebook obrien-j | + | |
- | 8749 playgroun slurm_co oxxxxxkh | + | |
- | 5801 notebook ostroukh oxxxxxkh | + | |
- | 8750 playgroun slurm_en oxxxxxkh | + | |
- | + | ||
</ | </ | ||
Line 92: | Line 64: | ||
< | < | ||
$ scontrol show partition notebook | $ scontrol show partition notebook | ||
- | PartitionName=notebook | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | |||
</ | </ | ||
< | < | ||
$scontrol show node maris004 | $scontrol show node maris004 | ||
- | NodeName=maris004 Arch=x86_64 CoresPerSocket=4 | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | |||
</ | </ | ||
< | < | ||
novamaris [1087] $ scontrol show jobs 1052 | novamaris [1087] $ scontrol show jobs 1052 | ||
- | JobId=1052 JobName=slurm_engine.sbatch | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | | ||
- | |||
</ | </ | ||
Line 161: | Line 83: | ||
0: maris005 | 0: maris005 | ||
1: maris006 | 1: maris006 | ||
+ | </ | ||
- | </ | ||
**Create three tasks running on the same node** | **Create three tasks running on the same node** | ||
< | < | ||
Line 171: | Line 93: | ||
</ | </ | ||
**Create three tasks running on different nodes specifying which nodes should __at least__ be used** | **Create three tasks running on different nodes specifying which nodes should __at least__ be used** | ||
+ | |||
< | < | ||
srun -N3 -w " | srun -N3 -w " | ||
Line 195: | Line 118: | ||
**Create a job script and submit it to slurm for execution** | **Create a job script and submit it to slurm for execution** | ||
- | :!: Use `swizard' | + | Suppose ''batch.sh'' |
+ | < | ||
+ | #!/bin/env bash | ||
+ | #SBATCH -n 2 | ||
+ | #SBATCH -w maris00[5-6] | ||
+ | srun hostname | ||
+ | </ | ||
+ | |||
+ | then submit it using '' | ||
- | Or wrote your own script to submit using | + | See '' |
==== Less-common user commands ==== | ==== Less-common user commands ==== | ||
Line 205: | Line 136: | ||
* **sshare** | * **sshare** | ||
* **sprio** | * **sprio** | ||
+ | * **sacct** | ||
=== sacctmgr === | === sacctmgr === | ||
Line 211: | Line 143: | ||
< | < | ||
$ sacctmgr show qos format=Name, | $ sacctmgr show qos format=Name, | ||
- | Name MaxCPUsPU MaxJobsPU | ||
- | ---------- --------- --------- -------------------- | ||
- | normal | ||
- | playground | ||
- | notebook | ||
</ | </ | ||
Line 232: | Line 159: | ||
</ | </ | ||
- | :!: If your job is serial (not parallel, that is not submitted using `srun' | + | :!: Note that in the example above the job is identified by id '' |
- | + | ||
- | :!: For parallel | + | |
=== sshare === | === sshare === | ||
Line 240: | Line 165: | ||
< | < | ||
- | $ sshare -U -u xxxxx | + | $ sshare -U -u < |
| | ||
-------------------- ---------- ---------- ----------- ----------- ------------- ---------- | -------------------- ---------- ---------- ----------- ----------- ------------- ---------- | ||
- | xxxxx yyyyyy | + | xxxxx yyyyyy |
</ | </ | ||
- | :!: On maris, usage parameters will decay over time according to a PriorityDecayHalfLife of 14 days. | ||
=== sprio === | === sprio === | ||
Line 257: | Line 181: | ||
</ | </ | ||
+ | To find what priority a running job was given type | ||
+ | < | ||
+ | squeue -o %Q -j < | ||
+ | </ | ||
+ | |||
+ | === sacct === | ||
+ | It displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database. For instance | ||
+ | |||
+ | < | ||
+ | sacct -o JobID, | ||
+ | | ||
+ | ------------ ---------- --------- ---------- ---------- ---------- ---------- ------------------- ------------------- | ||
+ | 13180 | ||
+ | 13180.batch | ||
+ | 13183 | ||
+ | 13183.batch | ||
+ | 13183.0 | ||
+ | |||
+ | |||
+ | </ | ||
+ | |||
+ | :!: Use '' | ||
===== Tips ===== | ===== Tips ===== | ||
- | To minimize the time your job spends in the queue you could specify multiple partitions so that the job can start as soon as possible. Use '' | + | To minimize the time your job spends in the queue you could specify multiple partitions so that the job could start as soon as possible. Use '' |
To have a rough estimate of when your queued job will start type '' | To have a rough estimate of when your queued job will start type '' | ||
Line 265: | Line 211: | ||
To translate a job script written for a scheduler different than slurm to slurm' | To translate a job script written for a scheduler different than slurm to slurm' | ||
http:// | http:// | ||
+ | |||
+ | === top-like node usage === | ||
+ | |||
+ | Should you want to monitor the usage of the cluster nodes in a top-like fashion type | ||
+ | |||
+ | < | ||
+ | sinfo -i 5 -S" | ||
+ | </ | ||
+ | |||
+ | === top-like job stats === | ||
+ | To monitor the resources consumed by your running job type | ||
+ | |||
+ | < | ||
+ | watch -n1 sstat --format JobID, | ||
+ | </ | ||
+ | |||
+ | === Make local file available to all nodes allocated to a slurm job === | ||
+ | |||
+ | To transmit a file to all nodes allocated to the currently active Slurm job use '' | ||
+ | |||
+ | < | ||
+ | > cat my.job | ||
+ | # | ||
+ | | ||
+ | srun / | ||
+ | |||
+ | > sbatch --nodes=8 my.job | ||
+ | srun: jobid 145 submitted | ||
+ | |||
+ | </ | ||
+ | |||
+ | === Specify nodes for a job === | ||
+ | |||
+ | For instance ''# | ||
+ | |||
+ | === Environment variables available to slurm jobs === | ||
+ | |||
+ | Type '' | ||
+ |