Difference between revisions of "Slurm"
From SCECpedia
Jump to navigationJump to search (Created page with "USC HPC is moving to slurm job manager so here are some notes on using that system at HPC. == Converting from Torque (PBS) to Slurm == Here are differences between SLURM and...") |
(Add Slurm overview and update broken links.) |
||
(3 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
− | USC HPC is | + | Slurm is a a cluster management and job scheduling system. To learn more about Slurm architecture and commands, refer to the official [https://slurm.schedmd.com/overview.html Slurm Workload Manager documentation]. |
+ | Templates and cheatsheets for using Slurm on USC HPC systems can be found at the [https://www.carc.usc.edu/user-guides/hpc-systems/using-our-hpc-systems CARC HPC User Guide]. | ||
+ | |||
+ | This page outlines the transition of USC HPC from Torque to the Slurm job manager. | ||
+ | |||
+ | == HPC script placing on task on one node == | ||
+ | |||
+ | <pre> | ||
+ | This is a script to call plot_vs30_map.py to run on a compute node, not the head node. Currently this script is failing on memory er | ||
+ | rorr when run at full resolution of 0.01 degrees spacing. Try running on compute node and asking for unlimited memory. | ||
+ | |||
+ | Two special commands in the slurm file determine how much memory we can access on a node. | ||
+ | These scripts use two commands to control the placement of tasks on nodes | ||
+ | #SBATCH --ntask=1 #total number of tasks you will run, one per processor | ||
+ | #SBATCH -N 1 #total number of nodes to use. Distribute ntasks onto this number of nodes. This configuration should run one task on o | ||
+ | ne node, unlimited memory, to try to get rid of memory error on high resolution mesh. | ||
+ | #Next plan is to instrument this memory error, print out amount used at critical points, and determine at what level it is running o | ||
+ | ut. | ||
+ | </pre> | ||
+ | |||
+ | <pre> | ||
+ | -bash-4.2$ more *.slurm | ||
+ | #!/bin/bash | ||
+ | #SBATCH -p scec | ||
+ | #SBATCH --ntasks=1 #number of tasks with one per processor | ||
+ | #SBATCH -N 1 # number of nodes | ||
+ | #SBATCH --mem 0 # Set to unlimited memory | ||
+ | #SBATCH --time=12:00:00 | ||
+ | #SBATCH -o ucvm2mesh_mpi_%A.out | ||
+ | #SBATCH -e ucvm2mesh_mpi_%A.err | ||
+ | #SBATCH --export=NONE | ||
+ | #SBATCH --mail-user=maechlin@usc.edu | ||
+ | #SBATCH --mail-type=END | ||
+ | |||
+ | cd /auto/scec-00/maechlin/ucvmc185/utilities | ||
+ | srun -v --mpi=pmi2 /auto/scec-00/maechlin/ucvmc185/utilities/plot_vs30_map.py -b 30.5,-126.0 -u 42.5,-112.5 -s 0.05 -a s -c cs173 -o | ||
+ | vs30_cs173.png | ||
+ | </pre> | ||
== Converting from Torque (PBS) to Slurm == | == Converting from Torque (PBS) to Slurm == |
Latest revision as of 21:24, 14 August 2024
Slurm is a a cluster management and job scheduling system. To learn more about Slurm architecture and commands, refer to the official Slurm Workload Manager documentation. Templates and cheatsheets for using Slurm on USC HPC systems can be found at the CARC HPC User Guide.
This page outlines the transition of USC HPC from Torque to the Slurm job manager.
HPC script placing on task on one node
This is a script to call plot_vs30_map.py to run on a compute node, not the head node. Currently this script is failing on memory er rorr when run at full resolution of 0.01 degrees spacing. Try running on compute node and asking for unlimited memory. Two special commands in the slurm file determine how much memory we can access on a node. These scripts use two commands to control the placement of tasks on nodes #SBATCH --ntask=1 #total number of tasks you will run, one per processor #SBATCH -N 1 #total number of nodes to use. Distribute ntasks onto this number of nodes. This configuration should run one task on o ne node, unlimited memory, to try to get rid of memory error on high resolution mesh. #Next plan is to instrument this memory error, print out amount used at critical points, and determine at what level it is running o ut.
-bash-4.2$ more *.slurm #!/bin/bash #SBATCH -p scec #SBATCH --ntasks=1 #number of tasks with one per processor #SBATCH -N 1 # number of nodes #SBATCH --mem 0 # Set to unlimited memory #SBATCH --time=12:00:00 #SBATCH -o ucvm2mesh_mpi_%A.out #SBATCH -e ucvm2mesh_mpi_%A.err #SBATCH --export=NONE #SBATCH --mail-user=maechlin@usc.edu #SBATCH --mail-type=END cd /auto/scec-00/maechlin/ucvmc185/utilities srun -v --mpi=pmi2 /auto/scec-00/maechlin/ucvmc185/utilities/plot_vs30_map.py -b 30.5,-126.0 -u 42.5,-112.5 -s 0.05 -a s -c cs173 -o vs30_cs173.png
Converting from Torque (PBS) to Slurm
Here are differences between SLURM and Torque.
- SLURM seems much snappier, at least at Stampede. I can submit a job, then immediately check the job status, and if there are available resources it will have already started. Torque (at HPC as least) has a scheduling delay
- SLURM can be annoying with placement of STDOUT/STDERR files. I believe that by default it puts them in your home directly? Or maybe it's poor naming, I can't quite remember. I ended up creating a script called "qsub" (because I'm used to that command) which does the following to give Torque style .o<ID> and .e<ID> files in the submission directory:
sbatch -o ${1}.o%j -e ${1}.e%j $1
- There are different command for managing jobs. The basic ones are:
- qstat -u kmilner => squeue -u kmilner
- qdel <job-id> => scancel <job-id>
- qsub <job-file> => sbatch <job-file) (or use my command above)
- The headers are different. Here are equivalent headers for the same job, with Stampede2 style syntax (might be slightly different for HPCC)
SLURM: #SBATCH -t 00:60:00 #SBATCH -N 2 #SBATCH -n 40 #SBATCH -p scec PBS (Torque): #PBS -q scec #PBS -l walltime=00:60:00,nodes=2:ppn=20 #PBS -V