Slurm

From SCECpedia
Jump to navigationJump to search

USC HPC is moving to slurm job manager so here are some notes on using that system at HPC.

HPC script placing on task on one node

This is a script to call plot_vs30_map.py to run on a compute node, not the head node. Currently this script is failing on memory er
rorr when run at full resolution of 0.01 degrees spacing. Try running on compute node and asking for unlimited memory.

Two special commands in the slurm file determine how much memory we can access on a node.
These scripts use two commands to control the placement of tasks on nodes
#SBATCH --ntask=1 #total number of tasks you will run, one per processor
#SBATCH -N 1 #total number of nodes to use. Distribute ntasks onto this number of nodes. This configuration should run one task on o
ne node, unlimited memory, to try to get rid of memory error on high resolution mesh.
#Next plan is to instrument this memory error, print out amount used at critical points, and determine at what level it is running o
ut.
-bash-4.2$ more *.slurm
#!/bin/bash
#SBATCH -p scec
#SBATCH --ntasks=1 #number of tasks with one per processor
#SBATCH -N 1 # number of nodes
#SBATCH --mem 0 # Set to unlimited memory
#SBATCH --time=12:00:00
#SBATCH -o ucvm2mesh_mpi_%A.out
#SBATCH -e ucvm2mesh_mpi_%A.err
#SBATCH --export=NONE
#SBATCH --mail-user=maechlin@usc.edu
#SBATCH --mail-type=END

cd /auto/scec-00/maechlin/ucvmc185/utilities
srun -v --mpi=pmi2 /auto/scec-00/maechlin/ucvmc185/utilities/plot_vs30_map.py -b 30.5,-126.0 -u 42.5,-112.5 -s 0.05 -a s -c cs173 -o
 vs30_cs173.png

HPC website

Converting from Torque (PBS) to Slurm

Here are differences between SLURM and Torque.

  1. SLURM seems much snappier, at least at Stampede. I can submit a job, then immediately check the job status, and if there are available resources it will have already started. Torque (at HPC as least) has a scheduling delay
  2. SLURM can be annoying with placement of STDOUT/STDERR files. I believe that by default it puts them in your home directly? Or maybe it's poor naming, I can't quite remember. I ended up creating a script called "qsub" (because I'm used to that command) which does the following to give Torque style .o<ID> and .e<ID> files in the submission directory:
    sbatch -o ${1}.o%j -e ${1}.e%j $1
  1. There are different command for managing jobs. The basic ones are:
    1. qstat -u kmilner => squeue -u kmilner
    2. qdel <job-id> => scancel <job-id>
    3. qsub <job-file> => sbatch <job-file) (or use my command above)
  2. The headers are different. Here are equivalent headers for the same job, with Stampede2 style syntax (might be slightly different for HPCC)
SLURM:

#SBATCH -t 00:60:00
#SBATCH -N 2
#SBATCH -n 40
#SBATCH -p scec

PBS (Torque):

#PBS -q scec
#PBS -l walltime=00:60:00,nodes=2:ppn=20
#PBS -V

Related Entries