UCERF3-ETAS Measurements

From SCECpedia
Jump to navigationJump to search

This page summarizes the performance study of UCERF3-ETAS ran locally in Docker and on CARC Discovery, SDSC Expanse, TACC Frontera, and TACC Stampede3. This study allows for evaluation of resource requirements in single-node and multiple-node simulations of the Ridgecrest M7.1 ETAS forecast (ci38457511).

Performance Results

In the tables below, "Core Hours" reflect ACCESS SUs used and are computed by dividing the runtime in minutes by 60, multiplying by the number of nodes used, and then multiplying by the number of CPU cores available on that node. As we are charged for the full node regardless of CPU utilization, this metric doesn't reflect how many cores within a node are used. The "Cores / Node" column reflects the average number of CPU cores available per node as derived by running scontrol show node -a <node_list> over the list of nodes allocated per job. "RAM / Node" is a measure of RAM available to ETAS by either ETAS_MEM_GB or MEM_GIGS, not the total RAM on a given node.

Docker Measurements
Number of Nodes Cores / Node RAM / Node (GB) Number of Catalogs Runtime (min) Core Hours
1 14 75 10 1.3 0.30
1 14 75 100 4.4 1.0
1 14 75 1000 25.8 6.0
1 14 75 10,000 286.3 67

Dockerized local runs with a resource allocation of 14 CPU (@ 1 thread / CPU), 96GB RAM, 1GB Swap, 64GB Disk. u3etas_launcher uses 80% of available RAM, ETAS_MEM_GB=75.

Single-node measurements below were collected using ETAS_MEM_GB=32. Multi-node measurements automatically override the default value.

Discovery is a heterogenous system. Not all nodes within the same partition have the same number of cores available. The cores available column is calculated by taking the average of cores available over the nodes assigned.


Discovery Measurements
Number of Nodes Cores / Node RAM / Node (GB) Threads / Node Total Threads RAM / Thread Scratch Enabled Number of Catalogs Runtime (min) Core Hours
1 24 32 30 30 1.1 Y 10 1.7 0.68
1 24 32 30 30 1.1 Y 100 157.2 63
1 20 32 30 30 1.1 Y 1000 201.4 67
1 24 32 30 30 1.1 Y 10,000 424.8 170
14 20 50 10 140 5 Y 10 2.9 14
14 46.86 (8x 64, 6x 24) 50 10 140 5 Y 100 2.8 31
14 55.43 (11x 64, 3x 24) 50 10 140 5 Y 1000 3.6 47
14 52.57 (10x 64, 4x 24) 50 10 140 5 Y 10,000 17.2 210
14 20 50 10 140 5 Y 100,000 228.1 1064
32 64 50 10 320 5 Y 10 0.78 27
32 64 50 10 320 5 Y 100 1.2 41
32 60.25 (29x 64, 3x 24) 50 10 320 5 Y 1000 3.9 125
32 50.75 (22x 64, 4x 24, 6x 20) 50 10 320 5 Y 10,000 10.5 284
32 27.50 (4x 64, 16x 24, 12x 20) 50 10 320 5 Y 100,000 99.2 1455


Expanse Measurements
Number of Nodes Cores / Node RAM / Node (GB) Threads / Node Total Threads RAM / Thread Scratch Enabled Number of Catalogs Runtime (min) Core Hours
1 128 32 10 10 3.2 10 2.9 6.2
1 128 32 10 10 3.2 100 10.4 22
1 128 32 10 10 3.2 1000 22.6 48
1 128 32 10 10 3.2 10,000 207.7 443
1 128 220 44 44 5 Y 10 1.0 2.1
1 128 220 44 44 5 Y 100 2.4 5.1
1 128 220 44 44 5 Y 1000 14.1 30
1 128 200 40 40 5 Y 10,000 67.1 143
9 128 200 40 360 5 Y 100,000 133.9 2571
14 128 50 10 140 5 10 1.8 54
14 128 50 10 140 5 100 2.1 63
14 128 50 10 140 5 1000 5.4 161
14 128 50 10 140 5 10,000 18.9 564
14 128 50 10 140 5 100,000 162.4 4850
14 128 200 40 560 5 Y 10 1.7 51
14 128 200 40 560 5 Y 100 2.2 66
14 128 200 40 560 5 Y 1000 4.1 122
14 128 200 40 560 5 Y 10,000 15.3 457
14 128 200 40 560 5 Y 100,000 90.5 2703
14 128 200 25 350 8 Y 100,000 86.1 2572
14 128 224 14 196 16 Y 10,000 15.6 467
32 128 50 10 320 5 10 2.1 143
32 128 50 10 320 5 100 2.3 157
32 128 50 10 320 5 1000 3.4 232
32 128 50 10 320 5 10,000 11.3 771
32 128 50 10 320 5 100,000 74.8 5106
32 128 200 40 1280 5 Y 10 2.0 137
32 128 200 40 1280 5 Y 100 2.7 184
32 128 200 40 1280 5 Y 1000 2.7 184
32 128 200 40 1280 5 Y 10,000 8.8 601
32 128 200 40 1280 5 Y 100,000 41.8 2854
32 128 224 14 196 16 Y 100,000 122.2 8342


Frontera Measurements
Number of Nodes Cores / Node RAM / Node (GB) Threads / Node Total Threads RAM / Thread Scratch Enabled Number of Catalogs Runtime (min) Core Hours
2 56 160 20 40 8 Y 10 2.1 3.9
14 56 160 20 280 8 Y 10,000 13.2 172
14 56 160 20 280 8 Y 100,000 103.8 1356
18 56 160 20 360 8 Y 100,000 81.2 1364


Stampede3 Measurements
Number of Nodes Queue Cores / Node RAM / Node (GB) Threads / Node Total Threads RAM / Thread Scratch Enabled Number of Catalogs Runtime (min) Core Hours Node Hours Charge Rate (SU/Node Hour) Charge (SU)
14 ICX 80 200 25 350 8 Y 10,000 10.6 198 2.48 1.5 3.7
14 ICX 80 200 25 350 8 Y 100,000 83.5 1559 19.5 1.5 29.3
14 SKX 48 144 18 252 8 Y 10,000 13.9 156 3.25 1 3.3
14 SKX 48 144 18 252 8 Y 100,000 111.1 1244 25.9 1 25.9
14 SPR 112 104 13 182 8 Y 10,000 18.9 494 4.41 2 8.8
14 SPR 112 104 13 182 8 Y 100,000 164.5 4299 38.4 2 76.8
20 SKX 48 144 18 360 8 Y 100,000 78.7 1259 26.2 1 26.2
27 SPR 112 104 13 351 8 Y 100,000 87 4385 39.2 2 78.4

Stampede3 SUs billed = (# nodes) x (job duration in wall clock hours) x (charge rate per node-hour)

Installation and Configuration

Running ETAS simulations on OpenSHA is simplified through a collection of launcher binaries and scripts called ucerf3-etas-launcher. The process of installation and configuration varies across systems, however the foundations remain the same. Running simulations always occurs in three phases.

  1. Building configuration files for a specified event, where we configure MPI nodes and number of simulations
  2. Launching the simulation with the configuration files
  3. Consolidating and plotting simulations data

Docker

When running UCERF3-ETAS simulations locally, using Docker allows for a consistent environment without the need to manage dependencies and the ability to easily provision resources. Download the Docker image for the M7.1 Ridgecrest main shock by running docker pull sceccode/ucerf3_jup or searching for "ucerf3_jup" on Docker Desktop. I prefer to use Docker Desktop, but the command-line is sufficient.

Under Docker Desktop settings, I allocated 14 CPUs, 96GB of RAM, 1GB of Swap, and 64GB of disk storage to the Docker environment. Open a terminal on your system with the Docker CLT installed and run docker run -d -p 8888:8888 sceccode/ucerf3_jup --name ucerf3-etas. This allows you to run a container forwarding the port 8888 for the Jupyter Notebook server. From here, you can navigate to the Jupyter Notebook web application at http://localhost:8888 to access an interactive terminal for the container. Alternatively, you can run the container directly in Docker Desktop and navigate to the "Exec" tab to access the terminal without needing Jupyter Notebook or port-fortwarding.

Once inside your container, use the following workflow to run local simulations and plot data, where $NUM_SIM is the number of simulations desired.

u3etas_comcat_event_config_builder.sh --event-id ci38457511 --mag-complete 3.5 --radius 25 --num-simulations $NUM_SIM --days-before 7 --max-point-src-mag 6 --finite-surf-shakemap --finite-surf-shakemap-min-mag 4.5 --nodes 1 --hours 24 --output-dir target/docker-comcat-ridgecrest-m7.1-n1-s$NUM_SIM

u3etas_launcher.sh $HOME/target/docker-comcat-ridgecrest-m7.1-n${NUM_SIM}/config.json | tee target/docker-comcat-ridgecrest-m7.1-n1-s${NUM_SIM}/u3etas_launcher.log

u3etas_plot_generator.sh $HOME/target/docker-comcat-ridgecrest-m7.1-n1-s${NUM_SIM}/config.json

You'll notice that we didn't specify an hpc-site parameter. As we are running these simulations locally, and not on a High Performance Computing system, we don't need to define a site to generate a corresponding Slurm file. Instead of passing a slurm file to sbatch, we can execute the launcher directly with the generated config.json file. I also pipe into the tee command to capture output for logging purposes.

In Docker Desktop, you can navigate to the "Volumes" tab to find the stored data for the containers and download them onto your host system.

Discovery

Establish a Discovery SSH or CARC OnDemand connection and clone the ucerf3-etas-launcher GitHub repository at the path /project/scec_608/$USER/ucerf3/ucerf3-etas-launcher, where $USER is your username.

Edit the bashrc file at $HOME/.bashrc to update the PATH to include the downloaded ETAS scripts and load HPC modules necessary to run ETAS in a multiple-node environment.

# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
fi

# Switch groups, but only if necessary
if [[ `id -gn` != "scec_608" && $- =~ i ]]
then
#    echo "switching group"
    newgrp scec_608
    exit
fi

PATH=$PATH:$HOME/.local/bin:$HOME/bin
export TERM=linux

## MODULES
module load usc # this is loaded by default on login nodes, but not on compute nodes, so we need to add 'usc' here so that the subsequent modules will work
module load gcc/11.3
module load openjdk
module load git
module load vim
# every once in a while CARC breaks java, and we need this to avoid unsatisfied link errors
# if you get them looking related to libawt_xawt.so: libXext.so.6 or similar, uncommend the following
# previously encountered and then went away, but came back after may 2024 maintenence window
module load libxtst # no clue why we suddently needed this to avoid a weird JVM unsatisfied link exception

# compute nodes don't have unzip...
which unzip > /dev/null 2> /dev/null
if [[ $? -ne 0 ]];then
        module load unzip
        module load bzip2
fi

## https://github.com/opensha/ucerf3-etas-launcher/tree/master/parallel/README.md
export PROJFS=/project/scec_608/$USER
export ETAS_LAUNCHER=$PROJFS/ucerf3/ucerf3-etas-launcher
export ETAS_SIM_DIR=$PROJFS/ucerf3/etas_sim
export ETAS_MEM_GB=5 # this will be overridden in batch scripts for parallel jobs, set low enough so that the regular U3ETAS scripts can run on the login node to configure jobs
export MPJ_HOME=/project/scec_608/kmilner/mpj/mpj-current
export PATH=$ETAS_LAUNCHER/parallel/slurm_sbin/:$ETAS_LAUNCHER/sbin/:$MPJ_HOME/bin:$PATH

if [[ `hostname` == e19-* ]];then
        # on a compute node in the SCEC queue
        export OPENSHA_MEM_GB=50
        export FST_HAZARD_SPACING=0.2
        export OPENSHA_JAR_DISABLE_UPDATE=1
elif [[ -n "$SLURM_JOB_ID" ]];then
        # on a compute node otherwise
        export OPENSHA_JAR_DISABLE_UPDATE=1
        unset OPENSHA_MEM_GB
else
        export OPENSHA_MEM_GB=10
fi
export OPENSHA_FST=/project/scec_608/kmilner/git/opensha-fault-sys-tools
export OPENSHA_FS_GIT_BRANCH=master
export PATH=$PATH:$OPENSHA_FST/sbin

You'll notice that in the bashrc there are references to user "kmilner", do not change these. The files here are readable by other users and are necessary for running ETAS. There are future plans to migrate much of this code outside of the user bash file and into an MPJ Express wrapper script, to improve portability and simplify the configuration process.

After editing the bashrc file, either login and logout or run source ~/.bashrc to load the new changes.

Utilizing launcher scripts, an interactive compute node can be accessed to build configuration files directly on Discovery, as opposed to building locally and transferring over SCP/SFTP. Non-trivial jobs cannot be executed on the head node, which is why configuration files are built in such a way. Do so now by running slurm_interactive.sh. After waiting for resource provisioning, build the configuration inside the interactive compute node with

cd $ETAS_SIM_DIR && u3etas_comcat_event_config_builder.sh --event-id ci38457511 --mag-complete 3.5 --radius 25 --num-simulations $NUM_SIM --days-before 7 --max-point-src-mag 6 --finite-surf-shakemap --finite-surf-shakemap-min-mag 4.5 --hpc-site USC_CARC --nodes $NUM_NODE --hours 24 --output-dir $ETAS_SIM_DIR/discovery-comcat-ridgecrest-m7.1-n${NUM_NODE}-s$NUM_SIM

where $NUM_SIM is the number of simulations to run and $NUM_NODE is the number of nodes to utilize.

The generated configuration does require a bit of manual tweaking prior to execution. Navigate to the generated simulation directory. You'll notice that unlike before with localized Docker runs, we have a a Slurm file.

Slurm files invoke the launcher script with the config JSON file through an MPJ wrapper. MPJ, or Message Passing in Java, is utilized to enable the parallel distribution of work across the HPC compute nodes. The Slurm file also specifies the number of nodes and other parameters relevant to work distribution. In your simulation folder you should see the files "etas_sim_mpj.slurm", "plot_results.slurm", and "opensha-all.jar". If the jar file failed to copy, copy it manually from ${ETAS_LAUNCHER}/opensha/opensha-all.jar Make the following changes to etas_sim_mpj.slurm:

  • Rename partition from scec -> main: #SBATCH -p main
  • Ensure the ETAS JSON config path is prefixed by simulation directory: ETAS_CONF_JSON="${ETAS_SIM_DIR}/...
  • Update scratch directory from scratch2 -> scratch1: SCRATCH_OPTION="--scratch-dir /scratch1/$USER/etas_scratch"
  • If this simulation is on a single-node, don't invoke the MPJ wrapper:
date
echo "RUNNING ETAS-LAUNCHER"
u3etas_launcher.sh --threads $THREADS $ETAS_CONF_JSON
ret=$?
date

Additionally, inside the config.json file, update the "outputDir" to be prefixed with "${ETAS_SIM_DIR}/" prior to the output name, to prevent the creation of a duplicate folder.

After making the necessary changes, place the ETAS simulation on the job queue by running slurm_submit.sh etas_sim_mpj.slurm. You can rename the slurm file prior to submission to set the job name to more easily manage jobs. Stdout and stderr is written files to {JOB}.o{ID} and {JOB}.e{ID} respectively. Runtime is derived from the timestamps in the output file. Results are written to either a results/ directory or the binary "results_complete.bin".

Generate plots with "plot_results.slurm". Similarly, you must also update the partition name from "scec" to "main" and submit the job with slurm_submit.sh. View final plots in the generated "index.html". If you do not have a graphical session, you may need to download the simulation folder to view plots locally.

Expanse

The Expanse Configuration takes into consideration the Expanse User Guide and existing Quakeworx Dev Configuration.

In order to establish an SSH connection to Expanse, you must first verify your Expanse project allocation. Verify project allocation at Expanse Portal -> OnDemand -> Allocation and Usage Information at Resource “Expanse”. If not present, file a troubleshooting ticket at support.access-ci.org

Unlike on Discovery, we are going to set up our own MPJ Express installation and configure an MPJ Express Wrapper. A similar process will be rolled out to Discovery in the future.

  1. Clone MPJ Express to /expanse/lustre/projects/usc143/$USER/mpj-express: $ git clone https://github.com/kevinmilner/mpj-express.git
  2. Set Wrapper path in mpj-express/conf/mpjexpress.conf: mpjexpress.ssh.wrapper=/expanse/lustre/projects/usc143/$USER/ucerf3/ucerf3-etas-env-wrapper.sh. You may want to explicitly write your username in the path here instead of using $USER.
  3. Create the MPJ Wrapper file at the specified path as follows:
#!/bin/bash

module load cpu/0.15.4
module load openjdk/11.0.2

export PROJFS=/expanse/lustre/projects/usc143/$USER
export ETAS_LAUNCHER=$PROJFS/ucerf3/ucerf3-etas-launcher
export ETAS_SIM_DIR=$PROJFS/ucerf3/u3etas_sim
export MPJ_HOME=$PROJFS/mpj-express
export PATH=$ETAS_LAUNCHER/parallel/slurm_sbin:$ETAS_LAUNCHER/sbin/:$MPJ_HOME/bin:$PATH

"$@"
exit $?

Add the following to the bashrc:

module load sdsc
module load cpu/0.15.4
module load openjdk/11.0.2

# compute nodes don't have unzip...
which unzip > /dev/null 2> /dev/null
if [[ $? -ne 0 ]];then
        module load unzip
        module load bzip2
fi

# https://github.com/opensha/ucerf3-etas-launcher/tree/master/parallel/README.md
export PROJFS=/expanse/lustre/projects/usc143/$USER
export ETAS_LAUNCHER=$PROJFS/ucerf3/ucerf3-etas-launcher
export ETAS_SIM_DIR=$PROJFS/ucerf3/u3etas_sim
export ETAS_MEM_GB=5 # this will be overridden in batch scripts for parallel jobs, set low enough so that the regular U3ETAS scripts can run on the login node to configure jobs
export MPJ_HOME=$PROJFS/mpj-express
export PATH=$ETAS_LAUNCHER/parallel/slurm_sbin/:$ETAS_LAUNCHER/sbin/:$MPJ_HOME/bin:$PATH

Single-node simulations won't invoke the MPJ Express Wrapper, which is why these changes are necessary.

Connect to an interactive compute node:

srun --partition=debug  --pty --account=usc143 --nodes=1 --ntasks-per-node=4 --mem=16G -t 00:30:00 --wait=0 --export=ALL /bin/bash

and build the simulation with NUM_SIM catalogs and NUM_NODE nodes.

u3etas_comcat_event_config_builder.sh --event-id ci38457511 --mag-complete 3.5 --radius 25 --num-simulations $NUM_SIM --days-before 7 --max-point-src-mag 6 --finite-surf-shakemap --finite-surf-shakemap-min-mag 4.5 --hpc-site USC_CARC --nodes $NUM_NODE --hours 24 --output-dir $ETAS_SIM_DIR/expanse-comcat-ridgecrest-m7.1-n${NUM_NODE}-s$NUM_SIM

Inside the slurm config file, set to use the "compute" partition instead of "scec".

#SBATCH --partition compute

I can't confirm if it's necessary to set MPJ_HOME in the Slurm configuration, given our wrapper sets this already, but I have set it here as well.

Take care to explicitly set ETAS_MEM_GB for single-node runs to desired memory available < MEM_GIGS. Consider the total RAM available per node, which is ~256GB according to the Expanse User Guide.

Set to directly invoke u3etas_launcher for single-node simulations just like previously done on Discovery.


Quakeworx Dev doesn’t use a scratch file. Scratch files aren't necessary, but may speed up I/O operations. Comment out the SCRATCH parameter in the slurm configuration.

Set the account for your research project, in my case it's "usc143".

#SBATCH --account=usc143

Depending on your account quota, you may struggle to run 32 node simulations. In my case I used another project "ddp408" for these simulations.

Check available projects for expanse resource with expanse-client user -r expanse

Unlike on Discovery, we must set ntasks-per-node or ntasks.

#SBATCH --ntasks 20

I ran successfully with 20, although you can try a higher value. Too many tasks may result in a job quota failure.

As Expanse has 128 cores available per node, and we are charged for the full node regardless of utilization, take care to set cores-per-node=128. I didn't do this for my runs, but we still reflect the 128 cores in the Measurements table to accurately reflect the cost.

Job execution and data plotting instructions are identical to Discovery.

Frontera

Before attempting to configure Frontera or Stampede3, consider that as TACC systems they share the Stockyard filesystem. Configuration for Frontera and Stampede3 are nearly identical, which means we could share many files from $STOCKYARD/frontera to $STOCKYARD/stampede3. In fact, mpj-express, ucerf3-etas-launcher, and jdk-22 can be stored directly in $STOCKYARD and shared across systems. This wasn’t done in our example as I didn’t realize this until later. u3etas_sim should still be under $WORK for each system for easier organization. In both Frontera and Stampede3 there are paths for $WORK and $SCRATCH that are already set by default inside Stockyard.

Firstly, we configure the user bashrc with compute modules and paths for MPJ and ETAS.

############
# SECTION 1
#
# There are three independent and safe ways to modify the standard
# module setup. Below are three ways from the simplest to hardest.
#   a) Use "module save"  (see "module help" for details).
#   b) Place module commands in ~/.modules
#   c) Place module commands in this file inside the if block below.
#
# Note that you should only do one of the above.  You do not want
# to override the inherited module environment by having module
# commands outside of the if block[3].

if [ -z "$__BASHRC_SOURCED__" -a "$ENVIRONMENT" != BATCH ]; then
  export __BASHRC_SOURCED__=1

  ##################################################################
  # **** PLACE MODULE COMMANDS HERE and ONLY HERE.              ****
  ##################################################################

  module load gcc
  module load git

  # compute nodes don't have unzip...
  which unzip > /dev/null 2> /dev/null
  if [[ $? -ne 0 ]]; then
    module load unzip
    module load bzip2
  fi

fi
############
# SECTION 2
#
# Please set or modify any environment variables inside the if block
# below.  For example, modifying PATH or other path like variables
# (e.g LD_LIBRARY_PATH), the guard variable (__PERSONAL_PATH___)
# prevents your PATH from having duplicate directories on sub-shells.

if [ -z "$__PERSONAL_PATH__" ]; then
  export __PERSONAL_PATH__=1

  ###################################################################
  # **** PLACE Environment Variables including PATH here.        ****
  ###################################################################


  export JAVA_HOME=$WORK/jdk-22.0.1
  export PATH=$HOME/bin:$JAVA_HOME/bin:$PATH

  # https://github.com/opensha/ucerf3-etas-launcher/tree/master/parallel/README.md
  export ETAS_LAUNCHER=$WORK/ucerf3/ucerf3-etas-launcher
  export ETAS_SIM_DIR=$WORK/ucerf3/u3etas_sim
  export ETAS_MEM_GB=5 # this will be overridden in batch scripts for parallel jobs, set low enough so that the regular U3ETAS scripts can run on the login node to configure jobs
  export MPJ_HOME=$WORK/mpj-express
  export PATH=$ETAS_LAUNCHER/parallel/slurm_sbin/:$ETAS_LAUNCHER/sbin/:$MPJ_HOME/bin:$PATH

fi

There is no module for OpenJDK available on TACC systems, nor is there any existing MPJExpress installation.

Install Java from tarball at https://www.oracle.com/java/technologies/javase/jdk22-archive-downloads.html and tar -xzvf into $WORK/jdk-22.0.1

Download Kevin's fork of MPJExpress at $WORK/mpj-express

git clone https://github.com/kevinmilner/mpj-express

Unlike on Expanse, an MPJ Wrapper script was not necessary, just ensure you configure the slurm script and bashrc correctly.

Connect to an interactive node by running idev. Defaults to 30 minutes on default queue. srun is also available on Frontera, but idev is preferred as per the Frontera User Guide.

idev -A EAR20006 -p flex -N 1 -n 4 -m 30

From within an interactive node, build the configuration with desired number of catalogs, NUM_SIM and nodes, NUM_NODE.

cd $ETAS_SIM_DIR && u3etas_comcat_event_config_builder.sh --event-id ci38457511 --mag-complete 3.5 --radius 25 --num-simulations $NUM_SIM --days-before 7 --max-point-src-mag 6 --finite-surf-shakemap --finite-surf-shakemap-min-mag 4.5 --hpc-site TACC_FRONTERA --nodes $NUM_NODE --hours 24 --queue normal --output-dir $ETAS_SIM_DIR/frontera-comcat-ridgecrest-m7.1-n${NUM_NODE}-s$NUM_SIM

Navigate to the generated config's etas_sim_mpj.slurm and make the following changes. Note that this step will not be necessary after we update the default slurm script for this hpc-site in the u3etas-launcher code.

Set SBATCH

#SBATCH -t 24:00:00
#SBATCH --nodes 14
#SBATCH --ntasks 14
#SBATCH --cpus-per-task=56
#SBATCH --partition normal
#SBATCH --mem 0

Don't use FastMPJ, use ExpressMPJ. Use your own MPJ_HOME path, not mine.

# FMPJ_HOME directory, fine to use mine
#FMPJ_HOME=/home1/00950/kevinm/FastMPJ
MPJ_HOME=/work2/10177/bhatthal/frontera/mpj-express

Update PATH to use MPJ_HOME instead of FMPJ_HOME

export PBS_NODEFILE=$NEW_NODEFILE
export PATH=$PATH:$MPJ_HOME/bin

Add timers and call ExpressMPJ instead

t1=$(date +%s) # epoch start time in seconds

date
echo "RUNNING MPJ"
mpjrun_errdetect_wrapper.sh -machinefile $PBS_NODEFILE -np $NP -dev niodev -Djava.library.path=$FMPJ_HOME/lib -Xmx${MEM_GIGS}G -cp $JAR_FILE -class scratch.UCERF3.erf.ETAS.launcher.MPJ_ETAS_Launcher --min-dispatch $MIN_DISPATCH --max-dispatch $MAX_DISPATCH --threads $THREADS $TEMP_OPTION $SCRATCH_OPTION $CLEAN_OPTION --end-time `scontrol show job $SLURM_JOB_ID | egrep --only-matching 'EndTime=[^ ]+' | cut -c 9-` $ETAS_CONF_JSON
ret=$?
date

t2=$(date +%s) # epoch end time in seconds
numSec=$(echo $t2 - $t1 | bc -q ) # the number of seconds the process took.
runTime=$(date -ud @$numSec +%T) # Convert the seconds into Hours:Mins:Sec
echo "Time to build: $runTime ($numSec seconds)"

exit $ret

Make sure you’re not on an interactive compute node. From a login node, execute slurm_submit.sh etas_sim_mpj.slurm.

Stampede3

In this example we run 14 nodes on Icelake (ICX), Skylake (SKX), and Sapphire Rapids (SPR). Refer to Stampede3 User guide for all queues.

Just like on Frontera, set the user bashrc with modules and paths.

Create interactive session with idev, in my case I was issued 1 node, 48 tasks per node on skx-dev (Skylake) using project DS-Sybershake.

I noticed that the TACC_STAMPEDE3 enum constant is missing from ucerf3-etas-launcher, even though it is present in the OpenSHA repository: https://github.com/opensha/opensha/blob/9df7200b6ed8984b9024a67f81ad630da8278a92/src/main/java/scratch/UCERF3/erf/ETAS/launcher/util/ETAS_ConfigBuilder.java#L47

This configuration uses FastMPJ anyway, so when we eventually transition to using my configuration in the OpenSHA repository, we’ll update ucerf3-etas-launcher to bundle the latest opensha-all.jar with the TACC_STAMPEDE3 enum instead of the deprecated TACC_STAMPEDE2. We may also move USC_CARC into CARC_DISCOVERY and CARC_ENDEAVOUR, as their configurations are different. We’ll use the TACC_FRONTERA configuration for now.

ICX

cd $ETAS_SIM_DIR && u3etas_comcat_event_config_builder.sh --event-id ci38457511 --mag-complete 3.5 --radius 25 --num-simulations $NUM_SIM --days-before 7 --max-point-src-mag 6 --finite-surf-shakemap --finite-surf-shakemap-min-mag 4.5 --hpc-site TACC_FRONTERA --nodes $NUM_NODE --hours 24 --queue normal --output-dir $ETAS_SIM_DIR/stampede3-icx-comcat-ridgecrest-m7.1-n${NUM_NODE}-s$NUM_SIM

SPR

cd $ETAS_SIM_DIR && u3etas_comcat_event_config_builder.sh --event-id ci38457511 --mag-complete 3.5 --radius 25 --num-simulations $NUM_SIM --days-before 7 --max-point-src-mag 6 --finite-surf-shakemap --finite-surf-shakemap-min-mag 4.5 --hpc-site TACC_FRONTERA --nodes $NUM_NODE --hours 24 --queue normal --output-dir $ETAS_SIM_DIR/stampede3-spr-comcat-ridgecrest-m7.1-n${NUM_NODE}-s$NUM_SIM

SKX

cd $ETAS_SIM_DIR && u3etas_comcat_event_config_builder.sh --event-id ci38457511 --mag-complete 3.5 --radius 25 --num-simulations $NUM_SIM --days-before 7 --max-point-src-mag 6 --finite-surf-shakemap --finite-surf-shakemap-min-mag 4.5 --hpc-site TACC_FRONTERA --nodes $NUM_NODE --hours 24 --queue normal --output-dir $ETAS_SIM_DIR/stampede3-skx-comcat-ridgecrest-m7.1-n${NUM_NODE}-s$NUM_SIM

For plotting results, use skx as it is the cheapest at 1 SU. Set -n 48. For each config's etas_sim_mpj.slurm, set -p to name of each queue icx, spr, and skx respectively. Take care to also set the MEM_GIGS to less than the amount present on each system. System Architecture for RAM/node and CPU/node for each queue is available at the Stampede3 User Guide.

Here's an example slurm configuration for Skylake:

#!/bin/bash

#SBATCH -t 24:00:00
#SBATCH --nodes 14
#SBATCH --ntasks 14
#SBATCH --cpus-per-task=48
#SBATCH -p skx
#SBATCH --mem 0

######################
## INPUT PARAMETERS ##
######################

# the above '#SBATCH' lines are requred, and are supposed to start with a '#'. They must be at the beginning of the file
# the '-t hh:mm:ss' argument is the wall clock time of the job
# the '-N 10' argument specifies the number of nodes required, in this case 10
# the '-n 560' argument specifies the number of cores, required by TACC. Set it to no more than 56*the number of nodes
# the 'p normal' argument specifies the queue, in this case we use the normal queue

## ETAS PARAMETERS ##

# path to the JSON configuration file
ETAS_CONF_JSON="/work2/10177/bhatthal/stampede3/ucerf3/u3etas_sim/stampede3-skx-comcat-ridgecrest-m7.1-n14-s10000/config.json"

## JAVA/MPJ PARAMETERS ##

# maxmimum memory in gigabytes. should be close to, but not over, total memory available
MEM_GIGS=144

# number of etas threads. should be approximately MEM_GIGS/5, and no more than the total number of threads available
THREADS=18

# FMPJ_HOME directory, fine to use mine
#FMPJ_HOME=/home1/00950/kevinm/FastMPJ
MPJ_HOME=/work2/10177/bhatthal/stampede3/mpj-express

# path to the opensha-ucerf3 jar file
JAR_FILE=${ETAS_LAUNCHER}/opensha/opensha-all.jar

# simulations are sent out in batches to each compute node. these paramters control the size of those batches
# smaller max size will allow for better checking of progress with watch_logparse.sh, but more wasted time at the end of batches waiting on a single calculation to finish
MIN_DISPATCH=$THREADS
MAX_DISPATCH=500

# this allows for catalogs to be written locally on each compute node in a temporary directory, then only copied back onto shared storage after they complete. this reduces I/O load, but makes it harder to track progress of individual simulations. comment this out to disable this option
TEMP_OPTION="--temp-dir /tmp/etas-results-tmp"

# this allows for the results directory to be hosted on a different filesystem, in this case the $SCRATCH filesystem. this will prevent many files from being written to $WORK, as well as reducing I/O load
SCRATCH_OPTION="--scratch-dir $SCRATCH/etas-results-tmp"

# this automatically deletes subdirectories of the results directory once a catalog has been sucessfully written to the master binary file. comment out to disable
CLEAN_OPTION="--clean"

##########################
## END INPUT PARAMETERS ##
##   DO NOT EDIT BELOW  ##
##########################

NEW_JAR="`dirname ${ETAS_CONF_JSON}`/`basename $JAR_FILE`"
cp $JAR_FILE $NEW_JAR
if [[ -e $NEW_JAR ]];then
        JAR_FILE=$NEW_JAR
fi

PBS_NODEFILE="/tmp/${USER}-hostfile-${SLURM_JOBID}"
echo "creating PBS_NODEFILE: $PBS_NODEFILE"
scontrol show hostnames $SLURM_NODELIST > $PBS_NODEFILE

NEW_NODEFILE="/tmp/${USER}-hostfile-fmpj-${PBS_JOBID}"
echo "creating PBS_NODEFILE: $NEW_NODEFILE"
hname=$(hostname)
if [ "$hname" == "" ]
then
  echo "Error getting hostname. Exiting"
  exit 1
else
  cat $PBS_NODEFILE | sort | uniq | fgrep -v $hname > $NEW_NODEFILE
fi

export PBS_NODEFILE=$NEW_NODEFILE
export PATH=$PATH:$MPJ_HOME/bin

JVM_MEM_MB=26624

t1=$(date +%s) # epoch start time in seconds

date
echo "RUNNING MPJ"
mpjrun_errdetect_wrapper.sh $PBS_NODEFILE -dev hybdev -Djava.library.path=$MPJ_HOME/lib -Xmx${MEM_GIGS}G -cp $JAR_FILE scratch.UCERF3.erf.ETAS.launcher.MPJ_ETAS_Launcher --min-dispatch $MIN_DISPATCH --max-dispatch $MAX_DISPATCH --threads $THREADS $TEMP_OPTION $SCRATCH_OPTION $CLEAN_OPTION --end-time `scontrol show job $SLURM_JOB_ID | egrep --only-matching 'EndTime=[^ ]+' | cut -c 9-` $ETAS_CONF_JSON
ret=$?
date

t2=$(date +%s) # epoch end time in seconds
numSec=$(echo $t2 - $t1 | bc -q ) # the number of seconds the process took.
runTime=$(date -ud @$numSec +%T) # Convert the seconds into Hours:Mins:Sec
echo "Time to build: $runTime ($numSec seconds)"

exit $ret

Conclusions

NOTE: This section is still under development

After running 100k catalog simulations across various HPC systems with as comparable configurations as resources permit, I can recommend the following configurations.

  1. Best Value - System with greatest performance over cost: TODO
  2. Cheapest - Overall cheapest system to run in terms of service units charged: TODO
  3. Best Performance - Fastest runtime regardless of cost: Expanse (100k catalogs, 32 nodes, 41.8 minutes)