Difference between revisions of "UCERF3-ETAS Measurements"

From SCECpedia
Jump to navigationJump to search
m
Line 162: Line 162:
 
| 32 || 10,000 || 10.5 || 5.6
 
| 32 || 10,000 || 10.5 || 5.6
 
|}
 
|}
Single-node runs didn't override ETAS_MEM_GB=5
+
Single-node runs didn't override ETAS_MEM_GB=5.
 +
 
 +
```TODO(bhatthal): Consider rerunning single-node with ETAS_MEM_GB=32```
  
  

Revision as of 22:10, 23 August 2024

Page is under active construction, some sections may be incomplete. Aug 13 2024 - bhatthal@usc.edu

This page summarizes the performance study of UCERF3-ETAS ran locally in Docker and on CARC Discovery, SDSC Expanse, and TACC Frontera. This study allows for evaluation of resource requirements in single-node and multiple-node simulations of the Ridgecrest M7.1 ETAS forecast (ci38457511).

Installation and Configuration

Running ETAS simulations on OpenSHA is simplified through a collection of launcher binaries and scripts called ucerf3-etas-launcher. The process of installation and configuration varies across systems, however the foundations remain the same. Running simulations always occurs in three phases.

  1. Building configuration files for a specified event, where we configure MPI nodes and number of simulations
  2. Launching the simulation with the configuration files
  3. Consolidating and plotting simulations data

Docker

TODO(bhatthal)

  • Explain Docker installation, DockerHub and Dockerfile set up
  • Docker resource provisioning
  • Access Jupyter notebook terminal via web client
  • Config generation with u3etas_comcat_event_config_builder. Explain hpc-site and slurm file vs config.json
  • u3etas_launcher.sh and plot generator

Discovery

Establish a Discovery SSH or CARC OnDemand connection and clone the ucerf3-etas-launcher GitHub repository at the path /project/scec_608/$USER/ucerf3/ucerf3-etas-launcher, where $USER is your username.

Edit the bashrc file at $HOME/.bashrc to update the PATH to include the downloaded ETAS scripts and load HPC modules necessary to run ETAS in a multiple-node environment.

# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
fi

# Switch groups, but only if necessary
if [[ `id -gn` != "scec_608" && $- =~ i ]]
then
#    echo "switching group"
    newgrp scec_608
    exit
fi

PATH=$PATH:$HOME/.local/bin:$HOME/bin
export TERM=linux

## MODULES
module load usc # this is loaded by default on login nodes, but not on compute nodes, so we need to add 'usc' here so that the subsequent modules will work
module load gcc/11.3
module load openjdk
module load git
module load vim
# every once in a while CARC breaks java, and we need this to avoid unsatisfied link errors
# if you get them looking related to libawt_xawt.so: libXext.so.6 or similar, uncommend the following
# previously encountered and then went away, but came back after may 2024 maintenence window
module load libxtst # no clue why we suddently needed this to avoid a weird JVM unsatisfied link exception

# compute nodes don't have unzip...
which unzip > /dev/null 2> /dev/null
if [[ $? -ne 0 ]];then
        module load unzip
        module load bzip2
fi

## https://github.com/opensha/ucerf3-etas-launcher/tree/master/parallel/README.md
export PROJFS=/project/scec_608/$USER
export ETAS_LAUNCHER=$PROJFS/ucerf3/ucerf3-etas-launcher
export ETAS_SIM_DIR=$PROJFS/ucerf3/etas_sim
export ETAS_MEM_GB=5 # this will be overridden in batch scripts for parallel jobs, set low enough so that the regular U3ETAS scripts can run on the login node to configure jobs
export MPJ_HOME=/project/scec_608/kmilner/mpj/mpj-current
export PATH=$ETAS_LAUNCHER/parallel/slurm_sbin/:$ETAS_LAUNCHER/sbin/:$MPJ_HOME/bin:$PATH

if [[ `hostname` == e19-* ]];then
        # on a compute node in the SCEC queue
        export OPENSHA_MEM_GB=50
        export FST_HAZARD_SPACING=0.2
        export OPENSHA_JAR_DISABLE_UPDATE=1
elif [[ -n "$SLURM_JOB_ID" ]];then
        # on a compute node otherwise
        export OPENSHA_JAR_DISABLE_UPDATE=1
        unset OPENSHA_MEM_GB
else
        export OPENSHA_MEM_GB=10
fi
export OPENSHA_FST=/project/scec_608/kmilner/git/opensha-fault-sys-tools
export OPENSHA_FS_GIT_BRANCH=master
export PATH=$PATH:$OPENSHA_FST/sbin

You'll notice that in the bashrc there are references to user "kmilner", do not change these. The files here are readable by other users and are necessary for running ETAS. There are future plans to migrate much of this code outside of the user bash file and into an MPJ Express wrapper script, to improve portability and simplify the configuration process.

After editing the bashrc file, either login and logout or run source ~/.bashrc to load the new changes.

Utilizing launcher scripts, an interactive compute node can be accessed to build configuration files directly on Discovery, as opposed to building locally and transferring over SCP/SFTP. Non-trivial jobs cannot be executed on the head node, which is why configuration files are built in such a way. Do so now by running slurm_interactive.sh. After waiting for resource provisioning, build the configuration inside the interactive compute node with

cd $ETAS_SIM_DIR && u3etas_comcat_event_config_builder.sh --event-id ci38457511 --mag-complete 3.5 --radius 25 --num-simulations $NUM_SIM --days-before 7 --max-point-src-mag 6 --finite-surf-shakemap --finite-surf-shakemap-min-mag 4.5 --hpc-site USC_CARC --nodes $NUM_NODE --hours 24

where $NUM_SIM is the number of simulations to run and $NUM_NODE is the number of nodes to utilize.

The generated configuration does require a bit of manual tweaking prior to execution. Navigate to the generated simulation directory. You'll notice that unlike before with localized Docker runs, we have a a Slurm file.

Slurm files invoke the launcher script with the config JSON file through an MPJ wrapper. MPJ, or Message Passing in Java, is utilized to enable the parallel distribution of work across the HPC compute nodes. The Slurm file also specifies the number of nodes and other parameters relevant to work distribution. In your simulation folder you should see the files "etas_sim_mpj.slurm", "plot_results.slurm", and "opensha-all.jar". If the jar file failed to copy, copy it manually from ${ETAS_LAUNCHER}/opensha/opensha-all.jar Make the following changes to etas_sim_mpj.slurm:

  • Rename partition from scec -> main: #SBATCH -p main
  • Ensure the ETAS JSON config path is prefixed by simulation directory: ETAS_CONF_JSON="${ETAS_SIM_DIR}/...
  • Set GB of RAM per node close to max. I set MEM_GIGS=150 and THREADS=30. You can try a higher value.
  • Update scratch directory from scratch2 -> scratch1: SCRATCH_OPTION="--scratch-dir /scratch1/$USER/etas_scratch"
  • If this simulation is on a single-node, don't invoke the MPJ wrapper:
date
echo "RUNNING ETAS-LAUNCHER"
u3etas_launcher.sh --threads $THREADS $ETAS_CONF_JSON
ret=$?
date

Additionally, inside the config.json file, update the "outputDir" to be prefixed with "${ETAS_SIM_DIR}/" prior to the output name, to prevent the creation of a duplicate folder.

After making the necessary changes, place the ETAS simulation on the job queue by running slurm_submit.sh etas_sim_mpj.slurm. You can rename the slurm file prior to submission to set the job name to more easily manage jobs. Stdout and stderr is written files to {JOB}.o{ID} and {JOB}.e{ID} respectively. Runtime is derived from the timestamps in the output file. Results are written to either a results/ directory or the binary "results_complete.bin".

Generate plots with "plot_results.slurm". Similarly, you must also update the partition name from "scec" to "main" and submit the job with slurm_submit.sh. View final plots in the generated "index.html". If you do not have a graphical session, you may need to download the simulation folder to view plots locally.

Expanse

TODO(bhatthal)

Frontera

TODO(bhatthal)


Performance Results

In the tables below, service units was computed by dividing the runtime by 60 and multiplying by the number of nodes used. We observe that as the numbers of nodes used increased, runtime decreased linearly while the allocation usage remained mostly flat.

Docker Measurements
Number of Nodes Number of Catalogs Runtime (min) Service Units (SUs) Used
1 10 1.3 2.2E-2
1 100 4.4 7.3E-2
1 1000 25.8 0.43

Dockerized local runs with a resource allocation of 14 CPU (@ 1 thread / CPU), 96GB RAM, 1GB Swap, 64GB Disk. u3etas_launcher uses 80% of available RAM, ETAS_MEM_GB=75.


Discovery Measurements
Number of Nodes Number of Catalogs Runtime (min) Service Units (SUs) Used
1 10 67.9 1.1
1 100 89.6 1.5
1 1000 472.7 7.9
1 10,000 1312.3 22
32 10 0.78 0.42
32 100 1.2 0.64
32 1000 3.9 2.08
32 10,000 10.5 5.6

Single-node runs didn't override ETAS_MEM_GB=5.

```TODO(bhatthal): Consider rerunning single-node with ETAS_MEM_GB=32```


Expanse

Single-node runs override ETAS_MEM_GB=32 TODO(bhatthal)

Frontera

TODO(bhatthal)

Conclusion

TODO(bhatthal)