Difference between revisions of "UCERF3-ETAS Measurements"
(→Performance Results: Fill Total Threads column) |
(→Performance Results: Add Expanse 14 and 32 node runs with 16GB RAM / Thread) |
||
Line 5: | Line 5: | ||
== Performance Results == | == Performance Results == | ||
− | |||
− | |||
In the tables below, "Core Hours" reflect ACCESS SUs used and are computed by dividing the runtime in minutes by 60, multiplying by the number of nodes used, and then multiplying by the number of CPU cores available on that node. As we are charged for the full node regardless of CPU utilization, this metric doesn't reflect how many cores within a node are used. The "Cores / Node" column reflects the average number of CPU cores available per node as derived by running <code>scontrol show node -a <node_list></code> over the list of nodes allocated per job. "RAM / Node" is a measure of RAM available to ETAS by either ETAS_MEM_GB or MEM_GIGS, not the total RAM on a given node. | In the tables below, "Core Hours" reflect ACCESS SUs used and are computed by dividing the runtime in minutes by 60, multiplying by the number of nodes used, and then multiplying by the number of CPU cores available on that node. As we are charged for the full node regardless of CPU utilization, this metric doesn't reflect how many cores within a node are used. The "Cores / Node" column reflects the average number of CPU cores available per node as derived by running <code>scontrol show node -a <node_list></code> over the list of nodes allocated per job. "RAM / Node" is a measure of RAM available to ETAS by either ETAS_MEM_GB or MEM_GIGS, not the total RAM on a given node. | ||
Line 107: | Line 105: | ||
|- | |- | ||
| 14 || 128 || 200 || 40 || 560 || 5 || Y || 100,000 || 90.5 || 2.7E3 | | 14 || 128 || 200 || 40 || 560 || 5 || Y || 100,000 || 90.5 || 2.7E3 | ||
+ | |- | ||
+ | | 14 || 128 || 224 || 14 || 196 || 16 || Y || 10,000 || 15.6 || 4.7E2 | ||
|- | |- | ||
| 32 || 128 || 50 || 10 || 320 || 5 || || 10 || 2.1 || 1.4E2 | | 32 || 128 || 50 || 10 || 320 || 5 || || 10 || 2.1 || 1.4E2 | ||
Line 127: | Line 127: | ||
|- | |- | ||
| 32 || 128 || 200 || 40 || 1280 || 5 || Y || 100,000 || 41.8 || 2.9E3 | | 32 || 128 || 200 || 40 || 1280 || 5 || Y || 100,000 || 41.8 || 2.9E3 | ||
+ | |- | ||
+ | | 32 || 128 || 224 || 14 || 196 || 16 || Y || 100,000 || 122.2 || 3.6E3 | ||
|} | |} | ||
Revision as of 16:35, 16 September 2024
Page is under active construction, some sections may be incomplete. Aug 13 2024 - bhatthal@usc.edu
This page summarizes the performance study of UCERF3-ETAS ran locally in Docker and on CARC Discovery, SDSC Expanse, and TACC Frontera. This study allows for evaluation of resource requirements in single-node and multiple-node simulations of the Ridgecrest M7.1 ETAS forecast (ci38457511).
Contents
Performance Results
In the tables below, "Core Hours" reflect ACCESS SUs used and are computed by dividing the runtime in minutes by 60, multiplying by the number of nodes used, and then multiplying by the number of CPU cores available on that node. As we are charged for the full node regardless of CPU utilization, this metric doesn't reflect how many cores within a node are used. The "Cores / Node" column reflects the average number of CPU cores available per node as derived by running scontrol show node -a <node_list>
over the list of nodes allocated per job. "RAM / Node" is a measure of RAM available to ETAS by either ETAS_MEM_GB or MEM_GIGS, not the total RAM on a given node.
Number of Nodes | Cores / Node | RAM / Node (GB) | Number of Catalogs | Runtime (min) | Core Hours |
---|---|---|---|---|---|
1 | 14 | 75 | 10 | 1.3 | 0.30 |
1 | 14 | 75 | 100 | 4.4 | 1.0 |
1 | 14 | 75 | 1000 | 25.8 | 6.0 |
1 | 14 | 75 | 10,000 | 286.3 | 67 |
Dockerized local runs with a resource allocation of 14 CPU (@ 1 thread / CPU), 96GB RAM, 1GB Swap, 64GB Disk. u3etas_launcher uses 80% of available RAM, ETAS_MEM_GB=75.
Single-node measurements below were collected using ETAS_MEM_GB=32. Multi-node measurements automatically override the default value.
Discovery is a heterogenous system. Not all nodes within the same partition have the same number of cores available. The cores available column is calculated by taking the average of cores available over the nodes assigned.
Number of Nodes | Cores / Node | RAM / Node (GB) | Threads / Node | Total Threads | RAM / Thread | Scratch Enabled | Number of Catalogs | Runtime (min) | Core Hours |
---|---|---|---|---|---|---|---|---|---|
1 | 24 | 32 | 30 | 30 | 1.1 | Y | 10 | 1.7 | 0.68 |
1 | 24 | 32 | 30 | 30 | 1.1 | Y | 100 | 157.2 | 63 |
1 | 20 | 32 | 30 | 30 | 1.1 | Y | 1000 | 201.4 | 67 |
1 | 24 | 32 | 30 | 30 | 1.1 | Y | 10,000 | 424.8 | 1.7E2 |
14 | 20 | 50 | 10 | 140 | 5 | Y | 10 | 2.9 | 14 |
14 | 46.86 (8x 64, 6x 24) | 50 | 10 | 140 | 5 | Y | 100 | 2.8 | 31 |
14 | 55.43 (11x 64, 3x 24) | 50 | 10 | 140 | 5 | Y | 1000 | 3.6 | 49 |
14 | 52.57 (10x 64, 4x 24) | 50 | 10 | 140 | 5 | Y | 10,000 | 17.2 | 2.1E2 |
14 | 20 | 50 | 10 | 140 | 5 | Y | 100,000 | 228.1 | 1.1E3 |
32 | 64 | 50 | 10 | 320 | 5 | Y | 10 | 0.78 | 27 |
32 | 64 | 50 | 10 | 320 | 5 | Y | 100 | 1.2 | 41 |
32 | 60.25 (29x 64, 3x 24) | 50 | 10 | 320 | 5 | Y | 1000 | 3.9 | 1.3E2 |
32 | 50.75 (22x 64, 4x 24, 6x 20) | 50 | 10 | 320 | 5 | Y | 10,000 | 10.5 | 2.8E2 |
32 | 27.50 (4x 64, 16x 24, 12x 20) | 50 | 10 | 320 | 5 | Y | 100,000 | 99.2 | 1.5E3 |
Number of Nodes | Cores / Node | RAM / Node (GB) | Threads / Node | Total Threads | RAM / Thread | Scratch Enabled | Number of Catalogs | Runtime (min) | Core Hours |
---|---|---|---|---|---|---|---|---|---|
1 | 128 | 32 | 10 | 10 | 3.2 | 10 | 2.9 | 6.2 | |
1 | 128 | 32 | 10 | 10 | 3.2 | 100 | 10.4 | 22 | |
1 | 128 | 32 | 10 | 10 | 3.2 | 1000 | 22.6 | 48 | |
1 | 128 | 32 | 10 | 10 | 3.2 | 10,000 | 207.7 | 4.4E2 | |
1 | 128 | 220 | 44 | 44 | 5 | Y | 10 | 1.0 | 2.1 |
1 | 128 | 220 | 44 | 44 | 5 | Y | 100 | 2.4 | 5.1 |
1 | 128 | 220 | 44 | 44 | 5 | Y | 1000 | 14.1 | 30 |
1 | 128 | 200 | 40 | 40 | 5 | Y | 10,000 | 67.1 | 1.4E2 |
14 | 128 | 50 | 10 | 140 | 5 | 10 | 1.8 | 54 | |
14 | 128 | 50 | 10 | 140 | 5 | 100 | 2.1 | 63 | |
14 | 128 | 50 | 10 | 140 | 5 | 1000 | 5.4 | 1.6E2 | |
14 | 128 | 50 | 10 | 140 | 5 | 10,000 | 18.9 | 5.6E2 | |
14 | 128 | 50 | 10 | 140 | 5 | 100,000 | 162.4 | 4.9E3 | |
14 | 128 | 200 | 40 | 560 | 5 | Y | 10 | 1.7 | 51 |
14 | 128 | 200 | 40 | 560 | 5 | Y | 100 | 2.2 | 66 |
14 | 128 | 200 | 40 | 560 | 5 | Y | 1000 | 4.1 | 1.2E2 |
14 | 128 | 200 | 40 | 560 | 5 | Y | 10,000 | 15.3 | 4.6E2 |
14 | 128 | 200 | 40 | 560 | 5 | Y | 100,000 | 90.5 | 2.7E3 |
14 | 128 | 224 | 14 | 196 | 16 | Y | 10,000 | 15.6 | 4.7E2 |
32 | 128 | 50 | 10 | 320 | 5 | 10 | 2.1 | 1.4E2 | |
32 | 128 | 50 | 10 | 320 | 5 | 100 | 2.3 | 1.6E2 | |
32 | 128 | 50 | 10 | 320 | 5 | 1000 | 3.4 | 2.3E2 | |
32 | 128 | 50 | 10 | 320 | 5 | 10,000 | 11.3 | 7.7E2 | |
32 | 128 | 50 | 10 | 320 | 5 | 100,000 | 74.8 | 5.1E3 | |
32 | 128 | 200 | 40 | 1280 | 5 | Y | 10 | 2.0 | 1.4E2 |
32 | 128 | 200 | 40 | 1280 | 5 | Y | 100 | 2.7 | 1.8E2 |
32 | 128 | 200 | 40 | 1280 | 5 | Y | 1000 | 2.7 | 1.8E2 |
32 | 128 | 200 | 40 | 1280 | 5 | Y | 10,000 | 8.8 | 6.0E2 |
32 | 128 | 200 | 40 | 1280 | 5 | Y | 100,000 | 41.8 | 2.9E3 |
32 | 128 | 224 | 14 | 196 | 16 | Y | 100,000 | 122.2 | 3.6E3 |
TODO(bhatthal): I haven't started Frontera runs yet, although confirmed I'm able to run a Slurm job for /bin/date
Installation and Configuration
Running ETAS simulations on OpenSHA is simplified through a collection of launcher binaries and scripts called ucerf3-etas-launcher. The process of installation and configuration varies across systems, however the foundations remain the same. Running simulations always occurs in three phases.
- Building configuration files for a specified event, where we configure MPI nodes and number of simulations
- Launching the simulation with the configuration files
- Consolidating and plotting simulations data
Docker
When running UCERF3-ETAS simulations locally, using Docker allows for a consistent environment without the need to manage dependencies and the ability to easily provision resources. Download the Docker image for the M7.1 Ridgecrest main shock by running docker pull sceccode/ucerf3_jup
or searching for "ucerf3_jup" on Docker Desktop. I prefer to use Docker Desktop, but the command-line is sufficient.
Under Docker Desktop settings, I allocated 14 CPUs, 96GB of RAM, 1GB of Swap, and 64GB of disk storage to the Docker environment.
Open a terminal on your system with the Docker CLT installed and run docker run -d -p 8888:8888 sceccode/ucerf3_jup --name ucerf3-etas
. This allows you to run a container forwarding the port 8888 for the Jupyter Notebook server.
From here, you can navigate to the Jupyter Notebook web application at http://localhost:8888 to access an interactive terminal for the container. Alternatively, you can run the container directly in Docker Desktop and navigate to the "Exec" tab to access the terminal without needing Jupyter Notebook or port-fortwarding.
Once inside your container, use the following workflow to run local simulations and plot data, where $NUM_SIM is the number of simulations desired.
u3etas_comcat_event_config_builder.sh --event-id ci38457511 --mag-complete 3.5 --radius 25 --num-simulations $NUM_SIM --days-before 7 --max-point-src-mag 6 --finite-surf-shakemap --finite-surf-shakemap-min-mag 4.5 --nodes 1 --hours 24 --output-dir target/docker-comcat-ridgecrest-m7.1-n1-s$NUM_SIM u3etas_launcher.sh $HOME/target/docker-comcat-ridgecrest-m7.1-n${NUM_SIM}/config.json | tee target/docker-comcat-ridgecrest-m7.1-n1-s${NUM_SIM}/u3etas_launcher.log u3etas_plot_generator.sh $HOME/target/docker-comcat-ridgecrest-m7.1-n1-s${NUM_SIM}/config.json
You'll notice that we didn't specify an hpc-site
parameter. As we are running these simulations locally, and not on a High Performance Computing system, we don't need to define a site to generate a corresponding Slurm file. Instead of passing a slurm file to sbatch, we can execute the launcher directly with the generated config.json
file. I also pipe into the tee command to capture output for logging purposes.
In Docker Desktop, you can navigate to the "Volumes" tab to find the stored data for the containers and download them onto your host system.
Discovery
Establish a Discovery SSH or CARC OnDemand connection and clone the ucerf3-etas-launcher GitHub repository at the path /project/scec_608/$USER/ucerf3/ucerf3-etas-launcher
, where $USER is your username.
Edit the bashrc file at $HOME/.bashrc
to update the PATH to include the downloaded ETAS scripts and load HPC modules necessary to run ETAS in a multiple-node environment.
# .bashrc # Source global definitions if [ -f /etc/bashrc ]; then . /etc/bashrc fi # Switch groups, but only if necessary if [[ `id -gn` != "scec_608" && $- =~ i ]] then # echo "switching group" newgrp scec_608 exit fi PATH=$PATH:$HOME/.local/bin:$HOME/bin export TERM=linux ## MODULES module load usc # this is loaded by default on login nodes, but not on compute nodes, so we need to add 'usc' here so that the subsequent modules will work module load gcc/11.3 module load openjdk module load git module load vim # every once in a while CARC breaks java, and we need this to avoid unsatisfied link errors # if you get them looking related to libawt_xawt.so: libXext.so.6 or similar, uncommend the following # previously encountered and then went away, but came back after may 2024 maintenence window module load libxtst # no clue why we suddently needed this to avoid a weird JVM unsatisfied link exception # compute nodes don't have unzip... which unzip > /dev/null 2> /dev/null if [[ $? -ne 0 ]];then module load unzip module load bzip2 fi ## https://github.com/opensha/ucerf3-etas-launcher/tree/master/parallel/README.md export PROJFS=/project/scec_608/$USER export ETAS_LAUNCHER=$PROJFS/ucerf3/ucerf3-etas-launcher export ETAS_SIM_DIR=$PROJFS/ucerf3/etas_sim export ETAS_MEM_GB=5 # this will be overridden in batch scripts for parallel jobs, set low enough so that the regular U3ETAS scripts can run on the login node to configure jobs export MPJ_HOME=/project/scec_608/kmilner/mpj/mpj-current export PATH=$ETAS_LAUNCHER/parallel/slurm_sbin/:$ETAS_LAUNCHER/sbin/:$MPJ_HOME/bin:$PATH if [[ `hostname` == e19-* ]];then # on a compute node in the SCEC queue export OPENSHA_MEM_GB=50 export FST_HAZARD_SPACING=0.2 export OPENSHA_JAR_DISABLE_UPDATE=1 elif [[ -n "$SLURM_JOB_ID" ]];then # on a compute node otherwise export OPENSHA_JAR_DISABLE_UPDATE=1 unset OPENSHA_MEM_GB else export OPENSHA_MEM_GB=10 fi export OPENSHA_FST=/project/scec_608/kmilner/git/opensha-fault-sys-tools export OPENSHA_FS_GIT_BRANCH=master export PATH=$PATH:$OPENSHA_FST/sbin
You'll notice that in the bashrc there are references to user "kmilner", do not change these. The files here are readable by other users and are necessary for running ETAS. There are future plans to migrate much of this code outside of the user bash file and into an MPJ Express wrapper script, to improve portability and simplify the configuration process.
After editing the bashrc file, either login and logout or run source ~/.bashrc
to load the new changes.
Utilizing launcher scripts, an interactive compute node can be accessed to build configuration files directly on Discovery, as opposed to building locally and transferring over SCP/SFTP. Non-trivial jobs cannot be executed on the head node, which is why configuration files are built in such a way. Do so now by running slurm_interactive.sh
. After waiting for resource provisioning, build the configuration inside the interactive compute node with
cd $ETAS_SIM_DIR && u3etas_comcat_event_config_builder.sh --event-id ci38457511 --mag-complete 3.5 --radius 25 --num-simulations $NUM_SIM --days-before 7 --max-point-src-mag 6 --finite-surf-shakemap --finite-surf-shakemap-min-mag 4.5 --hpc-site USC_CARC --nodes $NUM_NODE --hours 24 --output-dir $ETAS_SIM_DIR/discovery-comcat-ridgecrest-m7.1-n${NUM_NODE}-s$NUM_SIM
where $NUM_SIM is the number of simulations to run and $NUM_NODE is the number of nodes to utilize.
The generated configuration does require a bit of manual tweaking prior to execution. Navigate to the generated simulation directory. You'll notice that unlike before with localized Docker runs, we have a a Slurm file.
Slurm files invoke the launcher script with the config JSON file through an MPJ wrapper. MPJ, or Message Passing in Java, is utilized to enable the parallel distribution of work across the HPC compute nodes. The Slurm file also specifies the number of nodes and other parameters relevant to work distribution.
In your simulation folder you should see the files "etas_sim_mpj.slurm", "plot_results.slurm", and "opensha-all.jar". If the jar file failed to copy, copy it manually from ${ETAS_LAUNCHER}/opensha/opensha-all.jar
Make the following changes to etas_sim_mpj.slurm:
- Rename partition from scec -> main:
#SBATCH -p main
- Ensure the ETAS JSON config path is prefixed by simulation directory:
ETAS_CONF_JSON="${ETAS_SIM_DIR}/...
- Update scratch directory from scratch2 -> scratch1:
SCRATCH_OPTION="--scratch-dir /scratch1/$USER/etas_scratch"
- If this simulation is on a single-node, don't invoke the MPJ wrapper:
date echo "RUNNING ETAS-LAUNCHER" u3etas_launcher.sh --threads $THREADS $ETAS_CONF_JSON ret=$? date
Additionally, inside the config.json file, update the "outputDir" to be prefixed with "${ETAS_SIM_DIR}/" prior to the output name, to prevent the creation of a duplicate folder.
After making the necessary changes, place the ETAS simulation on the job queue by running slurm_submit.sh etas_sim_mpj.slurm
. You can rename the slurm file prior to submission to set the job name to more easily manage jobs. Stdout and stderr is written files to {JOB}.o{ID} and {JOB}.e{ID} respectively. Runtime is derived from the timestamps in the output file. Results are written to either a results/ directory or the binary "results_complete.bin".
Generate plots with "plot_results.slurm". Similarly, you must also update the partition name from "scec" to "main" and submit the job with slurm_submit.sh. View final plots in the generated "index.html". If you do not have a graphical session, you may need to download the simulation folder to view plots locally.
Expanse
The Expanse Configuration takes into consideration the Expanse User Guide and existing Quakeworx Dev Configuration.
In order to establish an SSH connection to Expanse, you must first verify your Expanse project allocation. Verify project allocation at Expanse Portal -> OnDemand -> Allocation and Usage Information at Resource “Expanse”. If not present, file a troubleshooting ticket at support.access-ci.org
Unlike on Discovery, we are going to set up our own MPJ Express installation and configure an MPJ Express Wrapper. A similar process will be rolled out to Discovery in the future.
- Clone MPJ Express to
/expanse/lustre/projects/usc143/$USER/mpj-express
:$ git clone https://github.com/kevinmilner/mpj-express.git
- Set Wrapper path in
mpj-express/conf/mpjexpress.conf
:mpjexpress.ssh.wrapper=/expanse/lustre/projects/usc143/$USER/ucerf3/ucerf3-etas-env-wrapper.sh
. You may want to explicitly write your username in the path here instead of using $USER. - Create the MPJ Wrapper file at the specified path as follows:
#!/bin/bash module load cpu/0.15.4 module load openjdk/11.0.2 export PROJFS=/expanse/lustre/projects/usc143/$USER export ETAS_LAUNCHER=$PROJFS/ucerf3/ucerf3-etas-launcher export ETAS_SIM_DIR=$PROJFS/ucerf3/u3etas_sim export MPJ_HOME=$PROJFS/mpj-express export PATH=$ETAS_LAUNCHER/parallel/slurm_sbin:$ETAS_LAUNCHER/sbin/:$MPJ_HOME/bin:$PATH "$@" exit $?
Add the following to the bashrc:
module load sdsc module load cpu/0.15.4 module load openjdk/11.0.2 # compute nodes don't have unzip... which unzip > /dev/null 2> /dev/null if [[ $? -ne 0 ]];then module load unzip module load bzip2 fi # https://github.com/opensha/ucerf3-etas-launcher/tree/master/parallel/README.md export PROJFS=/expanse/lustre/projects/usc143/$USER export ETAS_LAUNCHER=$PROJFS/ucerf3/ucerf3-etas-launcher export ETAS_SIM_DIR=$PROJFS/ucerf3/u3etas_sim export ETAS_MEM_GB=5 # this will be overridden in batch scripts for parallel jobs, set low enough so that the regular U3ETAS scripts can run on the login node to configure jobs export MPJ_HOME=$PROJFS/mpj-express export PATH=$ETAS_LAUNCHER/parallel/slurm_sbin/:$ETAS_LAUNCHER/sbin/:$MPJ_HOME/bin:$PATH
Single-node simulations won't invoke the MPJ Express Wrapper, which is why these changes are necessary.
Connect to an interactive compute node:
srun --partition=debug --pty --account=usc143 --nodes=1 --ntasks-per-node=4 --mem=16G -t 00:30:00 --wait=0 --export=ALL /bin/bash
and build the simulation with NUM_SIM catalogs and NUM_NODE nodes.
u3etas_comcat_event_config_builder.sh --event-id ci38457511 --mag-complete 3.5 --radius 25 --num-simulations $NUM_SIM --days-before 7 --max-point-src-mag 6 --finite-surf-shakemap --finite-surf-shakemap-min-mag 4.5 --hpc-site USC_CARC --nodes $NUM_NODE --hours 24 --output-dir $ETAS_SIM_DIR/expanse-comcat-ridgecrest-m7.1-n${NUM_NODE}-s$NUM_SIM
Inside the slurm config file, set to use the "compute" partition instead of "scec".
#SBATCH --partition compute
I can't confirm if it's necessary to set MPJ_HOME in the Slurm configuration, given our wrapper sets this already, but I have set it here as well.
Take care to explicitly set ETAS_MEM_GB for single-node runs to desired memory available < MEM_GIGS. Consider the total RAM available per node, which is ~256GB according to the Expanse User Guide.
Set to directly invoke u3etas_launcher for single-node simulations just like previously done on Discovery.
Quakeworx Dev doesn’t use a scratch file. Scratch files aren't necessary, but may speed up I/O operations. Comment out the SCRATCH parameter in the slurm configuration.
Set the account for your research project, in my case it's "usc143".
#SBATCH --account=usc143
Depending on your account quota, you may struggle to run 32 node simulations. In my case I used another project "ddp408" for these simulations.
Check available projects for expanse resource with expanse-client user -r expanse
Unlike on Discovery, we must set ntasks-per-node or ntasks.
#SBATCH –ntasks 20
I ran successfully with 20, although you can try a higher value. Too many tasks may result in a job quota failure.
As Expanse has 128 cores available per node, and we are charged for the full node regardless of utilization, take care to set cores-per-node=128. I didn't do this for my runs, but we still reflect the 128 cores in the Measurements table to accurately reflect the cost.
Job execution and data plotting instructions are identical to Discovery.
Frontera
TODO(bhatthal)