Difference between revisions of "CyberShake Study 14.2"

Revision as of 17:18, 21 February 2014

CyberShake Study 14.2 is a computational study to calculate physics-based probabilistic seismic hazard curves under 4 different conditions: CVM-S4.26 with CPU, CVM-S4.26 with GPU, a 1D model with CPU, and CVM-H without a GTL with GPU. It uses the Graves and Pitarka (2010) rupture variations and the UCERF2 ERF. Both the SGT calculations and the post-processing will be done on Blue Waters. The goal is to calculate the standard Southern California site list (286 sites) used in previous CyberShake studies so we can produce comparison curves and maps, and understand the impact of the SGT codes and velocity models on the CyberShake seismic hazard.

Computational Status

Study 14.2 began at 6:35:18 am PST on Tuesday, February 18, 2014.

Data Products

Data products will be available here.

The following parameters can be used to query the CyberShake database on focal.usc.edu for data products from this run:

CVM-S4.26:  Velocity Model ID 5
CVM-H 11.9, no GTL:  Velocity Model ID 7
BBP 1D: Velocity Model ID 8

AWP-ODC-CPU: SGT Variation ID 6
AWP-ODC-GPU: SGT Variation ID 8

Graves & Pitarka 2010: Rupture Variation Scenario ID 4

UCERF 2 ERF: ERF ID 35

Goals

Science Goals

Calculate a hazard map using CVM-S4.26.
Calculate a hazard map using CVM-H without a GTL.
Calculate a hazard map using a 1D model obtained by averaging.

Technical Goals

Show that Blue Waters can be used to perform both the SGT and post-processing phases
Compare time-to-solution with Study 13.4. We define time-to-solution to be equivalent to the makespan of all of the workflows; that is, the time that elapses between when the first workflow is submitted (to HTCondor for execution) and when all jobs in all workflows have successfully completed execution, which includes calculation of all hazard curves. This metric includes any system downtime or workflow stoppages.
Compare the performance and queue times when using AWP-ODC-SGT CPU vs AWP-ODC-SGT GPU codes.

To meet these goals, we will calculate 4 hazard maps:

AWP-ODC-SGT CPU with CVM-S4.26
AWP-ODC-SGT GPU with CVM-S4.26
AWP-ODC-SGT CPU with CVM-H 11.9, no GTL
AWP-ODC-SGT GPU with BBP 1D

Verification

For verification, we will calculate hazard curves for PAS, WNGC, USC, and SBSM under all 4 conditions.

WNGC

	3s	5s	10s
CVM-H (no GTL), CPU
BBP 1D, GPU
CVM-S4.26, CPU
CVM-S4.26, GPU

USC

	3s	5s	10s
CVM-H (no GTL), CPU
BBP 1D, GPU
CVM-S4.26, CPU
CVM-S4.26, GPU

PAS

	3s	5s	10s
CVM-H (no GTL), CPU
BBP 1D, GPU
CVM-S4.26, CPU
CVM-S4.26, GPU

SBSM

	3s	5s	10s
CVM-H (no GTL), CPU
BBP 1D, GPU
CVM-S4.26, CPU
CVM-S4.26, GPU

Sites

We are proposing to run 286 sites around Southern California. Those sites include 46 points of interest, 27 precarious rock sites, 23 broadband station locations, 43 20 km gridded sites, and 147 10 km gridded sites. All of them fall within the Southern California box except for Diablo Canyon and Pioneer Town. You can get a CSV file listing the sites here. A KML file listing the sites is available here.

Fig 1: Sites selected for Study 2.3 Purple are gridded sites, red are precarious rocks, orange are SCSN stations, and yellow are sites of interest.

Performance Enhancements (over Study 13.4)

SGT Codes

Switched to running a single job to generate and write the velocity mesh, as opposed to separate jobs for generating and merging into 1 file.
We have chosen PX and PY to be 10 x 10 for the GPU SGT code; this seems to be a good balance between efficiency and reduced walltimes. X and Y dimensions must be multiples of 20 so that each processor has an even number of grid points in the X and Y dimensions.
We chose the number of CPU processors dynamically, so that each is responsible for ~64x50x50 grid points.

PP Codes

Switched to SeisPSA_multi, which synthesizes multiple rupture variations per invocation. Planning to use a factor of 5, so only ~83,000 invocations will be needed. Reduces the I/O, since we don't have to read in the extracted SGT files for each rupture variation.

Workflow Management

A single workflow is created which contains the SGT, the PP, and the hazard curve workflows.
Added a cron job on shock to monitor the proxy certificates and send email when the certificates have <24 hours remaining.
Modified the AutoPPSubmit.py cron workflow submission script to first check the Blue Waters jobmanagers and not submit jobs if it cannot authenticate.
Added file locking on pending.txt so only 1 auto-submit instance runs at a time.
Added logic to the planning scripts to capture the TC, the SC, and the RC path and write them to a metadata file.
We only keep the stderr and stdout from a job if it fails.
Added an hourly cron job to clear out held jobs from the HTCondor queue.

Codes

The CyberShake codebase used for this study was tagged "study_14.2" in the CyberShake SVN repository on source.

Additional dependencies not in the SVN repository include:

Blue Waters

UCVM 13.9.0 SVN CyberShake 14.2 study version
- Euclid 1.3
- Proj 4.8.0
- CVM-S4.26 SVN CyberShake 14.2 study version
- BBP 1D
- CVM-H 11.9.1

Memcached 1.4.15
- Libmemcached 1.0.15
- Libevent 2.0.21

Pegasus 4.4.0, updated from the Pegasus git repository. pegasus-version reports version 4.4.0cvs-x86_64_sles_11-20140109230844Z .

shock.usc.edu

Pegasus 4.4.0, updated from the Pegasus git repository. pegasus-version reports version 4.4.0cvs-x86_64_rhel_6-20140214200349Z .

HTCondor 8.0.3 Sep 19 2013 BuildID: 174914

Globus Toolkit 5.0.4

Lessons Learned

AWP_ODC_GPU code, under certain situations, produced incorrect filenames.

Incorrect dependency in DAX generator - NanCheckY was a child of AWP_SGTx.

Try out Pegasus cleanup - accidentally blew away running directory using find.

Computational and Data Estimates

We will use a 200-node 2-week XK reservation and a 700-node 2-week XE reservation.

Computational Time

SGTs, CPU: 150 node-hrs/site x 286 sites x 2 models = 86K node-hours, XE nodes

SGTs, GPU: 90 node-hrs/site x 286 sites x 2 models = 52K node-hours, XK nodes

Study 13.4 had 29% overrun, so 1.29 x (86K + 52K) = 180K node-hours for SGTs

PP: 60 node-hrs/site x 286 sites x 4 models = 70K node-hours, XE nodes

Study 13.4 had 35% overrun on PP, so 1.35 x 70K = 95K node-hours

Total: 275K node-hours

Storage Requirements

Blue Waters

Unpurged disk usage to store SGTs: 40 GB/site x 286 sites x 4 models = 45 TB

Purged disk usage: (11 GB/site seismograms + 0.2 GB/site PSA + 690 GB/site temporary) x 286 sites x 4 models = 783 TB

SCEC

Archival disk usage: 12.3 TB seismograms + 0.2 TB PSA files on scec-04 (has 19 TB free) & 93 GB curves, disaggregations, reports, etc. on scec-00 (931 GB free)

Database usage: 3 rows/rupture variation x 410K rupture variations/site x 286 sites x 4 models = 1.4 billion rows x 151 bytes/row = 210 GB (880 GB free on focal.usc.edu disk)

Temporary disk usage: 5.5 TB workflow logs. We're now not capturing the job output if the job runs successfully, which should save a moderate amount of space. scec-02 has 12 TB free.

Metrics

Before beginning the run, Blue Waters reports 15224 jobs and 387,386.00 total node hours executed by scottcal.

Presentations and Papers

Science Readiness Review

Technical Readiness Review

Time To Solution Summary (pdf)

Time To Solution Summary (docx)

Time To Solution Speadsheet (xlsx)[

@@ Line 223: / Line 223: @@
 *Incorrect dependency in DAX generator - NanCheckY was a child of AWP_SGTx.
+*Try out Pegasus cleanup - accidentally blew away running directory using find.
 == Computational and Data Estimates ==

Difference between revisions of "CyberShake Study 14.2"

Revision as of 17:18, 21 February 2014

Contents

Computational Status

Data Products

Goals

Science Goals

Technical Goals

Verification

WNGC

USC

PAS

SBSM

Sites

Performance Enhancements (over Study 13.4)

SGT Codes

PP Codes

Workflow Management

Codes

Blue Waters

shock.usc.edu

Lessons Learned

Computational and Data Estimates

Computational Time

Storage Requirements

Blue Waters

SCEC

Metrics

Presentations and Papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools