Difference between revisions of "CyberShake Study 15.4"
Line 315: | Line 315: | ||
We launched 2 XK reservations on Blue Waters for 852 nodes each starting at 9 pm PDT on April 17th, and 2 XE reservations for 564 nodes each starting on 10 pm PDT on April 17th. Due to XK jobs having slower throughput than we expected, blocking the XE jobs, and Titan SGTs slowing down greatly, we gave back one of the XE reservations at 8:50 am PDT on April 18th. | We launched 2 XK reservations on Blue Waters for 852 nodes each starting at 9 pm PDT on April 17th, and 2 XE reservations for 564 nodes each starting on 10 pm PDT on April 17th. Due to XK jobs having slower throughput than we expected, blocking the XE jobs, and Titan SGTs slowing down greatly, we gave back one of the XE reservations at 8:50 am PDT on April 18th. | ||
+ | |||
+ | In preparation for downtimes, we stopped submitting new workflows at 9:03 pm PDT on April 19th. | ||
== Presentations and Papers == | == Presentations and Papers == |
Revision as of 04:04, 20 April 2015
CyberShake Study 15.4 is a computational study to calculate one physics-based probabilistic seismic hazard model for Southern California at 1 Hz, using CVM-S4.26, the GPU implementation of AWP-ODC-SGT, the Graves and Pitarka (2015) rupture variations with uniform hypocenters, and the UCERF2 ERF. The SGT calculations will be split between NCSA Blue Waters and OLCF Titan, and the post-processing will be done entirely on Blue Waters. The goal is to calculate the standard Southern California site list (286 sites) used in previous CyberShake studies so we can produce comparison curves and maps, and data products for the UGMS Committee.
Contents
Preparation for production runs
Check list of Mayssa's concernsUpdate DAX to support separate MD5 sumsAdd MD5 sum job to TCEvaluate topology-aware schedulingGet DirectSynth working at full run scale, verify resultsModify workflow to have md5sums be in parallelTest of 1 Hz simulation with 2 Hz source - 2/27Add a third pilot job type to Titan pilots - 2/27Run test of full 1 Hz SGT workflow on Blue Waters - 3/4Add cleanup to workflow and test - 3/4Test interface between Titan workflows and Blue Waters workflows - 3/4Add capability to have files on Blue Waters correctly striped - 3/6Add restart capability to DirectSynth - 3/6File ticket for extended walltime for small jobs on Titan - 3/6Add DirectSynth to workflow tools - 3/6Implement and test parallel version of reformat_awp - 3/6Set up usage monitoring on Blue Waters and TitanAdd ability to determine if SGTs are being run on Blue Waters or TitanModify auto-submit system to distinguish between full runs and PP runsScience readiness review - 3/18Technical readiness review - 3/18Create study description file for Run Manager - 3/13- Simulate curves for 3 sites with final configuration; compare curves and seismograms
- File ticket for 90-day purged space at Blue Waters
- Tag code on shock, Blue Waters, Titan
- Request reservation at Blue Waters
Follow up on high priority jobs at TitanMake changes to technical review slidesUpgrade UCVM on Blue Waters to match Titan versionEvaluate using a single workflow rather than split workflows
Computational Status
Study 15.4 began execution at 10:44:11 PDT on April 16, 2015.
Progress can be monitored here (requires SCEC login).
We are estimating a completion date of July 15, 2015.
Data Products
Goals
Science Goals
- Calculate a 1 Hz hazard map of Southern California.
- Produce a contour map at 1 Hz for the UGMS committee.
- Compare the hazard maps at 0.5 Hz and 1 Hz.
- Produce a hazard map with the Graves & Pitarka (2014) rupture generator.
Technical Goals
- Show that Titan can be integrated into our CyberShake workflows.
- Demonstrate scalability for 1 Hz calculations.
- Show that we can split the SGT calculations across sites.
Verification
Forward Comparison
More information on a comparison of forward and reciprocity results is available here.
DirectSynth
A comparison of 1 Hz results with SeisPSA to results with DirectSynth for WNGC. SeisPSA results are in magenta, DirectSynth results are in black. They're so close it's difficult to make out the magenta.
2s | 3s | 5s | 10s |
---|---|---|---|
2 Hz source
Before beginning Study 15.4, we wanted to investigate our source filtering parameters, to see if it was possible to improve the accuracy of hazard curves at frequencies closer to the CyberShake study frequency.
In describing our results, we will refer to the "simulation frequency" and the "source frequency". The simulation frequency refers to the choice of mesh spacing and dt. The source frequency is the frequency the impulse used in the SGT simulation was low-pass filtered (using a 4th order Butterworth filter) at.
All of these calculations were done for WNGC, ERF 36, with uniform ruptures and the AWP-ODC-GPU SGT code.
Comparisons were done using the following runs:
- 0.5 Hz simulation, 0.5 Hz filtered source (run 3837)
- 0.5 Hz simulation, 1 Hz filtered source (run 3853)
- 1 Hz simulation, 2 Hz filtered source (run 3860)
- 1 Hz simulation, 1 Hz filtered source (run 3861)
First, we performed a run with a 0.5 Hz simulation frequency and a 1.0 Hz source frequency, and compared it to the runs we had been doing in the past, which are 0.5 Hz simulation / 0.5 Hz source. The 1.0 Hz source frequency has an impact on the hazard curves, even at 3 seconds. Semilog curves are on the top row, log/log curves on the bottom.
From spectral plots of the largest 3 sec PSA seismograms, we can see that the PseudoAA response is affected, even at periods much higher than the filter frequency:
Next, we repeated the same experiment for a 1.0 Hz simulation frequency and a 1.0 Hz and 2.0 Hz source frequency:
The hazard curves are practically the same. To try to understand why the hazard curves from the 1.0 Hz experiment don't show the same kind of differences we saw at 0.5 Hz, we first looked at the SGTs.
There is a clear difference in the spectral content of SGTs generated with different frequency content - you can see the different in these Fourier spectra plots, starting at about 1.0 Hz:
The differences are also clear when examining the frequency content of a large-amplitude seismogram, starting around 0.7 or 0.8 Hz:
However, these differences are about 2 orders of magnitude smaller in amplitude than the largest amplitudes, around 0.1 Hz. This is unlike the 0.5 Hz results, in which we see only about 1 order of magnitude difference. Additionally, we see these differences starting at about 0.7 or 0.8 Hz, so they are not picked up by the 2 second hazard curves. Part of the reason for this is because the source doesn't have a lot of high frequency content. Rob is investigating this for future updates to the rupture generator.
We also compared Respect and PSA results to verify the spectral response codes; there are some small differences at high frequency, but overall they are very similar:
Using a 2 Hz filter does have small impacts on the seismograms; for example, here are plots of two of the largest seismograms for WNGC with a 1 Hz and 2 Hz source filter. The seismograms generated with the 2 Hz source filter have sharper peaks which are a results of their higher frequency content, but it should not be trusted, as the mesh spacing and dt of the simulation do not justify accuracy above 1 Hz:
So for non-frequency-dependent applications of seismograms generated with a 2 Hz source, they should be filtered.
Based on this analysis, we plan to perform Study 15.4 using a filter of 2 Hz, to capture additional frequency information between 0.7 and 1 Hz. We have updated the database schema so that we can capture the filter frequency used for various runs.
Blue Waters vs Titan for SGT calculation
SGT duration
The SGTs are generated for 200 seconds. However, the reciprocity calculations are performed for 300 seconds.
For Parkfield San Andreas events, the farthest sites are PTWN (420 km) and s758 (400 km). The seismograms for the northernmost events at those stations are below.
For Bombay Beach San Andreas events, the farthest site is DBCN (510 km).
Sites
We are proposing to run 286 sites around Southern California. Those sites include 46 points of interest, 27 precarious rock sites, 23 broadband station locations, 43 20 km gridded sites, and 147 10 km gridded sites. All of them fall within the Southern California box except for Diablo Canyon and Pioneer Town. You can get a CSV file listing the sites here. A KML file listing the sites is available here.
Performance Enhancements (over Study 14.2)
Responses to Study 14.2 Lessons Learned
- AWP_ODC_GPU code, under certain situations, produced incorrect filenames.
This was fixed during the Study 14.2 run.
- Incorrect dependency in DAX generator - NanCheckY was a child of AWP_SGTx.
This was fixed during the Study 14.2 run.
- Try out Pegasus cleanup - accidentally blew away running directory using find, and later accidentally deleted about 400 sets of SGTs.
We have added cleanup to the SGT workflow, since that's where most of the extra data is generated, especially with two copies of the SGTs (the ones generated by AWP-ODC-GPU, and then the reformatted ones).
- 50 connections per IP is too many for hpc-login2 gridftp server; brings it down. Try using a dedicated server next time with more aggregated files.
We have moved our USC gridftp transfer endpoint to hpc-scec.usc.edu, which does very little other than GridFTP transfers.
SGT codes
- We have moved to a parallel version of reformat_awp. With this parallel version, we can reduce the runtime by 65%.
PP codes
- We have switched from using extract_sgt for the SGT extraction and SeisPSA for the seismogram synthesis to DirectSynth, a code which reads in the SGTs across multiple cores and then uses MPI to send them directly to workers, which perform the seismogram synthesis. We anticipate this code will give us an efficiency improvement of at least 50% over the old approach, since it does not require the writing and reading of the extracted SGT files.
Workflow management
- We are using a pilot job daemon on Titan to monitor the shock queue and submit pilot jobs to Titan accordingly.
- The MD5sums calculated on the SGTs at the start of the post-processing now run in parallel with the actual post-processing calculations. If the MD5 sum job fails, the entire workflow will be aborted, but since that is rare, the majority of the time the rest of the post-processing workflow can continue without having the MD5 sums in the critical path.
Codes
The CyberShake codebase used for this study was tagged "study_15_4" in the CyberShake SVN repository on source.
Additional dependencies not in the SVN repository include:
Blue Waters
- UCVM 14.3.0
- Euclid 1.3
- Proj 4.8.0
- CVM-S 4.26
- Memcached 1.4.15
- Libmemcached 1.0.18
- Libevent 2.0.21
- Pegasus 4.5.0, updated from the Pegasus git repository. pegasus-version reports version 4.5.0cvs-x86_64_sles_11-20150224175937Z .
Titan
- UCVM 14.3.0
- Euclid 1.3
- Proj 4.8.0
- CVM-S 4.26
- Pegasus 4.5.0, updated from the Pegasus git repository.
- pegasus-version for the login and service nodes reports 4.5.0cvs-x86_64_sles_11-20140807210927Z
- pegasus-version for the compute nodes reports 4.5.0cvs-x86_64_sles_11-20140807211355Z
- HTCondor version: 8.2.1 Jun 27 2014 BuildID: 256063
shock.usc.edu
- Pegasus 4.5.0 RC1. pegasus-version reports 4.5.0rc1-x86_64_rhel_6-20150410215343Z .
- HTCondor 8.2.8 Apr 07 2015 BuildID: 312769
- Globus Toolkit 5.2.5
Lessons Learned
- Some of the DirectSynth jobs couldn't fit their SGTs into the number of SGT handlers, nor finish in the wallclock time. In the future, test against a larger range of volumes and sites.
Computational and Data Estimates
Computational Time
Titan
SGTs (GPU): 1800 node-hrs/site x 143 sites = 258K node-hours = 7.7M SUs
Add 25% margin: 9.6M SUs
Blue Waters
SGTs (GPU): 1300 node-hrs/site x 143 sites = 186K node-hours (3.0M SUs), XK nodes
SGTs (CPU): 100 node-hrs/site x 143 sites = 14K node-hours (458K SUs), XE nodes
PP: 1500 node-hrs/site x 286 sites = 429K node-hours (13.7M SUs), XE nodes
Add 25% margin: 768K node-hours
Storage Requirements
Titan
Purged space to store SGTs while generating: (1.5 TB SGTs + 120 GB mesh + 1.5 TB reformatted SGTs)/site x 143 sites = 446 TB
Blue Waters
Space to store SGTs (delayed purge): 1.5 TB/site x 286 sites = 429 TB
Purged disk usage: (1.5 TB SGTs + 120 GB mesh + 1.5 TB reformatted SGTs)/site x 143 sites + (27 GB/site seismograms + 0.2 GB/site PSA + 0.2 GB/site RotD) x 286 sites = 453 TB
SCEC
Archival disk usage: 7.5 TB seismograms + 0.1 TB PSA files + 0.1 TB RotD files on scec-04 (has 171 TB free) & 24 GB curves, disaggregations, reports, etc. on scec-00 (171 TB free)
Database usage: (5 rows PSA + 7 rows RotD)/rupture variation x 450K rupture variations/site x 286 sites = 1.5 billion rows x 151 bytes/row = 227 GB (4.3 TB free on focal.usc.edu disk)
Temporary disk usage: 515 GB workflow logs. scec-02 has 171 TB free.
Performance Metrics
At 8:20 pm PDT on launch day, 102,585,945 SUs available on Titan. 831,995 used in April under username callag.
At 8:30 pm PDT on launch day, 257975.45 node-hours burned under scottcal on Blue Waters. 48571 jobs launched under the project on Blue Waters summary page.
We launched 2 XK reservations on Blue Waters for 852 nodes each starting at 9 pm PDT on April 17th, and 2 XE reservations for 564 nodes each starting on 10 pm PDT on April 17th. Due to XK jobs having slower throughput than we expected, blocking the XE jobs, and Titan SGTs slowing down greatly, we gave back one of the XE reservations at 8:50 am PDT on April 18th.
In preparation for downtimes, we stopped submitting new workflows at 9:03 pm PDT on April 19th.