Difference between revisions of "CyberShake Data"

From SCECpedia
Jump to navigationJump to search
 
(20 intermediate revisions by the same user not shown)
Line 1: Line 1:
Current multiple ways of accessing CyberShake Data.  
+
This page provides an overview of CyberShake Data, and how to access it.
  
== CyberShake Data Access ==
+
CyberShake data can be broken down into the following elements based on when it is used in simulations:
<pre>
 
The seismograms are available on SCEC servers.  There are three steps involved in accessing the seismograms:
 
  
1) Determine which seismograms are of interest
+
#Input data needed for CyberShake runs, such as which ruptures go with which site.  This information is stored in the CyberShake database.
2) Find the seismogram file
+
#Temporary data generated during CyberShake production runs.  This data remains on the cluster and is purged.
3) Read the seismogram data
+
#Output data products generated by CyberShake runs.  This data is transferred from the cluster to SCEC disks, and some of it is inserted into the CyberShake database for quick access.
  
1) The first step is to figure out which seismograms are of interest.  To do this, you'll need to know a) the site and run ID of the CyberShake run, and b) the source id, rupture id, and rupture variation id of the seismogram.  You may have done this already.  If not, you can determine the run ID by using MySQL to query the CyberShake database on focal.usc.edu.  To determine a run ID, you will need to know values for a few parameters, such as the ERF, the velocity model, and the rupture variation generator version. Here's a sample query:
+
We will focus on (1) and (3).
  
select R.Run_ID from CyberShake_Runs R, CyberShake_Sites S
+
== CyberShake database overview ==
where S.CS_Short_Name="LADT" //Replace this with the site you want
 
and R.Site_ID=S.CS_Site_ID
 
and R.ERF_ID=35 //35 corresponds to UCERF 2.  You'll always want to use this.
 
and R.SGT_Variation_ID=5 //This is the code used to generate the SGTs.  5 = Rob Graves' code.  6 = AWP-ODC.  7 = a newer, revised version of the Graves code.
 
and R.Rup_Var_Scenario_ID=4 //This is the code used to generated the rupture variations.  3 = Graves & Pitarka 2007.  4 = Graves & Pitarka 2010.  You'll probably want to use 4.
 
and R.Velocity_Model_ID=1 //This is the velocity model.  1 = CVM-S4.  2 = CVM-H v11.2.  4 = CVM-H v11.9.
 
and (R.Max_Frequency=0.5 or R.Max_Frequency is null) //This selects only deterministic (0.5 Hz) runs.  If you wanted only runs that include the stochastic high-frequency components, you would change this to "R.Max_Frequency=10.0"
 
and R.Status="Verified"; //This means the run completed successfully
 
  
From this you should be able to get a run ID.  Let me know if you have questions about the parameters.
+
CyberShake data is served through 2 on-disk relational database servers running MySQL/MariaDB, and an SQLite file for each past study.
  
From an SRF file, you already have the source ID and run ID.  The difference between rupture variations is the hypocenter locations and the slip distributions. The hypocenter locations are in the database, so you can look at them with a query like:
+
=== MySQL/MariaDB Databases ===
  
select Rup_Var_ID, Hypocenter_Lat, Hypocenter_Lon, Hypocenter_Depth from Rupture_Variations
+
The two databases used to store CyberShake data are focal.usc.edu ('focal') and moment.usc.edu ('moment').
where ERF_ID=35
 
and Source_ID=0
 
and Rupture_ID=0;
 
  
To understand the different slips, the easiest thing to do is to plot them.  Let me know if you'd like to do this and I can provide details.
+
Examples of accessing data stored in these databases can be found at [[Accessing_CyberShake_Database_Data]].
  
2) Once you have a run ID and a specific seismogram, the next step is to find the file on disk.  We have been moving things around, so there are a couple of places to check.
+
==== Moment DB ====
  
There is a directory on the SCEC servers which corresponds to the specific run ID. It will be in one of:
+
Moment is the production database server.  Currently, it maintains all the necessary inputs, metadata on all CyberShake runs, and results for Study 15.12 and Study 17.3.
  
/home/scec-04/tera3d/CyberShake/data/PPFiles/<site>/<runID>
+
Read-only access to moment is:
/home/scec-04/tera3d/CyberShake/data/from_scec-02/PPFiles/<site>/<runID>
 
/home/scec-02/tera3d/CyberShake2007/data/PPFiles/<site>/<runID>
 
  
Inside that directory you will see 1 of 2 possibilities -- we changed our file format about a year ago, so we have data in both formats.
+
host: moment.usc.edu
 +
user: cybershk_ro
 +
password: CyberShake2007
 +
database: CyberShake
  
In the older format, the directory will contain a number (10-80) of Seismogram*.zip files.  One of these zip files has the seismogram you want, but there's no straighforward way to tell.  I recommend running the following from the command line (in bash):
+
==== Focal DB ====
  
for i in `ls *_seismograms.zip`; do echo $i; unzip -l $i | grep Seismogram_<site>_<sourceID>_<ruptureID>_<ruptureVariationID>.grm; done
+
Focal is the database server for external user queries.  We plan to remove all but the most recent few studies from focal, but this is still in progress, so for now focal has all inputs, metadata, and results up through Study 15.12.
  
When you run it, you'll get output like:
+
Read-only access to focal is:
  
CyberShake_LADT_749_3_seismograms.zip
+
host: focal.usc.edu
CyberShake_LADT_749_40_seismograms.zip
+
user: cybershk_ro
CyberShake_LADT_749_41_seismograms.zip
+
password: CyberShake2007
CyberShake_LADT_749_42_seismograms.zip
+
  database: CyberShake
    24000 02-11-2011 19:34  Seismogram_LADT_128_1100_0.grm
 
CyberShake_LADT_749_43_seismograms.zip
 
CyberShake_LADT_749_44_seismograms.zip
 
CyberShake_LADT_749_45_seismograms.zip
 
  
So that would tell you that file Seismogram_LADT_128_1100_0.grm is in CyberShake_LADT_749_42_seismograms.zip.
+
=== SQLite files ===
  
Once you've IDed the zip file, you can extract the seismogram you want by:
 
  
unzip <zip file> <seismogram file> .
 
  
If the seismograms are in the newer format, then there's a different approach.  You'll know because the directory will contain several thousand Seismogram_*.grm files.  The one you want is Seismogram_<site>_<sourceID>_<ruptureID>.grm, with whatever source ID and rupture ID you are interested in seismograms for.
+
== CyberShake input data ==
  
3) Once you have your seismogram file, the last step is to read the dataIt's velocity data, in cm/s, with dt = 0.1 seconds.
+
At the beginning of a CyberShake run, the database is queried to determine site information (name, latitude, longitude)This can be found in the CyberShake_Sites table.
  
If the file is in the older format (you can tell because the filename is Seismogram_<site>_<sourceID>_<ruptureID>_<ruptureVariationID>.grm), the data is stored in binary format as:
+
The database is also used to determine which ruptures fall within the 200 km cutoff.  This information is used to construct the necessary volume and select the correct rupture files for processing.  This can be found in the CyberShake_Site_Ruptures table, which contains a list of ruptures for each site which fall within a given cutoff.
  
<3000 4-byte floats - X component seismogram>
+
Both of these tables are populated by Kevin when we select new sites for CyberShake processing.
<3000 4-byte floats - Y component seismogram>
 
  
If you run "od -f <seismogram file> | more" you can see the values.  You may wish to write a small C or Python code to read the data out of the file so that you can process it.
+
== CyberShake output data ==
  
If the file is in the newer format (filename Seismogram_<site>_<sourceID>_<ruptureID>.grm), all the rupture variations for a single rupture are in the same file.  The format is (binary, again):
+
CyberShake runs produce the following output data, divided into data staged back from the cluster, and local data products:
  
<Rupture Variation 1 header>
+
Data staged from cluster:
<Rupture Variation 1, X component - 3000 4-byte floats>
+
*Seismograms
<Rupture Variation 1, Y component - 3000 4-byte floats>
+
*Peak spectral acceleration, X and Y component and geometric mean
<Rupture Variation 2 header>
+
*RotD results (for some studies)
<Rupture Variation 2, X component - 3000 4-byte floats>
+
*Duration results (for some studies)
<Rupture Variation 2, Y component - 3000 4-byte floats>
 
...
 
<Rupture Variation n header>
 
<Rupture Variation n, X component - 3000 4-byte floats>
 
<Rupture Variation n, Y component - 3000 4-byte floats>
 
  
Where this gets tricky is that the rupture variations are not necessarily in order in the file.  So you will have to read the header data to find the rupture variations you are interested in.
+
Local data products:
 +
*Hazard curves
 +
*Disaggregation results
 +
*Hazard maps
  
The format of the headers is outlined here:  http://scec.usc.edu/it/Seismogram_and_PSA_headers_for_I/O_forwarding (access with your SCEC computing login).  We recommend writing a simple C code to read and parse the headers, and we can point you to some examples if that would be helpful.  Once you've found the rupture variation you're interested in, you can then read in the data and do some processing.
+
=== CyberShake data staged from cluster ===
</pre>
 
  
== CyberShake Data Request Website ==
+
The data products below are all generated on the remote system, then staged back to SCEC storage as part of the workflow.  Some of these data products are inserted into the database.
 +
 
 +
==== Seismograms ====
 +
 
 +
Seismogram access is detailed at [[Accessing CyberShake Seismograms]].
 +
 
 +
==== Acceleration Data ====
 +
 
 +
In CyberShake, we have two kinds of acceleration intensity measure data:
 +
# X and Y component and geometric mean data
 +
# RotD50 and RotD100 data (since Study 15.4).
 +
Accessing this data depends on which periods you want, as some of it is in the database, and the rest of it is in files.  Accessing this data is detailed at [[Accessing CyberShake Peak Acceleration Data]].
 +
 
 +
==== Duration ====
 +
 
 +
Duration metric data was populated to the database for Study 15.12 (a study with stochastic components) but not for Study 17.3.  Accessing duration data also depends on if you want what's in the database or not.  Details on this are available in [[Accessing CyberShake Duration Data]].
 +
 
 +
=== CyberShake data products generated locally ===
 +
 
 +
These data products are generated locally, on shock.usc.edu, in the final stages of the workflow.
 +
 
 +
==== Hazard Curves ====
 +
 
 +
Hazard curves are produced by combining the intensity measure data in the database, at a certain period, with the probability of each event.  The code for performing this is part of OpenSHA.
 +
 
 +
Hazard curves are located in the directory /home/scec-00/cybershk/opensha/curves/<site short name>.  The convention for the hazard curve name for a particular run, component, and period is:
 +
<pre><site short name>_ERF<erf ID>_Run<Run ID>_SA_<period>sec_<component>_<yyyy>_<mm>_<dd>.<png or pdf></pre>
 +
 
 +
Note that the year, month, and day are when the run was completed, not when the hazard curve is produced.
 +
 
 +
In general, hazard curves are automatically generated for the same periods which are inserted into the database.
 +
 
 +
==== Disaggregations ====
 +
 
 +
Disaggregations calculate how much each CyberShake source (by 'source' we mean UCERF source) contributes to the overall hazard at a certain point on the hazard curve.
 +
 
 +
Disaggregations are automatically performed at an exeedance probability of 4e-4 (2% in 50 years).  These disaggregation files are available in /home/scec-00/cybershk/opensha/disagg .  The convention for the disaggregation file name is:
 +
<pre><site short name>_ERF<erf ID>_Run<Run ID>_DisaggPOE_<probability level>_SA_<period>sec_<yyyy>_<mm>_<dd>.<pdf, png, or txt></pre>
 +
 
 +
Note that the year, month, and day are when the run was completed, not when the hazard curve is produced.
 +
 
 +
The PDF and PNG files are images, showing a breakdown of what magnitude events at what distance contributed to the hazard.  The PDF and text files also have a numerical breakdown, by source, of the percent contribution.
 +
 
 +
==== Hazard Maps ====
 +
 
 +
Hazard maps calculate the hazard for a region, by sampling many hazard curves at a certain probability or IM level, calculating the difference between that curve and a GMPE basemap, and interpolating these differences on top of a GMPE basemap.
 +
 
 +
Hazard maps are generated at the conclusion of each study by Kevin.  Maps are posted on the wiki page for each study, under 'Data Products'.
 +
 
 +
<!-- == CyberShake Data Request Website ==
 
We have developed a prototype CyberShake data access website. It can retrieve the most common CyberShake data requests for our most recent CyberShake Simulations. It may be useful as a way to review the types of data products available from a CyberShake Hazard Model. At this time, however, it requires a SCEC login (username/pwd) to use. Please request a SCEC login if you with to try this access method. A description and link to the site is below:
 
We have developed a prototype CyberShake data access website. It can retrieve the most common CyberShake data requests for our most recent CyberShake Simulations. It may be useful as a way to review the types of data products available from a CyberShake Hazard Model. At this time, however, it requires a SCEC login (username/pwd) to use. Please request a SCEC login if you with to try this access method. A description and link to the site is below:
  
*[[CyberShake Data Request]]
+
*[[CyberShake Data Request]] //-->
  
 
== Related Entries ==
 
== Related Entries ==

Latest revision as of 21:06, 3 June 2022

This page provides an overview of CyberShake Data, and how to access it.

CyberShake data can be broken down into the following elements based on when it is used in simulations:

  1. Input data needed for CyberShake runs, such as which ruptures go with which site. This information is stored in the CyberShake database.
  2. Temporary data generated during CyberShake production runs. This data remains on the cluster and is purged.
  3. Output data products generated by CyberShake runs. This data is transferred from the cluster to SCEC disks, and some of it is inserted into the CyberShake database for quick access.

We will focus on (1) and (3).

CyberShake database overview

CyberShake data is served through 2 on-disk relational database servers running MySQL/MariaDB, and an SQLite file for each past study.

MySQL/MariaDB Databases

The two databases used to store CyberShake data are focal.usc.edu ('focal') and moment.usc.edu ('moment').

Examples of accessing data stored in these databases can be found at Accessing_CyberShake_Database_Data.

Moment DB

Moment is the production database server. Currently, it maintains all the necessary inputs, metadata on all CyberShake runs, and results for Study 15.12 and Study 17.3.

Read-only access to moment is:

host: moment.usc.edu
user: cybershk_ro
password: CyberShake2007
database: CyberShake

Focal DB

Focal is the database server for external user queries. We plan to remove all but the most recent few studies from focal, but this is still in progress, so for now focal has all inputs, metadata, and results up through Study 15.12.

Read-only access to focal is:

host: focal.usc.edu
user: cybershk_ro
password: CyberShake2007
database: CyberShake

SQLite files

CyberShake input data

At the beginning of a CyberShake run, the database is queried to determine site information (name, latitude, longitude). This can be found in the CyberShake_Sites table.

The database is also used to determine which ruptures fall within the 200 km cutoff. This information is used to construct the necessary volume and select the correct rupture files for processing. This can be found in the CyberShake_Site_Ruptures table, which contains a list of ruptures for each site which fall within a given cutoff.

Both of these tables are populated by Kevin when we select new sites for CyberShake processing.

CyberShake output data

CyberShake runs produce the following output data, divided into data staged back from the cluster, and local data products:

Data staged from cluster:

  • Seismograms
  • Peak spectral acceleration, X and Y component and geometric mean
  • RotD results (for some studies)
  • Duration results (for some studies)

Local data products:

  • Hazard curves
  • Disaggregation results
  • Hazard maps

CyberShake data staged from cluster

The data products below are all generated on the remote system, then staged back to SCEC storage as part of the workflow. Some of these data products are inserted into the database.

Seismograms

Seismogram access is detailed at Accessing CyberShake Seismograms.

Acceleration Data

In CyberShake, we have two kinds of acceleration intensity measure data:

  1. X and Y component and geometric mean data
  2. RotD50 and RotD100 data (since Study 15.4).

Accessing this data depends on which periods you want, as some of it is in the database, and the rest of it is in files. Accessing this data is detailed at Accessing CyberShake Peak Acceleration Data.

Duration

Duration metric data was populated to the database for Study 15.12 (a study with stochastic components) but not for Study 17.3. Accessing duration data also depends on if you want what's in the database or not. Details on this are available in Accessing CyberShake Duration Data.

CyberShake data products generated locally

These data products are generated locally, on shock.usc.edu, in the final stages of the workflow.

Hazard Curves

Hazard curves are produced by combining the intensity measure data in the database, at a certain period, with the probability of each event. The code for performing this is part of OpenSHA.

Hazard curves are located in the directory /home/scec-00/cybershk/opensha/curves/<site short name>. The convention for the hazard curve name for a particular run, component, and period is:

<site short name>_ERF<erf ID>_Run<Run ID>_SA_<period>sec_<component>_<yyyy>_<mm>_<dd>.<png or pdf>

Note that the year, month, and day are when the run was completed, not when the hazard curve is produced.

In general, hazard curves are automatically generated for the same periods which are inserted into the database.

Disaggregations

Disaggregations calculate how much each CyberShake source (by 'source' we mean UCERF source) contributes to the overall hazard at a certain point on the hazard curve.

Disaggregations are automatically performed at an exeedance probability of 4e-4 (2% in 50 years). These disaggregation files are available in /home/scec-00/cybershk/opensha/disagg . The convention for the disaggregation file name is:

<site short name>_ERF<erf ID>_Run<Run ID>_DisaggPOE_<probability level>_SA_<period>sec_<yyyy>_<mm>_<dd>.<pdf, png, or txt>

Note that the year, month, and day are when the run was completed, not when the hazard curve is produced.

The PDF and PNG files are images, showing a breakdown of what magnitude events at what distance contributed to the hazard. The PDF and text files also have a numerical breakdown, by source, of the percent contribution.

Hazard Maps

Hazard maps calculate the hazard for a region, by sampling many hazard curves at a certain probability or IM level, calculating the difference between that curve and a GMPE basemap, and interpolating these differences on top of a GMPE basemap.

Hazard maps are generated at the conclusion of each study by Kevin. Maps are posted on the wiki page for each study, under 'Data Products'.


Related Entries