Difference between revisions of "2016 CyberShake database migration"

From SCECpedia
Jump to navigationJump to search
Line 1: Line 1:
 
== Overview of CyberShake data products ==
 
== Overview of CyberShake data products ==
CyberShake is a multi-layered seismic hazard model. The CyberShake system is designed with two primary interfaces to external programs, both of which operate through an MySQL database. The MySQL database schema is maintained in the CyberShake SVN repository. A recent version is posted [http://hypocenter.usc.edu/research/CyberShake/createDB_sql_v3_7.txt online]. CyberShake has an input interface for UCERF rupture forecasts. OpenSHA uses this interface to populate fault models, rupture probablities. Internal CyberShake programs populate information about sites, and velocity models, and other hazard model specific information. Then, the CyberShake system run on HPC systems and much of the processing is conducted using scientific workflow tools. The end point of the CyberShake processing is the external output interface. The CyberShake output external interface provides site information, hazard curves, and peak amplitude information. Associated output data, not currently in the database includes SRF files, and seismogram files. These are access through file-based metadata lookups.
+
CyberShake is a multi-layered seismic hazard model. The CyberShake system is designed with two primary interfaces to external programs, both of which operate through an MySQL database. The MySQL database schema is maintained in the CyberShake SVN repository. A recent version is posted [http://hypocenter.usc.edu/research/CyberShake/createDB_sql_v3_7.txt online]. CyberShake has an input interface for UCERF rupture forecasts. OpenSHA uses this interface to populate CyberShake database with fault models and rupture probabilities. Internal CyberShake programs populate additional DB Tables with information about sites, and velocity models, and other hazard model specific information. Then, the CyberShake system run on HPC systems. Much of the CyberShake processing is conducted using scientific workflow tools. The end point of the CyberShake workflow processing are output Database tables, on the CyberShake production database server. These tables represent the CyberShake output interface. The CyberShake output external interface provides study information, site information, hazard curves, and peak amplitude information. Associated output data, not currently in the database, includes SRF files, and seismogram files. These output external data are access through file-based lookups, often depending on metadata encoded in the output file names.
  
 
To clarify terminology:
 
To clarify terminology:
 
*"Input data": Rupture data, ERF-related data, sites data.  This data is shared between studies.
 
*"Input data": Rupture data, ERF-related data, sites data.  This data is shared between studies.
 
*"Run data": What parameters are used with each run, timestamps, systems, study membership.  A run is only part of a single study.
 
*"Run data": What parameters are used with each run, timestamps, systems, study membership.  A run is only part of a single study.
*"Output DB data": Peak amplitudes data
+
*"Output DB data": Peak amplitudes data, point on hazard curves, frequencies for which hazard curves are calculated.
 
*"Output External data": SRF files, seismogram files
 
*"Output External data": SRF files, seismogram files
  
Line 14: Line 14:
 
== Goals of DB Migration ==
 
== Goals of DB Migration ==
 
*Improve performance of production CyberShake runs. Indications are that database write performance is slowing down our CyberShake production runs.  
 
*Improve performance of production CyberShake runs. Indications are that database write performance is slowing down our CyberShake production runs.  
*Separate production data from completed studies to reduce the possibility the new production runs will affect completed studies.
+
*Separate production data from completed studies to reduce the possibility that new production runs will affect completed studies.
*Provide improved read performance for users of CyberShake databases.
+
*Provide improved read performance for users of CyberShake external output interface, such as programs that plot CyberShake hazard maps, hazard curves, and peak amplitudes.
 
*Build CyberShake data access mechanisms and infrastructure that will support planned UGMS CyberShake MCER web site
 
*Build CyberShake data access mechanisms and infrastructure that will support planned UGMS CyberShake MCER web site
  
 
== Status of DB resources following migration ==
 
== Status of DB resources following migration ==
*Production Mysql database server running on upgraded computer hardware. Focal hardware will be replaced with current moment hardware.
+
*Production mysql software upgraded to current version of mysql software.
*On read-only server, 2 databases: 1 with Study 15.4, and 1 with Study 15.12 data.
+
*Production Mysql database server running on upgraded computer hardware.
*On production server, 1 database with all input data, the runs and output data for Study 15.12 and 15.4, and the runs and output data for runs which are not associated with any study.
+
*Read-only Mysql database server running on adequate computer hardware and storage.
 +
*CyberShake production server contains: 1 database with all input data, the runs and output data for Study 15.12 and 15.4, and the runs and output data for runs which are not associated with any study.
 +
*CyberShake read-only database server: 2 databases: 1 with Study 15.4, and 1 with Study 15.12 data.
 +
* Sqlite3
 
*After the above is complete, migrate older studies to alternative format and delete from production server.
 
*After the above is complete, migrate older studies to alternative format and delete from production server.
  

Revision as of 20:20, 14 July 2016

Overview of CyberShake data products

CyberShake is a multi-layered seismic hazard model. The CyberShake system is designed with two primary interfaces to external programs, both of which operate through an MySQL database. The MySQL database schema is maintained in the CyberShake SVN repository. A recent version is posted online. CyberShake has an input interface for UCERF rupture forecasts. OpenSHA uses this interface to populate CyberShake database with fault models and rupture probabilities. Internal CyberShake programs populate additional DB Tables with information about sites, and velocity models, and other hazard model specific information. Then, the CyberShake system run on HPC systems. Much of the CyberShake processing is conducted using scientific workflow tools. The end point of the CyberShake workflow processing are output Database tables, on the CyberShake production database server. These tables represent the CyberShake output interface. The CyberShake output external interface provides study information, site information, hazard curves, and peak amplitude information. Associated output data, not currently in the database, includes SRF files, and seismogram files. These output external data are access through file-based lookups, often depending on metadata encoded in the output file names.

To clarify terminology:

  • "Input data": Rupture data, ERF-related data, sites data. This data is shared between studies.
  • "Run data": What parameters are used with each run, timestamps, systems, study membership. A run is only part of a single study.
  • "Output DB data": Peak amplitudes data, point on hazard curves, frequencies for which hazard curves are calculated.
  • "Output External data": SRF files, seismogram files

CyberShake Database server hardware

  • Production CyberShake DB version (focal): Server version: 5.1.73 Source distribution
  • Read Only CyberShake DB version (moment): Server version: 10.0.23-MariaDB MariaDB Server

Goals of DB Migration

  • Improve performance of production CyberShake runs. Indications are that database write performance is slowing down our CyberShake production runs.
  • Separate production data from completed studies to reduce the possibility that new production runs will affect completed studies.
  • Provide improved read performance for users of CyberShake external output interface, such as programs that plot CyberShake hazard maps, hazard curves, and peak amplitudes.
  • Build CyberShake data access mechanisms and infrastructure that will support planned UGMS CyberShake MCER web site

Status of DB resources following migration

  • Production mysql software upgraded to current version of mysql software.
  • Production Mysql database server running on upgraded computer hardware.
  • Read-only Mysql database server running on adequate computer hardware and storage.
  • CyberShake production server contains: 1 database with all input data, the runs and output data for Study 15.12 and 15.4, and the runs and output data for runs which are not associated with any study.
  • CyberShake read-only database server: 2 databases: 1 with Study 15.4, and 1 with Study 15.12 data.
  • Sqlite3
  • After the above is complete, migrate older studies to alternative format and delete from production server.

Detailed Procedure for CyberShake DB Migration

  1. Run mysqldump on entire DB on focal. Generate dumpfiles for all the input data, each study's output and runs data, and the runs and output data which is not part of any study.
  2. Delete database on moment.
  3. Reconfigure DB on moment (single file per table, etc.)
  4. Load Study 15.12, 15.4, non-study data into DB on moment using the InnoDB engine.
  5. Confirm the reload into moment was successful.
  6. Convert older study runs, output data, and all input data from MySQL dump file into SQLite format. Create a different DB for each study.
  7. Confirm the reloads into SQLite format were successful.
  8. Delete database on focal.
  9. Load input data, Study 15.12 runs+output data, and Study 15.4 runs+output data onto focal for read-only access, using the MyISAM engine. Each study is in a separate database.
  10. Swap names of focal and moment so we don't have to change all our scripts.

Since the input data is much smaller (~100x) than the output data, we will keep a full copy of it with each study. It would be much more time intensive to identify which subset of input data applies just to the study and the extra space needed to keep it all is trivial. However, for each study, we will only keep the runs data for runs which are associated with that study.