Difference between revisions of "CyberShake Study Database Archiving"

From SCECpedia
Jump to navigationJump to search
Line 45: Line 45:
 
<ol>
 
<ol>
 
<li>mysqldump --lock-all-tables -u cybershk -p --where 'AR_Hazard_Curve_ID in (select distinct C.AR_Hazard_Curve_ID from AR_Hazard_Curves C where C.AR_Hazard_Curve_Dataset_ID in (select distinct D.AR_Hazard_Dataset_ID from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID) and Lat>=(select distinct D.Min_Lat from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID) and Lat<=(select distinct D.Max_Lat from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID) and Lon>=(select distinct D.Min_Lon from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID) and Lon<=(select distinct D.Max_Lon from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID) )' CyberShake AR_Hazard_Curve_Points > ar_hazard_curve_points.sql</li>
 
<li>mysqldump --lock-all-tables -u cybershk -p --where 'AR_Hazard_Curve_ID in (select distinct C.AR_Hazard_Curve_ID from AR_Hazard_Curves C where C.AR_Hazard_Curve_Dataset_ID in (select distinct D.AR_Hazard_Dataset_ID from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID) and Lat>=(select distinct D.Min_Lat from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID) and Lat<=(select distinct D.Max_Lat from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID) and Lon>=(select distinct D.Min_Lon from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID) and Lon<=(select distinct D.Max_Lon from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID) )' CyberShake AR_Hazard_Curve_Points > ar_hazard_curve_points.sql</li>
<li>mysqldump --lock-all-tables -u cybershk -p --where 'Hazard_Curve_ID in (select C.Hazard_Curve_ID from Hazard_Curves C, Hazard_Datasets D where D.Study_ID=<study id to archive>)' CyberShake AR_Hazard_Curve_Points > ar_hazard_curve_points.sql</li>
 
 
<li>mysqldump --lock-all-tables -u cybershk -p --where 'CS_Site_ID in (select R.Site_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>) and ERF_ID in (select R.ERF_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>)' CyberShake CyberShake_Site_Ruptures > cs_site_ruptures.sql</li>
 
<li>mysqldump --lock-all-tables -u cybershk -p --where 'CS_Site_ID in (select R.Site_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>) and ERF_ID in (select R.ERF_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>)' CyberShake CyberShake_Site_Ruptures > cs_site_ruptures.sql</li>
 
<li>mysqldump --lock-all-tables -u cybershk -p --where 'ERF_ID in (select R.ERF_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>) and (Source_ID, Rupture_ID) in (select distinct SR.Source_ID, SR.Rupture_ID from CyberShake_Site_Ruptures SR, CyberShake_Runs R where SR.CS_Site_ID=R.Site_ID and SR.ERF_ID=R.ERF_ID and R.Study_ID=<study id to archive>)' CyberShake Ruptures > ruptures.sql</li>
 
<li>mysqldump --lock-all-tables -u cybershk -p --where 'ERF_ID in (select R.ERF_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>) and (Source_ID, Rupture_ID) in (select distinct SR.Source_ID, SR.Rupture_ID from CyberShake_Site_Ruptures SR, CyberShake_Runs R where SR.CS_Site_ID=R.Site_ID and SR.ERF_ID=R.ERF_ID and R.Study_ID=<study id to archive>)' CyberShake Ruptures > ruptures.sql</li>

Revision as of 23:29, 11 January 2024

SQLite Archiving

The following procedure should be used when we want to archive a study to SQLite which is either in the production or data access database to disk, to free up room in the database.

First, dump the old database contents into a new directory by running the following commands.

Note that if you don't want to lock the tables, you can replace '--lock-all-tables' with '--single-transaction=TRUE' in the following commands.

  1. Determine the Study ID you want to archive, from looking at the Studies table in the DB:
  2. select * from Studies;
  3. Dump the PeakAmplitudes, using:
  4. mysqldump --lock-all-tables -u cybershk -p --where 'Run_ID in (select R.Run_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>)' CyberShake PeakAmplitudes > peak_amps.sql
  5. Dump Hazard_Datasets, using:
  6. mysqldump --lock-all-tables -u cybershk -p --where 'Study_ID=<study id to archive>' CyberShake Hazard_Datasets > hazard_datasets.sql
  7. Dump Hazard_Curves, using:
  8. mysqldump --lock-all-tables -u cybershk -p --where 'Hazard_Dataset_ID in (select D.Hazard_Dataset_ID from Hazard_Datasets D where D.Study_ID=<study id to archive>)' CyberShake Hazard_Curves > hazard_curves.sql
  9. Dump Hazard_Curve_Points, using:
  10. mysqldump --lock-all-tables -u cybershk -p --where 'Hazard_Curve_ID in (select C.Hazard_Curve_ID from Hazard_Curves C, Hazard_Datasets D where D.Study_ID=<study id to archive>)' CyberShake Hazard_Curve_Points > hazard_curve_points.sql
  11. Dump CyberShake_Runs, using:
  12. mysqldump --lock-all-tables -u cybershk -p --where 'Study_ID=<study id to archive>' CyberShake CyberShake_Runs > runs.sql
  13. You now need the rest of the input tables. They used to be pretty small, but now they're up to ~70GB with indices. As of the completion of Study 22.12, here are their approximate sizes (in the DB, not as dump files):
    Table Size (GB)
    AR_Hazard_Curve_Points 44.6
    Points 6.3
    CyberShake_Site_Ruptures 10.3
    Ruptures 2.3
    Rupture_Variations 2.6
    AR_Hazard_Curves 1.4
    Other tables 0.5

    You can either:

    • Dump all of the input tables -- they are small enough that we don't mind capturing data which wasn't directly used in this study.
    • mysqldump -u cybershk -p CyberShake AR_Hazard_Curve_Points AR_Hazard_Curves AR_Hazard_Datasets Atten_Rel_Metadata Atten_Rels CyberShake_Site_Regions CyberShake_Site_Ruptures CyberShake_Site_Types CyberShake_Sites ERF_IDs ERF_Metadata ERF_Probability_Models IM_Types Points Rupture_Variation_Probability_Modifier Rupture_Variation_Scenario_IDs Rupture_Variation_Scenario_Metadata Rupture_Variations Ruptures SGT_Variation_IDs SGT_Variation_Metadata Studies Time_Spans Velocity_Models > input_tables.sql
    • Dump only the sections of the large input tables that you need. Note that the Points table isn't used anymore, so don't bother to dump it.
      1. mysqldump --lock-all-tables -u cybershk -p --where 'AR_Hazard_Curve_ID in (select distinct C.AR_Hazard_Curve_ID from AR_Hazard_Curves C where C.AR_Hazard_Curve_Dataset_ID in (select distinct D.AR_Hazard_Dataset_ID from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID) and Lat>=(select distinct D.Min_Lat from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID) and Lat<=(select distinct D.Max_Lat from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID) and Lon>=(select distinct D.Min_Lon from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID) and Lon<=(select distinct D.Max_Lon from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID) )' CyberShake AR_Hazard_Curve_Points > ar_hazard_curve_points.sql
      2. mysqldump --lock-all-tables -u cybershk -p --where 'CS_Site_ID in (select R.Site_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>) and ERF_ID in (select R.ERF_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>)' CyberShake CyberShake_Site_Ruptures > cs_site_ruptures.sql
      3. mysqldump --lock-all-tables -u cybershk -p --where 'ERF_ID in (select R.ERF_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>) and (Source_ID, Rupture_ID) in (select distinct SR.Source_ID, SR.Rupture_ID from CyberShake_Site_Ruptures SR, CyberShake_Runs R where SR.CS_Site_ID=R.Site_ID and SR.ERF_ID=R.ERF_ID and R.Study_ID=<study id to archive>)' CyberShake Ruptures > ruptures.sql
      4. mysqldump --lock-all-tables -u cybershk -p --where '(Rup_Var_Scenario_ID, ERF_ID) in (select R.Rup_Var_Scenario_ID, R.ERF_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>) and (Source_ID, Rupture_ID) in (select distinct SR.Source_ID, SR.Rupture_ID from CyberShake_Site_Ruptures SR, CyberShake_Runs R where SR.CS_Site_ID=R.Site_ID and SR.ERF_ID=R.ERF_ID and R.Study_ID=<study id to archive>)' CyberShake Rupture_Variations > rupture_variations.sql
      5. mysqldump --lock-all-tables -u cybershk -p --where 'AR_Hazard_Dataset_ID in (select distinct D.AR_Hazard_Dataset_ID from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID) and Lat>=(select distinct D.Min_Lat from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID) and Lat<=(select distinct D.Max_Lat from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID) and Lon>=(select distinct D.Min_Lon from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID) and Lon<=(select distinct D.Max_Lon from AR_Hazard_Datasets D, CyberShake_Runs R where R.Study_ID=<study id to archive> and D.Time_Span_ID=1 and D.Prob_Model_ID=1 and D.ERF_ID=R.ERF_ID and D.Velocity_Model_ID=R.Velocity_Model_ID)' CyberShake AR_Hazard_Curves > ar_hazard_curves.sql
      6. mysqldump --lock-all-tables -u cybershk -p CyberShake AR_Hazard_Datasets Atten_Rel_Metadata Atten_Rels CyberShake_Site_Regions CyberShake_Site_Types CyberShake_Sites ERF_IDs ERF_Metadata ERF_Probability_Models IM_Types Rupture_Variation_Probability_Modifier Rupture_Variation_Scenario_IDs Rupture_Variation_Scenario_Metadata SGT_Variation_IDs SGT_Variation_Metadata Studies Time_Spans Velocity_Models

Next, convert the SQL dumps into SQLite format using mysql2sqlite:

  1. For each of the 6 dump files, run:
 mysql2sqlite <SQL dump file> > <SQLite dump file>
 Example:  ./mysql2sqlite peak_amps.sql > peak_amps.sqlite

For large tables this may take an hour or two.

Create a SQLite database and import the tables. Use sqlite3 3.7.11 or later, or there will be an error reading the dump files.

  1. Create a database for this study:
  2. sqlite3 <study name>
  3. For each dump file, run the following command:
  4. .read <path/to/dump/file.sqlite>

Run a few queries on the original table on the database, and on the sqlite files to check that the count is the same.

  1. Check the number of PeakAmplitudes:
  2. select count(*) from PeakAmplitudes P, CyberShake_Runs R where R.Study_ID=<study_id> and P.Run_ID=R.Run_ID;
  3. Check the number of rupture variations for each site:
  4. select count(*) from CyberShake_Runs R, CyberShake_Site_Ruptures SR, Rupture_Variations V where R.Site_ID=SR.CS_Site_ID and SR.Source_ID=V.Source_ID and SR.Rupture_ID=V.Rupture_ID and R.ERF_ID=V.ERF_ID and R.ERF_ID=SR.ERF_ID and V.Rup_Var_Scenario_ID=R.Rup_Var_Scenario_ID and R.Study_ID=<study id>;

Create index on Run_ID and IM_Type_ID in PeakAmplitudes table to speed up access.

  1. Copy over the original version to a new file.
  2. Set the SQLITE_TMPDIR variable to point to a filesystem with lots of free space.
  3. Using a current version of sqlite3, open the new database file and run:
  4. CREATE INDEX "idx_PeakAmplitudes_Run_ID_IM_Type_ID" ON "PeakAmplitudes" (`Run_ID`, `IM_Type_ID`);

Move the files to the archive location on project.

  1. Tar up the sqlite files:
  2. tar czvf <study_name>.tgz *.sqlite
  3. SFTP the tgz file to the study sqlite archive location on project at CARC (/project/scec_608/cybershake/results/sqlite_studies/<study name>), using hpc-transfer1.
  4. Extract the database to /project/scec_608/cybershake/results/sqlite_studies/<study_name> .
  5. tar xzvf sqlite_dumps/<study_name>.tgz <study_name>.sqlite

Move study to from production DB to data access DB