Difference between revisions of "CyberShake Study Database Archiving"

From SCECpedia
Jump to navigationJump to search
Line 3: Line 3:
 
The following procedure should be used when we want to archive a study to SQLite which is either in the production or data access database to disk, to free up room in the database.
 
The following procedure should be used when we want to archive a study to SQLite which is either in the production or data access database to disk, to free up room in the database.
  
<b>First, dump the old database contents into a new directory:</b>
+
<b>First, dump the old database contents into a new directory by running the following commands.</b>
 +
 
 +
Note that if you don't want to lock the tables, you can replace '--lock-all-tables' with '--single-transaction=TRUE' in the following commands.
  
 
<ol>
 
<ol>
Line 18: Line 20:
 
<li>Dump CyberShake_Runs, using:</li>
 
<li>Dump CyberShake_Runs, using:</li>
 
   mysqldump --lock-all-tables -u cybershk -p --where 'Study_ID=<study id to archive>' CyberShake CyberShake_Runs > runs.sql
 
   mysqldump --lock-all-tables -u cybershk -p --where 'Study_ID=<study id to archive>' CyberShake CyberShake_Runs > runs.sql
<li>You now need the rest of the input tables.  They used to be pretty small, but now they're up to ~70GB.  As of the completion of Study 22.12, here are their approximate sizes:
+
<li>You now need the rest of the input tables.  They used to be pretty small, but now they're up to ~70GB with indices.  As of the completion of Study 22.12, here are their approximate sizes (in the DB, not as dump files):
 
{| border="1"
 
{| border="1"
 
! Table !! Size (GB)
 
! Table !! Size (GB)
Line 36: Line 38:
 
| Other tables || 0.5
 
| Other tables || 0.5
 
|}
 
|}
 
+
You can either:
 +
<ul>
 
<li>Dump all of the input tables -- they are small enough that we don't mind capturing data which wasn't directly used in this study.</li>
 
<li>Dump all of the input tables -- they are small enough that we don't mind capturing data which wasn't directly used in this study.</li>
   mysqldump -u cybershk -p CyberShake AR_Hazard_Curve_Points AR_Hazard_Curves AR_Hazard_Datasets Atten_Rel_Metadata Atten_Rels CyberShake_Site_Regions CyberShake_Site_Ruptures CyberShake_Site_Types CyberShake_Sites ERF_IDs ERF_Metadata ERF_Probability_Models IM_Types Points Rupture_Variation_Probability_Modifier Rupture_Variation_Scenario_IDs Rupture_Variation_Scenario_Metadata Rupture_Variations Ruptures SGT_Variation_IDs SGT_Variation_Metadata Studies Time_Spans Velocity_Models > input_tables.sql
+
   mysqldump -u cybershk -p CyberShake AR_Hazard_Curve_Points AR_Hazard_Curves AR_Hazard_Datasets Atten_Rel_Metadata Atten_Rels CyberShake_Site_Regions CyberShake_Site_Ruptures CyberShake_Site_Types CyberShake_Sites ERF_IDs ERF_Metadata ERF_Probability_Models IM_Types Points Rupture_Variation_Probability_Modifier Rupture_Variation_Scenario_IDs Rupture_Variation_Scenario_Metadata Rupture_Variations Ruptures SGT_Variation_IDs SGT_Variation_Metadata Studies Time_Spans Velocity_Models > input_tables.sql</li>
 
+
<li>Dump only the sections of the large input tables that you need.  Note that the Points table isn't used anymore, so don't bother to dump it.
If you don't want to lock the tables, you can replace '--lock-all-tables' with '--single-transaction=TRUE'.
+
<ol>
 +
<li>mysqldump --lock-all-tables -u cybershk -p --where 'Hazard_Curve_ID in (select C.Hazard_Curve_ID from Hazard_Curves C, Hazard_Datasets D where D.Study_ID=<study id to archive>)' CyberShake AR_Hazard_Curve_Points</li>
 +
<li>mysqldump --lock-all-tables -u cybershk -p --where 'CS_Site_ID in (select R.Site_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>) and ERF_ID in (select R.ERF_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>)' CyberShake CyberShake_Site_Ruptures > cs_site_ruptures.sql</li>
 +
<li>mysqldump --lock-all-tables -u cybershk -p --where 'ERF_ID in (select R.ERF_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>) and (Source_ID, Rupture_ID) in (select distinct SR.Source_ID, SR.Rupture_ID from CyberShake_Site_Ruptures SR, CyberShake_Runs R where SR.CS_Site_ID=R.Site_ID and SR.ERF_ID=R.ERF_ID and R.Study_ID=<study id to archive>)' CyberShake Ruptures > ruptures.sql</li>
 +
<li>mysqldump --lock-all-tables -u cybershk -p --where '(Rup_Var_Scenario_ID, ERF_ID) in (select R.Rup_Var_Scenario_ID, R.ERF_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>) and (Source_ID, Rupture_ID) in (select distinct SR.Source_ID, SR.Rupture_ID from CyberShake_Site_Ruptures SR, CyberShake_Runs R where SR.CS_Site_ID=R.Site_ID and SR.ERF_ID=R.ERF_ID and R.Study_ID=<study id to archive>)' CyberShake Rupture_Variations > rupture_variations.sql</li>
 +
<li>AR_Hazard_Curves</li>
 +
<li>Other tables</li>
 +
</ol>
 +
</ul>
  
 
</ol>
 
</ol>

Revision as of 21:24, 1 May 2023

SQLite Archiving

The following procedure should be used when we want to archive a study to SQLite which is either in the production or data access database to disk, to free up room in the database.

First, dump the old database contents into a new directory by running the following commands.

Note that if you don't want to lock the tables, you can replace '--lock-all-tables' with '--single-transaction=TRUE' in the following commands.

  1. Determine the Study ID you want to archive, from looking at the Studies table in the DB:
  2. select * from Studies;
  3. Dump the PeakAmplitudes, using:
  4. mysqldump --lock-all-tables -u cybershk -p --where 'Run_ID in (select R.Run_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>)' CyberShake PeakAmplitudes > peak_amps.sql
  5. Dump Hazard_Datasets, using:
  6. mysqldump --lock-all-tables -u cybershk -p --where 'Study_ID=<study id to archive>' CyberShake Hazard_Datasets > hazard_datasets.sql
  7. Dump Hazard_Curves, using:
  8. mysqldump --lock-all-tables -u cybershk -p --where 'Hazard_Dataset_ID in (select D.Hazard_Dataset_ID from Hazard_Datasets D where D.Study_ID=<study id to archive>)' CyberShake Hazard_Curves > hazard_curves.sql
  9. Dump Hazard_Curve_Points, using:
  10. mysqldump --lock-all-tables -u cybershk -p --where 'Hazard_Curve_ID in (select C.Hazard_Curve_ID from Hazard_Curves C, Hazard_Datasets D where D.Study_ID=<study id to archive>)' CyberShake Hazard_Curve_Points > hazard_curve_points.sql
  11. Dump CyberShake_Runs, using:
  12. mysqldump --lock-all-tables -u cybershk -p --where 'Study_ID=<study id to archive>' CyberShake CyberShake_Runs > runs.sql
  13. You now need the rest of the input tables. They used to be pretty small, but now they're up to ~70GB with indices. As of the completion of Study 22.12, here are their approximate sizes (in the DB, not as dump files):
    Table Size (GB)
    AR_Hazard_Curve_Points 44.6
    Points 6.3
    CyberShake_Site_Ruptures 10.3
    Ruptures 2.3
    Rupture_Variations 2.6
    AR_Hazard_Curves 1.4
    Other tables 0.5

    You can either:

    • Dump all of the input tables -- they are small enough that we don't mind capturing data which wasn't directly used in this study.
    • mysqldump -u cybershk -p CyberShake AR_Hazard_Curve_Points AR_Hazard_Curves AR_Hazard_Datasets Atten_Rel_Metadata Atten_Rels CyberShake_Site_Regions CyberShake_Site_Ruptures CyberShake_Site_Types CyberShake_Sites ERF_IDs ERF_Metadata ERF_Probability_Models IM_Types Points Rupture_Variation_Probability_Modifier Rupture_Variation_Scenario_IDs Rupture_Variation_Scenario_Metadata Rupture_Variations Ruptures SGT_Variation_IDs SGT_Variation_Metadata Studies Time_Spans Velocity_Models > input_tables.sql
    • Dump only the sections of the large input tables that you need. Note that the Points table isn't used anymore, so don't bother to dump it.
      1. mysqldump --lock-all-tables -u cybershk -p --where 'Hazard_Curve_ID in (select C.Hazard_Curve_ID from Hazard_Curves C, Hazard_Datasets D where D.Study_ID=<study id to archive>)' CyberShake AR_Hazard_Curve_Points
      2. mysqldump --lock-all-tables -u cybershk -p --where 'CS_Site_ID in (select R.Site_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>) and ERF_ID in (select R.ERF_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>)' CyberShake CyberShake_Site_Ruptures > cs_site_ruptures.sql
      3. mysqldump --lock-all-tables -u cybershk -p --where 'ERF_ID in (select R.ERF_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>) and (Source_ID, Rupture_ID) in (select distinct SR.Source_ID, SR.Rupture_ID from CyberShake_Site_Ruptures SR, CyberShake_Runs R where SR.CS_Site_ID=R.Site_ID and SR.ERF_ID=R.ERF_ID and R.Study_ID=<study id to archive>)' CyberShake Ruptures > ruptures.sql
      4. mysqldump --lock-all-tables -u cybershk -p --where '(Rup_Var_Scenario_ID, ERF_ID) in (select R.Rup_Var_Scenario_ID, R.ERF_ID from CyberShake_Runs R where R.Study_ID=<study id to archive>) and (Source_ID, Rupture_ID) in (select distinct SR.Source_ID, SR.Rupture_ID from CyberShake_Site_Ruptures SR, CyberShake_Runs R where SR.CS_Site_ID=R.Site_ID and SR.ERF_ID=R.ERF_ID and R.Study_ID=<study id to archive>)' CyberShake Rupture_Variations > rupture_variations.sql
      5. AR_Hazard_Curves
      6. Other tables

Next, convert the SQL dumps into SQLite format using mysql2sqlite:

  1. For each of the 6 dump files, run:
 mysql2sqlite <SQL dump file> > <SQLite dump file>
 Example:  ./mysql2sqlite peak_amps.sql > peak_amps.sqlite

For large tables this may take an hour or two.

Create a SQLite database and import the tables. Use sqlite3 3.7.11 or later, or there will be an error reading the dump files.

  1. Create a database for this study:
  2. sqlite3 <study name>
  3. For each dump file, run the following command:
  4. .read <path/to/dump/file.sqlite>

Run a few queries on the original table on the database, and on the sqlite files to check that the count is the same.

  1. Check the number of PeakAmplitudes:
  2. select count(*) from PeakAmplitudes P, CyberShake_Runs R where R.Study_ID=<study_id> and P.Run_ID=R.Run_ID;
  3. Check the number of rupture variations for each site:
  4. select count(*) from CyberShake_Runs R, CyberShake_Site_Ruptures SR, Rupture_Variations V where R.Site_ID=SR.CS_Site_ID and SR.Source_ID=V.Source_ID and SR.Rupture_ID=V.Rupture_ID and R.ERF_ID=V.ERF_ID and R.ERF_ID=SR.ERF_ID and V.Rup_Var_Scenario_ID=R.Rup_Var_Scenario_ID and R.Study_ID=<study id>;

Move the files to the archive location on project.

  1. Tar up the sqlite files:
  2. tar czvf <study_name>.tgz *.sqlite
  3. SFTP the tgz file to the study sqlite archive location on project at CARC (/project/scec_608/cybershake/results/sqlite_studies/<study name>), using hpc-transfer1.
  4. Extract the database to /project/scec_608/cybershake/results/sqlite_studies/<study_name> .
  5. tar xzvf sqlite_dumps/<study_name>.tgz <study_name>.sqlite

Move study to from production DB to data access DB