Difference between revisions of "DirectSynth"

From SCECpedia
Jump to navigationJump to search
 
(5 intermediate revisions by the same user not shown)
Line 12: Line 12:
 
*# Reads in header information, broadcasts to everyone.
 
*# Reads in header information, broadcasts to everyone.
 
*# Determines which SGT points go to which SGT handlers, broadcasts to everyone.
 
*# Determines which SGT points go to which SGT handlers, broadcasts to everyone.
 +
*# Sends point header data to relevant SGT handler.
 
*# While there is still work to do:
 
*# While there is still work to do:
 
*## Gathers data from workers and writes to files
 
*## Gathers data from workers and writes to files
Line 19: Line 20:
 
*# Receive header information from master.
 
*# Receive header information from master.
 
*# Receive mapping of SGT points to handlers.
 
*# Receive mapping of SGT points to handlers.
 +
*# Receive point header data.
 
*# Together, MPI-read in SGT files, with each handler reading its assigned section.
 
*# Together, MPI-read in SGT files, with each handler reading its assigned section.
 
*# While there is still work to do:
 
*# While there is still work to do:
Line 25: Line 27:
 
* Task Manager (process <num SGT handlers>)
 
* Task Manager (process <num SGT handlers>)
 
*# Read in rupture list which we are synthesizing seismograms for
 
*# Read in rupture list which we are synthesizing seismograms for
*# For each rupture in the list, determine how many tasks are needed to calculate all the rupture variations for that task.  The limiting factor is memory per worker process, since (size of rupture variation)*num_rupture_variations + 1 GB sgt buffer < 1.8 GB.  If the rupture is in the checkpoint file, it's already been calculated and isn't included from the task list.
+
*# For each rupture in the list, determine how many tasks are needed to calculate all the rupture variations for that task.  The limiting factor is memory per worker process, since (size of rupture variation)*num_rupture_variations + SGT buffer < 1.8 GB.  If the rupture is in the checkpoint file, it's already been calculated and isn't included from the task list.
 
*# While there are still tasks in the task list:
 
*# While there are still tasks in the task list:
 
*## Handler worker requests for more work by sending a task description
 
*## Handler worker requests for more work by sending a task description
Line 47: Line 49:
 
By far, the majority of the communication is between the SGT handlers and the workers.  For an average 1 Hz CyberShake run, here are communication estimates.
 
By far, the majority of the communication is between the SGT handlers and the workers.  For an average 1 Hz CyberShake run, here are communication estimates.
  
Assumptions:
+
==== Assumptions ====
 
* 1 Hz CyberShake run
 
* 1 Hz CyberShake run
 
* 750 GB SGTs per component (8.39M points)
 
* 750 GB SGTs per component (8.39M points)
Line 54: Line 56:
 
* 7000 ruptures
 
* 7000 ruptures
 
* Average of 69 rupture variations per rupture
 
* Average of 69 rupture variations per rupture
* Average of 41,500 tasks
+
* Average of 41,500 tasks = 12 rupture variations/task.
 +
* 116 nodes: 1 master, 1151 SGT handlers, 1 task manager, 2559 workers.
 +
* Average of 7290 SGT points managed per SGT handler.
 +
* Average of 38220 points per rupture.
 +
* Average of 500 SGTs per request.
  
Communication
+
==== Communication ====
  
# Broadcast of SGT header from master: 480 MB
+
* Broadcast of SGT header, from master: 192 MB
# Broadcast of SGT points to SGT handler mapping:
+
* Broadcast of SGT points to SGT handler mapping, from master: 4 KB
 +
* Sending of SGT point headers, from master to individual SGT handlers: 911 KB/SGT handler x 1151 SGT handlers = 1 GB
 +
* Each worker processes 17.1 tasks, which requires data from the SGT handler to the worker:  17.1 tasks * 137220 SGT points/task * 188 KB/SGT point = 420 GB/worker x 2559 workers = <b>997 TB</b>.
 +
* Output is sent from worker to master: 64 KB/rupture variation x 12 rupture variations/task x 16 tasks/worker = 12 MB/worker x 2559 workers = 30 GB.
  
== Improvement over previous post-processing ==
+
Walltime is about 10 hours, so this requires <b>average throughput of 28.4 GB/sec</b>.
  
 +
Essentially, we have a bundling factor of 12; on average, each SGT we transfer is used to calculate 12 rupture variations.  If we permit more memory to be used for rupture variation storage and less for SGTs, we can increase the bundling factor but also increase the number of SGT requests.
  
== Limitations ==
+
If we decrease the SGT request size to 512 MB, this would create the following changes:
 +
 
 +
* Average number of rupture variations per task increases to 30.
 +
* Number of tasks decreases to 28,800.
 +
* Each worker processes 11.3 tasks, which requires data from the SGT handler to the worker:  11.3 tasks * 130310 SGT points/task * 188 KB/SGT point = 262.8 GB/worker x 2559 workers = 657 TB; throughput of 18.7 GB/sec.

Latest revision as of 17:57, 10 September 2015

DirectSynth is a CyberShake post-processing code authored by Scott Callaghan in early 2015. It provides an alternative to separate SGT extraction and seismogram synthesis/PSA/RotD jobs, and requires fewer SUs and less I/O, making it a more efficient choice for CyberShake runs at frequencies at or above 1 Hz.

Overview

At a high level, DirectSynth works by reading in SGTs across a large number of processors. Another set of processors works through a list of seismogram synthesis/PSA/RotD tasks, requesting the SGTs needed for synthesis from the processor(s) which have them in memory. Since the total quantity of SGTs needed to synthesize a seismogram may be larger than what can fit into a single processor's memory, the requests are divided up so as not to exceed 1 GB at a time. Since multiple rupture variations will use the same SGTs, as many rupture variations as can fit into a processor's memory will be synthesized at the same time. Output files (seismograms, PSA, RotD) are sent to a single process for writing. Whenever all the rupture variations for a rupture are completed, that file is fsync()ed and the source and rupture ID are recorded in a checkpoint file.

Details

The processes in DirectSynth have 1 of 4 roles:

  • Master (process rank 0).
    1. Reads in header information, broadcasts to everyone.
    2. Determines which SGT points go to which SGT handlers, broadcasts to everyone.
    3. Sends point header data to relevant SGT handler.
    4. While there is still work to do:
      1. Gathers data from workers and writes to files
      2. Updates checkpoint file as ruptures finish
  • SGT Handlers (processes 1 - <num SGT handlers-1>)
    1. Receive header information from master.
    2. Receive mapping of SGT points to handlers.
    3. Receive point header data.
    4. Together, MPI-read in SGT files, with each handler reading its assigned section.
    5. While there is still work to do:
      1. Handle worker requests for SGTs by sending relevant header info followed by SGT data
  • Task Manager (process <num SGT handlers>)
    1. Read in rupture list which we are synthesizing seismograms for
    2. For each rupture in the list, determine how many tasks are needed to calculate all the rupture variations for that task. The limiting factor is memory per worker process, since (size of rupture variation)*num_rupture_variations + SGT buffer < 1.8 GB. If the rupture is in the checkpoint file, it's already been calculated and isn't included from the task list.
    3. While there are still tasks in the task list:
      1. Handler worker requests for more work by sending a task description
  • Workers (process <num SGT handlers + 1> - end)
    1. Receive mapping of SGT points to handlers.
    2. While there is still work to do:
      1. Request and receive task from Task Manager.
      2. Perform task:
        1. Generate rupture variations
        2. Determine list of SGTs which are required
        3. Divide list into individual requests, not to exceed 1 GB (each request goes to 1 SGT Handler)
        4. For each request:
          1. Request and obtain SGTs and header information from the correct SGT Handler
          2. Perform processing and convolution of SGTs with each rupture variation
        5. Perform PSA and RotD calculations
      3. Send data to Master for writing to filesystem.

Communication patterns

By far, the majority of the communication is between the SGT handlers and the workers. For an average 1 Hz CyberShake run, here are communication estimates.

Assumptions

  • 1 Hz CyberShake run
  • 750 GB SGTs per component (8.39M points)
  • 1.2 GB SGT header per component
  • 4000 timesteps
  • 7000 ruptures
  • Average of 69 rupture variations per rupture
  • Average of 41,500 tasks = 12 rupture variations/task.
  • 116 nodes: 1 master, 1151 SGT handlers, 1 task manager, 2559 workers.
  • Average of 7290 SGT points managed per SGT handler.
  • Average of 38220 points per rupture.
  • Average of 500 SGTs per request.

Communication

  • Broadcast of SGT header, from master: 192 MB
  • Broadcast of SGT points to SGT handler mapping, from master: 4 KB
  • Sending of SGT point headers, from master to individual SGT handlers: 911 KB/SGT handler x 1151 SGT handlers = 1 GB
  • Each worker processes 17.1 tasks, which requires data from the SGT handler to the worker: 17.1 tasks * 137220 SGT points/task * 188 KB/SGT point = 420 GB/worker x 2559 workers = 997 TB.
  • Output is sent from worker to master: 64 KB/rupture variation x 12 rupture variations/task x 16 tasks/worker = 12 MB/worker x 2559 workers = 30 GB.

Walltime is about 10 hours, so this requires average throughput of 28.4 GB/sec.

Essentially, we have a bundling factor of 12; on average, each SGT we transfer is used to calculate 12 rupture variations. If we permit more memory to be used for rupture variation storage and less for SGTs, we can increase the bundling factor but also increase the number of SGT requests.

If we decrease the SGT request size to 512 MB, this would create the following changes:

  • Average number of rupture variations per task increases to 30.
  • Number of tasks decreases to 28,800.
  • Each worker processes 11.3 tasks, which requires data from the SGT handler to the worker: 11.3 tasks * 130310 SGT points/task * 188 KB/SGT point = 262.8 GB/worker x 2559 workers = 657 TB; throughput of 18.7 GB/sec.