Parallel Broadband

From SCECpedia
Revision as of 00:27, 19 January 2012 by Patrices (talk | contribs)
Jump to navigationJump to search

Overview

The original SCEC Broadband Platform executes a simulation workflow in a serial manner. Each code within each module, and the workflow itself, is executed serially. All BBP simulations involve these sequence of steps:

  1. rupture generation
  2. low-frequency seismogram synthesis for N stations
  3. high-frequency seismogram synthesis for N stations
  4. application of site effects for N stations
  5. goodness of fit calculations
  6. plotting


Rupture generation reads in a few input files and outputs a single source rupture description. Seismogram synthesis and site effects run on a list of N stations. Goodness of fit and plotting require information from the previously generated seismograms of all stations.


There are three levels in which the platform can be parallelized:

  1. Parallelize the scientific codes used in each module (MPI, OpenMP)
  2. Parallelize the individual simulation workflow (parallelize the seismogram synthesis by station)
  3. Parallelize groups of simulations (multiple simulation workflows can execute simultaneously)


The approach described here is the second option. Seismogram synthesis and site effects are broken out by station and those modules are run in parallel. A simulation involving 100 stations may be run on up to 100 cores, giving an approximately 100x speed-up in execution speed.


Parallel Broadband

A new software layer has been added on top of the existing BBP software stack that allows a user to parallelize an existing serial workflow and execute it either on multiple cores within a single host or multiple nodes on a HPC resource. The user begins by creating a simulation workflow description in the traditional way (using run_bbp_2G.py) and saving it to disk. Then, the new parallel interface is invoked on that workflow description. The serial workflow is parsed, parallelized, and optionally executed.


Supported System Configurations

Parallel Broadband can be run on three types of systems:

  1. Any stand-alone, multi-processor computer running Linux
  2. Any HPC resource running Linux and a job scheduler that supports multi-node jobs, and MoM nodes that can ssh into the compute nodes of the job
  3. Remotely to any system as described in #2 via GRAM (requires BBP to be installed on both the HPC host and the remote system)


So, Parallel Broadband can be run on a quad-core desktop computer and use all four cores. It can also run as a job submitted to USC HPCC, and use as many cores as there are stations in the simulation.


System Requirements

  • Running a parallel workflow locally:
    • BBP installation (with all of its dependencies)
    • Optional job manager/scheduler
  • Running a parallel workflow remotely:
    • BBP installation (with all of its dependencies) on both local and remote system
    • GRAM support on remote system
    • GridFTP support on both local and remote systems
    • Job manager on remote system


Installation

Parallel BBP is fully integrated with the original Broadband platform and shares its installation process. The parallel functionality may require optional configuration steps depending on how the system is run.


HPC System Installation Notes

When a parallel simulation is run as a job on an HPC host, the user may need to manually setup the default environment on the compute nodes. Specifically, the user may need to configure environment variables (PATH and PYTHONPATH, for example) so that the correct version of Python is used. The shell script ./comps/setup_bbp_env.sh is the place to declare those custom environment settings.

If run on a stand-alone computer, customizing that shell script is unnecessary.


Remote Job Submission Installation Notes

If you want to submit simulations remotely from one computer to an HPC system with GRAM, the Broadband platform must be installed and configured on both systems. In addition, both installations must list a host configuration for the remote HPC system in ./comps/install_cfg.py. The self.HOSTS python dictionary lists the hostname, jobmanager, gridftp server, various paths, and env variables for each remote HPC resource. An example self.HOSTS specifications follows for USC HPCC:

# GRAM remote job submission sites
# Keys:
#      hostname: fully qualified domain name of computing resource
#      jobmanager: the job manager to use
#      gridftp: fully qualified domain name of sites gridftp server
#      compsdir: path to remote BBP comps dir
#      startdir: path to remote BBP start dir
#      batchdir: path to remove BBP batch dir
#      env : dictionary of environment variables to define in remote
#            job environment
    self.HOSTS = {"hpc":\
                    {"hostname":"hpc-master.usc.edu", \
                       "jobmanager":"jobmanager-pbs", \
                       "gridftp":"hpc-login2.usc.edu", \
                       "compsdir":"/home/rcf-104/patrices/bbp/comps", \
                       "startdir":"/home/rcf-104/patrices/bbp/start", \
                       "batchdir":"/home/rcf-104/patrices/bbp/batch", \
                       "env":\
                       {"PYTHONPATH":"/home/rcf-104/patrices/bbp/comps", \
                        "PATH":"/home/rcf-104/patrices/opt/Python-2.6.2/bin:/bin:/usr/bin" \
                          }\
                       }\
                    }


Running a Parallel Workflow Locally

On the machine you wish to run the simulation, generate a serial workflow description using run_bbp_2G.py with the "-g" option.

$ ./run_bbp_2G.py -g

You will be prompted interactively for the codebases and inputs to be used in your simulation unless you provide a response file with the "-o" option. Take note of the path to the XML file that is saved at the end.


Then, convert the serial workflow to a parallel workflow:

$ ./bbp_s2p.py -g -s ../xml/1234567.xml

This command takes the serial workflow "../xml/1234567.xml", converts it to a parallel workflow, and saves it to "../parallel/1234567.pxml"


At this point you may execute the parallel workflow. On a stand-alone machine with four processors, execute the following:

$ ./bbp_s2p.py -x ../parallel/1234567.pxml -c 4 -n localhost


In a PBS script on USC HPCC as a four core job, execute the following:

$ ./bbp_s2p.py -x ../parallel/1234567.pxml -c 4 -n $PBS_NODEFILE


Running a Parallel Workflow Remotely

One or more simulations may be run remotely on an HPC system with GRAM by using the ./comps/run_bbp_2g_gram.py script. This is an advanced capability. In this mode, the user specifies an option template file (containing all the answers to a run_bbp_2G.py interactive session), lists their input files, and specifies some job parameters for the remote job scheduler (walltime, number of cores, etc). The system then performs these steps:

  • zips up the option template and input files into a tar file
  • transfers that data over to the remote HPC system with GridFTP
  • unpacks the input files
  • instantiates the option template into one or more simulations (ie: one or more parallel workflow .pxml files) in a batching process
  • runs the batch of simulations using the job parameters provided to run_bbp_2g_gram.py
  • zips up the outdata result directories from each simulation
  • tranfers the simulation outdata back to the local host with GridFTP


Input files are generally a rupture description and a station list for each simulation that the user wishes to run. The option template file is a parameterized option file similar to what run_bbp_2G.py accepts, except lines in the file may be replaced with variables that expand out into one or more options. In this way, many simulations can be defined. The following is an example option template file for a hanging wall simulation, rob_template.txt:

n
y
1
2
[srcfile:/home/scec-00/patrices/tmp/bbp_inputs/rob_sources.txt]
1
2
[stafile:/home/scec-00/patrices/tmp/bbp_inputs/rob_stations.txt]
1
1
y
y
n

Each line may contain a simple answer or number selection, or it may contain a variable of the form [variable_label:/file/path/containing/responses]. In the above example, the file rob_sources.txt contains these entries:

/home/scec-00/rgraves/NgaW2/FwHw/FaultInfo/Inputs/m6.00_d20_r90_z0.src

And rob_stations.txt contains:

/home/scec-00/rgraves/NgaW2/FwHw/StatInfo/rv01-m6.00_stats.stl


Keep in mind that the paths shown here are all on the local system. When this option file is sent to the remote HPC system, all of these local paths are changed by BBP to point to the unpacked input files that it transfered over with GridFTP.

To run the above example simulation at USC HPCC, for example, execute the following command from the local system:

$ ./run_bbp_2g_gram.py -k hpc -b HANGING_WALL -o rob_template.txt -c 100 -n 25 -q main -w 30

The -k option specifies the host in self.HOSTS (./comps/install_cfg.py) in which to send the job. The -b option assigns a string label to the batch. The -o option provides the option template file. The remaining options specify the job parameters. The program is blocking in that it will end only when the remote job is finished and the simulation data has been retrieved.


Simulation Batching

The Broadband platform provides a batching interface that allows multiple simulations to be created and run with a single step. This batching mechanism interacts with the parallel interface as can be seen in #Running a Parallel Workflow Remotely but it can also be used independently.

To create a batch of simulations that can be run together as a unit, use the bbp_batch_sim.py utility:

$ ./bbp_batch_sim.py -b HANGING_WALL -o rob_template.txt -g


The option template file is a parameterized option file similar to what run_bbp_2G.py accepts, except lines in the file may be replaced with variables that expand out into one or more options. See #Running a Parallel Workflow Remotely for an example. The batch name parameter is just a user-defined label to assign to the batch. The -g option specifies to generate the batch only, and not execute it. A previously generated batch can be executed with the -x option.