Difference between revisions of "Distributed Simulated Annealing"

From SCECpedia
Jump to navigationJump to search
Line 29: Line 29:
  
 
== Performance Graphs ==
 
== Performance Graphs ==
For the purposes of benchmarking, we present results for 4 different problems: Northern California (Well Constrained), Northern California (Poorly Constrained), All California (Well Constrained), and All California (Poorly Constrained). These 4 problems help demonstrate the affect of problem size and degree of constraints on the parallel speedup of the parallel SA algorithm.
+
For the purposes of benchmarking, we present results for 4 different problems: Northern California (Well Constrained), Northern California (Poorly Constrained), All California (Well Constrained), and All California (Poorly Constrained). These 4 problems help demonstrate the affect of problem size and degree of constraints on the parallel speedup of the parallel SA algorithm. Due to the random nature of simulated annealing, we did 5 identical runs with each parameterisation and averaged the results.
  
 
{| border="1"
 
{| border="1"
 
!Dataset
 
!Dataset
!Northern California (Well Constrained)
+
!Northern California (Well Constrained)<br>35,554 elements
!Northern California (Poorly Constrained)
+
!Northern California (Poorly Constrained)<br>35,554 elements
!All California (Well Constrained)
+
!All California (Well Constrained)<br>130,786 elements
!All California (Poorly Constrained)
+
!All California (Poorly Constrained)<br>130,786 elements
 
|-
 
|-
 
!Energy vs Time
 
!Energy vs Time
Line 76: Line 76:
  
 
== Conclusions ==
 
== Conclusions ==
The parallel simulated annealing algorithm clearly improves upon the classical serial approach.
+
The parallel simulated annealing algorithm clearly improves upon the classical serial approach. For smaller, well constrained problems, we saw a maximum speedup (averaged over 5 identical runs) of over 20x for a 50 node run (4 threads per node, 200 threads total). For less constrained, and/or larger solution spaces, we saw more marginal (6x-15x) improvement. Although

Revision as of 23:12, 6 September 2011

Introduction

An inversion-based methodology is being developed for the 3rd Uniform California Earthquake Rupture Forecast (UCERF3) that simultaneously satisfies available slip-rate, paleoseismic event-rate, and magnitude-distribution constraints. Simulated Annealing (Kirkpatrick 1983) is a well defined method for solving optomization problems, but can be slow for problems with a large solution space, such as the UCERF3 "Grand Inversion." We present a parallel simulated annealing approach to solve for the rates of all ruptures that extend through the seismogenic thickness on major mapped faults in California.

Serial SA Algorithm

  • s = s0; e = E(s)
  • sbest = s; ebest = e
  • k = 0
  • while k < max_iterations:
    • snew = neighbour(s)
    • enew = E(snew)
    • if P(e, enew, temperature) > random(); then
      • s = snew; e = enew
    • if enew < ebest
      • sbest = snew; ebest = enew
    • k++'

Parallel SA Algorithm

  • s = s0; e = E(s)
  • sbest = s; ebest = e
  • k = 0
  • while k < max_iterations
    • on n processors, do nSubIterations iterations of serial SA
    • find processor with best overall (lowest energy) solution, sbest
    • redistribute sbest, ebest to all processors
    • k += nSubIterations

Implementation

We implemented the parallel simulated annealing algorithm in OpenSHA (http://www.opensha.org), a Java-based framework for Seismic Hazard Analysis which is being used to develop UCERF3. All benchmarking calculations presented here were calculated on the USC HPCC cluster (http://www.usc.edu/hpcc/). There are two levels of parallelization used: cluster lever, and node level. Each HPCC node has 8 processors, so threading is used to make use of all available processors. We determined that 4 threads/node was optimal, possibly due to the use of a parallel sparse matrix multiplication package (used to calculate misfit, and thus energy) becoming overloaded when used with 8 threads/node. For cluster level parallelization, we used MPJ Express (http://mpj-express.org/, Baker 2007), a Java-based MPI implementation.

Performance Graphs

For the purposes of benchmarking, we present results for 4 different problems: Northern California (Well Constrained), Northern California (Poorly Constrained), All California (Well Constrained), and All California (Poorly Constrained). These 4 problems help demonstrate the affect of problem size and degree of constraints on the parallel speedup of the parallel SA algorithm. Due to the random nature of simulated annealing, we did 5 identical runs with each parameterisation and averaged the results.

Dataset Northern California (Well Constrained)
35,554 elements
Northern California (Poorly Constrained)
35,554 elements
All California (Well Constrained)
130,786 elements
All California (Poorly Constrained)
130,786 elements
Energy vs Time ncal_constrained_e_vs_t.small.png ncal_unconstrained_e_vs_t.small.png state_constrained_e_vs_t.small.png state_unconstrained_e_vs_t.small.png
Avg Energy vs Time ncal_constrained_avg_e_vs_t.small.png ncal_unconstrained_avg_e_vs_t.small.png state_constrained_avg_e_vs_t.small.png state_unconstrained_avg_e_vs_t.small.png
Serial Time vs Parallel Time ncal_constrained_st_vs_pt.small.png ncal_unconstrained_st_vs_pt.small.png state_constrained_st_vs_pt.small.png state_unconstrained_st_vs_pt.small.png
Time Speedup vs Time ncal_constrained_spd_vs_t.small.png ncal_unconstrained_spd_vs_t.small.png state_constrained_spd_vs_t.small.png state_unconstrained_spd_vs_t.small.png
Std. Dev. vs Time ncal_constrained_std_dev_vs_t.small.png ncal_unconstrained_std_dev_vs_t.small.png state_constrained_std_dev_vs_t.small.png state_unconstrained_std_dev_vs_t.small.png
Improvement vs Energy ncal_constrained_imp_vs_t.small.png ncal_unconstrained_imp_vs_t.small.png state_constrained_imp_vs_t.small.png state_unconstrained_imp_vs_t.small.png

Conclusions

The parallel simulated annealing algorithm clearly improves upon the classical serial approach. For smaller, well constrained problems, we saw a maximum speedup (averaged over 5 identical runs) of over 20x for a 50 node run (4 threads per node, 200 threads total). For less constrained, and/or larger solution spaces, we saw more marginal (6x-15x) improvement. Although