Difference between revisions of "Distributed Simulated Annealing"
(3 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | '''Download the 2011 SCEC Annual Meeting poster here: [http://opensha.usc.edu/ftp/kmilner/ucerf3/dsa_poster/parallel_sa_2011_scec_meeting.pdf PDF] [http://opensha.usc.edu/ftp/kmilner/ucerf3/dsa_poster/parallel_sa_2011_scec_meeting.png PNG]''' | ||
+ | |||
== Introduction == | == Introduction == | ||
− | An inversion-based methodology is being developed for the 3rd Uniform California Earthquake Rupture Forecast (UCERF3) that simultaneously satisfies available slip-rate, paleoseismic event-rate, and magnitude-distribution constraints. Simulated Annealing (Kirkpatrick 1983) is a well defined method for solving optomization problems, but can be slow for problems with a large solution space, such as the UCERF3 "Grand Inversion." We present a parallel simulated annealing approach to solve for the rates of all ruptures that extend through the seismogenic thickness on major mapped faults in California. | + | An inversion-based methodology is being developed for the 3rd Uniform California Earthquake Rupture Forecast (UCERF3) that simultaneously satisfies available slip-rate, paleoseismic event-rate, and magnitude-distribution constraints. Simulated Annealing (Kirkpatrick 1983) is a well defined method for solving optomization problems, but can be slow for problems with a large solution space, such as the UCERF3 "Grand Inversion." We present a parallel simulated annealing approach to quickly solve for the rates of all ruptures that extend through the seismogenic thickness on major mapped faults in California. |
== Serial SA Algorithm == | == Serial SA Algorithm == | ||
Line 28: | Line 30: | ||
We implemented the parallel simulated annealing algorithm in OpenSHA (http://www.opensha.org), a Java-based framework for Seismic Hazard Analysis which is being used to develop UCERF3. All benchmarking calculations presented here were calculated on the USC HPCC cluster (http://www.usc.edu/hpcc/). There are two levels of parallelization used: cluster lever, and node level. Each HPCC node has 8 processors, so threading is used to make use of all available processors. For cluster level parallelization, we used MPJ Express (http://mpj-express.org/, Baker 2007), a Java-based MPI implementation. | We implemented the parallel simulated annealing algorithm in OpenSHA (http://www.opensha.org), a Java-based framework for Seismic Hazard Analysis which is being used to develop UCERF3. All benchmarking calculations presented here were calculated on the USC HPCC cluster (http://www.usc.edu/hpcc/). There are two levels of parallelization used: cluster lever, and node level. Each HPCC node has 8 processors, so threading is used to make use of all available processors. For cluster level parallelization, we used MPJ Express (http://mpj-express.org/, Baker 2007), a Java-based MPI implementation. | ||
− | There are two parameters used in our parallel approach: the number of threads per node (threads/node), and the number of iterations between communication | + | There are two parameters used in our parallel approach: the number of threads per node (threads/node), and the number of iterations between intra-node communication and distribution of the best overall solution (nSubIterations). Parameter sweeps and analysis of parallel speedup led us to choose 4 threads/node was as, possibly due to the use of a parallel sparse matrix multiplication package (used to calculate misfit, and thus energy) becoming overloaded when used with 8 threads/node. We also determined that setting nSubIterations to 200 resulted in the best balance between quickly redistributing good results to all available processors, and reducing communications overhead. |
== Performance Graphs == | == Performance Graphs == | ||
Line 37: | Line 39: | ||
{| border="1" | {| border="1" | ||
!Dataset | !Dataset | ||
− | !Northern California (Well Constrained)<br> | + | !Northern California (Well Constrained)<br>39,075 elements |
− | !Northern California (Poorly Constrained)<br> | + | !Northern California (Poorly Constrained)<br>39,075 elements |
− | !All California (Well Constrained)<br> | + | !All California (Well Constrained)<br>198,260 elements |
− | !All California (Poorly Constrained)<br> | + | !All California (Poorly Constrained)<br>198,260 elements |
|- | |- | ||
!Energy vs Time | !Energy vs Time | ||
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_constrained_e_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_constrained_e_vs_t.small.png] |
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_unconstrained_e_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_unconstrained_e_vs_t.small.png] |
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_constrained_e_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_constrained_e_vs_t.small.png] |
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_unconstrained_e_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_unconstrained_e_vs_t.small.png] |
|- | |- | ||
!Avg Energy vs Time | !Avg Energy vs Time | ||
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_constrained_avg_e_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_constrained_avg_e_vs_t.small.png] |
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_unconstrained_avg_e_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_unconstrained_avg_e_vs_t.small.png] |
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_constrained_avg_e_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_constrained_avg_e_vs_t.small.png] |
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_unconstrained_avg_e_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_unconstrained_avg_e_vs_t.small.png] |
|- | |- | ||
!Serial Time vs Parallel Time | !Serial Time vs Parallel Time | ||
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_constrained_st_vs_pt.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_constrained_st_vs_pt.small.png] |
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_unconstrained_st_vs_pt.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_unconstrained_st_vs_pt.small.png] |
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_constrained_st_vs_pt.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_constrained_st_vs_pt.small.png] |
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_unconstrained_st_vs_pt.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_unconstrained_st_vs_pt.small.png] |
|- | |- | ||
!Time Speedup vs Time | !Time Speedup vs Time | ||
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_constrained_spd_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_constrained_spd_vs_t.small.png] |
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_unconstrained_spd_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_unconstrained_spd_vs_t.small.png] |
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_constrained_spd_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_constrained_spd_vs_t.small.png] |
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_unconstrained_spd_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_unconstrained_spd_vs_t.small.png] |
|- | |- | ||
!Std. Dev. vs Time | !Std. Dev. vs Time | ||
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_constrained_std_dev_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_constrained_std_dev_vs_t.small.png] |
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_unconstrained_std_dev_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_unconstrained_std_dev_vs_t.small.png] |
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_constrained_std_dev_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_constrained_std_dev_vs_t.small.png] |
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_unconstrained_std_dev_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_unconstrained_std_dev_vs_t.small.png] |
|- | |- | ||
!Improvement vs Energy | !Improvement vs Energy | ||
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_constrained_imp_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_constrained_imp_vs_t.small.png] |
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_unconstrained_imp_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_unconstrained_imp_vs_t.small.png] |
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_constrained_imp_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_constrained_imp_vs_t.small.png] |
− | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/ | + | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_unconstrained_imp_vs_t.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_unconstrained_imp_vs_t.small.png] |
+ | |- | ||
+ | !Time Speedup vs Threads | ||
+ | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_constrained_spd_vs_thrd.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_constrained_spd_vs_thrd.small.png] | ||
+ | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_unconstrained_spd_vs_thrd.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/ncal_unconstrained_spd_vs_thrd.small.png] | ||
+ | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_constrained_spd_vs_thrd.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_constrained_spd_vs_thrd.small.png] | ||
+ | |[http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_unconstrained_spd_vs_thrd.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/allcal_unconstrained_spd_vs_thrd.small.png] | ||
|} | |} | ||
+ | |||
+ | Speedup Vs Threads Comparisons | ||
+ | [http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/spd_vs_threads_comp.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/spd_vs_threads_comp.small.png] | ||
+ | [http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/spd_vs_threads_evencomp.png http://opensha.usc.edu/ftp/kmilner/ucerf3/2011agu/spd_vs_threads_evencomp.small.png] | ||
== Conclusions == | == Conclusions == | ||
The parallel simulated annealing algorithm clearly improves upon the classical serial approach. For smaller, well constrained problems, we saw a maximum speedup (averaged over 5 identical runs) of over 20x for a 50 node run (4 threads per node, 200 threads total). For less constrained, and/or larger solution spaces, we saw more marginal yet still significant (6x-15x) improvement. For the All California runs, we noticed a clearer correlation between the number of processors used and the performance of the algorithm (for Northern California this was generally the case, but not always). We also noticed that using a greater number of processors generally reduced the variation in energy levels between identical runs (see Standard Deviation vs Time plots). Although the algorithm scales well to 20 or even 50 nodes (80/200 threads, respectively), it appears that adding more processors beyond this amount leads to diminishing returns. | The parallel simulated annealing algorithm clearly improves upon the classical serial approach. For smaller, well constrained problems, we saw a maximum speedup (averaged over 5 identical runs) of over 20x for a 50 node run (4 threads per node, 200 threads total). For less constrained, and/or larger solution spaces, we saw more marginal yet still significant (6x-15x) improvement. For the All California runs, we noticed a clearer correlation between the number of processors used and the performance of the algorithm (for Northern California this was generally the case, but not always). We also noticed that using a greater number of processors generally reduced the variation in energy levels between identical runs (see Standard Deviation vs Time plots). Although the algorithm scales well to 20 or even 50 nodes (80/200 threads, respectively), it appears that adding more processors beyond this amount leads to diminishing returns. |
Latest revision as of 00:57, 30 November 2011
Download the 2011 SCEC Annual Meeting poster here: PDF PNG
Contents
Introduction
An inversion-based methodology is being developed for the 3rd Uniform California Earthquake Rupture Forecast (UCERF3) that simultaneously satisfies available slip-rate, paleoseismic event-rate, and magnitude-distribution constraints. Simulated Annealing (Kirkpatrick 1983) is a well defined method for solving optomization problems, but can be slow for problems with a large solution space, such as the UCERF3 "Grand Inversion." We present a parallel simulated annealing approach to quickly solve for the rates of all ruptures that extend through the seismogenic thickness on major mapped faults in California.
Serial SA Algorithm
- s = s0; e = E(s)
- sbest = s; ebest = e
- k = 0
- while k < max_iterations:
- snew = neighbour(s)
- enew = E(snew)
- if P(e, enew, temperature) > random(); then
- s = snew; e = enew
- if enew < ebest
- sbest = snew; ebest = enew
- k++'
Parallel SA Algorithm
- s = s0; e = E(s)
- sbest = s; ebest = e
- k = 0
- while k < max_iterations
- on n processors, do nSubIterations iterations of serial SA
- find processor with best overall (lowest energy) solution, sbest
- redistribute sbest, ebest to all processors
- k += nSubIterations
Implementation
We implemented the parallel simulated annealing algorithm in OpenSHA (http://www.opensha.org), a Java-based framework for Seismic Hazard Analysis which is being used to develop UCERF3. All benchmarking calculations presented here were calculated on the USC HPCC cluster (http://www.usc.edu/hpcc/). There are two levels of parallelization used: cluster lever, and node level. Each HPCC node has 8 processors, so threading is used to make use of all available processors. For cluster level parallelization, we used MPJ Express (http://mpj-express.org/, Baker 2007), a Java-based MPI implementation.
There are two parameters used in our parallel approach: the number of threads per node (threads/node), and the number of iterations between intra-node communication and distribution of the best overall solution (nSubIterations). Parameter sweeps and analysis of parallel speedup led us to choose 4 threads/node was as, possibly due to the use of a parallel sparse matrix multiplication package (used to calculate misfit, and thus energy) becoming overloaded when used with 8 threads/node. We also determined that setting nSubIterations to 200 resulted in the best balance between quickly redistributing good results to all available processors, and reducing communications overhead.
Performance Graphs
For the purposes of benchmarking, we present results for 4 different problems: Northern California (Well Constrained), Northern California (Poorly Constrained), All California (Well Constrained), and All California (Poorly Constrained). These 4 problems help demonstrate the affect of problem size and degree of constraints on the parallel speedup of the parallel SA algorithm. Due to the random nature of simulated annealing, we did 5 identical runs with each parameterisation and averaged the results.
Speedup Vs Threads Comparisons
Conclusions
The parallel simulated annealing algorithm clearly improves upon the classical serial approach. For smaller, well constrained problems, we saw a maximum speedup (averaged over 5 identical runs) of over 20x for a 50 node run (4 threads per node, 200 threads total). For less constrained, and/or larger solution spaces, we saw more marginal yet still significant (6x-15x) improvement. For the All California runs, we noticed a clearer correlation between the number of processors used and the performance of the algorithm (for Northern California this was generally the case, but not always). We also noticed that using a greater number of processors generally reduced the variation in energy levels between identical runs (see Standard Deviation vs Time plots). Although the algorithm scales well to 20 or even 50 nodes (80/200 threads, respectively), it appears that adding more processors beyond this amount leads to diminishing returns.