CyberShake migration to Summit
This page is being used to gather information on the effort to migrate CyberShake from Titan to Summit.
Contents
Software Status
Initially, we are running each phase on Summit and comparing the output with that on Titan.
- PreCVM: OK
- UCVM:
- Smoothing:
- PreSGT: OK
- PreAWP: OK
- AWP-GPU-SGT:
AWP-GPU-SGT
We arbitrarily selected site s2257 and ran the X-component (the AWP X-component, which is the RWG Y-component) on Summit, to compare to the results on Titan.
Numerical comparison
When comparing all SGT points (test-reference), we get:
Average diff = 1.317764e-12, average percent diff = -3.324251% Largest diff of 0.000150 at float index 1718660606. Largest percent diff of 5017404928.000000% at float index 8636382051.
When we only consider points greater than 1e-10 (peak SGT values are usually 1e-4 to 1e-3), we get:
Average diff = 1.317764e-12, average percent diff = 0.133713%, average absolute percent diff = 12.979661% Largest diff of 0.000150 at float index 1718660606. Largest percent diff of 7148245.500000% at float index 1626932526.
The average percent diff is much less than before, but we included the absolute percent diff, which reveals that many points differ by considerable amounts and there was some canceling going on when looking at the average percent diff.
As a result of this, we decided to more closely investigate what is causing the differences between the two systems.
Initial plots
We identified point ID 71610 as the point which contains the largest difference between the test and reference SGTs.
Below are plots of this point, comparing the Titan and Summit results and the difference. The zoomed-in version focuses on the 25 seconds with the largest differences.
CUDA versions
The default version of CUDA on Titan is 9.1.85 and on Summit it's 9.2.148. We tried rebuilding the AWP code on Summit with 9.1.85 and GCC and rerunning. It didn't make much difference:
Average diff = 1.272838e-12, average percent diff = 0.133971%, average absolute percent diff = 12.971045% Largest diff of 0.000150 at float index 1718660606. Largest percent diff of 7148275.000000% at float index 1626932526
X and Y-only work distribution
We tried distributing the work only across X (PX=120, PY=1, PZ=1)but got similar differences.
X:
Average diff = 1.317764e-12, average percent diff = 0.133713%, average absolute percent diff = 12.979661% Largest diff of 0.000150 at float index 1718660606. Largest percent diff of 7148245.500000% at float index 1626932526.
We also tried it across Y (PX=1, PY=210, PZ=1). This resulted in a memory error:
*** Error in `/gpfs/alpine/proj-shared/geo112/CyberShake/software/AWP-GPU-SGT/bin/pmcl3d': free(): invalid size: 0x0000000040728bc0 *** ======= Backtrace: ========= /lib64/libc.so.6(cfree+0x4a0)[0x2000004f9cd0] /gpfs/alpine/proj-shared/geo112/CyberShake/software/AWP-GPU-SGT/bin/pmcl3d[0x10008d34] /gpfs/alpine/proj-shared/geo112/CyberShake/software/AWP-GPU-SGT/bin/pmcl3d[0x10008774] /lib64/libc.so.6(+0x25100)[0x200000485100] /lib64/libc.so.6(__libc_start_main+0xc4)[0x2000004852f4]
Smaller SGTs
We tried generating SGTs for the SMALL site, which are about 6% the size of s2257, on both Titan and Summit. We actually saw even larger average differences:
Average diff = 6.027801e-12, average percent diff = -1.673267%, average absolute percent diff = 21.367271% Largest diff of 0.000067 at float index 175412388. Largest percent diff of 2091817.000000% at float index 1363132518.
Frequency content
We did a frequency content analysis on the largest difference SGT from the SMALL site:
Ossian's test
Ossian O'Reilly created a simple, single-GPU test here on Github.
Running his test on Titan and Summit, and running his comparison script, I obtained the following agreement:
Difference (titan, x=1, y=1) (summit, x=1, y=1), l2 (abs): 2.050260e-04 l2 (rel) 2.090467e-06 l1 (abs) 1.097141e-01 l1 (rel) 7.702747e-06 linf (abs) 6.659655e-06 linf (rel) 1.872744e-06
Comparing the same two outputs using my scripts, using a cutoff for percent comparisons of 1e-6:
Average diff = 6.023823e-10, average percent diff = -0.026654%, average absolute percent diff = 0.452920% Largest diff of 0.000007 at float index 128648. Largest percent diff of 69.524002% at float index 3874621. Absolute percentage difference histogram: 0.0001 0.0010 0.0100 0.1000 0.3000 1.0000 3.0000 10.00 100.00 1000.0 138147 187512 179189 202769 122466 110331 57253 27862 7452 0 0 13.37% 18.15% 17.35% 19.63% 11.86% 10.68% 5.54% 2.70% 0.72% 0.00% 0.00% 13.37% 31.53% 48.87% 68.50% 80.36% 91.04% 96.58% 99.28% 100.00% 100.00 100.00%
Approximately half the points have a percent difference less than 0.01%, 90% less than 0.3%, and 99% less than 3%.
These two comparisons show us different information: point-to-point comparison can be biased by small values which have little impact on the result. For example, ~80% of the points have values less than 1e-6, but the largest values are greater than 1. L1 comparison can be biased by the largest points, which carry much more weight than points 1 or 2 orders of magnitude lower, which are still important for agreement.
I tried running my code vs Ossian's code on Summit. The only differences are support for debugging MPI I/O in Ossian's version. We see differences similar to those above:
Average diff = 3.603304e-09, average percent diff = -0.000870%, average absolute percent diff = 0.585565% Largest diff of 0.000475 at float index 3272633. Largest percent diff of 16670.455078% at float index 3272726. Absolute percentage difference histogram: 0.0001 0.0010 0.0100 0.1000 0.3000 1.0000 3.0000 10.0000 100.0000 1000.0000 67238 114994 199943 268809 145799 121488 68547 38022 7930 84 1 6.51% 11.13% 19.36% 26.03% 14.12% 11.76% 6.64% 3.68% 0.77% 0.01% 0.00% 6.51% 17.64% 37.00% 63.03% 77.14% 88.91% 95.54% 99.22% 99.99% 100.00% 100.00%
Additional tests
Comparison using Ossian's parameters with CyberShake code on Titan vs Summit, no optimization:
Average diff = 3.000922e-09, average percent diff = 0.013876%, average absolute percent diff = 0.650546% Largest diff of 0.000475 at float index 3272633. Largest percent diff of 16670.525391% at float index 3272726. Absolute percentage difference histogram:
0.0001 0.0010 0.0100 0.1000 0.3000 1.0000 3.0000 10.0000 100.0000 1000.0000 72818 114632 210094 283026 131370 108666 63794 36609 11761 84 1 7.05% 11.10% 20.34% 27.40% 12.72% 10.52% 6.18% 3.54% 1.14% 0.01% 0.00% 7.05% 18.15% 38.49% 65.89% 78.61% 89.13% 95.31% 98.85% 99.99% 100.00% 100.00%
Comparison with 200 x 200 x 200 volume:
Comparison with CyberShake source:
Comparison with -s 1: