Difference between revisions of "UCVMC MPI Testing"
Line 8: | Line 8: | ||
I installed all models. both the "make check" and the example ucvm_query worked. Also, I confirmed that the mpi executables were built, inclugin ucvm2mesh-mpi, which we want to test. | I installed all models. both the "make check" and the example ucvm_query worked. Also, I confirmed that the mpi executables were built, inclugin ucvm2mesh-mpi, which we want to test. | ||
+ | |||
+ | == Confirm MPI executables are built correctly == | ||
+ | First, I checked the ucvmc/bin directory to confirm that the MPI executable are built. I see the following: | ||
+ | |||
+ | |||
+ | When I try to run the ./ucvm2mesh-mpi, without parameters, I get this | ||
+ | <pre> | ||
+ | |||
+ | -bash-4.2$ ./ucvm2mesh-mpi | ||
+ | -------------------------------------------------------------------------- | ||
+ | It looks like orte_init failed for some reason; your parallel process is | ||
+ | likely to abort. There are many reasons that a parallel process can | ||
+ | fail during orte_init; some of which are due to configuration or | ||
+ | environment problems. This failure appears to be an internal failure; | ||
+ | here's some additional information (which may only be relevant to an | ||
+ | Open MPI developer): | ||
+ | |||
+ | PMI2_Job_GetId failed failed | ||
+ | --> Returned value (null) (14) instead of ORTE_SUCCESS | ||
+ | -------------------------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | It looks like orte_init failed for some reason; your parallel process is | ||
+ | likely to abort. There are many reasons that a parallel process can | ||
+ | fail during orte_init; some of which are due to configuration or | ||
+ | environment problems. This failure appears to be an internal failure; | ||
+ | here's some additional information (which may only be relevant to an | ||
+ | Open MPI developer): | ||
+ | |||
+ | orte_ess_init failed | ||
+ | --> Returned value (null) (14) instead of ORTE_SUCCESS | ||
+ | -------------------------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | It looks like MPI_INIT failed for some reason; your parallel process is | ||
+ | likely to abort. There are many reasons that a parallel process can | ||
+ | fail during MPI_INIT; some of which are due to configuration or environment | ||
+ | problems. This failure appears to be an internal failure; here's some | ||
+ | additional information (which may only be relevant to an Open MPI | ||
+ | developer): | ||
+ | |||
+ | ompi_mpi_init: ompi_rte_init failed | ||
+ | --> Returned "(null)" (14) instead of "Success" (0) | ||
+ | -------------------------------------------------------------------------- | ||
+ | *** An error occurred in MPI_Init | ||
+ | *** on a NULL communicator | ||
+ | *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, | ||
+ | *** and potentially your MPI job) | ||
+ | [hpc-login2:16596] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! | ||
+ | </pre> | ||
== UCVMC installation == | == UCVMC installation == | ||
Line 56: | Line 104: | ||
</pre> | </pre> | ||
− | I removed the | + | I removed the lines in my .bash_profile, that setup openmpi, and I confirmed that the OMPI vars, and the openmpi path in my path, are not set in that case. |
== Related Entries == | == Related Entries == | ||
*[[UCVM]] | *[[UCVM]] |
Revision as of 19:32, 29 March 2018
Contents
Building UCVMC on hpc with MPI
Install on a shared file system because it is a large installation (25GB)
- cd /home/scec-00/maechlin/
- %git clone https://github.com/SCECcode/ucvmc.git
Then follow installation instruction on github wiki:
I installed all models. both the "make check" and the example ucvm_query worked. Also, I confirmed that the mpi executables were built, inclugin ucvm2mesh-mpi, which we want to test.
Confirm MPI executables are built correctly
First, I checked the ucvmc/bin directory to confirm that the MPI executable are built. I see the following:
When I try to run the ./ucvm2mesh-mpi, without parameters, I get this
-bash-4.2$ ./ucvm2mesh-mpi -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PMI2_Job_GetId failed failed --> Returned value (null) (14) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_init failed --> Returned value (null) (14) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "(null)" (14) instead of "Success" (0) -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [hpc-login2:16596] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
UCVMC installation
UCVMC installation should detect whether MPI is avialable, and if so, build the MPI codes including ucvm2mesh-mpi.
Test environment is the USC HPC cluster. First, configure .bash_profile to source the openmpi setup directory
# # Setup MPI # if [ -e /usr/usc/openmpi/default/setup.sh ]; then source /usr/usc/openmpi/default/setup.sh fi #
Then the env command shows that path to openmp libraries have been added.
PATH=/home/scec-00/maechlin/ucvmc/lib/proj4/bin:/home/scec-00/maechlin/anaconda2/bin:/usr/usc/openmpi/1.8.8/slurm/bin:/usr/lib64/q t-3.3/bin:/opt/mam/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin PWD=/home/rcf-01/maechlin LANG=en_US.UTF-8 BBP_GF_DIR=/home/scec-00/maechlin/bbp/default/bbp_gf HISTCONTROL=ignoredups KRB5CCNAME=FILE:/tmp/krb5cc_14364_IcUSJQ OMPI_MCA_oob_tcp_if_exclude=lo,docker0,usb0,myri0 SHLVL=1 HOME=/home/rcf-01/maechlin OMPI_CC=gcc OMPI_MCA_btl_openib_if_exclude=mlx4_0:2 PYTHONPATH=/home/scec-00/maechlin/bbp/default/bbp/bbp/comps:/home/scec-00/maechlin/ucvmc/utilities:/home/scec-00/maechlin/ucvmc/ut ilities/pycvm OMPI_MCA_btl_openib_warn_nonexistent_if=0 LOGNAME=maechlin BBP_VAL_DIR=/home/scec-00/maechlin/bbp/default/bbp_val QTLIB=/usr/lib64/qt-3.3/lib CVS_RSH=ssh SSH_CONNECTION=47.39.67.178 50365 68.181.205.206 22 OMPI_CXX=g++ LESSOPEN=||/usr/bin/lesspipe.sh %s XDG_RUNTIME_DIR=/run/user/14364 DISPLAY=localhost:16.0 BBP_DATA_DIR=/home/scec-00/maechlin/bbp/default/bbp_data OMPI_MCA_btl=^scif _=/usr/bin/env
I removed the lines in my .bash_profile, that setup openmpi, and I confirmed that the OMPI vars, and the openmpi path in my path, are not set in that case.