Difference between revisions of "UCVMC MPI Testing"
(Created page with "== UCVMC installation == UCVMC installation should detect whether MPI is avialable, and if so, build the MPI codes including ucvm2mesh-mpi. Test environment is the USC HPC cl...") |
|||
(7 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | == Rotation Example == | ||
+ | *[[image:Rotation pts.png]] | ||
+ | == Configuration processor counts == | ||
+ | Program is erroring with 4,2,1 configuration. | ||
+ | Based on the following code, I think we specify 6 processor, not 7 | ||
+ | <pre> | ||
+ | #ifdef UM_ENABLE_MPI | ||
+ | /* Check MPI related config items */ | ||
+ | if (nproc > 0) { | ||
+ | if ((cfg->dims.dim[0] % cfg->proc_dims.dim[0] != 0) || | ||
+ | (cfg->dims.dim[1] % cfg->proc_dims.dim[1] != 0) || | ||
+ | (cfg->dims.dim[2] % cfg->proc_dims.dim[2] != 0)) { | ||
+ | fprintf(stderr, "[%d] Mesh dims must be divisible by proc dims\n", | ||
+ | myid); | ||
+ | return(1); | ||
+ | } | ||
+ | |||
+ | if (nproc != cfg->proc_dims.dim[0]*cfg->proc_dims.dim[1]*cfg->proc_dims.dim[2]) { | ||
+ | fprintf(stderr, "[%d] Proc space does not equal MPI core count\n", | ||
+ | myid); | ||
+ | return(1); | ||
+ | } | ||
+ | } | ||
+ | </pre> | ||
+ | == Confirm MPI executables are built correctly == | ||
+ | First, I checked the ucvmc/bin directory to confirm that the MPI executable are built. I see the following: | ||
+ | |||
+ | <pre> | ||
+ | -bash-4.2$ ls *MPI* *mpi* | ||
+ | basin_query_mpi ucvm2etree-extract-MPI ucvm2etree-merge-MPI ucvm2etree-sort-MPI ucvm2mesh-mpi vs30_query_mpi | ||
+ | </pre> | ||
+ | |||
+ | When I try to run the ./ucvm2mesh-mpi, without parameters, I get this | ||
+ | <pre> | ||
+ | |||
+ | -bash-4.2$ ./ucvm2mesh-mpi | ||
+ | -------------------------------------------------------------------------- | ||
+ | It looks like orte_init failed for some reason; your parallel process is | ||
+ | likely to abort. There are many reasons that a parallel process can | ||
+ | fail during orte_init; some of which are due to configuration or | ||
+ | environment problems. This failure appears to be an internal failure; | ||
+ | here's some additional information (which may only be relevant to an | ||
+ | Open MPI developer): | ||
+ | |||
+ | PMI2_Job_GetId failed failed | ||
+ | --> Returned value (null) (14) instead of ORTE_SUCCESS | ||
+ | -------------------------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | It looks like orte_init failed for some reason; your parallel process is | ||
+ | likely to abort. There are many reasons that a parallel process can | ||
+ | fail during orte_init; some of which are due to configuration or | ||
+ | environment problems. This failure appears to be an internal failure; | ||
+ | here's some additional information (which may only be relevant to an | ||
+ | Open MPI developer): | ||
+ | |||
+ | orte_ess_init failed | ||
+ | --> Returned value (null) (14) instead of ORTE_SUCCESS | ||
+ | -------------------------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | It looks like MPI_INIT failed for some reason; your parallel process is | ||
+ | likely to abort. There are many reasons that a parallel process can | ||
+ | fail during MPI_INIT; some of which are due to configuration or environment | ||
+ | problems. This failure appears to be an internal failure; here's some | ||
+ | additional information (which may only be relevant to an Open MPI | ||
+ | developer): | ||
+ | |||
+ | ompi_mpi_init: ompi_rte_init failed | ||
+ | --> Returned "(null)" (14) instead of "Success" (0) | ||
+ | -------------------------------------------------------------------------- | ||
+ | *** An error occurred in MPI_Init | ||
+ | *** on a NULL communicator | ||
+ | *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, | ||
+ | *** and potentially your MPI job) | ||
+ | [hpc-login2:16596] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! | ||
+ | </pre> | ||
+ | |||
+ | == Building UCVMC on hpc with MPI == | ||
+ | Install on a shared file system because it is a large installation (25GB) | ||
+ | * cd /home/scec-00/maechlin/ | ||
+ | * %git clone https://github.com/SCECcode/ucvmc.git | ||
+ | |||
+ | Then follow installation instruction on github wiki: | ||
+ | *[https://github.com/SCECcode/UCVMC/wiki/Installation-From-Source-Code UCVMC Wiki] | ||
+ | |||
+ | I installed all models. both the "make check" and the example ucvm_query worked. Also, I confirmed that the mpi executables were built, inclugin ucvm2mesh-mpi, which we want to test. | ||
+ | |||
== UCVMC installation == | == UCVMC installation == | ||
UCVMC installation should detect whether MPI is avialable, and if so, build the MPI codes including ucvm2mesh-mpi. | UCVMC installation should detect whether MPI is avialable, and if so, build the MPI codes including ucvm2mesh-mpi. | ||
Line 44: | Line 130: | ||
OMPI_MCA_btl=^scif | OMPI_MCA_btl=^scif | ||
_=/usr/bin/env | _=/usr/bin/env | ||
+ | </pre> | ||
+ | |||
+ | I removed the lines in my .bash_profile, that setup openmpi, and I confirmed that the OMPI vars, and the openmpi path in my path, are not set in that case. | ||
− | + | == Related Entries == | |
+ | *[[UCVM]] |
Latest revision as of 15:40, 30 March 2018
Contents
Rotation Example
Configuration processor counts
Program is erroring with 4,2,1 configuration. Based on the following code, I think we specify 6 processor, not 7
#ifdef UM_ENABLE_MPI /* Check MPI related config items */ if (nproc > 0) { if ((cfg->dims.dim[0] % cfg->proc_dims.dim[0] != 0) || (cfg->dims.dim[1] % cfg->proc_dims.dim[1] != 0) || (cfg->dims.dim[2] % cfg->proc_dims.dim[2] != 0)) { fprintf(stderr, "[%d] Mesh dims must be divisible by proc dims\n", myid); return(1); } if (nproc != cfg->proc_dims.dim[0]*cfg->proc_dims.dim[1]*cfg->proc_dims.dim[2]) { fprintf(stderr, "[%d] Proc space does not equal MPI core count\n", myid); return(1); } }
Confirm MPI executables are built correctly
First, I checked the ucvmc/bin directory to confirm that the MPI executable are built. I see the following:
-bash-4.2$ ls *MPI* *mpi* basin_query_mpi ucvm2etree-extract-MPI ucvm2etree-merge-MPI ucvm2etree-sort-MPI ucvm2mesh-mpi vs30_query_mpi
When I try to run the ./ucvm2mesh-mpi, without parameters, I get this
-bash-4.2$ ./ucvm2mesh-mpi -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PMI2_Job_GetId failed failed --> Returned value (null) (14) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_init failed --> Returned value (null) (14) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "(null)" (14) instead of "Success" (0) -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [hpc-login2:16596] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
Building UCVMC on hpc with MPI
Install on a shared file system because it is a large installation (25GB)
- cd /home/scec-00/maechlin/
- %git clone https://github.com/SCECcode/ucvmc.git
Then follow installation instruction on github wiki:
I installed all models. both the "make check" and the example ucvm_query worked. Also, I confirmed that the mpi executables were built, inclugin ucvm2mesh-mpi, which we want to test.
UCVMC installation
UCVMC installation should detect whether MPI is avialable, and if so, build the MPI codes including ucvm2mesh-mpi.
Test environment is the USC HPC cluster. First, configure .bash_profile to source the openmpi setup directory
# # Setup MPI # if [ -e /usr/usc/openmpi/default/setup.sh ]; then source /usr/usc/openmpi/default/setup.sh fi #
Then the env command shows that path to openmp libraries have been added.
PATH=/home/scec-00/maechlin/ucvmc/lib/proj4/bin:/home/scec-00/maechlin/anaconda2/bin:/usr/usc/openmpi/1.8.8/slurm/bin:/usr/lib64/q t-3.3/bin:/opt/mam/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin PWD=/home/rcf-01/maechlin LANG=en_US.UTF-8 BBP_GF_DIR=/home/scec-00/maechlin/bbp/default/bbp_gf HISTCONTROL=ignoredups KRB5CCNAME=FILE:/tmp/krb5cc_14364_IcUSJQ OMPI_MCA_oob_tcp_if_exclude=lo,docker0,usb0,myri0 SHLVL=1 HOME=/home/rcf-01/maechlin OMPI_CC=gcc OMPI_MCA_btl_openib_if_exclude=mlx4_0:2 PYTHONPATH=/home/scec-00/maechlin/bbp/default/bbp/bbp/comps:/home/scec-00/maechlin/ucvmc/utilities:/home/scec-00/maechlin/ucvmc/ut ilities/pycvm OMPI_MCA_btl_openib_warn_nonexistent_if=0 LOGNAME=maechlin BBP_VAL_DIR=/home/scec-00/maechlin/bbp/default/bbp_val QTLIB=/usr/lib64/qt-3.3/lib CVS_RSH=ssh SSH_CONNECTION=47.39.67.178 50365 68.181.205.206 22 OMPI_CXX=g++ LESSOPEN=||/usr/bin/lesspipe.sh %s XDG_RUNTIME_DIR=/run/user/14364 DISPLAY=localhost:16.0 BBP_DATA_DIR=/home/scec-00/maechlin/bbp/default/bbp_data OMPI_MCA_btl=^scif _=/usr/bin/env
I removed the lines in my .bash_profile, that setup openmpi, and I confirmed that the OMPI vars, and the openmpi path in my path, are not set in that case.