Difference between revisions of "UCVMC MPI Testing"
Line 1: | Line 1: | ||
+ | == Configuration processor counts == | ||
+ | Program is erroring with 4,2,1 configuration. | ||
+ | Based on the following code, I think we specify 6 processor, not 7 | ||
+ | <pre> | ||
+ | #ifdef UM_ENABLE_MPI | ||
+ | /* Check MPI related config items */ | ||
+ | if (nproc > 0) { | ||
+ | if ((cfg->dims.dim[0] % cfg->proc_dims.dim[0] != 0) || | ||
+ | (cfg->dims.dim[1] % cfg->proc_dims.dim[1] != 0) || | ||
+ | (cfg->dims.dim[2] % cfg->proc_dims.dim[2] != 0)) { | ||
+ | fprintf(stderr, "[%d] Mesh dims must be divisible by proc dims\n", | ||
+ | myid); | ||
+ | return(1); | ||
+ | } | ||
+ | |||
+ | if (nproc != cfg->proc_dims.dim[0]*cfg->proc_dims.dim[1]*cfg->proc_dims.dim[2]) { | ||
+ | fprintf(stderr, "[%d] Proc space does not equal MPI core count\n", | ||
+ | myid); | ||
+ | return(1); | ||
+ | } | ||
+ | } | ||
+ | </pre> | ||
== Confirm MPI executables are built correctly == | == Confirm MPI executables are built correctly == | ||
First, I checked the ucvmc/bin directory to confirm that the MPI executable are built. I see the following: | First, I checked the ucvmc/bin directory to confirm that the MPI executable are built. I see the following: |
Revision as of 01:04, 30 March 2018
Contents
Configuration processor counts
Program is erroring with 4,2,1 configuration. Based on the following code, I think we specify 6 processor, not 7
#ifdef UM_ENABLE_MPI /* Check MPI related config items */ if (nproc > 0) { if ((cfg->dims.dim[0] % cfg->proc_dims.dim[0] != 0) || (cfg->dims.dim[1] % cfg->proc_dims.dim[1] != 0) || (cfg->dims.dim[2] % cfg->proc_dims.dim[2] != 0)) { fprintf(stderr, "[%d] Mesh dims must be divisible by proc dims\n", myid); return(1); } if (nproc != cfg->proc_dims.dim[0]*cfg->proc_dims.dim[1]*cfg->proc_dims.dim[2]) { fprintf(stderr, "[%d] Proc space does not equal MPI core count\n", myid); return(1); } }
Confirm MPI executables are built correctly
First, I checked the ucvmc/bin directory to confirm that the MPI executable are built. I see the following:
-bash-4.2$ ls *MPI* *mpi* basin_query_mpi ucvm2etree-extract-MPI ucvm2etree-merge-MPI ucvm2etree-sort-MPI ucvm2mesh-mpi vs30_query_mpi
When I try to run the ./ucvm2mesh-mpi, without parameters, I get this
-bash-4.2$ ./ucvm2mesh-mpi -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PMI2_Job_GetId failed failed --> Returned value (null) (14) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_init failed --> Returned value (null) (14) instead of ORTE_SUCCESS -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "(null)" (14) instead of "Success" (0) -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [hpc-login2:16596] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
Building UCVMC on hpc with MPI
Install on a shared file system because it is a large installation (25GB)
- cd /home/scec-00/maechlin/
- %git clone https://github.com/SCECcode/ucvmc.git
Then follow installation instruction on github wiki:
I installed all models. both the "make check" and the example ucvm_query worked. Also, I confirmed that the mpi executables were built, inclugin ucvm2mesh-mpi, which we want to test.
UCVMC installation
UCVMC installation should detect whether MPI is avialable, and if so, build the MPI codes including ucvm2mesh-mpi.
Test environment is the USC HPC cluster. First, configure .bash_profile to source the openmpi setup directory
# # Setup MPI # if [ -e /usr/usc/openmpi/default/setup.sh ]; then source /usr/usc/openmpi/default/setup.sh fi #
Then the env command shows that path to openmp libraries have been added.
PATH=/home/scec-00/maechlin/ucvmc/lib/proj4/bin:/home/scec-00/maechlin/anaconda2/bin:/usr/usc/openmpi/1.8.8/slurm/bin:/usr/lib64/q t-3.3/bin:/opt/mam/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin PWD=/home/rcf-01/maechlin LANG=en_US.UTF-8 BBP_GF_DIR=/home/scec-00/maechlin/bbp/default/bbp_gf HISTCONTROL=ignoredups KRB5CCNAME=FILE:/tmp/krb5cc_14364_IcUSJQ OMPI_MCA_oob_tcp_if_exclude=lo,docker0,usb0,myri0 SHLVL=1 HOME=/home/rcf-01/maechlin OMPI_CC=gcc OMPI_MCA_btl_openib_if_exclude=mlx4_0:2 PYTHONPATH=/home/scec-00/maechlin/bbp/default/bbp/bbp/comps:/home/scec-00/maechlin/ucvmc/utilities:/home/scec-00/maechlin/ucvmc/ut ilities/pycvm OMPI_MCA_btl_openib_warn_nonexistent_if=0 LOGNAME=maechlin BBP_VAL_DIR=/home/scec-00/maechlin/bbp/default/bbp_val QTLIB=/usr/lib64/qt-3.3/lib CVS_RSH=ssh SSH_CONNECTION=47.39.67.178 50365 68.181.205.206 22 OMPI_CXX=g++ LESSOPEN=||/usr/bin/lesspipe.sh %s XDG_RUNTIME_DIR=/run/user/14364 DISPLAY=localhost:16.0 BBP_DATA_DIR=/home/scec-00/maechlin/bbp/default/bbp_data OMPI_MCA_btl=^scif _=/usr/bin/env
I removed the lines in my .bash_profile, that setup openmpi, and I confirmed that the OMPI vars, and the openmpi path in my path, are not set in that case.