UCVMC MPI Testing

From SCECpedia
Revision as of 15:40, 30 March 2018 by Maechlin (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Rotation Example

  • Rotation pts.png

Configuration processor counts

Program is erroring with 4,2,1 configuration. Based on the following code, I think we specify 6 processor, not 7

#ifdef UM_ENABLE_MPI
    /* Check MPI related config items */
    if (nproc > 0) {
      if ((cfg->dims.dim[0] % cfg->proc_dims.dim[0] != 0) || 
	  (cfg->dims.dim[1] % cfg->proc_dims.dim[1] != 0) ||
	  (cfg->dims.dim[2] % cfg->proc_dims.dim[2] != 0)) {
	fprintf(stderr, "[%d] Mesh dims must be divisible by proc dims\n", 
		myid);
	return(1);
      }
      
      if (nproc != cfg->proc_dims.dim[0]*cfg->proc_dims.dim[1]*cfg->proc_dims.dim[2]) {
	fprintf(stderr, "[%d] Proc space does not equal MPI core count\n", 
		myid);
	return(1);
      }
    }

Confirm MPI executables are built correctly

First, I checked the ucvmc/bin directory to confirm that the MPI executable are built. I see the following:

-bash-4.2$ ls *MPI* *mpi*
basin_query_mpi  ucvm2etree-extract-MPI  ucvm2etree-merge-MPI  ucvm2etree-sort-MPI  ucvm2mesh-mpi  vs30_query_mpi

When I try to run the ./ucvm2mesh-mpi, without parameters, I get this


-bash-4.2$ ./ucvm2mesh-mpi
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  PMI2_Job_GetId failed failed
  --> Returned value (null) (14) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value (null) (14) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "(null)" (14) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[hpc-login2:16596] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Building UCVMC on hpc with MPI

Install on a shared file system because it is a large installation (25GB)

Then follow installation instruction on github wiki:

I installed all models. both the "make check" and the example ucvm_query worked. Also, I confirmed that the mpi executables were built, inclugin ucvm2mesh-mpi, which we want to test.

UCVMC installation

UCVMC installation should detect whether MPI is avialable, and if so, build the MPI codes including ucvm2mesh-mpi.

Test environment is the USC HPC cluster. First, configure .bash_profile to source the openmpi setup directory

#
# Setup MPI
#
if [ -e /usr/usc/openmpi/default/setup.sh ]; then 
  source /usr/usc/openmpi/default/setup.sh
fi
#

Then the env command shows that path to openmp libraries have been added.

PATH=/home/scec-00/maechlin/ucvmc/lib/proj4/bin:/home/scec-00/maechlin/anaconda2/bin:/usr/usc/openmpi/1.8.8/slurm/bin:/usr/lib64/q
t-3.3/bin:/opt/mam/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
PWD=/home/rcf-01/maechlin
LANG=en_US.UTF-8
BBP_GF_DIR=/home/scec-00/maechlin/bbp/default/bbp_gf
HISTCONTROL=ignoredups
KRB5CCNAME=FILE:/tmp/krb5cc_14364_IcUSJQ
OMPI_MCA_oob_tcp_if_exclude=lo,docker0,usb0,myri0
SHLVL=1
HOME=/home/rcf-01/maechlin
OMPI_CC=gcc
OMPI_MCA_btl_openib_if_exclude=mlx4_0:2
PYTHONPATH=/home/scec-00/maechlin/bbp/default/bbp/bbp/comps:/home/scec-00/maechlin/ucvmc/utilities:/home/scec-00/maechlin/ucvmc/ut
ilities/pycvm
OMPI_MCA_btl_openib_warn_nonexistent_if=0
LOGNAME=maechlin
BBP_VAL_DIR=/home/scec-00/maechlin/bbp/default/bbp_val
QTLIB=/usr/lib64/qt-3.3/lib
CVS_RSH=ssh
SSH_CONNECTION=47.39.67.178 50365 68.181.205.206 22
OMPI_CXX=g++
LESSOPEN=||/usr/bin/lesspipe.sh %s
XDG_RUNTIME_DIR=/run/user/14364
DISPLAY=localhost:16.0
BBP_DATA_DIR=/home/scec-00/maechlin/bbp/default/bbp_data
OMPI_MCA_btl=^scif
_=/usr/bin/env

I removed the lines in my .bash_profile, that setup openmpi, and I confirmed that the OMPI vars, and the openmpi path in my path, are not set in that case.

Related Entries