Difference between revisions of "UCVM on Frontier"
Line 259: | Line 259: | ||
/conf directory has ucvm_env.sh | /conf directory has ucvm_env.sh | ||
+ | </pre> | ||
+ | |||
+ | == Example or Typical Configuration Errors == | ||
+ | <pre> | ||
+ | BATCH -N 8 | ||
+ | srun -N8 -n512 block:cyclic ${BIN_DIR}/ucvm2mesh_mpi -f ./norcal_ucvm2mesh.conf | ||
+ | srun: error: Unable to create step for job 1875534: More processors requested than permitted | ||
+ | # | ||
+ | srun -N8 -c -m block:cyclic ${BIN_DIR}/ucvm2mesh_mpi -f ./norcal_ucvm2mesh.conf | ||
+ | srun: error: Invalid numeric value "-m" for --cpus-per-task. | ||
+ | # | ||
+ | srun -N8 -c --ntasks-per-core=1 -m block:cyclic ${BIN_DIR}/ucvm2mesh_mpi -f ./norcal_ucvm2mesh.conf | ||
+ | srun: error: Invalid numeric value "--ntasks-per-core=1" for --cpus-per-task. | ||
+ | # | ||
+ | srun -N8 -c --ntasks-per-node=64 --ntasks-per-core=1 -m block:cyclic ${BIN_DIR}/ucvm2mesh_mpi -f ./norcal_ucvm2mesh.conf | ||
+ | srun: error: Invalid numeric value "--ntasks-per-node=64" for --cpus-per-task. | ||
+ | # | ||
+ | srun -N8 -n256 -m block:cyclic ${BIN_DIR}/ucvm2mesh_mpi -f ./norcal_ucvm2mesh.conf | ||
+ | #PMPI_Type_create_darray(448): Invalid argument array_of_psizes | ||
+ | # | ||
+ | srun -N8 -n512 -m block:cyclic ${BIN_DIR}/ucvm2mesh_mpi -f ./norcal_ucvm2mesh.conf | ||
+ | srun: error: Unable to create step for job 1871180: More processors requested than permitted | ||
+ | |||
+ | |||
+ | [0] expected 1000(processes) divisible by 2000(core count) | ||
</pre> | </pre> | ||
Revision as of 22:46, 2 May 2024
Contents
- 1 Testing the UCVM installation on Frontier
- 2 Building Process
- 3 Define INSTALL PATH
- 4 Setup Frontier Modules
- 5 Example Installation
- 6 Install Script on Frontier
- 7 Interactive session to run Build UCVM or run Acceptance Tests
- 8 Frontier library not loading
- 9 Example or Typical Configuration Errors
- 10 Size of CyberShake Meshes
- 11 Related Entries
Testing the UCVM installation on Frontier
We are implementing tests of the development v3ersion of UCVM used for CyberShake NorCal. At the end of the install, we expect the following model to be available:
- SFCVM
- CCA
- sf1d
Building Process
Building on the head node is very slow. We req Looks like we need to build on compute node. But compute node not network accessible. So do git clone, and largefile downloads on head node, then when ready to make, request a compute node.
Define INSTALL PATH
UCVM_INSTALL_PATH /lustre/orion/proj-shared/geo156/pmaech/scratch/TARGET_UCVM_SFCVM/ucvm_install
Setup Frontier Modules
[login03.frontier ~]$ module list Currently Loaded Modules: 1) craype-x86-trento 7) cray-dsmml/0.2.2 13) darshan-runtime/3.4.0 2) craype-network-ofi 8) cray-libsci/22.12.1.1 14) hsi/default 3) perftools-base/22.12.0 9) PrgEnv-cray/8.3.3 15) lfs-wrapper/0.0.1 4) xpmem/2.6.2-2.5_2.22__gd067c3f.shasta 10) cray-python/3.9.13.1 16) DefApps/default 5) cray-pmi/6.1.8 11) libfabric/1.15.2.0 17) libtool/2.4.6 6) craype/2.7.19 12) gcc/10.3.0 18) cray-mpich/8.1.23
This is typically built by keeping the default modules plus these:
- module load cray-python
- module load libtool/2.4.6
- module load libfabric
- module load gcc/10.3.0
Testing with two account showed this built ucvm binaries and tests(except CCA) passed.
Example Installation
Code is built a UCVM installtion on Frontier at : /ccs/home/mei/scratch/TARGET_UCVM_SFCVM/ucvm_install
source conf/ucvm_env.sh which ucvm_query ucvm_query -H
Next is to run test/run_testing to run some basic unit testing
Install Script on Frontier
#!/bin/bash # # hn=`hostname -d` ppwd=`pwd` export MY_TOP=$ppwd/scratch export TOP_UCVM_TARGET=$MY_TOP/TARGET_UCVM_SFCVM export UCVM_SRC_PATH=$TOP_UCVM_TARGET/UCVM export UCVM_INSTALL_PATH=$TOP_UCVM_TARGET/ucvm_install export UCVM_SALLOC_ENV="-A geo156 -q debug" export LD_LIBRARY_PATH=/opt/cray/libfabric/1.15.2.0/lib64:$LD_LIBRARY_PATH export LIBRARY_PATH=/opt/cray/libfabric/1.15.2.0/lib64:$LIBRARY_PATH rm -rf $TOP_UCVM_TARGET mkdir $TOP_UCVM_TARGET cd $TOP_UCVM_TARGET git clone https://github.com/SCECcode/ucvm.git -b withSFCVM UCVM cd $UCVM_SRC_PATH/largefiles ./get_largefiles.py -m sfcvm,cca,cvmsi,cvms cd $UCVM_SRC_PATH/largefiles; ./stage_largefiles.py cd $UCVM_SRC_PATH ./ucvm_setup.py -d -a -p $UCVM_INSTALL_PATH &> ucvm_setup_install.log cd $UCVM_SRC_PATH; make check &> make_check.log echo "..EXITING.." exit
Interactive session to run Build UCVM or run Acceptance Tests
salloc -A geo156 -N 1 -t 1:30:00 -J UCVM_Tests -q debug
Frontier library not loading
This file contains any messages produced by compilers while running configure, to aid debugging if configure makes a mistake. It was created by UCVM configure 22.7.0, which was generated by GNU Autoconf 2.69. Invocation command line was $ ./configure --enable-silent-rules --with-fftw-include-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/lib/fftw/include --with-fftw-lib-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/lib/fftw/lib --with-etree-include-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/lib/euclid3/include --with-etree-lib-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/lib/euclid3/lib --with-hdf5-include-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/lib/hdf5/include --with-hdf5-lib-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/lib/hdf5/lib --with-openssl-include-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/lib/openssl/include --with-openssl-lib-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/lib/openssl/lib --with-tiff-include-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/lib/tiff/include --with-tiff-lib-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/lib/tiff/lib --with-sqlite-include-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/lib/sqlite/include --with-sqlite-lib-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/lib/sqlite/lib --with-curl-include-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/lib/curl/include/curl --with-curl-lib-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/lib/curl/lib --with-proj-include-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/lib/proj/include --with-proj-lib-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/lib/proj/lib --enable-model-cca --with-cca-lib-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/model/cca/lib --with-cca-include-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/model/cca/include --enable-model-cvms --with-cvms-include-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/model/cvms/include --with-cvms-lib-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/model/cvms/lib --with-cvms-model-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/model/cvms/src --enable-model-cvmsi --with-cvmsi-lib-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/model/cvmsi/lib --with-cvmsi-include-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/model/cvmsi/include --with-cvmsi-model-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/model/cvmsi/model/i26 --enable-model-sfcvm --with-sfcvm-lib-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/model/sfcvm/lib --with-sfcvm-include-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/model/sfcvm/include --with-sfcvm-model-path=/lustre/orion/geo156/scratch/dean316/build_UCVM/build/model/sfcvm/src --prefix=/lustre/orion/geo156/scratch/dean316/build_UCVM/build ## --------- ## ## Platform. ## ## --------- ## hostname = login13 uname -m = x86_64 uname -r = 5.14.21-150400.24.46_12.0.83-cray_shasta_c uname -s = Linux uname -v = #1 SMP Tue May 23 03:16:47 UTC 2023 (c6cda89) /usr/bin/uname -p = x86_64 /bin/uname -X = unknown /bin/arch = x86_64 /usr/bin/arch -k = unknown /usr/convex/getsysinfo = unknown /usr/bin/hostinfo = unknown /bin/machine = unknown /usr/bin/oslevel = unknown /bin/universe = unknown PATH: /sw/frontier/lfs-wrapper/0.0.1/bin/lfs PATH: /opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/bin PATH: /opt/cray/pe/mpich/8.1.23/bin PATH: /sw/sources/hpss/bin PATH: /sw/frontier/spack-envs/base/opt/cray-sles15-zen3/gcc-10.3.0/darshan-runtime-3.4.0-g5tkbmgrfje7vnnh7ppfb6s5b7frivrl/bin PATH: /opt/cray/pe/gcc/10.3.0/bin PATH: /opt/cray/libfabric/1.15.2.0/bin PATH: /sw/frontier/spack-envs/base/opt/cray-sles15-zen3/gcc-10.3.0/libtool-2.4.6-4kukgkkoovpfysxbav23pbrddwn7kjbm/bin PATH: /opt/cray/pe/python/3.9.13.1/bin PATH: /opt/conda/bin PATH: /opt/clmgr/sbin PATH: /opt/clmgr/bin PATH: /opt/sgi/sbin PATH: /opt/sgi/bin PATH: /sw/frontier/bin PATH: /ccs/home/dean316/.local/bin PATH: /usr/local/bin PATH: /usr/bin PATH: /bin PATH: /opt/bin PATH: /opt/c3/bin PATH: /usr/lib/mit/bin PATH: /opt/puppetlabs/bin PATH: /sbin ## ----------- ## ## Core tests. ## ## ----------- ## configure:2426: checking for a BSD-compatible install configure:2494: result: /usr/bin/install -c configure:2505: checking whether build environment is sane configure:2560: result: yes configure:2711: checking for a thread-safe mkdir -p configure:2750: result: /usr/bin/mkdir -p configure:2757: checking for gawk configure:2773: found /usr/bin/gawk configure:2784: result: gawk configure:2795: checking whether make sets $(MAKE) configure:2817: result: yes configure:2846: checking whether make supports nested variables configure:2863: result: yes configure:3032: checking for ranlib configure:3048: found /usr/bin/ranlib configure:3059: result: ranlib configure:3087: checking build system type configure:3101: result: x86_64-pc-linux-gnu configure:3121: checking host system type configure:3134: result: x86_64-pc-linux-gnu configure:3213: checking for style of include used by make configure:3241: result: GNU configure:3267: checking whether to compile using MPI configure:3274: result: yes configure:3330: checking for mpicc configure:3346: found /opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/bin/mpicc configure:3357: result: mpicc configure:3431: checking for gcc configure:3458: result: mpicc configure:3687: checking for C compiler version configure:3696: mpicc --version >&5 gcc (GCC) 10.3.0 20210408 (Cray Inc.) Copyright (C) 2020 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. configure:3707: $? = 0 configure:3696: mpicc -v >&5 mpicc for MPICH version 8.1.23 Using built-in specs. COLLECT_GCC=/opt/cray/pe/gcc/10.3.0/bin/../snos/bin/gcc COLLECT_LTO_WRAPPER=/opt/cray/pe/gcc/10.3.0/snos/libexec/gcc/x86_64-suse-linux/10.3.0/lto-wrapper Target: x86_64-suse-linux Configured with: ../cpe-gcc-10.3.0-202104220029.0777bcc28ac1d/configure --prefix=/opt/cray/pe/gcc/10.3.0/snos --disable-nls --libdir=/opt/cray/pe/gcc/10.3.0/snos/lib --enable-languages=c,c++,fortran --with-gxx-include-dir=/opt/cray/pe/gcc/10.3.0/snos/include/g++ --with-slibdir=/opt/cray/pe/gcc/10.3.0/snos/lib --with-system-zlib --enable-shared --enable-__cxa_atexit --build=x86_64-suse-linux --with-ppl --with-cloog --disable-multilib Thread model: posix Supported LTO compression algorithms: zlib gcc version 10.3.0 20210408 (Cray Inc.) (GCC) configure:3707: $? = 0 configure:3696: mpicc -V >&5 gcc: error: unrecognized command-line option '-V' configure:3707: $? = 1 configure:3696: mpicc -qversion >&5 gcc: error: unrecognized command-line option '-qversion'; did you mean '--version'? configure:3707: $? = 1 configure:3727: checking whether the C compiler works configure:3749: mpicc conftest.c >&5 /usr/bin/ld: warning: libfabric.so.1, needed by /opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib/libmpi_gnu_91.so, not found (try using -rpath or -rpath-link) /usr/bin/ld: /opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib/libmpi_gnu_91.so: undefined reference to `fi_version@FABRIC_1.0' /usr/bin/ld: /opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib/libmpi_gnu_91.so: undefined reference to `fi_dupinfo@FABRIC_1.3' /usr/bin/ld: /opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib/libmpi_gnu_91.so: undefined reference to `fi_strerror@FABRIC_1.0' /usr/bin/ld: /opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib/libmpi_gnu_91.so: undefined reference to `fi_freeinfo@FABRIC_1.3' /usr/bin/ld: /opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib/libmpi_gnu_91.so: undefined reference to `fi_fabric@FABRIC_1.1' /usr/bin/ld: /opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib/libmpi_gnu_91.so: undefined reference to `fi_getinfo@FABRIC_1.3' collect2: error: ld returned 1 exit status configure:3753: $? = 1 configure:3791: result: no configure: failed program was: | /* confdefs.h */ | #define PACKAGE_NAME "UCVM" | #define PACKAGE_TARNAME "ucvm" | #define PACKAGE_VERSION "22.7.0" | #define PACKAGE_STRING "UCVM 22.7.0" | #define PACKAGE_BUGREPORT "software@scec.org" | #define PACKAGE_URL "" | #define PACKAGE "ucvm" | #define VERSION "22.7.0" | /* end confdefs.h. */ | | int | main () | { | | ; | return 0; | } configure:3796: error: in `/lustre/orion/geo156/scratch/dean316/build_UCVM/UCVM': configure:3798: error: C compiler cannot create executables See `config.log' for more details
Problem that we are seeing on Frontier. This is the work around, cat -10 config.log > r edit r to just have the configure command call ./r to run the command by hand make make install and ./ucvm_setup.py -a -r -d -p YOUR_UCVM_INSTALL_PATH check if /conf directory has ucvm_env.sh
Example or Typical Configuration Errors
BATCH -N 8 srun -N8 -n512 block:cyclic ${BIN_DIR}/ucvm2mesh_mpi -f ./norcal_ucvm2mesh.conf srun: error: Unable to create step for job 1875534: More processors requested than permitted # srun -N8 -c -m block:cyclic ${BIN_DIR}/ucvm2mesh_mpi -f ./norcal_ucvm2mesh.conf srun: error: Invalid numeric value "-m" for --cpus-per-task. # srun -N8 -c --ntasks-per-core=1 -m block:cyclic ${BIN_DIR}/ucvm2mesh_mpi -f ./norcal_ucvm2mesh.conf srun: error: Invalid numeric value "--ntasks-per-core=1" for --cpus-per-task. # srun -N8 -c --ntasks-per-node=64 --ntasks-per-core=1 -m block:cyclic ${BIN_DIR}/ucvm2mesh_mpi -f ./norcal_ucvm2mesh.conf srun: error: Invalid numeric value "--ntasks-per-node=64" for --cpus-per-task. # srun -N8 -n256 -m block:cyclic ${BIN_DIR}/ucvm2mesh_mpi -f ./norcal_ucvm2mesh.conf #PMPI_Type_create_darray(448): Invalid argument array_of_psizes # srun -N8 -n512 -m block:cyclic ${BIN_DIR}/ucvm2mesh_mpi -f ./norcal_ucvm2mesh.conf srun: error: Unable to create step for job 1871180: More processors requested than permitted [0] expected 1000(processes) divisible by 2000(core count)
Size of CyberShake Meshes
- Sample meshes for s3446 are 5760 x 9680 x 632 ~ 35B points.
- Run on 96 nodes (56 cores/node), and it takes around 10 minutes.
80m spacing 5760 * 80 = 460,800m 9680 * 80 = 774,400m 632 * 80 = 50,560m It's a rotated volume, so it's 5760 in the E/W direction but then rotated counter-clockwise 36 degrees.