If you have problems during the execution of MRCC, please attach the output with an adequate description of your case as well as the followings:
  • the way mrcc was invoked
  • the way build.mrcc was invoked
  • the output of build.mrcc
  • compiler version (for example: ifort -V, gfortran -v)
  • blas/lapack versions
  • as well as gcc and glibc versions

This information really helps us during troubleshooting :)

MRCC MPI crashes on AMD Epyc (oom-kill)

  • diefenbach
  • Topic Author
  • Offline
  • New Member
  • New Member
More
3 years 4 months ago #1039 by diefenbach
MRCC MPI crashes on AMD Epyc (oom-kill) was created by diefenbach
Dear all,

when running the current MRCC binary (2020-02-22) with MPI parallelism on an AMD Epyc (Zen2) architecture, the calculation hangs at the dmrcc_mpi process, which consumes all of the resident memory and then crashes with an oom-kill event:

ps xl
Code:
PID PPID WCHAN STAT TIME COMMAND 8118 8113 do_wai S 0:00 /bin/bash /var/spool/slurm/d/job486747/slurm_script 8137 8118 do_wai S 0:00 /bin/sh /compuchem/bin/intel/compilers_and_libraries_2019.3.199/linux/mpi/intel64/bin/mpirun -np 1 dmrcc_mpi 8142 8137 poll_s S 0:00 mpiexec.hydra -np 1 dmrcc_mpi 8143 8142 poll_s Ss 0:00 /compuchem/bin/intel/compilers_and_libraries_2019.3.199/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 8146 8143 6614392 - R 1:12 dmrcc_mpi

top
Code:
KiB Mem : 52823801+total, 42264691+free, 10068935+used, 4901744 buff/cache KiB Swap: 13421772+total, 13420800+free, 9728 used. 42610931+avail Mem PID PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8741 20 0 80.3g 80.2g 2168 R 99.7 15.9 0:38.76 dmrcc_mpi
Memory consumption continues until total Mem is filled and oom-kill is issued:
Code:
slurmstepd: error: Detected 1 oom-kill event(s) in step 486747.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Has anyone access to AMD Epyc machines and perhaps also encountered this issue, and possibly come up with a solution?

Cheers,
Martin

Below is the job script, which runs fine on Intel systems, but crashes when run on AMD Epyc (regardless of submission type, e.g., batch queueing or interactive usage):
Code:
#!/bin/bash #SBATCH --overcommit #SBATCH --nodes=1 #SBATCH --ntasks=2 #SBATCH --cpus-per-task=4 export OMP_NUM_THREADS=4 export MKL_NUM_THREADS=4 MRCC_DIR=/compuchem/bin/mrcc/2020-02-22.binary INTEL_DIR=/compuchem/bin/intel/compilers_and_libraries_2019.3.199/linux source ${INTEL_DIR}/mkl/bin/mklvars.sh intel64 ilp64 source ${INTEL_DIR}/mpi/intel64/bin/mpivars.sh release_mt -ofi_internal=0 export PATH=${MRCC_DIR}:${PATH} cp ${HOME}/mrcc.inp MINP dmrcc &> mrcc.out exit

mrcc.inp:
Code:
basis=cc-pVTZ calc=CCSDT(Q) scftype=ROHF mult=2 mem=8GB cctol=12 mpitasks=2 unit=bohr geom H N 1 R1 H 2 R1 1 A R1=2.00000000000 A=104.2458898548

Please Log in or Create an account to join the conversation.

  • nagypeter
  • Offline
  • Premium Member
  • Premium Member
  • MRCC developer
More
3 years 4 months ago #1040 by nagypeter
Replied by nagypeter on topic MRCC MPI crashes on AMD Epyc (oom-kill)
Dear Martin,

unfortunately, we currently do not have access to AMD processors.
I assume the AMD CPUs again work for jobs without MPI and fail only with MPI.
Do you see any other process besides dmrcc_mpi? Perhaps scf_mpi?

If you do not see the scf_mpi processes, the issue is possibly again
in an MPI_Comm_spawn call responsible for spawning the scf_mpi processes,
and could be a similar internode communication issue as before. There could be something AMD (or non-Intel) specific issue we are unaware of.
Did you try to play around with the suggestions for the previous problem?

You may also try to compile MRCC from source and link with
- either OpenMPI instead of IntelMPI (please see that manual first about the
required OpenMPI version and patch)
- or a more recent IntelMPI hoping for better AMD support.

You are again welcome the share some cluster documentation, if there is any.
I hope some of them helps. I am sorry that this is not much help so far.

Best of luck,
Peter & Laszlo

Please Log in or Create an account to join the conversation.

  • diefenbach
  • Topic Author
  • Offline
  • New Member
  • New Member
More
3 years 3 months ago #1046 by diefenbach
Replied by diefenbach on topic MRCC MPI crashes on AMD Epyc (oom-kill)
Dear Peter and Laszlo,

thanks for your suggestions. Yes, the error occurs with MPI only, and it occurs also if the calculation is run on a single node (ie, not only for runs across multiple nodes). Serial or OMP parallel runs work fine.

In the meantime I played around with a few different IntelMPI versions, including 2019.3, 2019.5, and 2021.1 - the latter is part of the current OneAPI release. In conjunction with the MRCC binary of 2020-02-22 all of these lead to the same result, where dmrcc_mpi consumes all of the resident memory and crashes after calling an oom-event. There is no scf_mpi process appearing.

However, compilation of the latest 2020-02-22 source code with the pre-installed ifort 2019.5.281 on our cluster and MKL/InteMPI 2019.3.199, solved this memory issue:
Code:
INTEL_MKL_ROOT=/compuchem/bin/intel/compilers_and_libraries_2019.3.199/linux INTEL_MPI_ROOT=/compuchem/bin/intel/compilers_and_libraries_2019.3.199/linux INTEL_CMP_ROOT=/cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux export PATH=${INTEL_CMP_ROOT}/bin/intel64:${PATH} export LD_LIBRARY_PATH=${INTEL_CMP_ROOT}/compiler/lib/intel64:${LD_LIBRARY_PATH} source ${INTEL_MKL_ROOT}/mkl/bin/mklvars.sh intel64 ilp64 source ${INTEL_MPI_ROOT}/mpi/intel64/bin/mpivars.sh release_mt -ofi_internal=0 ./build.mrcc Intel -i64 -pOMP -pMPI=IntelMPI

Apparently, I can reproduce the MPI error with the oom-event, if MRCC is compiled using the same compiler/MPI versions, but with statically linked libraries via
Code:
./build.mrcc Intel -i64 -pOMP -pMPI=IntelMPI -s


I am not sure about the origin of this behaviour, but it may be connected to libimf.so?

Just for your information - the cluster hardware configuration this was tested on includes Intel and AMD compute nodes with EDR InfiniBand interconnects, where the Intel nodes are dual socket Intel Xeon Gold 6148 (Skylake) with 20 cores per socket (40 cores) and 192 GB RAM per node and the AMD nodes are dual socket AMD EPYC 7452 (Zen2) with 32 cores per socket (64 cores) and 512 GB RAM per node.
 
Anyways, apparently my issue with AMD and MPI is solved by compiling MRCC from source using ifort 2019 with dynamically linked libraries.

Best wishes for the new year!
Martin

Please Log in or Create an account to join the conversation.

Time to create page: 0.046 seconds
Powered by Kunena Forum