× If you have problems during the execution of MRCC, please attach the output with an adequate description of your case as well as the followings:
  • the way mrcc was invoked
  • the way build.mrcc was invoked
  • the output of build.mrcc
  • compiler version (for example: ifort -V, gfortran -v)
  • blas/lapack versions
  • as well as gcc and glibc versions

This information really helps us during troubleshooting :)

MRCC MPI/OpenMP and SLURM

More
4 months 1 day ago #1031 by diefenbach
Dear all,

I am trying to run the current MRCC binary (2020-02-22) with hybrid MPI/OpenMP parallelism using the SLURM batch queueing system.

Interactively (without batch queueing), MRCC runs fine as expected. E.g., with "mpitasks=2" and "OMP_NUM_THREADS=40" I obtain 2 MPI tasks via hydra_pmi_proxy, each spawning 40 threads, running successfully to completion using scf_mpi and mrcc_mpi.

With SLURM batch queueing, however, this appears to conflict with SLURM's "srun" command: dmrcc seems to launch instances of "srun" which are calling Intel MPI (hydra_pmi_proxy), and then hangs at the scf_mpi process.

Has anyone else encountered this issue, and perhaps come up with a solution?

Cheers,
Martin

Below is the slurm script, which runs fine interactively via
> salloc --overcommit --nodes=1 --ntasks=2 --cpus-per-task=40
  salloc: Node node45-008 ready for job
> ssh node45-008
> ./mrcc.sh

but hangs when submitted via "sbatch mrcc.sh":
mrcc.sh:
#!/bin/bash
#SBATCH --overcommit
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=40

export OMP_NUM_THREADS=40
export MKL_NUM_THREADS=40

MRCC_DIR=/compuchem/bin/mrcc/2020-02-22.binary
INTEL_DIR=/cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux
source ${INTEL_DIR}/mkl/bin/mklvars.sh intel64 ilp64
source ${INTEL_DIR}/mpi/intel64/bin/mpivars.sh release_mt

export PATH=${MRCC_DIR}:${PATH}
cp ${HOME}/mrcc.inp MINP
dmrcc &> ${HOME}/mrcc.out
exit

The stalled/hanging processes on the scheduled node:
> ssh node45-001
> ps xl
 PID   PPID    STAT TTY  TIME COMMAND
132202 132197  S    ?    0:00 /bin/bash /var/spool/slurm/d/job480438/slurm_script
132225 132202  S    ?    0:00 /bin/sh /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin/mpirun -np 1 dmrcc_mpi
132231 132225  S    ?    0:00 mpiexec.hydra -np 1 dmrcc_mpi
132232 132231  Ssl  ?    0:00 /usr/bin/srun -N 1 -n 1 --nodelist node45-008 --input none /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_bstrap_proxy --upstream-host node45-008.cm.cluster --upstream-port 45452 --pgid 0 --launcher slurm --launcher-number 1 --base-path /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
132233 132232  S    ?    0:00 /usr/bin/srun -N 1 -n 1 --nodelist node45-008 --input none /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_bstrap_proxy --upstream-host node45-008.cm.cluster --upstream-port 45452 --pgid 0 --launcher slurm --launcher-number 1 --base-path /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
132246 132240  S    ?    0:00 /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
132249 132246  S    ?    0:00 dmrcc_mpi
132516 132231  Ss   ?    0:00 /usr/bin/srun -N 1 -n 1 --nodelist node45-008 --input none /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_bstrap_proxy --upstream-host node45-008.cm.cluster --upstream-port 46735 --pgid 1 --launcher slurm --launcher-number 1 --base-path /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 /cluster/intel/2019.5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9

Please Log in or Create an account to join the conversation.

More
4 months 4 hours ago #1032 by nagypeter
Replied by nagypeter on topic MRCC MPI/OpenMP and SLURM
Dear Martin,

sorry for the problem, we did not have this issue before.
Your scripts look fine.

Could you share more information? E.g. on some clusters there were cluster specific mpi
problems. Can you share some documentation on the cluster setup?

You may also try
unset I_MPI_PMI_LIBRARY
to use intel's internal pmi lib instead of slurm's.

You may also try to start dmrcc_mpi specifically with mpirun and maybe also change the bootstrap server:
mpirun -np 1 -bootstrap $bootstrap dmrcc_mpi
with possibly ssh, rsh, etc... options for $bootstrap.

Any other output file and error msg info would be helpful, if you could share those.
Is this problem MRCC specific? Can you run other programs with similar setup?
Can you run MRCC via sbatch without MPI relying on MPI?
Do the test jobs run correctly?

Is login with ssh/rsh allowed between nodes? Our code needs to spawn the scf_mpi and mrcc_mpi processes via an MPI_Comm_spawn call in dmrcc_mpi and this has to be allowed.

I hope some of this helps.
Best wishes,
Peter

Please Log in or Create an account to join the conversation.

More
4 months 3 hours ago #1034 by diefenbach
Dear Peter,

many thanks for the reply!

Your suggestion to change the bootstrap server actually does the trick!

If I use dmrcc_mpi with mpirun instead of dmrcc, i.e.
# dmrcc &> mrcc.out
mpirun -np 1 -bootstrap ssh dmrcc_mpi &> mrcc.out
the job finishes to completion running 2 tasks with 40 threads each.

There is however, an error message at the very end stating fort.17, which only appears in combination with MPI (also without SLURM):
 Total CCSDT[Q] energy [au]:         -55.792496754758
 Total CCSDT(Q)/A energy [au]:       -55.792638783431
 Total CCSDT(Q)/B energy [au]:       -55.792641166686

 Fatal error in cp fort.17 .. 2> /dev/null.
 Program will stop.

 ************************ 2020-12-16 17:21:20 ************************
                   Error at the termination of mrcc.
 *********************************************************************

 ************************ 2020-12-16 17:21:22 ************************
                      Normal termination of mrcc.
 *********************************************************************
Is this something to worry about?

I am running the following input:
basis=cc-pVTZ
calc=CCSDT(Q)
scftype=ROHF
mult=2
cctol=12
mem=8GB
mpitasks=2

unit=bohr
geom
H
N 1 R1
H 2 R1 1 A

R1=2.00000000000
A=104.2458898548

Just as a side note -
The original problem with SLURM was specific to MRCC in combination with MPI. Jobs work without any issues when just asking for OpenMP threading within SLURM (i.e., jobs without "mpitasks=..." in the input file). Also other programs work regularly with sbatch/SLURM, e.g. Molpro with MPI/OpenMP hybrid parallel jobs.

Best wishes,
Martin

Please Log in or Create an account to join the conversation.

More
3 months 4 weeks ago #1036 by nagypeter
Replied by nagypeter on topic MRCC MPI/OpenMP and SLURM
Dear Martin,

I glad that the it worked out for you.

You can ignore the second issue with the fort.17 copy, the energies are fine.
Thank you for pointing it out, this will be fixed in the next release.

It you wish, you can replace in mrcc.f the lines

if(master_thread) then
lll=.false.
inquire(file='CCDENSITIES',exist=lll)
if (lll) call ishell('mv CCDENSITIES .. 2> /dev/null')
call ishell('cp fort.16 .. 2> /dev/null')
call ishell('cp fort.17 .. 2> /dev/null')
call ishell('cp fort.63 .. 2> /dev/null')
end if

by these

if(master_thread) then
lll=.false.
inquire(file='CCDENSITIES',exist=lll)
if (lll) call ishell('mv CCDENSITIES .. 2> /dev/null')
call ishell('cp fort.16 .. 2> /dev/null')
inquire(file='fort.17',exist=lll)
if (lll) call ishell('cp fort.17 .. 2> /dev/null')
inquire(file='fort.63',exist=lll)
if (lll) call ishell('cp fort.63 .. 2> /dev/null')
end if

and recompile.
Best wishes,
Peter

Please Log in or Create an account to join the conversation.

More
3 months 3 weeks ago #1037 by diefenbach
Dear Peter,

thanks again for your support! If the copy message may be ignored, I might as well do so for now and wait for the next binary release.

Concerning the original issue on MPI and SLURM apparently the culprit is the pre-installed 2019.5 version of IntelMPI on our cluster. After installation of the version 2019.3 recommended in the manual, dmrcc works just as intended:
#!/bin/bash
#SBATCH --overcommit
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=40

export OMP_NUM_THREADS=40
export MKL_NUM_THREADS=40

MRCC_DIR=/compuchem/bin/mrcc/2020-02-22.binary
INTEL_DIR=/compuchem/bin/intel/compilers_and_libraries_2019.3.199/linux
source ${INTEL_DIR}/mkl/bin/mklvars.sh intel64 ilp64
source ${INTEL_DIR}/mpi/intel64/bin/mpivars.sh release_mt -ofi_internal=0

export PATH=${MRCC_DIR}:${PATH}
cp ${HOME}/mrcc.inp MINP
dmrcc &> ${HOME}/mrcc.out
exit

I actually did not expect such an impact of a minor version release, and I am not sure about the origin of the issue, but there might have been a change in settings for the HYDRA process manager in IntelMPI 2019.5...

Anyways, everything works fine now on Intel-based machines! There is, however, another issue with AMD Epyc (Zen2) based architectures - I shall open a new thread for that.

Best regards,
Martin

Please Log in or Create an account to join the conversation.

More
3 months 3 weeks ago #1038 by nagypeter
Replied by nagypeter on topic MRCC MPI/OpenMP and SLURM
Dear Martin,

I am glad that the intel side works well now.
We did not have any experience with the 2019.5 IntelMPI version, it is good to know about this.

Please, indeed open a new thread for the AMD question. Before that, could you have a look at the thread below? It two could be related.
www.mrcc.hu/index.php/forum/running-mrcc...d-2020-binaries#1016

Best wishes,
Peter

Please Log in or Create an account to join the conversation.

Time to create page: 0.020 seconds
Powered by Kunena Forum