× If you have problems during the execution of MRCC, please attach the output with an adequate description of your case as well as the followings:
  • the way mrcc was invoked
  • the way build.mrcc was invoked
  • the output of build.mrcc
  • compiler version (for example: ifort -V, gfortran -v)
  • blas/lapack versions
  • as well as gcc and glibc versions

This information really helps us during troubleshooting :)

Severe performance regression of CCSDT between the 2019 and 2020 binaries

More
7 months 1 week ago - 7 months 1 week ago #1016 by TiborGY
Dear All,

I have tested both the 2019 and the 2020 binary releases, and to my dismay I have found that the 2020 binary is much slower than the 2019 binary, in fact it takes exactly twice as much time to do a single CCSDT iteration.
I am using MRCC via the Molpro interface.

The system has an AMD Zen2 CPU (3900X), but a while ago I think I saw similar performance regressions on an Intel Haswell-E based system. I have not kept those output files, so I do not have exact timings for that.

The MRCC output files are attached, the energies, etc. are identical, only the timings show a meaningful difference.

Best,
Tibor Győri
Attachments:
Last edit: 7 months 1 week ago by TiborGY. Reason: newlines

Please Log in or Create an account to join the conversation.

More
5 months 3 weeks ago #1024 by bkwx97
I'm having a similar issue as well with the AMD Epyc Rome (Ryzen 2) architecture. I'm using the 2020 binary and wondering if there are some suggested settings with OMP_PLACES, OMP_PROC_BIND, and KMP_AFFINITY. Or is it better in the case of the AMD processors to compile from source?

Please Log in or Create an account to join the conversation.

More
4 months 4 weeks ago #1033 by nagypeter
Dear Tibor and bkwx97,

Thanks for reporting this and sorry for the late reply.
Unfortunately, we do not have access to AMD processors and did not experience
any slowdown with Intel cpus, so I currently cannot check this.

My first suggestion would be for you to recompile from source with a recent ifort version that supports your cpus.

Could you, please share full output files for a problematic jobs performed with the 2019 and 2020 binary?

You might indeed try to play around the noted settings, e.g. OMP_PLACES=cores and OMP_PROC_BIND=spread, but we are not sure what option would work best for AMD.

You may also try to set this:
export MKL_ENABLE_INSTRUCTIONS=AVX2

As a workaround you could send the list of features you want to use and we can help which one if the same in the 2019 and 2020 codes.

Best wishes,
Peter

Please Log in or Create an account to join the conversation.

More
4 months 3 weeks ago #1035 by TiborGY
Dear Péter,

Please see the two output files I have attached to the first post, I have not let them run to completion, but the timings are there. If they are not sufficient, please let me know.

The results I have uploaded were from serial runs, and I doubt that adjusting anything OMP related would have an impact on single-threaded performance.

As far as I know MKL_ENABLE_INSTRUCTIONS does nothing on AMD CPUs, with the possible exception of the very newest 2020 MKL releases. For a very long time, the relevant environment variable for enabling SSE/AVX usage inside MKL with AMD CPUs had been the undocumented env. var. MKL_DEBUG_CPU_TYPE.

Before one of the 2020 MKL releases, MKL_DEBUG_CPU_TYPE=5 would force the usage of AVX2 in MKL, even on AMD processors. But one of the 2020 MKL releases removed this, although Intel seems to have started to work on recognizing AMD CPUs and running somewhat optimized code on them. So hopefully, eventually, MKL will run great on AMD without any hacks, but until then the best performance is probably achieved by using a 2019 MKL release with the MKL_DEBUG_CPU_TYPE CPU detection override.

I am also going to test this on Intel systems, and post those results as well.

Best,
Tibor

Please Log in or Create an account to join the conversation.

More
4 months 1 week ago - 4 months 1 week ago #1042 by TiborGY
Dear All,

I have repeated the calculation on two different Intel CPUs:
W-2295(Cascade Lake W), essentially equivalent to Skylake-X and Xeon Scalable (silver, gold, etc.) CPUs that have only 4 memory channels, like them it has AVX512 support
i7-5820K(Haswell-E), essentially equivalent to high clock speed v3 Xeons, no AVX512 support

Unfortunately I have found the same ~2x performance regression for both Intel systems as well. Interesting to note that the fastest system for single-thread performance is the AMD 3900X, at least when a single thread of MRCC is the only thing running on the system, and the older Haswell-E CPU is slightly faster than the CSL part, despite having no AVX-512 support and being overall much older.

My results so far (timings are in minutes per CCSDT iteration):
MRCC binary |Intel W-2295 |Intel i7-5820K |AMD 3900X
old235220158
new417409316

Logs for the Intel systems are attached to this post. (I killed the calculations after the timings for the first iteration were printed, so please ignore the errors at the end of the logs)

Best,
Tibor
Attachments:
Last edit: 4 months 1 week ago by TiborGY.

Please Log in or Create an account to join the conversation.

More
4 months 6 days ago #1047 by TiborGY
Further results: apparently this might not be caused by any code changes between the 2019 and 2020 releases, but rather a regression (?) in ifort.

I have compiled the 2019 source code with the latest ifort, and I am getting the same slowdown as with the 2020 binary.

Please Log in or Create an account to join the conversation.

Time to create page: 0.024 seconds
Powered by Kunena Forum