If you have problems during the execution of MRCC, please attach the output with an adequate description of your case as well as the followings:
  • the way mrcc was invoked
  • the way build.mrcc was invoked
  • the output of build.mrcc
  • compiler version (for example: ifort -V, gfortran -v)
  • blas/lapack versions
  • as well as gcc and glibc versions

This information really helps us during troubleshooting :)

CC(n) behaviour when memory is insufficient

  • MXvo5e35
  • Topic Author
  • Offline
  • New Member
  • New Member
More
2 years 2 months ago #1200 by MXvo5e35
I'm running a series of expensive CC(n) and CC(n)(n+1) calculations for large n (up to 5 or 6) and some moderate basis sets (up to aug-cc-pCVQZ). Obviously these have solid requirements in terms of CPU, time, and particularly memory, and I'm interested in pushing these calculations to the limit of the resources I have available. But I am seeing some behaviour that I don't understand, so I was wondering if anybody (probably one of the MRCC devs) could help explain it to me.

Consider the following MINP:
Code:
calc=ccsdtqp mem=180GB #mem=1400GB scftype=uhf scfiguess=ao gauss=spher basis=aug-cc-pCVQZ charge=0.0 mult=4 scftol=10 scfdtol=11 cctol=10 itol=12 ovltol=0.0 scfmaxit=50 ccmaxit=50 core=1 unit=angstrom geom=xyz 1   N 0.0 0.0 0.0

The particular parameters aren't inherently important, but note that there are two potential values for the mem variable, one commented out. The difference in behaviour according to which of these is used is what I'm trying to establish.

Background: I'm running my calculations generally on machines with about 192GB available RAM. I also have access to a couple of machines with about 1.5TB of RAM; these are underperformant and are prone to e.g. NUMA-related performance issues, but the massive amounts of RAM available make them interesting for what follows. I'll call these machines small and large respectively; I'm setting the mem variable in MINP to a safe value approaching the maximum available capacity on each (so 180GB for small, 1400GB for large, hence the entries in the MINP above).

If I execute the job above on a small machine, things don't run the way I would normally expect. In particular, some of the coupled-cluster setup programs (goldstone and xmrcc) seem to be executed multiple times, which I generally don't see in less challenging calculations. The tasks I see executed are as follows. I have annotated these with relevant output regarding the exact task that is claimed to run, and the claimed memory requirements output at each step:
Code:
minp integ scf ovirt goldstone => "Required memory (Mbytes):   82336.5" xmrcc => (CC(5)), "Total:      373281.0590    373281.0590" goldstone => "Required memory (Mbytes):   16160.8" xmrcc => (MRCC(5)) "Total:       12762.2385     13402.6918" mrcc => (CCSDTQP)

On a large machine, I see the following execution pattern:
Code:
minp integ scf ovirt goldstone => "Required memory (Mbytes):   82336.5" xmrcc => (CC(5)), "Total:     1115391.3260   1115391.3260" mrcc => (CCSDTQP)

If it would help, I can post the full output logs of each calculation here.

For the small calculation, clearly the total required memory from the first execution of goldstone (i.e. ~374GB) exceeds the 180GB that I've specified. In the past, I have noticed similar behaviour (repeated multiple executions of goldstone/xmrcc) when insufficient memory is available, but it's always been followed by a crash; I had assumed that this was due to some kind of a bad memory write and corruption of state. That's not a problem -- I'd much rather that a calculation fail fast when insufficient resources are available. However, here, the small calculation actually proceeds into the main calculation loop and seems to converge to what looks like a sensible value. I haven't finished the run for the large system yet, but so far (three iterations in), the energies, norms, etc. match the small run, which seems to confirm that everything was OK in the small-memory case.

1) So what is happening in the small case? Is MRCC trying to reconfigure the problem somehow into a form that can fit within available memory? More generally, if this happens, can I "trust" any resulting output? (I ask this because, as mentioned, I have only ever seen "MRCC(n)" being printed rather than "CC(n)" in cases where memory is insufficient, and almost always followed by a crash.)

2) Why is the initial "total" produced by xmrcc different for the two cases? The small calculation claims that it needs 374GB, but the large version claims about 1.1TB. One possibility I can see is due to the number of threads being used; the small calculations get 32 threads, and the large get 96 -- and 1.1TB is approximately 3 * 375GB...

Please Log in or Create an account to join the conversation.

  • kallay
  • Offline
  • Administrator
  • Administrator
  • Mihaly Kallay
More
2 years 2 months ago #1201 by kallay
Replied by kallay on topic CC(n) behaviour when memory is insufficient
Please note that the "Required memory" printed by goldstone is just a rough estimate, you can forget it. The real memory consumption is calculated by xmrcc.
1) Yes, mrcc will attempt to reconfigure the orbital spaces so that it
can fit into the memory. You can trust the results. The program should not crash. Please install the patches, that should solve the problem.
2) The memory required for the calculations depends on the number of threads used. Thus, with a larger number of threads, you need more memory.
 

Best regards,
Mihaly Kallay

Please Log in or Create an account to join the conversation.

  • MXvo5e35
  • Topic Author
  • Offline
  • New Member
  • New Member
More
2 years 2 months ago #1202 by MXvo5e35
Replied by MXvo5e35 on topic CC(n) behaviour when memory is insufficient
Thanks for the quick answer.

As far as I know, I have all of the currently available patches included... I think the crashes I've seen stem from cases where whatever reconfigurations are performed are still insufficient to bring the problem down to a manageable memory footprint. In this case, what would the expected behaviour be?

Also, just out of curiosity, is there any information (in a paper, maybe?) about what kind of reconfiguration is being made?

Please Log in or Create an account to join the conversation.

Time to create page: 0.040 seconds
Powered by Kunena Forum