If you have problems during the execution of MRCC, please attach the output with an adequate description of your case as well as the followings:
  • the way mrcc was invoked
  • the way build.mrcc was invoked
  • the output of build.mrcc
  • compiler version (for example: ifort -V, gfortran -v)
  • blas/lapack versions
  • as well as gcc and glibc versions

This information really helps us during troubleshooting :)

Restarts during perturbative corrections?

  • Nike
  • Topic Author
  • Offline
  • Premium Member
  • Premium Member
More
5 years 2 months ago #678 by Nike
Dear all,
I understand that if a job crashes during the perturbative corrections, then using rest=1 will mean MRCC goes back to the iterative step for 1 iteration, then starts the perturbative correction all the way from the beginning (spin case 1).

I have just had a job crash on "spin case 19" after about 22 days (32000 minutes) of calculations to do spin cases 1-18:
Code:
Perturbative corrections are calculated... ====================================================================== Spin case 1 Alpha: 2 Beta: 4 Number of excitations: 9674854 CPU time [min]: 18814.564 Wall time [min]: 9849.788 ====================================================================== Spin case 2 Alpha: 3 Beta: 3 Number of excitations: 53660880 CPU time [min]: 20286.653 Wall time [min]: 10593.980 ====================================================================== Spin case 3 Alpha: 4 Beta: 2 Number of excitations: 29030306 CPU time [min]: 21454.478 Wall time [min]: 11180.785 ====================================================================== Spin case 4 Alpha: 5 Beta: 1 Number of excitations: 1377072 CPU time [min]: 21644.188 Wall time [min]: 11276.145 ====================================================================== Spin case 5 Alpha: 2 Beta: 4 Number of excitations: 18634890 CPU time [min]: 22963.355 Wall time [min]: 11946.315 ====================================================================== Spin case 6 Alpha: 2 Beta: 4 Number of excitations: 41794692 CPU time [min]: 25978.119 Wall time [min]: 13466.476 ====================================================================== Spin case 7 Alpha: 3 Beta: 3 Number of excitations: 160982640 CPU time [min]: 31526.117 Wall time [min]: 16290.082 ====================================================================== Spin case 8 Alpha: 3 Beta: 3 Number of excitations: 167173416 CPU time [min]: 36412.186 Wall time [min]: 18741.087 ====================================================================== Spin case 9 Alpha: 4 Beta: 2 Number of excitations: 120744756 CPU time [min]: 41132.122 Wall time [min]: 21124.516 ====================================================================== Spin case 10 Alpha: 4 Beta: 2 Number of excitations: 58067420 CPU time [min]: 43282.776 Wall time [min]: 22209.112 ====================================================================== Spin case 11 Alpha: 5 Beta: 1 Number of excitations: 7463460 CPU time [min]: 44156.062 Wall time [min]: 22646.983 ====================================================================== Spin case 12 Alpha: 5 Beta: 1 Number of excitations: 1327932 CPU time [min]: 44337.005 Wall time [min]: 22737.989 ====================================================================== Spin case 13 Alpha: 2 Beta: 4 Number of excitations: 8318486 CPU time [min]: 44885.268 Wall time [min]: 23013.016 ====================================================================== Spin case 14 Alpha: 2 Beta: 4 Number of excitations: 80495544 CPU time [min]: 51382.765 Wall time [min]: 26288.737 ====================================================================== Spin case 15 Alpha: 2 Beta: 4 Number of excitations: 62686156 CPU time [min]: 55646.355 Wall time [min]: 28444.712 ====================================================================== Spin case 16 Alpha: 3 Beta: 3 Number of excitations: 149058000 CPU time [min]: 60427.369 Wall time [min]: 30840.410 ====================================================================== Spin case 17 Alpha: 3 Beta: 3 Number of excitations: 501528416 CPU time [min]: 77089.610 Wall time [min]: 39213.258 ====================================================================== Spin case 18 Alpha: 3 Beta: 3 Number of excitations: 160979568 CPU time [min]: 82699.806 Wall time [min]: 42054.491 ====================================================================== Spin case 19 Alpha: 4 Beta: 2 Number of excitations: 174126680 Fatal error in mrcc. Program will stop. ************************ 2019-02-04 08:03:39 ************************* Error at the termination of mrcc. **********************************************************************

I wonder if it's possible to make it so that we can save the results from spin cases 1-18, and then when we restart, just start with spin-case 19 immediately?

Also, I wonder if the energy contributions for each spin-case can be printed when the spin-case finishes?

The reason why is that I don't want to wait another 22 days to do spin-cases 1-18 all over again, and in the cases where there's only 19 spin-cases in total, then an estimate based on spin-cases 1-18 seems good enough. (of course ideally we'd do all 19 spin-cases, but if it means we have to restart from spin-case 1 and wait 22 days, we would rather just use the energy estimate coming from the first 18/19 spin cases).

Finally, is there an equation we can refer to in order to understand the different spin cases? I have looked at:

M. Kállay and J. Gauss (2008) Approximate treatment of higher excitations in coupled-cluster theory. II. Extension to general single-determinant reference functions and improved approaches for the canonical Hartree–Fock case. J. Chem. Phys. 129, pp. 144101, in the context of the notation in:

M. Kállay and J. Gauss (2005) Approximate treatment of higher excitations in coupled-cluster theory. J. Chem. Phys. 123, pp. 214105.

and:
Y. J. Bomble, J. F. Stanton, M. Kállay and J. Gauss (2005) Coupled cluster methods including non-iterative approximate quadruple excitation corrections. J. Chem. Phys. 123, pp. 054101.

I understand usually we only have 3 spin-cases and it becomes more spin-cases when we don't have enough RAM to do the 3 spin-cases fully, but what are the new spin-cases and how are they designed?

With best wishes!
Nike

Please Log in or Create an account to join the conversation.

  • kallay
  • Offline
  • Administrator
  • Administrator
  • Mihaly Kallay
More
5 years 2 months ago #679 by kallay
Replied by kallay on topic Restarts during perturbative corrections?
Dear Nike,
Unfortunately, it is not possible to restart the perturbative corrections.

Best regards,
Mihaly Kallay

Please Log in or Create an account to join the conversation.

  • Nike
  • Topic Author
  • Offline
  • Premium Member
  • Premium Member
More
5 years 1 month ago #681 by Nike
Dear Mihaly,
Is it at least possible to print out the energy contribution from each spin case? I have a case where 125 spin cases out of 127 completed, before the crash. The calculation took about 100 days. I started the calculation again, but if it crashes after 125 or 126 spin cases, I'd like to be able to add up the energy contributions from spin cases 1-125 as an "estimate" of what I would get with 127 spin cases.

With best wishes,
Nike

Please Log in or Create an account to join the conversation.

  • kallay
  • Offline
  • Administrator
  • Administrator
  • Mihaly Kallay
More
5 years 1 month ago #685 by kallay
Replied by kallay on topic Restarts during perturbative corrections?
Dear Nike,
Please edit file pert.f, print out variables corr1 and corr2 after line 601, and recompile the code. This will write out the cumulated perturbative corrections.

Best regards,
Mihaly Kallay

Please Log in or Create an account to join the conversation.

  • Nike
  • Topic Author
  • Offline
  • Premium Member
  • Premium Member
More
5 years 2 weeks ago #687 by Nike
Dear Mihaly,
Thanks for the suggestion.
I have added the two WRITE statements below:
Code:
corr1=corr1+fct*sum1 corr2=corr2+fct*sum2 else corr2=corr2+sum2 endif endif write(iout,"(' corr1 contribution: ',f18.12)") corr1 write(iout,"(' corr2 contribution: ',f18.12)") corr2 enddo !while

and re-compiled successfully but did not see any new output when I ran the program.


Also, I can see why one would want to parallelize within each spin case if there's only 1 or 2 spin cases, but when the # of spin cases is about the same as (or greater than) the number of cores, should we parallelize over the spin cases rather than parallelizing within the spin cases? Here is an example with 2 cores:
Code:
Perturbative corrections are calculated... ====================================================================== Spin case 1 Alpha: 3 Beta: 2 Number of excitations: 464494585 CPU time [min]: 10291.275 Wall time [min]: 5419.409 ====================================================================== Spin case 2 Alpha: 4 Beta: 1 Number of excitations: 113869560 CPU time [min]: 12329.565 Wall time [min]: 6445.038 ====================================================================== Spin case 3 Alpha: 3 Beta: 2 Number of excitations: 1117504190 CPU time [min]: 24282.021 Wall time [min]: 12538.886 ====================================================================== Spin case 4 Alpha: 3 Beta: 2 Number of excitations: 755944353 CPU time [min]: 31961.789 Wall time [min]: 16453.636 ====================================================================== Spin case 5 Alpha: 4 Beta: 1 Number of excitations: 368830759 CPU time [min]: 39072.617 Wall time [min]: 20057.350 ====================================================================== Spin case 6 Alpha: 4 Beta: 1 Number of excitations: 91754365 CPU time [min]: 40736.924 Wall time [min]: 20900.330 ====================================================================== Spin case 7 Alpha: 3 Beta: 2 Number of excitations: 876483136 CPU time [min]: 49811.214 Wall time [min]: 25510.303 ====================================================================== Spin case 8 Alpha: 3 Beta: 2 Number of excitations: 1818685774 CPU time [min]: 69571.559 Wall time [min]: 35771.351 ====================================================================== Spin case 9 Alpha: 3 Beta: 2 Number of excitations: 300908380 CPU time [min]: 72551.242 Wall time [min]: 37269.068 ====================================================================== Spin case 10 Alpha: 4 Beta: 1 Number of excitations: 438212395 CPU time [min]: 80907.590 Wall time [min]: 41475.333 ====================================================================== Spin case 11 Alpha: 4 Beta: 1 Number of excitations: 297219400 Fatal error in mrcc. Program will stop. ************************ 2019-03-18 18:24:40 ************************* Error at the termination of mrcc. **********************************************************************

The speed-up we get by dividing CPU Time by Wall Time, is only 1.89. Wouldn't it be a perfect 2x speed-up if we did each spin case on a different core (perhaps load balanced so that each core has to do roughly the same number of total excitations?

The situation is much more pronounced when we have more cores, here I used 11 cores:

Code:
Perturbative corrections are calculated... ====================================================================== Spin case 1 Alpha: 3 Beta: 2 Number of excitations: 1539990258 CPU time [min]:110395.983 Wall time [min]: 14322.505 ====================================================================== Spin case 2 Alpha: 3 Beta: 2 Number of excitations: 1540225714 CPU time [min]:137626.477 Wall time [min]: 16946.664 ====================================================================== Spin case 3 Alpha: 4 Beta: 1 Number of excitations: 759732452 CPU time [min]:153466.773 Wall time [min]: 18457.198 ====================================================================== Spin case 4 Alpha: 3 Beta: 2 Number of excitations: 3901717692 CPU time [min]:226253.656 Wall time [min]: 25559.211 ====================================================================== Spin case 5 Alpha: 3 Beta: 2 Number of excitations: 2625110920 CPU time [min]:273343.943 Wall time [min]: 30067.927 ====================================================================== Spin case 6 Alpha: 3 Beta: 2 Number of excitations: 3902968798 CPU time [min]:342327.476 Wall time [min]: 37179.708 ====================================================================== Spin case 7 Alpha: 3 Beta: 2 Number of excitations: 2625652424 CPU time [min]:388852.180 Wall time [min]: 41649.681 ====================================================================== Spin case 8 Alpha: 4 Beta: 1 Number of excitations: 2583772872 CPU time [min]:443175.096 Wall time [min]: 46691.254 ====================================================================== Spin case 9 Alpha: 4 Beta: 1 Number of excitations: 643242402 CPU time [min]:456705.232 Wall time [min]: 47999.934 ====================================================================== Spin case 10 Alpha: 3 Beta: 2 Number of excitations: 3247203320 Fatal error in mrcc. Program will stop. ************************ 2019-01-28 11:48:09 ************************* Error at the termination of mrcc. **********************************************************************

We can see the CPU time vs Wall time for the first spin case is only 7.7x speed-up (110395.983/14322.505 = 7.7). Whereas if we did 11 spin cases on 11 different cores, would get a perfect 11x speed-up. In this case there was 22 total spin cases to do, so we could have easily had each core doing a separate spin case.

With best wishes!
Nike

Please Log in or Create an account to join the conversation.

  • kallay
  • Offline
  • Administrator
  • Administrator
  • Mihaly Kallay
More
5 years 2 weeks ago #688 by kallay
Replied by kallay on topic Restarts during perturbative corrections?
Dear Nike,
What sort of calculation are you running? I have tested it for CCSDT(Q), and it works correctly.
Concerning the spin cases: please note that the division of the spin cases is related to the memory requirements and not to parallelization.

Best regards,
Mihaly Kallay

Please Log in or Create an account to join the conversation.

Time to create page: 0.043 seconds
Powered by Kunena Forum