Parallel errors in GW calculations

Concerns issues with computing quasiparticle corrections to the DFT eigenvalues - i.e., the self-energy within the GW approximation (-g n), or considering the Hartree-Fock exchange only (-x)

Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano

Post Reply
Guo_BIT
Posts: 36
Joined: Tue Jun 06, 2023 2:55 am

Parallel errors in GW calculations

Post by Guo_BIT » Mon Oct 14, 2024 3:37 am

Dear Developers
I am currently encountering parallel issues when performing G0W0 calculations. The last part of the log file is as follows:

Code: Select all

 <24s> P84-n37: [PARALLEL Response_G_space_and_IO for K(bz) on 1 CPU] Loaded/Total (Percentual):576/576(100%)
 <24s> P84-n37: [PARALLEL Response_G_space_and_IO for Q(ibz) on 1 CPU] Loaded/Total (Percentual):576/576(100%)
 <24s> P84-n37: [PARALLEL Response_G_space_and_IO for G-vectors on 1 CPU]
 <24s> P84-n37: [PARALLEL Response_G_space_and_IO for K-q(ibz) on 1 CPU] Loaded/Total (Percentual):576/576(100%)
 <25s> P84-n37: [LA@Response_G_space_and_IO] PARALLEL linear algebra uses a 6x6 SLK grid (36 cpu)
 <25s> P84-n37: [PARALLEL Response_G_space_and_IO for K(ibz) on 1 CPU] Loaded/Total (Percentual):576/576(100%)
 <25s> P84-n37: [PARALLEL Response_G_space_and_IO for CON bands on 72 CPU] Loaded/Total (Percentual):5/328(2%)
 <25s> P84-n37: [PARALLEL Response_G_space_and_IO for VAL bands on 3 CPU] Loaded/Total (Percentual):24/72(33%)
 <25s> P84-n37: [PARALLEL distribution for RL vectors(X) on 1 CPU] Loaded/Total (Percentual):540225/540225(100%)
 <33s> P84-n37: [MEMORY] Alloc WF%c( 8.984925 [Gb]) TOTAL:  11.47166 [Gb] (traced)  55.00800 [Mb] (memstat)
In the output (r-XXX_gw0_XXX) file, the output simply stopped without any error message. :|
I tried modifying the job script, like:

Code: Select all

#SBATCH --cpus-per-task=1
#SBATCH --tasks-per-node=36
and

Code: Select all

DIP_CPU= "1 72 3"       # [PARALLEL] CPUs for each role
DIP_ROLEs= "k c v"         # [PARALLEL] CPUs roles (k,c,v)
DIP_Threads=  0            # [OPENMP/X] Number of threads for dipoles
X_and_IO_CPU= "1 1 1 72 3"     # [PARALLEL] CPUs for each role
X_and_IO_ROLEs= "q g k c v"       # [PARALLEL] CPUs roles (q,g,k,c,v)
X_and_IO_nCPU_LinAlg_INV= 216   # [PARALLEL] CPUs for Linear Algebra
X_Threads=  0              # [OPENMP/X] Number of threads for response functions
SE_CPU= " 1 216 1"       # [PARALLEL] CPUs for each role
SE_ROLEs= "q qp b"         # [PARALLEL] CPUs roles (q,qp,b)
SE_Threads=  0
However, the issue was not resolved. Therefore, I hope to receive your assistance and look forward to your response

Sincerely,
Jingda Guo
Jingda Guo
Beijing Institute of Technology

User avatar
Daniele Varsano
Posts: 4047
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: Parallel errors in GW calculations

Post by Daniele Varsano » Tue Oct 15, 2024 9:34 am

Dear Jingda,

most probably it is a memory issue. You can try to distribute more efficiently the memory distribution, assigning the cpus in a more balanced way, but to guide on this I would need to look at the report file, or to know the number of occupied electrons. In general, you can also try to use less cpu per node in order to have more memory per task.

Then, please also note that this setting is quite unusual:

Code: Select all

SE_CPU= " 1 216 1"       # [PARALLEL] CPUs for each role
SE_ROLEs= "q qp b"         # [PARALLEL] CPUs roles (q,qp,b)
SE_Threads=  0
How many qp are you willing to calculate in the run? Assigning cpus on "b" will distribute the memory, and it is suggested. Anyway, the code did not arrive to this step as the problem you are facing are related to the screening part of the calculation.

Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

Guo_BIT
Posts: 36
Joined: Tue Jun 06, 2023 2:55 am

Re: Parallel errors in GW calculations

Post by Guo_BIT » Tue Oct 15, 2024 12:52 pm

Dear Daniele:
Thank you for your response. Below is our r_setup

Code: Select all

  [X] === Gaps and Widths ===
  [X] Conduction Band Min                           :  0.079528 [eV]
  [X] Valence Band Max                              :  0.000000 [eV]
  [X] Filled Bands                                  :   72
  [X] Empty Bands                                   :    73   500
There are six atoms in each unit cell. In the QE calculations, we considered SOC and turned off symmetry, resulting in a total of 576 k-points (24x24x1).

Additionly. We are only calculating the QP correction for a single transition, and indeed, this parallelization approach is not correct

Code: Select all

%QPkrange                        # [GW] QP generalized Kpoint/Band indices
289|289|72|73|
Jingda Guo
Beijing Institute of Technology

User avatar
Daniele Varsano
Posts: 4047
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: Parallel errors in GW calculations

Post by Daniele Varsano » Wed Oct 16, 2024 6:57 am

Dear Jingda,

given your electronic structure,
the following parallel structure should better distribute the memory

Code: Select all

X_and_IO_CPU= "1 1 1 36 6"     # [PARALLEL] CPUs for each role
X_and_IO_ROLEs= "q g k c v"       # [PARALLEL] CPUs roles (q,g,k,c,v)
you can try if it is enough to overcome the memory problem, otherwise try to use less cpu per node.

Best,

Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

Guo_BIT
Posts: 36
Joined: Tue Jun 06, 2023 2:55 am

Re: Parallel errors in GW calculations

Post by Guo_BIT » Wed Oct 16, 2024 7:48 am

Thank you very much for your suggestion. :D
After removing the other parallel settings and setting

Code: Select all

X_and_IO_CPU= "1 1 1 24 6"     # [PARALLEL] CPUs for each role
X_and_IO_ROLEs= "q g k c v"       # [PARALLEL] CPUs roles (q,g,k,c,v)
the GW calculation has now moved to the next step
Jingda Guo
Beijing Institute of Technology

Guo_BIT
Posts: 36
Joined: Tue Jun 06, 2023 2:55 am

Re: Parallel errors in GW calculations

Post by Guo_BIT » Fri Oct 18, 2024 3:43 am

Dear Daniele:

I apologize for bothering you again, but we have encountered the same issue while performing the BSE calculation. Here are the details:
After successfully completing the GW calculation, I attempted to run the BSE calculation using

Code: Select all

yambo -J 2D_WR_WC -F yambo_BSE.in -r -o b -X p -y d -k sex -V all
However, I encountered the following error in the LOG file:

Code: Select all

P1-n36: [ERROR] STOP signal received while in[08] Dynamic Dielectric Matrix (PPA) 
 P1-n36: [ERROR] Trying to overwrite variable X_RL_vecs in ./2D_WR_WC//ndb.pp with wrong dimensions
Based on topic I found in the Yambo community, I removed the *ndb.pp files (and no GW-related calculations were performed afterward). After the above operation, I reran the BSE calculation, and a similar error occurred once again

Code: Select all

 <13s> P71-n41: [PARALLEL Response_G_space_and_IO for K(bz) on 1 CPU] Loaded/Total (Percentual):576/576(100%)
 <13s> P71-n41: [PARALLEL Response_G_space_and_IO for Q(ibz) on 1 CPU] Loaded/Total (Percentual):576/576(100%)
 <13s> P71-n41: [PARALLEL Response_G_space_and_IO for G-vectors on 72 CPU]
 <13s> P71-n41: [PARALLEL Response_G_space_and_IO for K-q(ibz) on 1 CPU] Loaded/Total (Percentual):576/576(100%)
 <13s> P71-n41: [LA@Response_G_space_and_IO] PARALLEL linear algebra uses a 6x6 SLK grid (36 cpu)
 <13s> P71-n41: [PARALLEL Response_G_space_and_IO for K(ibz) on 1 CPU] Loaded/Total (Percentual):576/576(100%)
 <13s> P71-n41: [PARALLEL Response_G_space_and_IO for CON bands on 1 CPU] Loaded/Total (Percentual):328/328(100%)
 <13s> P71-n41: [PARALLEL Response_G_space_and_IO for VAL bands on 1 CPU] Loaded/Total (Percentual):72/72(100%)
 <13s> P71-n41: [WARNING] Allocation attempt of X_par_lower_triangle%blc of zero size.
 <13s> P71-n41: [WARNING] Allocation attempt of X_par%blc of zero size.
 <13s> P71-n41: [PARALLEL distribution for RL vectors(X) on 72 CPU] Loaded/Total (Percentual):0/863041(0%)
 <16s> P71-n41: [MEMORY] Alloc WF%c( 123.9300 [Gb]) TOTAL:  126.4085 [Gb] (traced)  54.88000 [Mb] (memstat)
I attempted some modifications, such as setting `PAR_def_mode = "KQmemory"`, but the error still persists. :cry: I’m concerned that there might be an issue with my calculation workflow.

Beset wishes
Jingda Guo
Jingda Guo
Beijing Institute of Technology

User avatar
Daniele Varsano
Posts: 4047
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: Parallel errors in GW calculations

Post by Daniele Varsano » Fri Oct 18, 2024 7:39 am

Dear Jingda,

from the log snaphost,
it seems you are not parallelizing over bands "c" and "v", so no distribution of memory in the wfs and you end asking for 126Gb which is unaffordable.
Moreover, if you are now interested in BSE you can calculate static screening only (-X s), this will save you time, but anyway you need to distribute the memory needed for the wfs.

Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

Guo_BIT
Posts: 36
Joined: Tue Jun 06, 2023 2:55 am

Re: Parallel errors in GW calculations

Post by Guo_BIT » Fri Oct 18, 2024 8:11 am

Dear Daniele:

Thank you very much for your suggestion. The calculation is now proceeding normally :D

Besides, I am still curious if there are any issues with the handling of

Code: Select all

P1-n36: [ERROR] STOP signal received while in[08] Dynamic Dielectric Matrix (PPA) 
P1-n36: [ERROR] Trying to overwrite variable X_RL_vecs in ./2D_WR_WC//ndb.pp with wrong dimensions
that is, whether I need to perform any calculations to calculate ndb.pp* back?
Jingda Guo
Beijing Institute of Technology

User avatar
Daniele Varsano
Posts: 4047
Joined: Tue Mar 17, 2009 2:23 pm
Contact:

Re: Parallel errors in GW calculations

Post by Daniele Varsano » Mon Oct 21, 2024 7:55 am

Dear Jingda,

I do not know exactly what went wrong here, probably looking at the report file you can have an hint. Is it possible that you asked in the self energy a number of G vectors not compatible with the ones calculated in the ndb.pp?

Best,
Daniele
Dr. Daniele Varsano
S3-CNR Institute of Nanoscience and MaX Center, Italy
MaX - Materials design at the Exascale
http://www.nano.cnr.it
http://www.max-centre.eu/

Post Reply