Page 1 of 1

Parallel errors in GW calculations

Posted: Mon Oct 14, 2024 3:37 am
by Guo_BIT
Dear Developers
I am currently encountering parallel issues when performing G0W0 calculations. The last part of the log file is as follows:

Code: Select all

 <24s> P84-n37: [PARALLEL Response_G_space_and_IO for K(bz) on 1 CPU] Loaded/Total (Percentual):576/576(100%)
 <24s> P84-n37: [PARALLEL Response_G_space_and_IO for Q(ibz) on 1 CPU] Loaded/Total (Percentual):576/576(100%)
 <24s> P84-n37: [PARALLEL Response_G_space_and_IO for G-vectors on 1 CPU]
 <24s> P84-n37: [PARALLEL Response_G_space_and_IO for K-q(ibz) on 1 CPU] Loaded/Total (Percentual):576/576(100%)
 <25s> P84-n37: [LA@Response_G_space_and_IO] PARALLEL linear algebra uses a 6x6 SLK grid (36 cpu)
 <25s> P84-n37: [PARALLEL Response_G_space_and_IO for K(ibz) on 1 CPU] Loaded/Total (Percentual):576/576(100%)
 <25s> P84-n37: [PARALLEL Response_G_space_and_IO for CON bands on 72 CPU] Loaded/Total (Percentual):5/328(2%)
 <25s> P84-n37: [PARALLEL Response_G_space_and_IO for VAL bands on 3 CPU] Loaded/Total (Percentual):24/72(33%)
 <25s> P84-n37: [PARALLEL distribution for RL vectors(X) on 1 CPU] Loaded/Total (Percentual):540225/540225(100%)
 <33s> P84-n37: [MEMORY] Alloc WF%c( 8.984925 [Gb]) TOTAL:  11.47166 [Gb] (traced)  55.00800 [Mb] (memstat)
In the output (r-XXX_gw0_XXX) file, the output simply stopped without any error message. :|
I tried modifying the job script, like:

Code: Select all

#SBATCH --cpus-per-task=1
#SBATCH --tasks-per-node=36
and

Code: Select all

DIP_CPU= "1 72 3"       # [PARALLEL] CPUs for each role
DIP_ROLEs= "k c v"         # [PARALLEL] CPUs roles (k,c,v)
DIP_Threads=  0            # [OPENMP/X] Number of threads for dipoles
X_and_IO_CPU= "1 1 1 72 3"     # [PARALLEL] CPUs for each role
X_and_IO_ROLEs= "q g k c v"       # [PARALLEL] CPUs roles (q,g,k,c,v)
X_and_IO_nCPU_LinAlg_INV= 216   # [PARALLEL] CPUs for Linear Algebra
X_Threads=  0              # [OPENMP/X] Number of threads for response functions
SE_CPU= " 1 216 1"       # [PARALLEL] CPUs for each role
SE_ROLEs= "q qp b"         # [PARALLEL] CPUs roles (q,qp,b)
SE_Threads=  0
However, the issue was not resolved. Therefore, I hope to receive your assistance and look forward to your response

Sincerely,
Jingda Guo

Re: Parallel errors in GW calculations

Posted: Tue Oct 15, 2024 9:34 am
by Daniele Varsano
Dear Jingda,

most probably it is a memory issue. You can try to distribute more efficiently the memory distribution, assigning the cpus in a more balanced way, but to guide on this I would need to look at the report file, or to know the number of occupied electrons. In general, you can also try to use less cpu per node in order to have more memory per task.

Then, please also note that this setting is quite unusual:

Code: Select all

SE_CPU= " 1 216 1"       # [PARALLEL] CPUs for each role
SE_ROLEs= "q qp b"         # [PARALLEL] CPUs roles (q,qp,b)
SE_Threads=  0
How many qp are you willing to calculate in the run? Assigning cpus on "b" will distribute the memory, and it is suggested. Anyway, the code did not arrive to this step as the problem you are facing are related to the screening part of the calculation.

Best,
Daniele

Re: Parallel errors in GW calculations

Posted: Tue Oct 15, 2024 12:52 pm
by Guo_BIT
Dear Daniele:
Thank you for your response. Below is our r_setup

Code: Select all

  [X] === Gaps and Widths ===
  [X] Conduction Band Min                           :  0.079528 [eV]
  [X] Valence Band Max                              :  0.000000 [eV]
  [X] Filled Bands                                  :   72
  [X] Empty Bands                                   :    73   500
There are six atoms in each unit cell. In the QE calculations, we considered SOC and turned off symmetry, resulting in a total of 576 k-points (24x24x1).

Additionly. We are only calculating the QP correction for a single transition, and indeed, this parallelization approach is not correct

Code: Select all

%QPkrange                        # [GW] QP generalized Kpoint/Band indices
289|289|72|73|

Re: Parallel errors in GW calculations

Posted: Wed Oct 16, 2024 6:57 am
by Daniele Varsano
Dear Jingda,

given your electronic structure,
the following parallel structure should better distribute the memory

Code: Select all

X_and_IO_CPU= "1 1 1 36 6"     # [PARALLEL] CPUs for each role
X_and_IO_ROLEs= "q g k c v"       # [PARALLEL] CPUs roles (q,g,k,c,v)
you can try if it is enough to overcome the memory problem, otherwise try to use less cpu per node.

Best,

Daniele

Re: Parallel errors in GW calculations

Posted: Wed Oct 16, 2024 7:48 am
by Guo_BIT
Thank you very much for your suggestion. :D
After removing the other parallel settings and setting

Code: Select all

X_and_IO_CPU= "1 1 1 24 6"     # [PARALLEL] CPUs for each role
X_and_IO_ROLEs= "q g k c v"       # [PARALLEL] CPUs roles (q,g,k,c,v)
the GW calculation has now moved to the next step

Re: Parallel errors in GW calculations

Posted: Fri Oct 18, 2024 3:43 am
by Guo_BIT
Dear Daniele:

I apologize for bothering you again, but we have encountered the same issue while performing the BSE calculation. Here are the details:
After successfully completing the GW calculation, I attempted to run the BSE calculation using

Code: Select all

yambo -J 2D_WR_WC -F yambo_BSE.in -r -o b -X p -y d -k sex -V all
However, I encountered the following error in the LOG file:

Code: Select all

P1-n36: [ERROR] STOP signal received while in[08] Dynamic Dielectric Matrix (PPA) 
 P1-n36: [ERROR] Trying to overwrite variable X_RL_vecs in ./2D_WR_WC//ndb.pp with wrong dimensions
Based on topic I found in the Yambo community, I removed the *ndb.pp files (and no GW-related calculations were performed afterward). After the above operation, I reran the BSE calculation, and a similar error occurred once again

Code: Select all

 <13s> P71-n41: [PARALLEL Response_G_space_and_IO for K(bz) on 1 CPU] Loaded/Total (Percentual):576/576(100%)
 <13s> P71-n41: [PARALLEL Response_G_space_and_IO for Q(ibz) on 1 CPU] Loaded/Total (Percentual):576/576(100%)
 <13s> P71-n41: [PARALLEL Response_G_space_and_IO for G-vectors on 72 CPU]
 <13s> P71-n41: [PARALLEL Response_G_space_and_IO for K-q(ibz) on 1 CPU] Loaded/Total (Percentual):576/576(100%)
 <13s> P71-n41: [LA@Response_G_space_and_IO] PARALLEL linear algebra uses a 6x6 SLK grid (36 cpu)
 <13s> P71-n41: [PARALLEL Response_G_space_and_IO for K(ibz) on 1 CPU] Loaded/Total (Percentual):576/576(100%)
 <13s> P71-n41: [PARALLEL Response_G_space_and_IO for CON bands on 1 CPU] Loaded/Total (Percentual):328/328(100%)
 <13s> P71-n41: [PARALLEL Response_G_space_and_IO for VAL bands on 1 CPU] Loaded/Total (Percentual):72/72(100%)
 <13s> P71-n41: [WARNING] Allocation attempt of X_par_lower_triangle%blc of zero size.
 <13s> P71-n41: [WARNING] Allocation attempt of X_par%blc of zero size.
 <13s> P71-n41: [PARALLEL distribution for RL vectors(X) on 72 CPU] Loaded/Total (Percentual):0/863041(0%)
 <16s> P71-n41: [MEMORY] Alloc WF%c( 123.9300 [Gb]) TOTAL:  126.4085 [Gb] (traced)  54.88000 [Mb] (memstat)
I attempted some modifications, such as setting `PAR_def_mode = "KQmemory"`, but the error still persists. :cry: I’m concerned that there might be an issue with my calculation workflow.

Beset wishes
Jingda Guo

Re: Parallel errors in GW calculations

Posted: Fri Oct 18, 2024 7:39 am
by Daniele Varsano
Dear Jingda,

from the log snaphost,
it seems you are not parallelizing over bands "c" and "v", so no distribution of memory in the wfs and you end asking for 126Gb which is unaffordable.
Moreover, if you are now interested in BSE you can calculate static screening only (-X s), this will save you time, but anyway you need to distribute the memory needed for the wfs.

Best,
Daniele

Re: Parallel errors in GW calculations

Posted: Fri Oct 18, 2024 8:11 am
by Guo_BIT
Dear Daniele:

Thank you very much for your suggestion. The calculation is now proceeding normally :D

Besides, I am still curious if there are any issues with the handling of

Code: Select all

P1-n36: [ERROR] STOP signal received while in[08] Dynamic Dielectric Matrix (PPA) 
P1-n36: [ERROR] Trying to overwrite variable X_RL_vecs in ./2D_WR_WC//ndb.pp with wrong dimensions
that is, whether I need to perform any calculations to calculate ndb.pp* back?

Re: Parallel errors in GW calculations

Posted: Mon Oct 21, 2024 7:55 am
by Daniele Varsano
Dear Jingda,

I do not know exactly what went wrong here, probably looking at the report file you can have an hint. Is it possible that you asked in the self energy a number of G vectors not compatible with the ones calculated in the ndb.pp?

Best,
Daniele