too many communicators parallelization error

Deals with issues related to computation of optical spectra, solving the Bethe-Salpeter equation.

Moderators: Davide Sangalli, andrea.ferretti, myrta gruning, andrea marini, Daniele Varsano

Post Reply
milesj
Posts: 30
Joined: Thu Jan 26, 2023 9:27 pm

too many communicators parallelization error

Post by milesj » Tue Mar 19, 2024 5:49 am

Hi all,

I keep running into the same error in my calculations for larger k grids. Everything goes smoothly until the BSE kernel calculation is finished, and then the computation crashes with a "too many communicators" error and no other explanation before it can start the haydock calculation (I believe this or something similar also happens when I'm trying to do a slepc calculation, but I'm not sure if the problems are related).

Code: Select all

Fatal error in PMPI_Comm_split: Other MPI error, error stack:
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
PMPI_Comm_split(1294)...............: MPI_Comm_split(MPI_COMM_WORLD, color=2015, key=1, new_comm=0x1516916bb858) failed
PMPI_Comm_split(1276)...............:
MPIR_Comm_split_allgather(1005).....:
MPIR_Get_contextid_sparse_group(615): Too many communicators (0/2048 free on this process; ignore_id=0)
I've attached the setup file for my yambo compilation (yambo-5.1.1), as well as the crashed slurm log file and the yambo LOG file. Sometimes when this issue arises I'm able to run just the haydock step without any parallelization on one node, but that's not possible if the memory required for the computation exceeds the RAM of the node I'm using. Sometimes it helps to run with just mpi parallelization across a few nodes and no OMP paralellization, but that also sometimes fails.
I'm not really sure how to approach this issue, so any advice is appreciated.

Best,
Miles
You do not have the required permissions to view the files attached to this post.
Miles Johnson
California Institute of Technology
PhD candidate in Applied Physics

User avatar
Davide Sangalli
Posts: 624
Joined: Tue May 29, 2012 4:49 pm
Location: Via Salaria Km 29.3, CP 10, 00016, Monterotondo Stazione, Italy
Contact:

Re: too many communicators parallelization error

Post by Davide Sangalli » Wed Mar 20, 2024 5:59 pm

Dear Miles,
since this happens after the calculation of the kernel, I suspect it has something to do with the solver.

Checking the code I see this might be due to the MPI implementation in the Haydock solver.
Indeed there this code which is very likely causing the issue.

Code: Select all

     do i_g=1,BS_nT_grps                                                
       !                                                                
       if (.not.PAR_IND_T_Haydock%element_1D(i_g)) then                 
         local_key=-1                                                   
         PAR_COM_T_Haydock(i_g)%my_CHAIN=BS_nT_grps+1                   
       else                                                             
         !                                                              
         local_key = 1                                                  
         if (PAR_IND_T_groups%element_1D(i_g)) local_key = 0            
         !                                                              
         PAR_COM_T_Haydock(i_g)%n_CPU=PAR_COM_T_Haydock(i_g)%n_CPU+1    
         PAR_COM_T_Haydock(i_g)%my_CHAIN = i_g                          
         !                                                              
       endif                                                            
       !                                                                
       call CREATE_the_COMM(PAR_COM_WORLD%COMM,PAR_COM_T_Haydock(i_g),local_key)
       !                                                                
     enddo                                                              
I'll try to get in touch with other developers to discuss how this could be solved.
For now the only suggestion I can give is to not distribute over the variable eh, but rather us k and t, to minimize the number of groups (BS_nT_grps).
Set something like this in the input

Code: Select all

BS_ROLEs= "k.eh.t"
BS_CPU="nk.1.nk"
with nk*nt = ncpu.

Best,
D.
Davide Sangalli, PhD
CNR-ISM, Division of Ultrafast Processes in Materials (FLASHit) and MaX Centre
https://sites.google.com/view/davidesangalli
http://www.max-centre.eu/

milesj
Posts: 30
Joined: Thu Jan 26, 2023 9:27 pm

Re: too many communicators parallelization error

Post by milesj » Mon Apr 15, 2024 3:27 am

Hi Davide,

Sorry I haven't had a chance to try this until recently, but I've gotten the same error both with and without openMP parallelisation:

Code: Select all

At line 82 of file /global/homes/m/milesj/my_modules/yambo_cpu/yambo-5.1.0/src/parallel/PARALLEL_get_user_structure.F
Fortran runtime error: Bad value during integer read
That's from the slurm log file, here's the yambo log file:

Code: Select all

<---> P1: [01] MPI/OPENMP structure, Files & I/O Directories
 <---> P1-nid005208: MPI Cores-Threads   : 24(CPU)-1(threads)
 <---> P1-nid005208: MPI Cores-Threads   : BS(environment)-k.eh.t(CPUs)-4.1.6(ROLEs)
 <---> P1-nid005208: [02] CORE Variables Setup
 <---> P1-nid005208: [02.01] Unit cells
 <---> P1-nid005208: [02.02] Symmetries
 <---> P1-nid005208: [02.03] Reciprocal space
 <---> P1-nid005208: [02.04] K-grid lattice
 <---> P1-nid005208: Using the new bz sampling setup
 <---> P1-nid005208: Grid dimensions      :  12  12  12
 <---> P1-nid005208: [02.05] Energies & Occupations
 <06s> P1-nid005208: [03] Transferred momenta grid and indexing
 <06s> P1-nid005208: [MEMORY] Alloc bare_qpg( 1.800992 [Gb]) TOTAL:  1.957100 [Gb] (traced)  2.093616 [Gb] (memstat)
 <13s> P1-nid005208: [04] Dipoles
 <13s> P1-nid005208: DIPOLES parallel ENVIRONMENT is incomplete. Switching to defaults
 <13s> P1-nid005208: [PARALLEL DIPOLES for K(ibz) on 2 CPU] Loaded/Total (Percentual):259/518(50%)
 <13s> P1-nid005208: [PARALLEL DIPOLES for CON bands on 2 CPU] Loaded/Total (Percentual):3/5(60%)
 <13s> P1-nid005208: [PARALLEL DIPOLES for VAL bands on 6 CPU] Loaded/Total (Percentual):2/11(18%)
 <13s> P1-nid005208: [DIP] Checking dipoles header
 <13s> P1-nid005208: [WARNING] [r,Vnl^pseudo] included in position and velocity dipoles.
 <13s> P1-nid005208: [WARNING] In case H contains other non local terms, these are neglected
This was on 24 nodes on the perlmutter cpu supercomputer cluster with no openMP parallelisation, but I've also tried it with OMP_NUM_THREADS=128, and also setting it to 48.1.64, all giving the same error.

Thanks,
Miles
Miles Johnson
California Institute of Technology
PhD candidate in Applied Physics

Post Reply