Epsilon run break down during kpoint list, independent of cpu count

Submitted by jugeb on Thu, 06/01/2017 - 06:23

Forums 

User questions

Hi everyone,

I am running epsilon.cplx.x for a system that I would guess is small/moderate: 26 atoms, 68 electrons, 41 bands, 6x6x6 kpoints (216 total kpoints via kgrid.x). Execution always fails after a couple of kpoints (total 216) with errors like this:

[ 12:11:47 | 60% ] processor 735 / 1224, remaining: 10 s.
[ 12:11:49 | 70% ] processor 857 / 1224, remaining: 7 s.
[ 12:11:51 | 80% ] processor 980 / 1224, remaining: 5 s.
[ 12:11:53 | 90% ] processor 1102 / 1224, remaining: 2 s.
Finished building polarizability matrix at 12:11:56.
Elapsed time: 23 s.

q-pt 5: Head of Epsilon = 2.228379354373837E+000 7.313642240044020E-019
q-pt 5: Epsilon(2,2) = 1.664736656152439E+000 -1.913058531960379E-019
Rank 73 [Fri May 26 12:11:57 2017] [c2-0c1s9n0] Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
PMPI_Cart_sub(211)..................: MPI_Cart_sub(comm=0xc400629e, remain_dims=0x7ffffffef628, comm_new=0x7ffffffef580) failed
PMPI_Cart_sub(153)..................:
MPIR_Comm_split_impl(276)...........:
MPIR_Get_contextid_sparse_group(674): Too many communicators (0/4096 free on this process; ignore_id=0)
Rank 907 [Fri May 26 12:11:57 2017] [c2-0c1s15n0] Fatal error in PMPI_Cart_sub: Other MPI error, error stack:
PMPI_Cart_sub(211)..................: MPI_Cart_sub(comm=0xc400629e, remain_dims=0x7ffffffef628, comm_new=0x7ffffffef580) failed
PMPI_Cart_sub(153)..................:
MPIR_Comm_split_impl(276)...........:

For me this looked like a memory issue, albeit the estimate given suggests that there is plenty of memory for the given job:
Memory available: 3487.1 MB per PE
Memory required for execution: 132.9 MB per PE
Memory required for vcoul: .0 MB per PE

I tried successively increasing the number of nodes from 2, 4, 16 (each node with 36 processors), but the calculation aboards after the same amount of k-points at the same position for each of these cases (18 out of 216 kpoints in this case). Decreasing the epsilon_cutoff increases the amount of kpoints, increasing the cutoff decreases the amount of kpoints that are able to be calculated, which still suggests that computational resources are the cause. Although it is an option to split the calculation into small junks of kpoints, merging them afterwards, I was hoping to understand/resolve this matter, since this might lead to cases where not even 1 kpoint can be handled anymore.

Thanks!

jdeslip's picture

Submitted by jdeslip on Mon, 06/05/2017 - 05:11

I'm not sure this is a memory issue. It looks like it is dying within scalapack due to scalapack creating too many communicators for your MPI library. Can you try a different version of scalapack or a different MPI library?

Submitted by jugeb on Mon, 06/05/2017 - 06:15

Unfortunately I didn't compile the code myself, but used the precompiled module on cori (module berkeleygw/1.2), as I thought this should be a stable build for the beginning given the large platform and the proximity to the developing group.

Do you guys compile these modules or should we contact the cori technicians to get the details of the building process?