Error in sigma.x : Segmentation fault

H.Katow's picture
Submitted by H.Katow on Thu, 03/09/2017 - 01:13

Forums 

User questions

Hello all,

I got an error when executing the sigma.real.x and calculation stopped.
The system is Graphane, hexagonal 2D semiconductor (4 atoms, 10 electrons per unit cel).
Here are inputs and outputs:

sigma.inp:
number_bands 99

screened_coulomb_cutoff 50

band_index_min 1
band_index_max 16

screening_semiconductor

begin kpoints
0.000000000 0.000000000 0.000000000 1.0
0.000000000 0.062500000 0.000000000 1.0
0.000000000 0.125000000 0.000000000 1.0
0.000000000 0.187500000 0.000000000 1.0
0.000000000 0.250000000 0.000000000 1.0
0.000000000 0.312500000 0.000000000 1.0
0.000000000 0.375000000 0.000000000 1.0
0.000000000 0.437500000 0.000000000 1.0
0.000000000 0.500000000 0.000000000 1.0
0.062500000 0.062500000 0.000000000 1.0
0.062500000 0.125000000 0.000000000 1.0
0.062500000 0.187500000 0.000000000 1.0
0.062500000 0.250000000 0.000000000 1.0
0.062500000 0.312500000 0.000000000 1.0
0.062500000 0.375000000 0.000000000 1.0
0.062500000 0.437500000 0.000000000 1.0
0.125000000 0.125000000 0.000000000 1.0
0.125000000 0.187500000 0.000000000 1.0
0.125000000 0.250000000 0.000000000 1.0
0.125000000 0.312500000 0.000000000 1.0
0.125000000 0.375000000 0.000000000 1.0
0.125000000 0.437500000 0.000000000 1.0
0.187500000 0.187500000 0.000000000 1.0
0.187500000 0.250000000 0.000000000 1.0
0.187500000 0.312500000 0.000000000 1.0
0.187500000 0.375000000 0.000000000 1.0
0.250000000 0.250000000 0.000000000 1.0
0.250000000 0.312500000 0.000000000 1.0
0.250000000 0.375000000 0.000000000 1.0
0.312500000 0.312500000 0.000000000 1.0
end

And I got a message as follows:
sigma.out
..........
Memory available: 1925.0 MB per PE
Reading header of WFN_inner
Highest occupied band (unshifted grid) = 5
Valence max (unshifted grid) = -2.030282 eV
Conduction min (unshifted grid) = 1.564574 eV
Middle energy (unshifted grid) = -.232854 eV
Fermi energy (unshifted grid) = -.232854 eV

Calculation parameters:
- Cutoff of the bare Coulomb interaction (Ry): 100.00
- Cutoff of the screened Coulomb interaction (Ry): 50.00
- Number of G-vectors up to the bare int. cutoff: 9733
- Number of G-vectors up to the screened int. cutoff: 3495
- Total number of bands in the calculation: 99
- Number of fully occupied valence bands: 5
- Number of partially occ. conduction bands: 0

Memory required for execution: 228.5 MB per PE
Memory required for vcoul: 114.4 MB per PE

Number of electrons per unit cell (from ifmax) = 10.000000
Number of electrons per unit cell (from occupations) = 10.000000
Plasma Frequency = .936401 Ry

Q-grid symmetries are being used.

Parallelization report:
- Using 96 processor(s), 6 pool(s), 16 processor(s) per pool.
- Each pool is computing 2 to 3 diagonal sigma matrix element(s).
- Note: distribution is not ideal because the number of diagonal sigma
matrix elements (16) is not divisible by the number of pools (6).
- Each pool is computing 0 off-diagonal sigma matrix element(s).
- Each pool is holding 0 to 7 band(s).
- Note: distribution is not ideal because the total number of bands
(99) is not divisible by the number of processors per pool (16).

.....

Number of k-points in WFN_inner: 30
Number of k-points in the full BZ of WFN_inner: 256
k+G sampling: 0.072169 0.062500 0.170480 (reciprocal lattice units)
WARNING: detected non-uniform k+G sampling, may cause strange results.
You should verify your answer with different cell-averaging cutoffs.

================================================================================
16:46:48 Dealing with k = 0.000000 0.000000 0.000000 1 / 30
================================================================================

Reading vxc.dat
Number of q-points in the irreducible BZ(k) (nrq): 30

Started calculating Sigma with 90 block(s) at 16:46:48.
[ 16:46:52 | 0% ] block 1 / 90.
[ 16:46:58 | 2% ] block 3 / 90, remaining: 306 s.
APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

It doesn't seem to be a problem of memory size according to the available memory and required memory.
The message "WARNING: detected non-uniform k+G sampling, may cause strange results." may be a hint,
but I don't know how to troubleshoot it.

I would appreciate if anyone can help me. Thank you for reading.

Best,
Katow

Submitted by babarker on Mon, 03/13/2017 - 16:54

Hello Katow,

If you compiled the code yourself, try recompiling with the various debug options enabled. That will help us to track down the cause of the crash.

Best,
Brad

H.Katow's picture

Submitted by H.Katow on Fri, 04/07/2017 - 23:52

Hello Brad,

Thank you for your comment and I'm sorry for being late to reply.
I tried some debug options (-CB -fpe0 -traceback -g ) with maximum verbosity, but I couldn't get informations more than this:
================================================================================
18:50:18 Dealing with k = 0.000000 0.000000 0.000000 1 / 30
================================================================================

Reading vxc.dat
Number of k-points in the irreducible BZ(q) (nrq): 30

q neq indrq itrq kg0

0.00000 0.00000 0.00000 1 1 12 0 0 0
0.00000 0.06250 0.00000 6 2 8 0 0 0
0.00000 0.12500 0.00000 6 3 8 0 0 0
0.00000 0.18750 0.00000 6 4 8 0 0 0
0.00000 0.25000 0.00000 6 5 8 0 0 0
0.00000 0.31250 0.00000 6 6 8 0 0 0
0.00000 0.37500 0.00000 6 7 8 0 0 0
0.00000 0.43750 0.00000 6 8 8 0 0 0
0.00000 0.50000 0.00000 3 9 8 0 0 0
0.06250 0.06250 0.00000 6 10 6 0 0 0
0.06250 0.12500 0.00000 12 11 1 0 0 0
0.06250 0.18750 0.00000 12 12 1 0 0 0
0.06250 0.25000 0.00000 12 13 1 0 0 0
0.06250 0.31250 0.00000 12 14 1 0 0 0
0.06250 0.37500 0.00000 12 15 1 0 0 0
0.06250 0.43750 0.00000 12 16 1 0 0 0
0.12500 0.12500 0.00000 6 17 6 0 0 0
0.12500 0.18750 0.00000 12 18 1 0 0 0
0.12500 0.25000 0.00000 12 19 1 0 0 0
0.12500 0.31250 0.00000 12 20 1 0 0 0
0.12500 0.37500 0.00000 12 21 1 0 0 0
0.12500 0.43750 0.00000 6 22 2 0 1 0
0.18750 0.18750 0.00000 6 23 6 0 0 0
0.18750 0.25000 0.00000 12 24 1 0 0 0
0.18750 0.31250 0.00000 12 25 1 0 0 0
0.18750 0.37500 0.00000 12 26 1 0 0 0
0.25000 0.25000 0.00000 6 27 6 0 0 0
0.25000 0.31250 0.00000 12 28 1 0 0 0
0.25000 0.37500 0.00000 6 29 2 0 1 0
0.31250 0.31250 0.00000 6 30 6 0 0 0

Started calculating Sigma with 120 block(s) at 18:50:18.

qpoint 1 out of 30

Reading Eps Back
*** VERBOSE: Read eps from memory time = 18:50:22.646
*** VERBOSE: nmtx = 9733 ncouls = 3495

q= 0.00000 0.00000 0.00000 n= 3495 head of epsilon inverse = 0.591323

*** VERBOSE: Calling gmap time = 18:50:22.647
*** VERBOSE: Calling genwf time = 18:50:22.649
[ 18:50:22 | 0% ] block 1 / 120.
Computing Sigma diag 1 to 5 of 20
*** VERBOSE: Working on band 1 1st pool time = 18:50:22.656
*** VERBOSE: Calling mtxel time = 18:50:22.656
*** VERBOSE: Creating 32 x 32 x 192 FFTW plans. time = 18:50:22.687
*** VERBOSE: Done creating plans time = 18:50:22.687
*** VERBOSE: Calling mtxel_sxch for diagonal matrix elements time =
18:50:22.738
APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
~

in sigma.out. I also executed previous version (ver.1.1-beta2) and got same result.
Does this log gives any new insight?
Or shall I put WRITE command or something on the source code?

Best regards,
Hiroki KATOW

Submitted by babarker on Sun, 04/09/2017 - 20:12

Hello Hiroki,

Do you mind sharing your input file data for the "Epsilon" calculations? The segfault appears to occur when the Sigma executable begins to make use of the data from that step.

Best,
Brad

H.Katow's picture

Submitted by H.Katow on Sun, 04/09/2017 - 21:30

Hello Brad,

Thank you for your immediate reply. Here is this.

epsilon.inp
epsilon_cutoff 100.0

number_bands 99

band_occupation 5*1 94*0

begin qpoints
0.001000000 0.001000000 0.000000000 1.0 1
0.000000000 0.062500000 0.000000000 1.0 0
0.000000000 0.125000000 0.000000000 1.0 0
0.000000000 0.187500000 0.000000000 1.0 0
0.000000000 0.250000000 0.000000000 1.0 0
0.000000000 0.312500000 0.000000000 1.0 0
0.000000000 0.375000000 0.000000000 1.0 0
0.000000000 0.437500000 0.000000000 1.0 0
0.000000000 0.500000000 0.000000000 1.0 0
0.062500000 0.062500000 0.000000000 1.0 0
0.062500000 0.125000000 0.000000000 1.0 0
0.062500000 0.187500000 0.000000000 1.0 0
0.062500000 0.250000000 0.000000000 1.0 0
0.062500000 0.312500000 0.000000000 1.0 0
0.062500000 0.375000000 0.000000000 1.0 0
0.062500000 0.437500000 0.000000000 1.0 0
0.125000000 0.125000000 0.000000000 1.0 0
0.125000000 0.187500000 0.000000000 1.0 0
0.125000000 0.250000000 0.000000000 1.0 0
0.125000000 0.312500000 0.000000000 1.0 0
0.125000000 0.375000000 0.000000000 1.0 0
0.125000000 0.437500000 0.000000000 1.0 0
0.187500000 0.187500000 0.000000000 1.0 0
0.187500000 0.250000000 0.000000000 1.0 0
0.187500000 0.312500000 0.000000000 1.0 0
0.187500000 0.375000000 0.000000000 1.0 0
0.250000000 0.250000000 0.000000000 1.0 0
0.250000000 0.312500000 0.000000000 1.0 0
0.250000000 0.375000000 0.000000000 1.0 0
0.312500000 0.312500000 0.000000000 1.0 0
end

Best,
Hiroki Katow

Submitted by babarker on Mon, 04/10/2017 - 16:04

Hello Hiroki,

The q- and k-grids look consistent. They are 16x16x1? Did you generate the grids WITHOUT explicitly invoking time reversal symmetry? This is an issue with, say, Quantum ESPRESSO; their default is to try and further reduce grids. The input files for pw2bgw in BerkeleyGW have the appropriate field set correctly.

Also, are the symmetries described in the header of the WFN(_inner) file, from wfn_rho_vxc_info.x, consistent with those found by kgrid.x?

You chose a very large value of "epsilon_cutoff" (100 Ry) but then use only half of it for "screened_coulomb_cutoff" (50 Ry). These should be the same, unless you are doing a convergence test.

Also, your choice of bands is exceedingly small. In reality, you don't have this degree of freedom, as it is defined implicitly by your choice of epsilon/screened_coulomb cutoff. A way to numerically estimate the number of empty states corresponding to a given cutoff is to use the "gsphere.py" utility in the "bin" directory. Depending on your vacuum size, you're probably looking at several-thousand empty states for a converged calculation.

jdeslip's picture

Submitted by jdeslip on Wed, 04/19/2017 - 19:21

Can you run this with more and/or fewere MPI ranks and nodes to see if the error dissapears. It still wonder if it could memory related (perhaps there is some edge case here that cause an un-accounted for array to blow up (OOM).

H.Katow's picture

Submitted by H.Katow on Fri, 05/19/2017 - 01:33

Dear Brad and Deslippe,

Sorry for being late for reply.

"Did you generate the grids WITHOUT explicitly invoking time reversal symmetry? This is an issue with, say, Quantum ESPRESSO; their default is to try and further reduce grids."
By explicitly declaring "noinv = .true." in PW input file, the sigma.x ran correctly.

"Also, your choice of bands is exceedingly small."
It turned out that my current server is powerless to execute with thousands of bands. Now I'm trying to move to super computer operated by SuSE Linux.

Thank you very much for your advices.

Bets regards,
Hiroki KATOW

Submitted by babarker on Sun, 05/21/2017 - 17:31

Hello Hiroki,

Good to hear the update. I recommend manually generating the k-grids and q-grids using the "kgrid.x" executable from BerkeleyGW. See (BerkeleyGW_root_directory)/MeanField/ESPRESSO/kgrid.inp for details. Depending on the version of Espresso, there may be some issues with particular symmetry groups, as I've seen in their discussion forums.

Best,
Brad