epsilon.x crash when writing dielectric matrix to file during tests

Submitted by fcannini on Wed, 08/09/2017 - 10:28



Hello there

While running the epsilon.x (both real and complex) tests of BGW 1.2.0 I'm facing the errors above with the inputs below:
- Benzene-SAPO/epsilon.inp
- GaAs-EPM/epsilon.inp
- Graphene/epsilon.inp
- Graphene/Graphene_3D.test

These are the only tests failing.

The error itself is the same in all inputs:

Writing dielectric matrix to file
[gputest-0-14:6139] *** An error occurred in MPI_Comm_dup
[gputest-0-14:6139] *** reported by process [1988034561,0]
[gputest-0-14:6139] *** on communicator MPI_COMM_WORLD
[gputest-0-14:6139] *** MPI_ERR_COMM: invalid communicator
[gputest-0-14:6139] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[gputest-0-14:6139] *** and potentially your MPI job)

The environment is the following:
- OS : centos 6.5
- CC/CXX/FC : gnu 5.3.0
- MPI : openmpi 1.10.7
- BLAS/LAPACK : openblas 0.2.20
- HDF5+MPI : 1.8.19
- SCALAPACK : 2.0.2

Here's the arch.mk file:

FCPP = $(CPP) -C
F90free = mpif90 -ffree-form -ffree-line-length-none -fbounds-check -Wall -cpp -std=gnu
LINK = mpif90 -fopenmp
CC_COMP = mpicxx -Wall -pedantic-errors -std=c++0x
C_COMP = mpicc -Wall -pedantic-errors -std=c99
C_LINK = mpicxx
C_OPTS = -O2
REMOVE = /bin/rm -f
FFTWLIB = -L/fftw/lib -lfftw3_omp -lfftw3
FFTWINCLUDE = /fftw/include
LAPACKLIB = -L/openblas/lib -lopenblaso
BLACS = -L/scalapack/lib -lmpiblacs
SCALAPACKLIB = -L/scalapack/lib -lscalapack $(BLACS)
HDF5LIB = -L/hdf5_mpi/lib -lhdf5hl_fortran -lhdf5_fortran -lhdf5_hl -lhdf5 -lz
HDF5INCLUDE = /hdf5_mpi/include/shared
TESTSCRIPT = make check-parallel

Any idea of the cause of the problem?


jdeslip's picture

Submitted by jdeslip on Thu, 08/24/2017 - 13:37

Hmm, I guess HDF5 library is trying to create new communicators and is dieing. Do you have another version of HDF5 your could try? You could also use the non-HDF5 berkeleygw build. In general, that performs worse however.

Submitted by fcannini on Fri, 08/25/2017 - 11:55

Hi jdeslip, thanks for answering.

Which series should i try, 1.8.x or 1.10.x ?
What can I do to dig further into the problem ?


jdeslip's picture

Submitted by jdeslip on Thu, 08/31/2017 - 22:57

I'd try to build with both if you can and see if either works. It probably depends a bit on the system you are studying about how import HDF5 is. For small-medium size systems the difference is probably negligible. For large systems you are probably looking at 100 MB/s IO without HDF5 and depending on your filesystem and MPI-IO iplementation > 1 GB/s with HDF5.