I/O error: failed to open file 'chi_converge.dat'

Submitted by vormar on Mon, 09/17/2012 - 07:55

Forums 

Bug reports

Dear Users and Developers,

First, I would like to thank the developers for releasing this amazing package.

I have been using BerkeleyGW on IBM based supercomputers, e.g. Cineca SP6 (now retired), Sara Huygens, Cineca Fermi. In all cases I had several I/O related problems when I ran calculations in parallel. I ran simple ppGW calculations based on a PWscf starting point for a molecule, just like the example on benzene.

During epsilon calculations I found errors of the following type:

"From proc 8: ERROR: Failed to open file 'chi_converge.dat' with error 14"

According to the IBM manual this error message simply means, not a surprise, "Error opening file".

Here is the list of what I set in the input file of epsilon. Since full_chi_conv_log 0 is set by default, chi_converge.dat file should be written during the calculation.

epsilon_cutoff 1.0
number_bands 300
band_occupation 28*1 272*0
cell_box_truncation
number_qpoints 1
begin qpoints
0.0000 0.000 0.000 1.0 1
end
gcomm_matrix
comm_mpi
degeneracy_check_override

By inspecting the source code, I think the problem may arise at line 177 of epsilon_main.f90. There, the file is opened by all processes, if I just added (peinf%ionode .eq. 0) in the if statement, everything went smoothly, e.g. I could reproduce serial calculations performed on another architecture. As expected, the printout is only done by peinf%ionode=0 (from line 1470), finally, I also added the same modification to line 1846 where the file is closed.

I think that similar errors could arise with the file opening of "x.dat" in line 230 of sigma_main.f90 and "vxc.dat" line 230 of sigma_main.f90.

Let me know what you think and if you need additional information.

Thanks,
Marton

--
PhD student
Department of Atomic Physics
Budapest University of Technology and Economics
Budafoki út 8., H-1111, Budapest, Hungary

dstrubbe's picture

Submitted by dstrubbe on Thu, 09/20/2012 - 09:02

Hi Marton,

Thanks for the useful and detailed report. We don't have access to machines of this type right now, so we did not notice these issues. I have made the changes you recommended regarding chi_converge.dat. The x.dat and vxc.dat files are not quite analogous though, since all the processors actually do read from them, although only proc 0 writes to them. Let us know if you do actually find an error from those files and we can change that part too if necessary.

You mention running an example and comparing to serial calculations. That is certainly good, but I would strongly recommend to everyone to run the testsuite first, before any examples. They are much quicker to run and are better defined for trying to resolve any problems. I presume your problem would have been visible in the testsuite as well.

You mention various European machines you are running on. It seems like these are large machines with a wide user base. If you are able to get the code to build and pass the testsuite on them, it would be much appreciated if you could contribute an 'arch.mk' file like the ones in the config directory, and a job script for running the testsuite, as in the testsuite directory, that we could include in a future release, like we have for some US supercomputers.

David