compiling BGW on stampede KNL system

Submitted by sgao on Fri, 10/14/2016 - 11:56

Forums 

Installation

Dear developer,

I want to compile BGW-1.2.0 on the new stampede cluster with the intel knight's landing (KNL) processors. However, the code I compiled run 10x slower compared with the one on the old stampede system (which is compiled with the recommanded arch.mk settings for stampede with mvapich2). Could you recommand better compling options for this system?

###########
The modules on the stampede KNL system are:
intel/17.0.0
fftw3/3.3.5
impi/17.0.0
phdf5/1.8.16

There's no mvapich2/2.1 module for this system.

###########
The arch.mk I used to compile the code is:

COMPFLAG = -DINTEL
PARAFLAG = -DMPI -DOMP
MATHFLAG = -DUSESCALAPACK -DUNPACKED -DUSEFFTW3 -DHDF5

FCPP = cpp -ansi
F90free = mpiifort -free -openmp -no-ipo -ip
LINK = mpiifort -openmp -no-ipo -ip
FOPTS = -O3 -xMIC-AVX512 -fp-model source
FNOOPTS = -O2 -xMIC-AVX512 -fp-model source -no-ip
MOD_OPT = -module
INCFLAG = -I

C_PARAFLAG = -DPARA -DMPICH_IGNORE_CXX_SEEK
CC_COMP = mpicxx -xMIC-AVX512
C_COMP = mpicc -xMIC-AVX512
C_LINK = mpicxx -xMIC-AVX512
C_OPTS = -O3 -xMIC-AVX512 -no-ipo -ip -openmp
C_DEBUGFLAG =

REMOVE = /bin/rm -f

# Math Libraries
#
FFTWLIB = $(TACC_FFTW3_LIB)/libfftw3_omp.a \
$(TACC_FFTW3_LIB)/libfftw3.a
FFTWINCLUDE = $(TACC_FFTW3_INC)

MKLPATH = $(MKLROOT)/lib/intel64
LAPACKLIB = -Wl,--start-group \
$(MKLPATH)/libmkl_intel_lp64.a \
$(MKLPATH)/libmkl_intel_thread.a \
$(MKLPATH)/libmkl_core.a \
$(MKLPATH)/libmkl_blacs_intelmpi_lp64.a \
-Wl,--end-group -lpthread -lm
SCALAPACKLIB = $(MKLPATH)/libmkl_scalapack_lp64.a

HDF5PATH = $(TACC_HDF5_LIB)
HDF5LIB = $(HDF5PATH)/libhdf5hl_fortran.a \
$(HDF5PATH)/libhdf5_hl.a \
$(HDF5PATH)/libhdf5_fortran.a \
$(HDF5PATH)/libhdf5.a \
$(HDF5PATH)/libsz.a \
-lz
HDF5INCLUDE = $(HDF5PATH)/../include

TESTSCRIPT = sbatch stampede.scr

Thanks,
Shiyuan

jdeslip's picture

Submitted by jdeslip on Sun, 12/18/2016 - 22:18

Hi Shiyuan,

We've actually spent quite a bit of time optimizing the code for KNL.

When you say it runs 10x slower, what are comparing? What step in the GW process and being run on how may nodes? What kind of calculation were you doing - and what does your batch script look like?

A few things could potentially be going wrong - something wrong with the build, some kind of task/thread binding issue etc.

Your arch.mk looks OK-ish. I would use MKL for the FFTs instead of the FFTW build however. There could be an issue with the openmp library in there (if was built with a different compiler or compiler version).

Jack

Submitted by sgao on Wed, 12/21/2016 - 15:11

Hi Jack,

I don't know much about parallel computing. 10x slower may not be a fair comparison. With more testing, this is what I got:

====================== # nodes == # cores == # MPI tasks == T1** === T2***
1 - old Sandy Bridge system == 16 ====== 256 ===== 256 ====== 45s ==== 5s
2 - new KNL system ======== 1 ======= 68 ====== 272* ===== 482s === 31s
3 - new KNL system ======== 4 ======= 272 ===== 1088 ===== 189s === 70s
4 - new KNL system ======== 4 ======= 272 ===== 272 ====== 181s === 8s

* There's 68 cores/node on KNL system and 4 hardware threads/core
** T1=time for calculation of matrix elements in epsilon, per k-point
*** T2=time for building polarizability matrix in epsilon, per k-point

So if I use 1MPI task/core (test 1 vs test 4), KNL has a similar speed/node, 4x slower speed/core; (the KNL processor has about half the frequency, so should I say the KNL run has 2x lower parallel efficiency?)
if I use 4MPI task/core (test 1 vs test 2&3), there's essentially no improvement and may get even worse. I don't know how this hardware threads works, but it's turned on by default if you only specify #MPI tasks but not #nodes in the bash script. Also: number of threads per MPI task is always 1 according to the output. Does it mean I'm using the hardware threads wrong?

Anyway, this is my bash script: (for test 4)

#!/bin/bash
#SBATCH -J eps # job name
#SBATCH -o o%j # output and error file name (%j expands to jobID)
#SBATCH -n 272 # total number of mpi tasks requested
#SBATCH -N 4 # total number of nodes (this is by default #mpi tasks/272)
#SBATCH -p normal # queue (partition)
#SBATCH -t 00:30:00 # run time (hh:mm:ss)
#
HOMEDIR=`pwd`
TARGET=$SCRATCH/knltest
BINDIR=$WORK/BerkeleyGW-1.2.0-knl/bin

mkdir -p $TARGET/03-eps
cd $TARGET/03-eps
ln -sf $TARGET/02-wfn/WFN_cplx ./WFN
cp $HOMEDIR/epsilon.inp ./epsilon.inp
ibrun $BINDIR/epsilon.cplx.x > ./eps.out

And this is my epsilon.inp:

epsilon_cutoff 10.0
number_bands 269
band_occupation 13*1 256*0
cell_slab_truncation
degeneracy_check_override
begin qpoints
0.041666667 0.041666667 0.000000000 1.0 0
0.041666667 0.083333333 0.000000000 1.0 0
0.041666667 0.125000000 0.000000000 1.0 0
0.041666667 0.166666667 0.000000000 1.0 0
0.041666667 0.208333333 0.000000000 1.0 0
0.041666667 0.250000000 0.000000000 1.0 0
0.041666667 0.291666667 0.000000000 1.0 0
0.041666667 0.333333333 0.000000000 1.0 0
0.041666667 0.375000000 0.000000000 1.0 0
0.041666667 0.416666667 0.000000000 1.0 0
0.041666667 0.458333333 0.000000000 1.0 0
0.041666667 0.500000000 0.000000000 1.0 0
0.041666667 0.541666667 0.000000000 1.0 0
0.041666667 0.583333333 0.000000000 1.0 0
0.041666667 0.625000000 0.000000000 1.0 0
0.333333333 0.333333333 0.000000000 1.0 0
end

The system is monolayer MoS2 with 24*24*1 k-grid.

Shiyuan