aportelli / hadrons Goto Github PK

View Code? Open in Web Editor NEW

19.0 4.0 30.0 3.54 MB

Grid-based workflow management system for lattice field theory simulations

License: GNU General Public License v2.0

C++ 97.88% Makefile 0.26% Shell 0.37% M4 1.40% HTML 0.09%

grid workflow-management lattice-gauge-theory lattice-qcd lattice-field-theory

hadrons's People

Stargazers

Watchers

hadrons's Issues

Intel compiler

Hi, We had to make some modifications to
Hadrons/Modules/MContraction/Baryon.hpp
The intel compiler wanted the last argument to regex_replace to be wrapped in a std::string().
For example:
gammaString = regex_replace(gammaString, std::regex("j12"),std::string("(Identity SigmaXZ)"));

HDF5-related errors when compiling on Skylake at Cambridge

I'm afraid I've had zero luck compiling Grid+Hadrons for the Skylake cluster at Cambridge over the past couple of weeks, and with Tuesday's DiRAC technical case deadline approaching I need to ask for help. Back in Grid issue 175 @aportelli mentioned building successfully at Cambridge, so I'm hoping there may be some world-readable build there from which I could copy workable options and module choices. I have tried to divine some of this from CI builds like this one, so far without success.

I've had several dozen builds fail on this system over the past couple of weeks. In case it might be helpful, I attach some logs from my latest attempt, which fails with the HDF5-related errors mentioned in the subject. (Grid's make check returns an error code but all the tests seem to say they're OK, which is why I'm opening this under Hadrons rather than Grid.)

From these logs I quote:

Grid commit:

commit: 7cf7f11e
branch: develop

Hadrons commit:

HADRONS_BRANCH='develop'
HADRONS_SHA='82ffe4f24a0684c37598440290f35c0b33d10b58'

Grid configure line:

$ /home/dc-scha3/dev/Grid-Sky/configure --prefix=/home/dc-scha3/dev/Grid-Sky/install --with-lime=/home/dc-scha3/dev/c-lime-Sky/install --enable-precision=double --enable-simd=AVX512 --enable-comms=mpi3 --enable-mkl CC=mpiicc CXX=mpiicpc CXXFLAGS="-std=c++11 -qoverride-limits"

Full CPU description:

$ cat /proc/cpuinfo | grep 'model name' | uniq
model name : Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz

Full compiler (for this latest attempt):

$ mpiicpc -v | grep -v Copyright
mpiicpc for the Intel(R) MPI Library 2019 Update 3 for Linux*
icpc version 19.0.3.192 (gcc version 8.4.0 compatibility)

For obvious reasons I've made many attempts to try to use what look like the most promising of the 45 different hdf5 modules on offer at Cambridge. The attached messy notes summarize the main ways these attempts have failed. Typically I get either

HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
#000: H5T.c line 1712 in H5Tcopy(): not a datatype or dataset
major: Invalid arguments to routine
minor: Inappropriate type

in Grid's make check or the undefined references in the attached Hadrons make log, in both cases with warnings about

ld: warning: libhdf5.so.8, needed by /lib/../lib64/libhdf5_cpp.so, may conflict with libhdf5.so.103

Google hasn't yet helped me with that.

Thanks in advance for any hints you may be able to offer.

Attachments:
build_script.txt
build_notes.txt
module_list.txt
Grid_config.log
grid.configure.summary.txt
Grid_make.log
Grid_make-install.log
Grid_make-check.log
Hadrons_config.log
Hadrons_make.log
Grid_make-check-old-HDF5.log

Difficult to understand DB error messages in case of graph inconsistency

Running a module with an output used by no other module causes the following db error
database error: error executing query 'INSERT INTO "objects" VALUES(16,'module_name',4,0,'undef',4294967295);'
which does not directly hint at the problem.

Module support for higher representations

Even after the fix in #86, one cannot use Hadrons for many things in higher representations, as none of the modules support any inputs other than the fundamental rep.

To rectify this, most modules need to have higher-representation versions registered and instantiated.

The most obvious way I can see to achieve this is to take lines 75–80 of Global.hpp and duplicate them to create FIMPL_ADJ, FIMPL_ASY and FIMPL_SYM (or similar), and then look for every (or at least almost every) instantiation of a module using FIMPL and duplicate it for the three higher representations.

I have a prototype that takes this approach just enough to allow a spectrum computation in the adjoint rep (so editing just MSource::Z2, MAction::Wilson, MSolver::RBPrecCG, MFermion::GaugeProp, and MContraction::Meson, and referring to WilsonAdjImplR explicitly), and it does seem work (giving compatible results with HiRep's measure_spectrum using Z2SEMWall sources).

(The solution is complicated by contractions—in that case there would be a minor combinatoric explosion if every combination of representations had to be dealt with (which presumably it would need to if you wanted to allow for chimera baryons and such things). I'm not sure how best to deal with that.)

@aportelli Does that seem a reasonable starting point for an implementation? (I ask because I'm happy to implement and PR it.) Or do you prefer a different solution that I haven't spotted? (Or is there some other way that I've missed of doing this without needing to implement the template instantiations and module registrations the Hadrons source—e.g. registering my own versions of the modules? The instantiation issues I've had suggest that that won't work, but it's entirely possible that I've missed something.)

Thanks!

Different contractions have different output file structures

I've noticed in recent work that the Meson contraction outputs a rather different XML structure to the WardIdentity contraction. This means downstream tooling to read and analyse these quantities needs to account separately for the two classes of input, rather than using a single generic reader.

Is there a strong reason for this choice? Would it be possible to make them more consistent?

Bug in naive scheduler for multiple trajectories in one job

I found a bug in a job that processes multiple trajectories and uses the naive scheduler. I get the following error at the start of the second trajectory:

HadronsXmlRun: ../../Hadrons/VirtualMachine.cpp:808: Grid::Hadrons::VirtualMachine::makeGarbageSchedule(const Program&) const::<lambda(unsigned int)>: Assertion `it != p.end()' failed.

Change of compile flags

Not so much of an issue but a health warning both @felixerben and myself have found that compiling with clang-10 and clang-11 will fail if you do not add CXXFLAGS=-pthread to hadrons on compile when building on laptops.

Open questions:

Was there a change in the configure script?
This seems to be limited to laptop compiles. Why?

3 Methods to Compute Meson 2-Pt Function

Hello! Recently I've been trying to better understand Hadrons and was looking at the Meson module. The code & documentation say the meson 2-pt function can be computed with three options for the sink parameter. I have attempted to do exactly this and am currently confused by the results. A summary of the problem is:

Using an MSink object (I believe a ScalarSink) reproduces the pion correlation function in Test_hadrons_spectrum.cpp. Using the same MSource object as used to compute the propagator, produces a correlator with only one non-zero entry at t=0. Using a SlicedPropagator produces a symmetric correlation function, but not one that decays exponentially.

Firstly here are a few time slices of each correlator to showcase the problem

meson correlator - ScalarSink
t=0 (0.47442148, 0.)
t=1 (0.14026629, 0.)
t=2 (0.01591819, 0.)

meson correlator - sources
t=0 (0.00053426, 0.)
t=1 (0., 0.)
t=2 (0., 0.)

meson correlator - Slice Propagator

t=0 (0.16742971, 0.)
t=1 (5.06398922, 0.)
t=2 (0.4132842, 0.)
t=3 (1.03891438, 0.)
t=4 (0.40092583, 0.)
t=5 (1.03891438, 0.)

I've also attached the main.cpp.txt main file which computes the correlator each way.

I appreciate I may just be misusing the other sink methods, and would appreciate any help.

Unable to compile for SU(4) for A100

I'm working with @LupoA trying to benchmark Hadrons on Tursa, and am hitting an issue that the nest of templates seemingly prevents Hadrons compiling with CUDA. A number of modules (including MGauss, which we need) give the errror:

/home/dp208/dp208/dc-benn2/prefix_su4test/include/Grid/threads/Accelerator.h(160): Error: Formal parameter space overflowed (4248 bytes required, max 4096 bytes allowed) in function _ZN4Grid11LambdaApplyIZNS_7LatticeINS_7iScalarINS_7iMatrixINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI7double2EEEEEELi4EEELi4EEEEEEaSINS_9BinaryMulENS1_INS2_INS2_INS2_ISD_EEEEEEEENS2_INS3_INS3_IS7_Li4EEELi4EEEEEEERSH_RKNS_23LatticeBinaryExpressionIT_T0_T1_EEEUlmmmE_EEvmmmT_

CPU compilation is fine.

Do you have any idea how fixing this could be approached, beyond going into Grid and renaming all of the type names to something shorter?

Thanks!

More complicated baryons are computed wrongly in Hadrons

In the first version of Baryon.hpp, which has been part of Hadrons until now, there was a bug regarding baryons which have an interpolator that is a linear combination of two or more quark combinations (like the sigma^0, lambda, ...). The problem is that the quark content is not passed in the correct (shuffled) order to the Grid function in lines 256-261:

            for (int iQ1 = 0; iQ1 < nQ; iQ1++){
                for (int iQ2 = 0; iQ2 < nQ; iQ2++){
                    BaryonUtils<FIMPL>::ContractBaryons_Sliced(q1[t],q2[t],q3[t],gAl,gBl,gAr,gBr,quarks[iQ1].c_str(),quarks[iQ2].c_str(),parity,ch);
                    cs += prefactors[iQ1]*prefactors[iQ2]*ch;
                }
            }

'Simple' baryons like nucleon, omega, sigma^+/-, Delta, etc. are not affected.

This issue will be solved with Raoul's pull request which will probably me merged into develop soon.

GPU runtime error with MAction:DWFF

Hadrons crashes at runtime on GPU when executing the MAction::DWFF module. The error is
r6i6n5.419883HadronsXmlRun: CUDA failure: cuIpcOpenMemHandle() (at /nfs/site/home/phcvs2/gitrepo/ifs-all/components/psm/temp.build-cuda/BUILD/libpsm2-11.2.89/ptl_am/am_cuda_memhandle_cache.c:281)returned 208
r6i6n3.432920Error returned from CUDA function.

Error does not seem to occur on a single node.

Test_QED most results are zero when run on GPU's

Issue

The output of the exchange, self-energy and tadpole correlators are zero when run on 8 nodes of tesseract (V100).
The two point correlator does output. Correctness with cpu run still needs confirming.
This seems to be related to the seqConserved source module used to insert the photon fields with a conserved current.
Inspection of Grid show that conserved currents are not compiled when Grid runs on GPU's.
This issue is in the lastest commit in the develop branch of Grid.

Actions required

A GPU ready version of the Sequential Conserved current for each action needs to be implemented in Grid to enable this functionality in Hadrons on GPU nodes.
Note: other modules that rely on Conserved currents will also be affected by this issue.

meson assertion when restoreMemoryProfile=true

I can run a job from a database using xml such as the attached:
Z2_s.3080.xml.txt
However, when I set restoreMemoryProfile=true, the job fails in the first meson contraction with an error such as:
Hadrons : Message : 7747.190612 s : ---------------- Measurement step 461/11249 (module 'meson_3pt_s_GFPW_quark_h0_h2_gT5_dt_20_p_0_0_0_t_16') ----------------
Hadrons : Message : 7747.190628 s : ................ Module execution
Hadrons : Message : 7747.190667 s : Computing meson contractions 'meson_3pt_s_GFPW_quark_h0_h2_gT5_dt_20_p_0_0_0_t_16' using quarks 'Prop_GFPW_h2_p_0_0_0_t_16' and 'PropSeq_GFPW_h0_gT5_dt_20_ps_0_0_0_s_p_0_0_0_t_16'
HadronsXmlRun: /home/dp008/dp008/dc-mars3/.local/Double/include/Grid/lattice/Lattice_base.h:50: void Grid::conformable(Grid::GridBase *, Grid::GridBase *): Assertion `lhs == rhs' failed.
The backtrace shows this happens at Meson.hpp:244, i.e.
buf = sink(c);
This happens repeatably for jobs with circa 10k modules. I did notice this once during testing on my laptop, but assumed it was user error as it did not recur.

Interference caused by using --debug-mem Grid flag

Hadrons memory management 'sees' different sizes for objects depending whether --debug-mem is used or not. An example is the "memory needed" Hadrons message compared between 2 identical runs differing by the use of --debug-mem:
Without --debug-mem:
Hadrons : Message : 0.748221 s : Schedule (memory needed: 79.3 MB):
With --debug-mem:
Hadrons : Message : 0.849924 s : Schedule (memory needed: 1.4 GB):

Unable to build a C++ Hadrons application for GPU

I'm trying to use Hadrons on Tursa, but am not able to compile an app.

What I have tried:

Configure Grid using:

../configure \
     --prefix ${HOME}/prefix2 \
     --enable-comms=mpi \
     --enable-simd=GPU \
     --enable-shm=nvlink \
     --enable-gen-simd-width=64 \
     --enable-accelerator=cuda \
     --enable-Nc=2 \
     --enable-accelerator-cshift \
     --disable-unified \
     --disable-gparity \
     --with-lime=${HOME}/prefix \
     CXX=nvcc \
     LDFLAGS="-cudart shared" \
     CXXFLAGS="-ccbin mpicxx -gencode arch=compute_80,code=sm_80 -std=c++14 -cudart shared"

Then make and install

Configure Hadrons using

../configure --with-grid=${HOME}/prefix2 --prefix=${HOME}/prefix2

Then make and install.

Follow the instructions in the app tutorial to build a test app. Configure using
```
../configure --with-grid=${HOME}/prefix2 --with-hadrons=${HOME}/prefix2
```

The configure step fails, as it sets CXX=g++ rather than the nvcc that was used for compiling Grid and Hadrons. This conflicts with the CXXFLAGS which are designed to work with nvcc, and are not recognised by g++.

configure:4458: g++ -o conftest  -g -O2 -I/home/dp208/dp208/dc-benn2/prefix2/include -I/home/dp208/dp208/dc-benn2/prefix2/include -I/home/dp208/dp208/dc-benn2/prefix2/include -I/home/dp208/dp208/dc-benn2/prefix/include -O3 -ccbin mpicxx -gencode arch=compute_80,code=sm_80 -std=c++14 -cudart shared -Xcompiler -fno-strict-aliasing --expt-extended-lambda --expt-relaxed-constexpr -Xcompiler -fopenmp    -L/home/dp208/dp208/dc-benn2/prefix2/lib -L/home/dp208/dp208/dc-benn2/prefix2/lib -L/home/dp208/dp208/dc-benn2/prefix2/lib -L/home/dp208/dp208/dc-benn2/prefix/lib -cudart shared -Xcompiler -fopenmp conftest.cpp  -lHadrons  -ldl -lGrid -lz -lcrypto -llime -lmpfr -lgmp -lstdc++ -lm -lcuda -lz >&5
g++: error: mpicxx: No such file or directory
g++: error: unrecognized debug output level 'encode'
g++: error: arch=compute_80,code=sm_80: No such file or directory
g++: error: shared: No such file or directory
g++: error: shared: No such file or directory
g++: error: unrecognized command line option '-ccbin'
g++: error: unrecognized command line option '-cudart'
g++: error: unrecognized command line option '-Xcompiler'; did you mean '--compile'?
g++: error: unrecognized command line option '--expt-extended-lambda'
g++: error: unrecognized command line option '--expt-relaxed-constexpr'
g++: error: unrecognized command line option '-Xcompiler'; did you mean '--compile'?
g++: error: unrecognized command line option '-cudart'
g++: error: unrecognized command line option '-Xcompiler'; did you mean '--compile'?

If I manually set CXX=nvcc, then I get a huge number of errors, where nvcc doesn't recognise the CUDA functions and datatypes; the first couple are, for example:

/home/dp208/dp208/dc-benn2/prefix2/include/Grid/threads/Accelerator.h:109:8: error: 'cudaStream_t' does not name a type; did you mean 'CUstream_st'?
 extern cudaStream_t copyStream;
        ^~~~~~~~~~~~
        CUstream_st
/home/dp208/dp208/dc-benn2/prefix2/include/Grid/threads/Accelerator.h:106:28: error: '__host__' does not name a type; did you mean 'CUhostFn'?
 #define accelerator_inline __host__ __device__ inline

Manually setting CXXFLAGS="" doesn't make any difference, as configure picks up the additional CXXFLAGS from hadrons-config.

Am I missing a step or flags that would allow me to configure to compile for GPU, or is this not currently supported?

Many thanks

Problems installing Hadrons on AMD GPUs

I'm trying to install Hadrons over a Grid build on the Crusher AMD machine on ORNL.

Installing Grid with lime, I've tested to make sure Grid code works well.
With Grid installed here at <grid_dir>, and the grid prefix dir (with lime, mpfr) in <grid_dir>/grid_prefix , I'm building Hadrons in the following way, using the instructions in: https://aportelli.github.io/Hadrons-doc/#/install

source setup_env.sh
git clone [email protected]:aportelli/Hadrons.git
cd Hadrons
./bootstrap.sh
mkdir build
cd build
 
../configure --prefix=<grid_dir>/grid_prefix  --with-grid=<grid_dir>/grid_prefix
 
make -j 14

At this stage, the code fails with a linker error:

ld.lld: error: undefined symbol: Grid::Hadrons::MContraction::TBaryonGamma3pt<Grid::WilsonImpl<Grid::Grid_simd<thrust::complex<double>, Grid::GpuVector<4, Grid::GpuComplex<HIP_vector_type<double, 2u> > > >, Grid::FundamentalRep<3>, Grid::CoeffReal> >::parseGammaLRString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<std::pair<std::pair<Grid::Gamma::Algebra, Grid::Gamma::Algebra>, std::pair<Grid::Gamma::Algebra, Grid::Gamma::Algebra> >, std::allocator<std::pair<std::pair<Grid::Gamma::Algebra, Grid::Gamma::Algebra>, std::pair<Grid::Gamma::Algebra, Grid::Gamma::Algebra> > > >&)
>>> referenced by Application.cpp
>>>               Application.o:(vtable for Grid::Hadrons::MContraction::BaryonGamma3pt) in archive ../Hadrons/libHadrons.a
 
ld.lld: error: undefined symbol: typeinfo for Grid::Hadrons::MContraction::TBaryonGamma3pt<Grid::WilsonImpl<Grid::Grid_simd<thrust::complex<double>, Grid::GpuVector<4, Grid::GpuComplex<HIP_vector_type<double, 2u> > > >, Grid::FundamentalRep<3>, Grid::CoeffReal> >
>>> referenced by Application.cpp
>>>               Application.o:(typeinfo for Grid::Hadrons::MContraction::BaryonGamma3pt) in archive ../Hadrons/libHadrons.a
clang-14: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [Makefile:423: HadronsXmlRun] Error 1
make[2]: *** Waiting for unfinished jobs....
clang-14: error: linker command failed with exit code 1 (use -v to see invocation)

I've tested for both N_c = 3 and 4, and both give the same errors.

To make sure the configure is correct, I tried the tests in the folder Hadrons/build/tests
I can successfully do the tests for :

Test_diskvector
Test_field_io
Test_database
Test_em_field

However, Test_free_prop fails with a similar linker error.

MIO::LoadBinary does not check for lattice size

Tobi noticed that the MIO::LoadBinary module does not check whether the file fits the lattice geometry. If Hadrons is run with a grid that is smaller than the configuration in the file it does not throw any error or warning. As the binary file format does not contain metadata like the plaquette, there is no check within Grid.

A simple solution would be to check the file size against the expected size based on the lattice geometry within MIO::LoadBinary.

Cannot link an app using MGauge::FundtoHirep

I'm trying to do spectroscopy in the adjoint representation; I believe that to do this I would need to create a gauge field and then use MGauge::FundtoHirep<AdjointRepresentation> (aka MGauge::FundtoAdjoint) to re-represent it.

Initially this won't compile, as both FundtoAdjoint and FundtoHirep aren't recognised. Uncommenting the line:

//MODULE_REGISTER_TMP(FundtoAdjoint,   TFundtoHirep<AdjointRepresentation>, MGauge);

from Hadrons/Modules/MGauge/FundtoHirep.hpp allows the app to compile, but it still fails to link due to undefined reference errors:

main.o: In function `std::_Function_handler<std::unique_ptr<Grid::Hadrons::ModuleBase, std::default_delete<Grid::Hadrons::ModuleBase> > (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), Grid::Hadrons::MGauge::MGaugeFundtoAdjointModuleRegistrar::MGaugeFundtoAdjointModuleRegistrar()::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)#1}>::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&)':
/home/dp208/dp208/dc-benn2/prefix2/include/Hadrons/Modules/MGauge/FundtoHirep.hpp:66: undefined reference to `Grid::Hadrons::MGauge::TFundtoHirep<Grid::AdjointRep<2> >::TFundtoHirep(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)'
main.o: In function `void Grid::Hadrons::VirtualMachine::createModule<Grid::Hadrons::MGauge::FundtoAdjoint>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Grid::Hadrons::MGauge::FundtoAdjoint::Par const&)':
/home/dp208/dp208/dc-benn2/prefix2/include/Hadrons/Modules/MGauge/FundtoHirep.hpp:66: undefined reference to `Grid::Hadrons::MGauge::TFundtoHirep<Grid::AdjointRep<2> >::TFundtoHirep(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)'
main.o:(.rodata._ZTIN4Grid7Hadrons6MGauge13FundtoAdjointE[_ZTIN4Grid7Hadrons6MGauge13FundtoAdjointE]+0x10): undefined reference to `typeinfo for Grid::Hadrons::MGauge::TFundtoHirep<Grid::AdjointRep<2> >'
main.o:(.rodata._ZTVN4Grid7Hadrons6MGauge13FundtoAdjointE[_ZTVN4Grid7Hadrons6MGauge13FundtoAdjointE]+0x28): undefined reference to `Grid::Hadrons::MGauge::TFundtoHirep<Grid::AdjointRep<2> >::getInput[abi:cxx11]()'
main.o:(.rodata._ZTVN4Grid7Hadrons6MGauge13FundtoAdjointE[_ZTVN4Grid7Hadrons6MGauge13FundtoAdjointE]+0x38): undefined reference to `Grid::Hadrons::MGauge::TFundtoHirep<Grid::AdjointRep<2> >::getOutput[abi:cxx11]()'
main.o:(.rodata._ZTVN4Grid7Hadrons6MGauge13FundtoAdjointE[_ZTVN4Grid7Hadrons6MGauge13FundtoAdjointE]+0x68): undefined reference to `Grid::Hadrons::MGauge::TFundtoHirep<Grid::AdjointRep<2> >::setup()'
main.o:(.rodata._ZTVN4Grid7Hadrons6MGauge13FundtoAdjointE[_ZTVN4Grid7Hadrons6MGauge13FundtoAdjointE]+0x70): undefined reference to `Grid::Hadrons::MGauge::TFundtoHirep<Grid::AdjointRep<2> >::execute()'

I imagine this is due to the definition of TFundtoHirep being within the .cpp rather than the .hpp file, so needing an explicit template specialisation somewhere in the code. I tried adding one within FundtoHirep.cpp, but still get the errors above.

I'm probably missing something, but have spent a couple of hours going around in circles, so thought it worth asking what the correct way of doing this is!

Eigenpack writing precision change check

Could have a check when writing eigenpack in a different precision that stops the job if norm2(diff) above some threshold

Time profile breakdown missing modules

The time profile breakdown is missing modules that have run.
The is likely due to using a map with the time as the key.

rtiming[t.second.count()] = t.first;
From Global.cpp in Hadrons::printTimeProfile

So modules that take the same amount of time will collide and overwrite each other

Spring cleaning

The code need some cleaning

Remove the Grid namespace
Check parameter name consistency
Populate all the getOutputFilenames functions after DB update
Use the new classes from FieldIo.hpp wherever appropriate

aportelli / hadrons Goto Github PK

hadrons's People

Stargazers

Watchers

Forkers

hadrons's Issues

Issue

Actions required

Recommend Projects

Recommend Topics

Recommend Org