aportelli / hadrons Goto Github PK
View Code? Open in Web Editor NEWGrid-based workflow management system for lattice field theory simulations
License: GNU General Public License v2.0
Grid-based workflow management system for lattice field theory simulations
License: GNU General Public License v2.0
Hi, We had to make some modifications to
Hadrons/Modules/MContraction/Baryon.hpp
The intel compiler wanted the last argument to regex_replace to be wrapped in a std::string().
For example:
gammaString = regex_replace(gammaString, std::regex("j12"),std::string("(Identity SigmaXZ)"));
I'm afraid I've had zero luck compiling Grid+Hadrons for the Skylake cluster at Cambridge over the past couple of weeks, and with Tuesday's DiRAC technical case deadline approaching I need to ask for help. Back in Grid issue 175 @aportelli mentioned building successfully at Cambridge, so I'm hoping there may be some world-readable build there from which I could copy workable options and module choices. I have tried to divine some of this from CI builds like this one, so far without success.
I've had several dozen builds fail on this system over the past couple of weeks. In case it might be helpful, I attach some logs from my latest attempt, which fails with the HDF5-related errors mentioned in the subject. (Grid's make check
returns an error code but all the tests seem to say they're OK, which is why I'm opening this under Hadrons rather than Grid.)
From these logs I quote:
Grid commit:
commit: 7cf7f11e
branch: develop
Hadrons commit:
HADRONS_BRANCH='develop'
HADRONS_SHA='82ffe4f24a0684c37598440290f35c0b33d10b58'
Grid configure line:
$ /home/dc-scha3/dev/Grid-Sky/configure --prefix=/home/dc-scha3/dev/Grid-Sky/install --with-lime=/home/dc-scha3/dev/c-lime-Sky/install --enable-precision=double --enable-simd=AVX512 --enable-comms=mpi3 --enable-mkl CC=mpiicc CXX=mpiicpc CXXFLAGS="-std=c++11 -qoverride-limits"
Full CPU description:
$ cat /proc/cpuinfo | grep 'model name' | uniq
model name : Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
Full compiler (for this latest attempt):
$ mpiicpc -v | grep -v Copyright
mpiicpc for the Intel(R) MPI Library 2019 Update 3 for Linux*
icpc version 19.0.3.192 (gcc version 8.4.0 compatibility)
For obvious reasons I've made many attempts to try to use what look like the most promising of the 45 different hdf5 modules on offer at Cambridge. The attached messy notes summarize the main ways these attempts have failed. Typically I get either
HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
#000: H5T.c line 1712 in H5Tcopy(): not a datatype or dataset
major: Invalid arguments to routine
minor: Inappropriate type
in Grid's make check
or the undefined references in the attached Hadrons make
log, in both cases with warnings about
ld: warning: libhdf5.so.8, needed by /lib/../lib64/libhdf5_cpp.so, may conflict with libhdf5.so.103
Google hasn't yet helped me with that.
Thanks in advance for any hints you may be able to offer.
Attachments:
build_script.txt
build_notes.txt
module_list.txt
Grid_config.log
grid.configure.summary.txt
Grid_make.log
Grid_make-install.log
Grid_make-check.log
Hadrons_config.log
Hadrons_make.log
Grid_make-check-old-HDF5.log
Running a module with an output used by no other module causes the following db error
database error: error executing query 'INSERT INTO "objects" VALUES(16,'module_name',4,0,'undef',4294967295);'
which does not directly hint at the problem.
Even after the fix in #86, one cannot use Hadrons for many things in higher representations, as none of the modules support any inputs other than the fundamental rep.
To rectify this, most modules need to have higher-representation versions registered and instantiated.
The most obvious way I can see to achieve this is to take lines 75–80 of Global.hpp and duplicate them to create FIMPL_ADJ
, FIMPL_ASY
and FIMPL_SYM
(or similar), and then look for every (or at least almost every) instantiation of a module using FIMPL
and duplicate it for the three higher representations.
I have a prototype that takes this approach just enough to allow a spectrum computation in the adjoint rep (so editing just MSource::Z2
, MAction::Wilson
, MSolver::RBPrecCG
, MFermion::GaugeProp
, and MContraction::Meson
, and referring to WilsonAdjImplR
explicitly), and it does seem work (giving compatible results with HiRep's measure_spectrum
using Z2SEMWall sources).
(The solution is complicated by contractions—in that case there would be a minor combinatoric explosion if every combination of representations had to be dealt with (which presumably it would need to if you wanted to allow for chimera baryons and such things). I'm not sure how best to deal with that.)
@aportelli Does that seem a reasonable starting point for an implementation? (I ask because I'm happy to implement and PR it.) Or do you prefer a different solution that I haven't spotted? (Or is there some other way that I've missed of doing this without needing to implement the template instantiations and module registrations the Hadrons source—e.g. registering my own versions of the modules? The instantiation issues I've had suggest that that won't work, but it's entirely possible that I've missed something.)
Thanks!
I've noticed in recent work that the Meson contraction outputs a rather different XML structure to the WardIdentity contraction. This means downstream tooling to read and analyse these quantities needs to account separately for the two classes of input, rather than using a single generic reader.
Is there a strong reason for this choice? Would it be possible to make them more consistent?
I found a bug in a job that processes multiple trajectories and uses the naive scheduler. I get the following error at the start of the second trajectory:
HadronsXmlRun: ../../Hadrons/VirtualMachine.cpp:808: Grid::Hadrons::VirtualMachine::makeGarbageSchedule(const Program&) const::<lambda(unsigned int)>: Assertion `it != p.end()' failed.
Not so much of an issue but a health warning both @felixerben and myself have found that compiling with clang-10 and clang-11 will fail if you do not add CXXFLAGS=-pthread to hadrons on compile when building on laptops.
Open questions:
Hello! Recently I've been trying to better understand Hadrons and was looking at the Meson module. The code & documentation say the meson 2-pt function can be computed with three options for the sink parameter. I have attempted to do exactly this and am currently confused by the results. A summary of the problem is:
Using an MSink
object (I believe a ScalarSink
) reproduces the pion correlation function in Test_hadrons_spectrum.cpp
. Using the same MSource
object as used to compute the propagator, produces a correlator with only one non-zero entry at t=0. Using a SlicedPropagator
produces a symmetric correlation function, but not one that decays exponentially.
Firstly here are a few time slices of each correlator to showcase the problem
meson correlator - ScalarSink
t=0 (0.47442148, 0.)
t=1 (0.14026629, 0.)
t=2 (0.01591819, 0.)
meson correlator - sources
t=0 (0.00053426, 0.)
t=1 (0., 0.)
t=2 (0., 0.)
meson correlator - Slice Propagator
t=0 (0.16742971, 0.)
t=1 (5.06398922, 0.)
t=2 (0.4132842, 0.)
t=3 (1.03891438, 0.)
t=4 (0.40092583, 0.)
t=5 (1.03891438, 0.)
I've also attached the main.cpp.txt main file which computes the correlator each way.
I appreciate I may just be misusing the other sink methods, and would appreciate any help.
I'm working with @LupoA trying to benchmark Hadrons on Tursa, and am hitting an issue that the nest of templates seemingly prevents Hadrons compiling with CUDA. A number of modules (including MGauss
, which we need) give the errror:
/home/dp208/dp208/dc-benn2/prefix_su4test/include/Grid/threads/Accelerator.h(160): Error: Formal parameter space overflowed (4248 bytes required, max 4096 bytes allowed) in function _ZN4Grid11LambdaApplyIZNS_7LatticeINS_7iScalarINS_7iMatrixINS3_INS_9Grid_simdIN6thrust7complexIdEENS_9GpuVectorILi4ENS_10GpuComplexI7double2EEEEEELi4EEELi4EEEEEEaSINS_9BinaryMulENS1_INS2_INS2_INS2_ISD_EEEEEEEENS2_INS3_INS3_IS7_Li4EEELi4EEEEEEERSH_RKNS_23LatticeBinaryExpressionIT_T0_T1_EEEUlmmmE_EEvmmmT_
CPU compilation is fine.
Do you have any idea how fixing this could be approached, beyond going into Grid and renaming all of the type names to something shorter?
Thanks!
In the first version of Baryon.hpp, which has been part of Hadrons until now, there was a bug regarding baryons which have an interpolator that is a linear combination of two or more quark combinations (like the sigma^0, lambda, ...). The problem is that the quark content is not passed in the correct (shuffled) order to the Grid function in lines 256-261:
for (int iQ1 = 0; iQ1 < nQ; iQ1++){
for (int iQ2 = 0; iQ2 < nQ; iQ2++){
BaryonUtils<FIMPL>::ContractBaryons_Sliced(q1[t],q2[t],q3[t],gAl,gBl,gAr,gBr,quarks[iQ1].c_str(),quarks[iQ2].c_str(),parity,ch);
cs += prefactors[iQ1]*prefactors[iQ2]*ch;
}
}
'Simple' baryons like nucleon, omega, sigma^+/-, Delta, etc. are not affected.
This issue will be solved with Raoul's pull request which will probably me merged into develop soon.
Hadrons crashes at runtime on GPU when executing the MAction::DWFF module. The error is
r6i6n5.419883HadronsXmlRun: CUDA failure: cuIpcOpenMemHandle() (at /nfs/site/home/phcvs2/gitrepo/ifs-all/components/psm/temp.build-cuda/BUILD/libpsm2-11.2.89/ptl_am/am_cuda_memhandle_cache.c:281)returned 208
r6i6n3.432920Error returned from CUDA function.
Error does not seem to occur on a single node.
I can run a job from a database using xml such as the attached:
Z2_s.3080.xml.txt
However, when I set restoreMemoryProfile=true, the job fails in the first meson contraction with an error such as:
Hadrons : Message : 7747.190612 s : ---------------- Measurement step 461/11249 (module 'meson_3pt_s_GFPW_quark_h0_h2_gT5_dt_20_p_0_0_0_t_16') ----------------
Hadrons : Message : 7747.190628 s : ................ Module execution
Hadrons : Message : 7747.190667 s : Computing meson contractions 'meson_3pt_s_GFPW_quark_h0_h2_gT5_dt_20_p_0_0_0_t_16' using quarks 'Prop_GFPW_h2_p_0_0_0_t_16' and 'PropSeq_GFPW_h0_gT5_dt_20_ps_0_0_0_s_p_0_0_0_t_16'
HadronsXmlRun: /home/dp008/dp008/dc-mars3/.local/Double/include/Grid/lattice/Lattice_base.h:50: void Grid::conformable(Grid::GridBase *, Grid::GridBase *): Assertion `lhs == rhs' failed.
The backtrace shows this happens at Meson.hpp:244, i.e.
buf = sink(c);
This happens repeatably for jobs with circa 10k modules. I did notice this once during testing on my laptop, but assumed it was user error as it did not recur.
Hadrons memory management 'sees' different sizes for objects depending whether --debug-mem
is used or not. An example is the "memory needed" Hadrons message compared between 2 identical runs differing by the use of --debug-mem
:
Without --debug-mem
:
Hadrons : Message : 0.748221 s : Schedule (memory needed: 79.3 MB):
With --debug-mem
:
Hadrons : Message : 0.849924 s : Schedule (memory needed: 1.4 GB):
I'm trying to use Hadrons on Tursa, but am not able to compile an app.
What I have tried:
../configure \
--prefix ${HOME}/prefix2 \
--enable-comms=mpi \
--enable-simd=GPU \
--enable-shm=nvlink \
--enable-gen-simd-width=64 \
--enable-accelerator=cuda \
--enable-Nc=2 \
--enable-accelerator-cshift \
--disable-unified \
--disable-gparity \
--with-lime=${HOME}/prefix \
CXX=nvcc \
LDFLAGS="-cudart shared" \
CXXFLAGS="-ccbin mpicxx -gencode arch=compute_80,code=sm_80 -std=c++14 -cudart shared"
../configure --with-grid=${HOME}/prefix2 --prefix=${HOME}/prefix2
../configure --with-grid=${HOME}/prefix2 --with-hadrons=${HOME}/prefix2
The configure step fails, as it sets CXX=g++
rather than the nvcc
that was used for compiling Grid and Hadrons. This conflicts with the CXXFLAGS
which are designed to work with nvcc
, and are not recognised by g++
.
configure:4458: g++ -o conftest -g -O2 -I/home/dp208/dp208/dc-benn2/prefix2/include -I/home/dp208/dp208/dc-benn2/prefix2/include -I/home/dp208/dp208/dc-benn2/prefix2/include -I/home/dp208/dp208/dc-benn2/prefix/include -O3 -ccbin mpicxx -gencode arch=compute_80,code=sm_80 -std=c++14 -cudart shared -Xcompiler -fno-strict-aliasing --expt-extended-lambda --expt-relaxed-constexpr -Xcompiler -fopenmp -L/home/dp208/dp208/dc-benn2/prefix2/lib -L/home/dp208/dp208/dc-benn2/prefix2/lib -L/home/dp208/dp208/dc-benn2/prefix2/lib -L/home/dp208/dp208/dc-benn2/prefix/lib -cudart shared -Xcompiler -fopenmp conftest.cpp -lHadrons -ldl -lGrid -lz -lcrypto -llime -lmpfr -lgmp -lstdc++ -lm -lcuda -lz >&5
g++: error: mpicxx: No such file or directory
g++: error: unrecognized debug output level 'encode'
g++: error: arch=compute_80,code=sm_80: No such file or directory
g++: error: shared: No such file or directory
g++: error: shared: No such file or directory
g++: error: unrecognized command line option '-ccbin'
g++: error: unrecognized command line option '-cudart'
g++: error: unrecognized command line option '-Xcompiler'; did you mean '--compile'?
g++: error: unrecognized command line option '--expt-extended-lambda'
g++: error: unrecognized command line option '--expt-relaxed-constexpr'
g++: error: unrecognized command line option '-Xcompiler'; did you mean '--compile'?
g++: error: unrecognized command line option '-cudart'
g++: error: unrecognized command line option '-Xcompiler'; did you mean '--compile'?
If I manually set CXX=nvcc
, then I get a huge number of errors, where nvcc
doesn't recognise the CUDA functions and datatypes; the first couple are, for example:
/home/dp208/dp208/dc-benn2/prefix2/include/Grid/threads/Accelerator.h:109:8: error: 'cudaStream_t' does not name a type; did you mean 'CUstream_st'?
extern cudaStream_t copyStream;
^~~~~~~~~~~~
CUstream_st
/home/dp208/dp208/dc-benn2/prefix2/include/Grid/threads/Accelerator.h:106:28: error: '__host__' does not name a type; did you mean 'CUhostFn'?
#define accelerator_inline __host__ __device__ inline
Manually setting CXXFLAGS=""
doesn't make any difference, as configure
picks up the additional CXXFLAGS
from hadrons-config
.
Am I missing a step or flags that would allow me to configure to compile for GPU, or is this not currently supported?
Many thanks
I'm trying to install Hadrons over a Grid build on the Crusher AMD machine on ORNL.
Installing Grid with lime, I've tested to make sure Grid code works well.
With Grid installed here at <grid_dir>, and the grid prefix dir (with lime, mpfr) in <grid_dir>/grid_prefix , I'm building Hadrons in the following way, using the instructions in: https://aportelli.github.io/Hadrons-doc/#/install
source setup_env.sh
git clone [email protected]:aportelli/Hadrons.git
cd Hadrons
./bootstrap.sh
mkdir build
cd build
../configure --prefix=<grid_dir>/grid_prefix --with-grid=<grid_dir>/grid_prefix
make -j 14
At this stage, the code fails with a linker error:
ld.lld: error: undefined symbol: Grid::Hadrons::MContraction::TBaryonGamma3pt<Grid::WilsonImpl<Grid::Grid_simd<thrust::complex<double>, Grid::GpuVector<4, Grid::GpuComplex<HIP_vector_type<double, 2u> > > >, Grid::FundamentalRep<3>, Grid::CoeffReal> >::parseGammaLRString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<std::pair<std::pair<Grid::Gamma::Algebra, Grid::Gamma::Algebra>, std::pair<Grid::Gamma::Algebra, Grid::Gamma::Algebra> >, std::allocator<std::pair<std::pair<Grid::Gamma::Algebra, Grid::Gamma::Algebra>, std::pair<Grid::Gamma::Algebra, Grid::Gamma::Algebra> > > >&)
>>> referenced by Application.cpp
>>> Application.o:(vtable for Grid::Hadrons::MContraction::BaryonGamma3pt) in archive ../Hadrons/libHadrons.a
ld.lld: error: undefined symbol: typeinfo for Grid::Hadrons::MContraction::TBaryonGamma3pt<Grid::WilsonImpl<Grid::Grid_simd<thrust::complex<double>, Grid::GpuVector<4, Grid::GpuComplex<HIP_vector_type<double, 2u> > > >, Grid::FundamentalRep<3>, Grid::CoeffReal> >
>>> referenced by Application.cpp
>>> Application.o:(typeinfo for Grid::Hadrons::MContraction::BaryonGamma3pt) in archive ../Hadrons/libHadrons.a
clang-14: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [Makefile:423: HadronsXmlRun] Error 1
make[2]: *** Waiting for unfinished jobs....
clang-14: error: linker command failed with exit code 1 (use -v to see invocation)
I've tested for both N_c = 3 and 4, and both give the same errors.
To make sure the configure is correct, I tried the tests in the folder Hadrons/build/tests
I can successfully do the tests for :
However, Test_free_prop
fails with a similar linker error.
Tobi noticed that the MIO::LoadBinary
module does not check whether the file fits the lattice geometry. If Hadrons is run with a grid that is smaller than the configuration in the file it does not throw any error or warning. As the binary file format does not contain metadata like the plaquette, there is no check within Grid.
A simple solution would be to check the file size against the expected size based on the lattice geometry within MIO::LoadBinary
.
I'm trying to do spectroscopy in the adjoint representation; I believe that to do this I would need to create a gauge field and then use MGauge::FundtoHirep<AdjointRepresentation>
(aka MGauge::FundtoAdjoint
) to re-represent it.
Initially this won't compile, as both FundtoAdjoint
and FundtoHirep
aren't recognised. Uncommenting the line:
//MODULE_REGISTER_TMP(FundtoAdjoint, TFundtoHirep<AdjointRepresentation>, MGauge);
from Hadrons/Modules/MGauge/FundtoHirep.hpp
allows the app to compile, but it still fails to link due to undefined reference errors:
main.o: In function `std::_Function_handler<std::unique_ptr<Grid::Hadrons::ModuleBase, std::default_delete<Grid::Hadrons::ModuleBase> > (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), Grid::Hadrons::MGauge::MGaugeFundtoAdjointModuleRegistrar::MGaugeFundtoAdjointModuleRegistrar()::{lambda(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)#1}>::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&)':
/home/dp208/dp208/dc-benn2/prefix2/include/Hadrons/Modules/MGauge/FundtoHirep.hpp:66: undefined reference to `Grid::Hadrons::MGauge::TFundtoHirep<Grid::AdjointRep<2> >::TFundtoHirep(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)'
main.o: In function `void Grid::Hadrons::VirtualMachine::createModule<Grid::Hadrons::MGauge::FundtoAdjoint>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Grid::Hadrons::MGauge::FundtoAdjoint::Par const&)':
/home/dp208/dp208/dc-benn2/prefix2/include/Hadrons/Modules/MGauge/FundtoHirep.hpp:66: undefined reference to `Grid::Hadrons::MGauge::TFundtoHirep<Grid::AdjointRep<2> >::TFundtoHirep(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)'
main.o:(.rodata._ZTIN4Grid7Hadrons6MGauge13FundtoAdjointE[_ZTIN4Grid7Hadrons6MGauge13FundtoAdjointE]+0x10): undefined reference to `typeinfo for Grid::Hadrons::MGauge::TFundtoHirep<Grid::AdjointRep<2> >'
main.o:(.rodata._ZTVN4Grid7Hadrons6MGauge13FundtoAdjointE[_ZTVN4Grid7Hadrons6MGauge13FundtoAdjointE]+0x28): undefined reference to `Grid::Hadrons::MGauge::TFundtoHirep<Grid::AdjointRep<2> >::getInput[abi:cxx11]()'
main.o:(.rodata._ZTVN4Grid7Hadrons6MGauge13FundtoAdjointE[_ZTVN4Grid7Hadrons6MGauge13FundtoAdjointE]+0x38): undefined reference to `Grid::Hadrons::MGauge::TFundtoHirep<Grid::AdjointRep<2> >::getOutput[abi:cxx11]()'
main.o:(.rodata._ZTVN4Grid7Hadrons6MGauge13FundtoAdjointE[_ZTVN4Grid7Hadrons6MGauge13FundtoAdjointE]+0x68): undefined reference to `Grid::Hadrons::MGauge::TFundtoHirep<Grid::AdjointRep<2> >::setup()'
main.o:(.rodata._ZTVN4Grid7Hadrons6MGauge13FundtoAdjointE[_ZTVN4Grid7Hadrons6MGauge13FundtoAdjointE]+0x70): undefined reference to `Grid::Hadrons::MGauge::TFundtoHirep<Grid::AdjointRep<2> >::execute()'
I imagine this is due to the definition of TFundtoHirep
being within the .cpp
rather than the .hpp
file, so needing an explicit template specialisation somewhere in the code. I tried adding one within FundtoHirep.cpp
, but still get the errors above.
I'm probably missing something, but have spent a couple of hours going around in circles, so thought it worth asking what the correct way of doing this is!
Could have a check when writing eigenpack in a different precision that stops the job if norm2(diff) above some threshold
The time profile breakdown is missing modules that have run.
The is likely due to using a map with the time as the key.
rtiming[t.second.count()] = t.first;
From Global.cpp in Hadrons::printTimeProfile
So modules that take the same amount of time will collide and overwrite each other
The code need some cleaning
getOutputFilenames
functions after DB updateFieldIo.hpp
wherever appropriateA declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.