diatomic / diy Goto Github PK
View Code? Open in Web Editor NEWdata-parallel out-of-core library
License: Other
data-parallel out-of-core library
License: Other
in bov.hpp, should the order in MPI_Type_create_subarray be MPI_ORDER_FORTRAN instead of MPI_ORDER_C? [x][y][z] order is column-major, ie, fortran order.
Issue by Hadrien Croubois
Friday Mar 06, 2015 at 22:09 GMT
Let's say we have particles distributed among blocks. If you want to use wrap, you can check if those particles interect the block bounds using diy::near.
Using an OutIter, diy::near will happend the concerned neighbours identifier hense telling us what neighbour we shoud send those particles to.
However I noticed very strang results when the grid size along a dimension is 1. Worst case senarios would be with only 1 block, when wraping should send to the same block but where the direction l->direction(neighbour)
will help determine which face we went through.
I'm starting thinking somethink is wrong with diy::near
even thow I can't say exactly what.
Add more out-of-core tests in light of Utkarsh's finding.
Issue by Dmitriy Morozov
Wednesday Nov 26, 2014 at 18:05 GMT
Currently, foreach()
has an optional boolean that tells it to skip a block if it has no incoming queues. Specifically, the block won't be loaded (if it's unloaded), but the callback will be called with 0 as the block pointer. This behavior is useful, for example, to set collectives (in the "until_done" type of execution).
We need a more general behavior. The specific need is introduced by reduce()
, where partners.active()
indicates which blocks need to be loaded and which do not. Currently, there is no mechanism for skipping them.
Proposed solution.
The argument to foreach should be a struct with at least three functions. First, call it pre()
, is given either lid
or gid
and an instance of master. Its job is to determine whether the block needs to be loaded. The second one, call it normal()
, is called if the block was loaded in memory. The third one, call it absent()
, is called with gid
and master proxy, when the block is to be skipped. This callback is necessary, for example, to execute collectives (e.g., setting the done
flag in the distance function code). Absent should probably be called with the same signature as the normal()
callback, but with the block pointer set to 0. That way, one could potentially use the same function for both.
Naturally, an adaptor ought to be provided so that the current single-callback mechanism works (by always loading the block).
Issue by Hadrien Croubois
Saturday Mar 14, 2015 at 03:58 GMT
I noticed that diy::mpi::io::file
functions uses a structure status
containing for status management. However those objected are allocated on the stack inside the function are are not valid later on. That mean that if the call is not fast enough, MPI might access a non existing object and possibly cause a segfault like error. Also we dont have control over the object.
What could be done :
status
object as an optional argument of the function, which would able user to provide an object they could manage.MPI_STATUS_IGNORE
to avoid any errorsIssue by Hadrien Croubois
Monday Mar 02, 2015 at 22:21 GMT
I noticed locks in diy::io::write_blocks when the number of block is not constant among the different processes.
I am having difficulty using DIY. Either I am not fully understanding how to use reduce and all-to-all, or there may be a bug. I have been using DIY for my image compositing, and it works in most cases, but in a few configurations, I am getting MPI errors and seg faults.
I have simplified my code and posted it here
In the readme there are two test that fail across machines and compilers for me. Can you take a look and see if I am doing something wrong?
Issue by Dmitriy Morozov
Monday Apr 13, 2015 at 16:25 GMT
Need to double check that Master::get()
always respects the block limit. It's not uncommon to use it during the final output, so it should respect the memory limits.
Issue by Dmitriy Morozov
Friday Jan 16, 2015 at 22:28 GMT
In the out-of-core regime, there is room to schedule how blocks are ordered during diy::reduce()
. In principle, the full order of active blocks is known ahead of time, so one could reorder the blocks optimally so that the number of movements of blocks out of core is minimized. We need to figure out which algorithm to use for this and implement this feature.
Issue by Dmitriy Morozov
Friday Nov 14, 2014 at 18:20 GMT
Currently, decompose()
calls the create
callback serially. This is useful because it allows one to call MPI functions from create()
, where the most common use-case is calling MPI-IO functions to load the data into the block. However, since it's desirable for IO-efficiency to combine the loading of the data with the first computation, it is common to perform initial local computation inside create()
. When DIY is allowed to use multiple threads, this creates a problem since it forces serial execution of the local computation.
Proposed solution. Pass two callbacks to decompose()
: serial
and parallel
. The first one would be called in serial and is allowed to use MPI routines directly. The second one will be spun into a new thread, with multiple threads executing at the same time. It's allowed to use only DIY routines for communication.
Note: this would require much tighter coupling between Master
and decompose()
than exists now.
Issue by Hadrien Croubois
Monday Mar 02, 2015 at 15:26 GMT
At this time the MPI_File_close mecanism is called in the destructor.
therefore, if the object isn't dynamicaly allocated it may not be destroyed when MPI_Finalize is called.
MPI_Init(&argc, &argv);
...
std::vector<float> pts;
...
diy::mpi::io::file file(world, path, diy::mpi::io::file::rdonly);
pts.resize(file.size()/sizeof(float));
file.read_at_all(0ul, pts);
...
MPI_Finalize();
Could be solved by ading a close() method which closes the file and gives a safe value to fh (can't find any ref about that but something like -1 might be ok)
Issue by Tom Peterka
Friday Jan 16, 2015 at 15:12 GMT
[ 10%] Building CXX object src/CMakeFiles/tess.dir/tess.cpp.o
In file included from /Users/tpeterka/software/tess/2-dev/src/tess.cpp:44:
/Users/tpeterka/software/diy2/include/diy/pick.hpp:133:19: warning: array index 3 is past the end of
the array (which contains 3 elements) [-Warray-bounds]
new_pt[3] = p[3] - r;
^ ~
/Users/tpeterka/software/diy2/include/diy/pick.hpp:41:13: note: in instantiation of function
template specialization 'diy::detail::shift<float [3]>' requested here
detail::shift(new_pt, p, r, link.direction(n), link.dimension());
^
In file included from /homes/tpeterka/software/diy/include/diy/collection.hpp:7:0,
from /homes/tpeterka/software/diy/include/diy/master.hpp:12,
from /homes/tpeterka/software/diy/tests/swap-reduce.cpp:10:
/homes/tpeterka/software/diy/include/diy/storage.hpp: In member function ‘virtual void diy::detail::FileBuffer::load_binary_back(char*, size_t)’:
/homes/tpeterka/software/diy/include/diy/storage.hpp:31:134: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’, declared with attribute warn_unused_result [-Wunused-result]
virtual inline void load_binary_back(char* x, size_t count) override { fseek(file, tail, SEEK_END); fread(x, 1, count, file); tail += count; fseek(file, head, SEEK_SET); }
^
/homes/tpeterka/software/diy/include/diy/storage.hpp: In member function ‘virtual void diy::FileStorage::get(int, diy::MemoryBuffer&, size_t)’:
/homes/tpeterka/software/diy/include/diy/storage.hpp:120:41: warning: ignoring return value of ‘ssize_t read(int, void*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
read(fh, &bb.buffer[0], fr.size);
I have attached a simple test that reproduces the issue. The issue is not hit all the time, so running the test multiple times may be needed. A simple workaround is to uncomment the barrier on line 95.
#include <diy/master.hpp>
#include <diy/mpi.hpp>
#include <cassert>
#include <iostream>
#include <vector>
template <typename T>
inline void PrintVector(std::ostream& out, const std::vector<T>& vec)
{
out << "size = " << vec.size() << ", values = ";
for (const T& val : vec)
{
out << val << " ";
}
out << "\n";
}
template <typename T>
void TestEqual(const std::vector<T>& v1, const std::vector<T>& v2)
{
bool passed = (v1.size() == v2.size());
if (passed)
{
for (std::size_t i = 0; i < v1.size(); ++i)
{
if(v1[i] != v2[i])
{
passed = false;
break;
}
}
}
if (!passed)
{
std::cout << "v1: ";
PrintVector(std::cout, v1);
std::cout << "v2: ";
PrintVector(std::cout, v2);
abort();
}
}
//-----------------------------------------------------------------------------
template <typename T>
struct Block
{
T send;
T received;
};
template <typename T>
void TestSerializationImpl(const T& obj)
{
diy::mpi::communicator comm;
Block<T> block;
{
diy::Master master(comm);
auto nblocks = comm.size();
diy::RoundRobinAssigner assigner(comm.size(), nblocks);
std::vector<int> gids;
assigner.local_gids(comm.rank(), gids);
assert(gids.size() == 1);
auto gid = gids[0];
diy::Link* link = new diy::Link;
diy::BlockID neighbor;
// send neighbor
neighbor.gid = (gid < (nblocks - 1)) ? (gid + 1) : 0;
neighbor.proc = assigner.rank(neighbor.gid);
link->add_neighbor(neighbor);
// recv neighbor
neighbor.gid = (gid > 0) ? (gid - 1) : (nblocks - 1);
neighbor.proc = assigner.rank(neighbor.gid);
link->add_neighbor(neighbor);
block.send = obj;
master.add(gid, &block, link);
// compute, exchange, compute
master.foreach([](Block<T> *b, const diy::Master::ProxyWithLink& cp) {
cp.enqueue(cp.link()->target(0), b->send);
});
master.exchange();
master.foreach([](Block<T> *b, const diy::Master::ProxyWithLink& cp) {
cp.dequeue(cp.link()->target(1).gid, b->received);
});
//comm.barrier();
}
TestEqual(block.send, block.received);
}
//-----------------------------------------------------------------------------
int main(int argc, char *argv[])
{
diy::mpi::environment mpienv(argc, argv);
const std::size_t ArraySize = 10;
for (int i = 0; i < 2; ++i)
{
std::vector<float> array(ArraySize);
const float step = 0.73;
float curval = 1.33 * static_cast<float>(i);
for (auto& v : array)
{
v = curval;
curval += step;
}
TestSerializationImpl(array);
}
std::cout << "Test completed successfuly\n";
return 0;
}
Issue by Hadrien Croubois
Wednesday Mar 04, 2015 at 21:14 GMT
the following code gives me an error : double free or corruption
std::vector<std::array<float,3>> particles;
master.foreach([&particles](void* b, const diy::Master::ProxyWithLink&, void*){
particles.push_back({{0.f, 0.f, 0.f}});
});
I'll try to reproduce this error without the lambda function (c++11)
Issue by Dmitriy Morozov
Wednesday Dec 03, 2014 at 22:08 GMT
Currently, if the first computation following domain decomposition involves neighborhood communication, it can be included inside the decomposition callback. This is useful in the out-of-core mode, so that all the work is done on a block while it's in memory, rather than taking two passes (and unloading and then re-loading a block on the second foreach()
).
The same coupling is not supported if the first communication is a global reduce()
. The issue, by and large, is that decomposition callback is passed the neighborhood link
. It knows nothing about the global communication pattern.
Probably, the entire decomposition mechanism needs to be rethought. This change would need to be coordinated with issue #3.
Issue by Dmitriy Morozov
Friday Dec 05, 2014 at 17:59 GMT
Add link serialization. The main issue is polymorphism of RegularLink (and different link types).
In master.hpp, line 1102: error: no member named 'notice' in 'spdlog::logger' (you will only see this in debug mode).
Use log->debug instead?
In include/diy/storage.hpp
, fread
is used, but its return value is ignored.
virtual inline void load_binary(char* x, size_t count) { fread(x, 1, count, file); }
fread
returns the number of items read, so when requesting 1000 bytes, a read may read just 10 and further calls are necessary.
Issue by Dmitriy Morozov
Friday Nov 14, 2014 at 18:08 GMT
For continuous data:
For discrete data:
Issue by Dmitriy Morozov
Wednesday Feb 25, 2015 at 18:19 GMT
Master::comm_exchange()
reports an inconsistency after in-memory swap when running Gaia out-of-core. (The size of the buffer doesn't match the size of the record.) It's not clear whether this affects the results of the computation (it doesn't seem to), but it's important to get to the bottom of this.
Issue by Hadrien Croubois
Monday Mar 02, 2015 at 21:02 GMT
When the total number of block is smaller then the number of procecess running in parallele, calling diy::Master::exchange() gives an error
[Archteryx:13682] *** Process received signal ***
[Archteryx:13682] Signal: Floating point exception (8)
[Archteryx:13682] Signal code: Integer divide-by-zero (1)
Is that a known problem ? I'll try and have a lot a where it comes from
Issue by Dmitriy Morozov
Sunday Feb 08, 2015 at 18:44 GMT
Currently, we launch a new set of threads with every foreach()
(and join them all at the end). It might make sense to create a pool of threads when Master
is initialized, and then just reuse them during the foreach()
computation.
Large queues (> 2GB
) are broken into smaller pieces (tagged with tags::piece
) to work around MPI limitations. detail::VectorWindow<char>
is used to put subsets of the MemoryBuffer
s into MPI i[s]send
routines without having to make a copy. shared_ptr
s are used to keep track of the original buffer as long as some in-flight message needs it.
We could use the same mechanism during, e.g., all-to-all
operations to avoid making copies between buffers.
Issue by Dmitriy Morozov
Sunday Feb 08, 2015 at 18:51 GMT
Many processors (for example, the ones on Edison) place memory next to the core that first touched it. Currently, we do not take this into consideration at all. Initial blocks are created in serial (to avoid problems with threaded MPI IO), which means the input data is probably placed next to the core on which the main thread is running. In some cases, this may be Ok (if extra data structures are created later during parallel computation), but in some cases (all dense data), this is surely suboptimal.
Worse yet, when we run the actual computation we do not keep track of the information about which is the preferred thread for a block (in terms of memory placement). We should consider taking this into account.
Issue by Tom Peterka
Saturday Feb 21, 2015 at 17:10 GMT
When compiling with gcc on linux, getting errors like this:
/homes/tpeterka/software/diy2/include/diy/mpi/collectives.hpp: In static member function static void diy::mpi::Collectives<T, Op>::gather(const diy::mpi::communicator&, const T&, int) [with T = int, Op = void_]:
/homes/tpeterka/software/diy2/include/diy/mpi/collectives.hpp:69:7: error: invalid conversion from const void_ to void* [-fpermissive]
/usr/include/mpich2/mpi.h:628:5: error: initializing argument 1 of int MPI_Gather(void_, int, MPI_Datatype, void_, int, MPI_Datatype, int, MPI_Comm) [-fpermissive]
As a temporary fix, I added (void*) casts to lines 29, 48, 69, 82, 155, and 165. They look like this:
MPI_Gather((void*)(Datatype::address(in)),
and so on for the other collectives. I did not change all of the collectives, only the ones being used (by block.hpp apparently) in order to compile. I did not push the change.
Issue by Dmitriy Morozov
Friday Nov 14, 2014 at 17:51 GMT
Add the ability to move queues out of core (when their blocks are also out of core). For example, this is essential to support IO-efficient distribution of points (through a swap-reduce). Roughly, the protocol should be:
foreach block:
process(block)
exchange-light # optional
exchange-heavy
exchange-light:
post queues
kick outstanding messages
if #queues too large:
swap some queues out
exchange-heavy:
post any unposted queues # happens if exchange-light wasn't invoked
kick outstanding messages
while there are external outgoing queues:
if #posted queues < #allowed queues:
load and post more queues
kick outstanding messages
Issue by Tom Peterka
Monday Nov 17, 2014 at 02:57 GMT
in fill(), should partners.reserve(kv.size - 1) be partners.reserve(kv.size)?
partners.size() is kv.size
Issue by Tom Peterka
Monday Nov 17, 2014 at 02:44 GMT
Windows builds fail as follows:
c:\...diy\storage.hpp(8): fatal error C1083: Cannot open include file: 'unistd.h': No such file or directory
Issue by Hadrien Croubois
Monday Mar 16, 2015 at 16:25 GMT
I got trouble receiving std::vectors
using diy2
I managed to reproduce the error with this code
master.foreach([&](void* b, const diy::Master::ProxyWithLink& cp, void*){
tess::Block* block = reinterpret_cast<tess::Block*>(b);
RCLink* l = dynamic_cast<RCLink*>(cp.link());
if (block->gid)
{
diy::BlockID dest;
dest.gid = 0;
dest.proc = 0;
std::vector<char> buffer(block->gid*100);
cp.enqueue(dest, buffer);
printf("Sending buffer %d -- size: %d\n", block->gid, buffer.size());
}
});
master.exchange();
master.foreach([&](void* b, const diy::Master::ProxyWithLink& cp, void*){
tess::Block* block = reinterpret_cast<tess::Block*>(b);
RCLink* l = dynamic_cast<RCLink*>(cp.link());
std::vector<int> gids;
cp.incoming(gids);
for (int gid : gids)
if (cp.incoming(gid).buffer.size())
{
printf("Receiving buffer at %d from %d -- size: %d\n", block->gid, gid, cp.incoming(gid).buffer.size());
}
});
Issue by Tom Peterka
Wednesday Jan 28, 2015 at 21:56 GMT
[ 9%] Building CXX object examples/mpi/CMakeFiles/test-mpi.dir/test-mpi.cpp.o
In file included from /home/tpeterka/software/diy2/include/diy/mpi.hpp:13,
from /home/tpeterka/software/diy2/examples/mpi/test-mpi.cpp:3:
/home/tpeterka/software/diy2/include/diy/mpi/collectives.hpp: In static member function 'static void diy::mpi::Collectives<T, Op>::reduce(const diy::mpi::communicator&, const T&, T&, int, const Op&) [with T = std::vector<int, std::allocator<int> >, Op = std::plus<int>]':
/home/tpeterka/software/diy2/include/diy/mpi/collectives.hpp:241: instantiated from 'void diy::mpi::reduce(const diy::mpi::communicator&, const T&, T&, int, const Op&) [with T = std::vector<int, std::allocator<int> >, Op = std::plus<int>]'
/home/tpeterka/software/diy2/examples/mpi/test-mpi.cpp:29: instantiated from here
/home/tpeterka/software/diy2/include/diy/mpi/collectives.hpp:135: error: invalid conversion from 'const void*' to 'void*'
/home/tpeterka/software/diy2/include/diy/mpi/collectives.hpp:135: error: initializing argument 1 of 'int MPI_Reduce(void*, void*, int, MPI_Datatype, MPI_Op, int, MPI_Comm)'
Issue by Dmitriy Morozov
Saturday May 09, 2015 at 17:21 GMT
For some reason diy::io::{read,write}_blocks
are templated by Master
. It's not clear why this is necessary, other than one could imagine other Master
s in the future. Consider removing the template parameter.
Issue by Dmitriy Morozov
Monday Mar 09, 2015 at 18:35 GMT
diy::reduce()
needs to take a void* aux
optional parameter (like Master::foreach()
does) that it would pass along to the callback functor. Sample use-case for this is sort, where aux could be used to pass the number of bins, so they don't have to be stored in the block.
Issue by Dmitriy Morozov
Saturday Feb 28, 2015 at 16:29 GMT
It would be nice if there was way to communicate to reduce() when a block could be skipped. A good example is the sorting algorithm, where during the histogram computation, no block needs to be loaded into memory.
Issue by Tom Peterka
Monday Nov 17, 2014 at 02:43 GMT
Tom:
Issue by Dmitriy Morozov
Monday Mar 09, 2015 at 23:34 GMT
The current way of saying that block's queues are out of core if the block is out of core is too crude. For example, when sorting, when computing a histogram, it doesn't make sense to move the queues (that carry the histogram) out of core, just because the block is out of core. There are alternative methods one could imagine (e.g., specify maximum queue size limit below which the queue is never moved). It might be nice to think of a way to abstract it away, similarly to how we provide the skip mechanism for blocks right now.
Issue by Dmitriy Morozov
Tuesday Nov 18, 2014 at 21:37 GMT
Add specialization Serialization< std::tuple<...> >, wrapped in checks for C++11.
Issue by Tom Peterka
Monday Nov 17, 2014 at 03:02 GMT
Issue by Tom Peterka
Saturday Feb 21, 2015 at 17:26 GMT
gcc warns about unused return values in following lines
/homes/tpeterka/software/diy2/include/diy/io/block.hpp:57:9: warning: ignoring return value of int truncate(const char*, __off_t)
/homes/tpeterka/software/diy2/include/diy/storage.hpp:80:41: warning: ignoring return value of ssize_t read(int, void*, size_t)
/homes/tpeterka/software/diy2/include/diy/storage.hpp:45:37: warning: ignoring return value of ssize_t write(int, const void*, size_t)
Issue by Dmitriy Morozov
Friday Dec 05, 2014 at 18:02 GMT
The most clear use case for this is to serialize RegularDecomposer
so that the information could be recovered. A generic way to do this would be to add an optional BinaryBuffer argument to {read,write}_blocks()
, which would be stored in the file header.
It would be great to add a feature to automatically set thread affinities. Otherwise we have to manually call sched_setaffinity() on some systems..
Issue by Dmitriy Morozov
Sunday Nov 23, 2014 at 17:57 GMT
Create a nothread.hpp
header that provides all the same classes and functions as thread.hpp
, but implements them in serial (and without any dependencies, like pthreads). It should be possible to choose which header master.hpp
uses with a compile flag.
We would like diy
to be supported on the Windows platform. Currently, there are two windows MPI libraries that work with diy
- Intel MPI and IBM Plantform MPI. Both are available for free, but registration with Intel or IBM is required.
MS-MPI is a commonly used MPI implementation for windows. Unfortunately, it does not support some newer features that are used by diy
and so it is currently unsupported. We should figure out some workarounds to support diy
on MS-MPI.
The main issue is that MS-MPI does not have the MPI_Win_flush_local
function. It does seem to have the non-local version of the function (MPI_Win_flush). Replacing with the non-local version should still be correct, if not performant, but we are still seeing test failures. Further investigation is needed to find the cause of the failure and an appropriate workaround.
Issue by Dmitriy Morozov
Wednesday Dec 10, 2014 at 18:54 GMT
Serialization operations should support C++11 move semantics. Of course, we cannot just move the binary data into the buffer, but in specializations for standard containers, we could clear the source container after copying the data over. This would benefit memory-constrained situations by getting rid of the originals as soon as the copies are made.
As discussed in #59. @utkarshayachit
Issue by Dmitriy Morozov
Saturday Jan 31, 2015 at 21:10 GMT
Right now the BinaryBuffer
size grows to match the data stored in it (so appending new data always results in copying). This behavior should be rethought in general, since it surely causes an unreasonable amount of memory reallocation and copying. One place where this has already manifested itself is in cian benchmarks: when appending a footer, we reallocate and copy everything just for a pair of extra integers (from
and to
).
Related: BinaryBuffer
should provide a reserve()
function.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.