Giter Site home page Giter Site logo

Comments (5)

jerinphilip avatar jerinphilip commented on August 29, 2024
Fresh SIGSEGV

Thread 2 "marian-decoder-" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6c36700 (LWP 79086)]
0x0000555555659d70 in std::__shared_ptr<marian::data::SubBatch, (__gnu_cxx::_Lock_policy)2>::get (this=0x0)
    at /usr/include/c++/8/bits/shared_ptr_base.h:1308
1308          { return _M_ptr; }
(gdb) backtrace
#0  0x0000555555659d70 in std::__shared_ptr<marian::data::SubBatch, (__gnu_cxx::_Lock_policy)2>::get (this=0x0)
    at /usr/include/c++/8/bits/shared_ptr_base.h:1308
#1  0x000055555565508e in std::__shared_ptr_access<marian::data::SubBatch, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get (this=0x0)
    at /usr/include/c++/8/bits/shared_ptr_base.h:1019
#2  0x00005555556502fa in std::__shared_ptr_access<marian::data::SubBatch, (__gnu_cxx::_Lock_policy)2, false, false>::operator-> (this=0x0)
    at /usr/include/c++/8/bits/shared_ptr_base.h:1013
#3  0x0000555555649de9 in marian::data::CorpusBatch::size (this=0x555565a6d100)
    at /home/jphilip/code/bergamot-translator@integration/3rd_party/marian-dev/src/data/corpus_base.h:271
#4  0x0000555555824166 in marian::BeamSearch::search (this=0x55555bf734a0,
    graph=std::shared_ptr<marian::ExpressionGraph> (use count 11, weak count 378) = {...},
    batch=std::shared_ptr<marian::data::CorpusBatch> (use count 2, weak count 0) = {...})
    at /home/jphilip/code/bergamot-translator@integration/3rd_party/marian-dev/src/translator/beam_search.cpp:259
#5  0x0000555555646ec3 in marian::bergamot::BatchTranslator::translate (this=0x55555a3ecf60, batch=...)
    at /home/jphilip/code/bergamot-translator@integration/src/translator/batch_translator.cpp:95
#6  0x000055555564717c in marian::bergamot::BatchTranslator::consumeFrom (this=0x55555a3ecf60, pcqueue=...)
    at /home/jphilip/code/bergamot-translator@integration/src/translator/batch_translator.cpp:109
#7  0x0000555555625ac6 in marian::bergamot::Service::<lambda()>::operator()(void) const (__closure=0x55555a053de8)
    at /home/jphilip/code/bergamot-translator@integration/src/translator/service.cpp:41
#8  0x0000555555627436 in std::__invoke_impl<void, marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> >(std::__invoke_other,
marian::bergamot::Service::<lambda()> &&) (__f=...) at /usr/include/c++/8/bits/invoke.h:60
#9  0x000055555562727f in std::__invoke<marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> >(marian::bergamot::Service::<lambda()> &&) (__fn=...) at /usr/include/c++/8/bits/invoke.h:95
#10 0x00005555556287f8 in std::thread::_Invoker<std::tuple<marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> > >::_M_invoke<0>(std::_Index_tuple<0>) (this=0x55555a053de8) at /usr/include/c++/8/thread:244
#11 0x00005555556287ce in std::thread::_Invoker<std::tuple<marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> > >::operator()(void) (this=0x55555a053de8) at /usr/include/c++/8/thread:253
#12 0x00005555556287b2 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> > > >::_M_run(void) (this=0x55555a053de0) at /usr/include/c++/8/thread:196
#13 0x00007ffff78cdd80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007ffff702f6db in start_thread (arg=0x7ffff6c36700) at pthread_create.c:463
#15 0x00007ffff6d58a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

from bergamot-translator.

jerinphilip avatar jerinphilip commented on August 29, 2024

Unable to reproduce since thread adjustments, marking as closed.

from bergamot-translator.

jerinphilip avatar jerinphilip commented on August 29, 2024
[2021-03-26 21:05:24] Received (3, 44) = 0.053427

Thread 48 "marian-decoder-" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdf77e700 (LWP 80447)]
0x00005555556700fc in marian::bergamot::Request::processHistory(unsigned long, std::shared_ptr<marian::History>) ()
(gdb)
(gdb)
(gdb) backtrace
#0  0x00005555556700fc in marian::bergamot::Request::processHistory(unsigned long, std::shared_ptr<marian::History>) ()
#1  0x00005555556701aa in marian::bergamot::RequestSentence::completeSentence(std::shared_ptr<marian::History>) ()
#2  0x0000555555679974 in marian::bergamot::Batch::completeBatch(std::vector<std::shared_ptr<marian::History>, std::allocator<std::shared_ptr<marian::History> > > const&) ()
#3  0x0000555555660451 in marian::bergamot::BatchTranslator::translate(marian::bergamot::Batch&) ()
#4  0x0000555555653a71 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::bergamot::Service::initialize_async_translators()::{lambda()#1}> > >::_M_run() ()
#5  0x00007ffff7224d80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff7bbd6db in start_thread (arg=0x7fffdf77e700) at pthread_create.c:463
#7  0x00007ffff68cea3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)

Sus on pcqueue and swap routine and Batch&.

from bergamot-translator.

jerinphilip avatar jerinphilip commented on August 29, 2024

The presently existing cleaveBatch is as follows:

bool Batcher::cleaveBatch(Batch &batch) {
// For now simply iterates on buckets and converts batches greedily. This
// has to be enhanced with optimizing over priority. The baseline
// implementation should at least be as fast as marian's maxi-batch with full
// corpus size as maxi-batch size.
batch.clear();
size_t paddedBatchSize = 0;
for (size_t length = 0; length < bucket_.size(); length++) {
auto p = bucket_[length].begin();
while (p != bucket_[length].end()) {
paddedBatchSize = (batch.size() + 1) * length;
if (paddedBatchSize <= miniBatchWords) {
auto q = p++;
batch.add(*q);
bucket_[length].erase(q);
} else {
// Check if elements exist
assert(batch.size() > 0);
return true;
}
}
}
bool isValidBatch = batch.size() > 0;
return isValidBatch;
}

The following code in jp/translate-perftest has been running for about 3 hours now without segfault.

It takes batch address from outside and tries to clear(), then populate the batch variable with the current batch. Absent batchExt and a local batch, during the run, there ends up cases where there are more than the intended elements (ABORTs fail). This was discovered when attempting to generate random batches at an even higher volume than WNGT20.

bool Batcher::nextRandomBatch(Batch &batchExt) {
std::pair<size_t, size_t> batchInfo;
while (exhaustConfig_.next(batchInfo)) {
// Generate batch
size_t B = batchInfo.first;
size_t T = batchInfo.second;
size_t available = bucket_[T].size();
if (available > 0) {
Batch batch;
batch.clear();
ABORT_IF(batch.size() != 0, "Batch size doesn't seem to be 0");
std::uniform_int_distribution<size_t> dist(0, available - 1);
static std::random_device rd;
static std::mt19937 gen(rd());
std::vector<size_t> idxs;
for (size_t b = 0; b < B; b++) {
idxs.push_back(dist(gen));
}
// ABORT_IF(idxs.size() * T >= miniBatchWords,
// "Too large a batch detected.");
std::sort(idxs.begin(), idxs.end());
size_t sentenceIdx = 0;
auto p = bucket_[T].begin();
for (auto &idx : idxs) {
while (sentenceIdx < idx) {
++sentenceIdx;
++p;
ABORT_IF(
p == bucket_[T].end(),
"Somehow we have reached the end of container, this is illegal.");
}
ABORT_IF(idx >= bucket_[T].size(),
"idx out of bounds. Something's wrong!");
ABORT_IF(idx != sentenceIdx, "idx != sentenceIdx. Something's wrong!");
ABORT_IF(p->numTokens() > T, "p->numTokens() > T. Something's wrong!");
batch.add(*p);
}
// LOG(info, "(idxs, T) = ({}, {}), (eB, eT) = ({}, {})", idxs.size(), T,
// batch.size(), batch.maxLength());
ABORT_IF(batch.size() != idxs.size(),
"Something's off, check above block!");
ABORT_IF(batch.maxLength() != T, "Something's off, check above block!");
batch.log();
batchExt = batch;
return true;
}
}
return false;
}

If such unexpected batches are created, this is what is possibly leading segfaults in certain WNGT20 runs as well. batch& is also connected to pcqueue with an std::swap through ProduceSwap.

from bergamot-translator.

jerinphilip avatar jerinphilip commented on August 29, 2024

We no longer have pcqueue, so this shouldn't be happening again.

from bergamot-translator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.