The bergamot-translator from browsermt

Jenkins bergamot-translator #1 failed

Build 'bergamot-translator' is failing!

Last 50 lines of build output:

[...truncated 2.44 KB...]
-- Detecting CXX compile features - done
-- Check for working C compiler: /usr/bin/gcc-8
-- Check for working C compiler: /usr/bin/gcc-8 -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
CMake Error at 3rd_party/CMakeLists.txt:1 (add_subdirectory):
  The source directory

    /var/lib/jenkins/workspace/bergamot-translator/3rd_party/marian-dev

  does not contain a CMakeLists.txt file.


CMake Error at 3rd_party/CMakeLists.txt:2 (add_subdirectory):
  The source directory

    /var/lib/jenkins/workspace/bergamot-translator/3rd_party/ssplit-cpp

  does not contain a CMakeLists.txt file.


CMake Error at 3rd_party/CMakeLists.txt:7 (get_property):
  get_property DIRECTORY scope provided but requested directory was not
  found.  This could be because the directory argument was invalid or, it is
  valid but has not been processed yet.


CMake Error at 3rd_party/CMakeLists.txt:8 (target_include_directories):
  Cannot specify include directories for target "marian" which is not built
  by this project.


CMake Error at 3rd_party/CMakeLists.txt:10 (get_property):
  get_property DIRECTORY scope provided but requested directory was not
  found.  This could be because the directory argument was invalid or, it is
  valid but has not been processed yet.


CMake Error at 3rd_party/CMakeLists.txt:11 (target_include_directories):
  Cannot specify include directories for target "ssplit" which is not built
  by this project.


-- Configuring incomplete, errors occurred!
See also "/var/lib/jenkins/workspace/bergamot-translator/build/CMakeFiles/CMakeOutput.log".
Build step 'Execute shell' marked build as failure
Skipped archiving because build is not successful

Changes since last successful build:

View full output

Find a way to prefix 3rd-party includes indicating source library

There's a translator folder in bergamot-translator and also 3rd-party/marian-dev. When files are included, it's not obvious which folder the sources came from, and leaving it thus can potentially lead to issues. For example, below file which includes translator from both marian and bergamot-translator with no way to distinguish on first look.

bergamot-translator/app/marian-decoder-new.cpp

Lines 11 to 15 in f89c989

    
           #include "translator/output_collector.h" 
        
           #include "translator/output_printer.h" 
        
           #include "translator/parser.h" 
        
           #include "translator/response.h" 
        
           #include "translator/service.h"

While this file is not strictly necessary here, this should ideally be configured so that the source-library is known, in this case 'marian' or 'bergamot-translator'. Unsure of the solution, but filing an issue.

Absorb BatchTranslator::thread_ into Service

@kpu: All BatchTranslator exposes publicly is launch a thread in the constructor and join. Consider making BatchTranslator a function (or functions) and moving the std::thread class to Service. Would also make it easier to test by making the thread launch optional.

QualityEstimation Implementation [RFC]

Input: translated text, Source text, model scores for tokens, tokenization information to make sense of model scores.

Output is expected to be containing for each sentence the following:

sentence level quality score: float
Word level quality score vector<float> corresponding to each of the Word

where Word is space separated words of a sentence (mozilla prefers word, not subword level scores). Continuous values preferred for more experimentation capabilities.

Let output be a struct called QualityEstimate. Implementation which can start from the below skeleton is tentatively going to be used by Service to make QualityEstimate a member in Response. (The layer above in UnifiedAPI doesn't have access to logprobs, so).

class QualityEstimator {
public:
  QualityEstimator(Args…) {
       // Use constructor to load an initialize any trained models
       // This is where I expect you to load any neural nets into a graph 
       // or prepare the model parameters (logistic regression) or something if you're using simpler.
  }
  QualityEstimate quality(Histories &histories, Response &response) {
      // AnnotatedText has the blob of text and sentence/word-token information which should be extracted.
      // modelScores are logprobs, they're accessed and ready.
      … your calculation code here
  }
};

This is to be built native first, and when readied exported to WASM.

@ugermann @abhi-agg @fredblain @mfomicheva /cc @kpu

Texts are treated as sequential blocking requests without async queueing

Scouring through bergamot-translator-extension sources, it seems that browser team is trying to break texts into sentences and provide them as individual requests. This can lead to a significant slowdown, referred to in the extension's release docs.

Translation speeds are temporarily low - about 150-200 words per second instead of expected 500-600 words per second

No threads/async prevents any of marian::bergamot::Services request collect and pack batches efficiently mechanism, as long as in the current state of unified API's specification. There are potentially two solutions in my mind, enable threading/async queuing mechanisms or collect it as one string and do the processing somewhere in TranslationModel.

bergamot-translator/src/translator/TranslationModel.cpp

Lines 60 to 64 in 51f702e

    
           std::vector<TranslationResult> 
        
           TranslationModel::translate(std::vector<std::string> &&texts, 
        
                                       TranslationRequest request) { 
        
             // Implementing a non-async version first. Unpleasant, but should work. 
        
             std::promise<std::vector<TranslationResult>> promise;

Opening the issue for discussion.
/cc @abhi-agg @motin

Integration Changes: Discussion

Opening this for a place to discuss changes to the existing 'abstract' API/implementation as the 'concrete' stuff comes in (#8).

Configuration (specifically, TranslationModelConfiguration) needs changes (or replace with Options, if you have marian anyway, why not use Options). Requesting inputs (@abhi-agg, @kpu). The following is the command I use to run a student model with a main app different from the one in this repository. Required/Present are annotated below. A question mark is added where I'm not sure.

./main 
	-m $ASSETS/students/ende/model.intgemm.alphas.bin # [Required, Present]
	--vocabs [Required, Present(?)]
		$ASSETS/students/ende/vocab.deen.spm 
		$ASSETS/students/ende/vocab.deen.spm # 
        --ssplit-mode sentence --beam-size 1 --skip-cost [Required(?), Not Present]
	--shortlist $ASSETS/students/ende/lex.s2t.gz 50 50 [Required(?), Not Present]
	--int8shiftAlphaAll (?)
        --cpu-threads 64 [Required, Not Present]
        --max-input-sentence-tokens 1024 [Required, Not Present]
        --max-input-tokens 1024 [Required, Not Present]
	-w 128 [?]
	--marian-decoder-alpha [Not Required]
	--quiet --quiet-translation [Not Required]
	--log speed/cpu.64.wngt20.log [Not Required]
	< $ASSETS/wngt20/sources.shuf

Edit 1: The following might potentially be required, probably @kpu or @ugermann can comment on the same.

--split-prefix-file

@abhi-agg has a draft of configuration (bergamot-translator/src/TranslationModelConfiguration, which is to be passed outside command line(?). I'm not sure how to connect to this due to the limitations, this blocks from connecting to TranslationModel.h.

I propose @abhi-agg change configuration class, and start repurposing Service class to suit requirements and convert it to TranslationModel. It builds from Options, which can possibly be constructed from a configuration file as well or some class like TranslationModelConfiguration if that's what's necessary.

Binary shortlist loading from bytebuffer

@qianqianzhu You're supposed to follow @XapaJIaMnu's PR for model loading as a byte array (#55), to provide the bytebuffer for shortlist (instead of model) as an argument at 5 places (or less). This would be shortlist_memory instead of model_memory.

Whether you build against main or Nick's PR is upto you, if it were up to me I'd do it on top of Nick's PR and create a mess. Note that since both of you are editing at the same places to pass the bytearray, expect a (simple to fix) conflict which one of you will have to deal with depending on who gets their code into main first.

The shortlist loading happens here:

bergamot-translator/src/translator/batch_translator.cpp

Line 21 in f89c989

slgen_ = New<data::LexicalShortlistGenerator>(options_, vocabs_->front(),

All translation is dumped into the BatchTranslator abstraction (where the shorlist is loaded), with Service delegating the query to BatchTranslator to translator after preprocessing and getting the input ready as a batch. Due to bergamot's insistence of a bridge interface to potentially integrate multiple translators, you'll have to pass the const void * (as argument) through:

AbstractTranslationModel -> TranslationModel -> NonThreadedService -> BatchTranslator , [ServiceBase]. [] means optional, like Nick did.

I'd be grateful if you adjust it at Service as well, which is the multithreaded implementation. If too much, I'm competent enough to do it on my own once the remaining are ready. Advance apologies for this mess.

Set marian::bergamot::Response to work with token-ranges instead of sentence-ranges

Source and target word byte ranges are required for Alignment, which seems to be important to the browser people. Alignment is proposed to be constructed from source byte-ranges to target-byte ranges.

For source, we construct sentence byte-ranges from source-token byte-ranges (ie, we already have source-word byte-ranges). However, translated text due to lack of API in marian-dev's vocabulary is stuck with operating at sentence level. The current source sets marian::bergamot::Response's operations to token-ranges capability, but is glued together with external browser API by assuming a long-translated sentence as a single-token, given the inability to extract token-level views at translation.

bergamot-translator/src/translator/response.cpp

Lines 58 to 66 in 45a8309

    
           // TODO(@jerinphilip): 
        
           // Currently considers target tokens as whole text. Needs 
        
           // to be further enhanced in marian-dev to extract alignments. 
        
           for (auto &range : translationRanges) { 
        
             std::vector<string_view> targetMappings; 
        
             const char *begin = &translation_[range.first]; 
        
             targetMappings.emplace_back(begin, range.second); 
        
             targetRanges_.push_back(std::move(targetMappings)); 
        
           }

An alternative to this is perhaps keeping sentence encoded and using the spaces to extract units, but this won't generalize to other vocabs which is are place in marian.

WASM builds fail on Ubuntu

I use this Makefile to compile WASM locally. This used to work before, however currently it fails near completion with the following error message.

[100%] Linking CXX executable bergamot-translator-worker.js
em++: warning: USE_PTHREADS + ALLOW_MEMORY_GROWTH may run non-wasm code slowly, see https://github.com/WebAssembly/design/issues/1271 [-Wpthreads-mem-growth]
wasm-ld: error: unknown file type: libpcre2_8_la-pcre2_compile.o
em++: error: '/mnt/Storage/jphilip/bergamot/emsdk/upstream/bin/wasm-ld -o bergamot-translator-worker.wasm CMakeFiles/bergamot-translator-worker.dir/bindings/TranslationModelBindings.cpp.o CMakeFiles/bergamot-translator-worker.dir/bindings/TranslationRequestBindings.cpp.o CMakeFiles/bergamot-translator-worker.dir/bindings/TranslationResultBindings.cpp.o ../src/translator/libbergamot-translator.a ../libmarian.a ../3rd_party/marian-dev/src/3rd_party/sentencepiece/src/libsentencepiece.a ../3rd_party/marian-dev/src/3rd_party/intgemm/libintgemm.a ../3rd_party/marian-dev/src/3rd_party/onnxjs/src/wasm-ops/libonnx-sgemm.a ../libssplit.a ../lib/libpcre2-8.a -L/mnt/Storage/jphilip/bergamot/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten --whole-archive /mnt/Storage/jphilip/bergamot/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libembind-rtti.a --no-whole-archive /mnt/Storage/jphilip/bergamot/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libgl-mt.a /mnt/Storage/jphilip/bergamot/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libal.a /mnt/Storage/jphilip/bergamot/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libhtml5.a /mnt/Storage/jphilip/bergamot/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc-mt.a /mnt/Storage/jphilip/bergamot/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libcompiler_rt-mt.a /mnt/Storage/jphilip/bergamot/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc++-mt-noexcept.a /mnt/Storage/jphilip/bergamot/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc++abi-mt-noexcept.a /mnt/Storage/jphilip/bergamot/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libdlmalloc-mt.a /mnt/Storage/jphilip/bergamot/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc_rt_wasm.a /mnt/Storage/jphilip/bergamot/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libsockets-mt.a -mllvm -combiner-global-alias-analysis=false -mllvm -enable-emscripten-sjlj -mllvm -disable-lsr --lto-legacy-pass-manager --allow-undefined --import-memory
--shared-memory --strip-debug --export main --export stackSave --export stackRestore --export stackAlloc --export __wasm_call_ctors --export __errno_location --export __emscripten_pthread_data_constructor --export __pthread_tsd_run_dtors --export _emscripten_call_on_thread --export _emscripten_do_dispatch_to_thread --export _emscripten_main_thread_futex --export _emscripten_thread_init --export emscripten_current_thread_process_queued_calls --export _emscripten_allow_main_runtime_queued_calls --export emscripten_futex_wake --export emscripten_get_global_libc --export emscripten_main_browser_thread_id --export emscripten_main_thread_process_queued_calls --export emscripten_register_main_browser_thread_id --export emscripten_run_in_main_runtime_thread_js --export
emscripten_stack_set_limits --export emscripten_sync_run_in_main_thread_2 --export emscripten_sync_run_in_main_thread_4 --export emscripten_tls_init --export pthread_self --export memalign --export malloc --export free --export setThrew --export _get_tzname --export _get_daylight --export _get_timezone --export-table -z stack-size=5242880 --initial-memory=16777216 --no-entry --max-me
mory=2147483648 --global-base=1024' failed (1)
wasm/CMakeFiles/bergamot-translator-worker.dir/build.make:140: recipe for target 'wasm/bergamot-translator-worker.js' failed
make[3]: *** [wasm/bergamot-translator-worker.js] Error 1
make[3]: Leaving directory '/mnt/Storage/jphilip/bergamot/build/wasm'
CMakeFiles/Makefile2:1064: recipe for target 'wasm/CMakeFiles/bergamot-translator-worker.dir/all' failed
make[2]: *** [wasm/CMakeFiles/bergamot-translator-worker.dir/all] Error 2
make[2]: Leaving directory '/mnt/Storage/jphilip/bergamot/build/wasm'
/mnt/Storage/jphilip/bergamot/build/wasm/Makefile:155: recipe for target 'all' failed
make[1]: *** [all] Error 2
make[1]: Leaving directory '/mnt/Storage/jphilip/bergamot/build/wasm'
Makefile:51: recipe for target 'wasm' failed
make: *** [wasm] Error 2

Pass the model, vocabs and lexical shortlist files to the API as bytes instead of filename paths

Right now, an implementation of marian's API has to provide full paths to the model, vocabulary and lexical shortlist files as arguments in order to have them loaded by the engine.

During an initial security review, this was found to be a potential flaw and also a blocker to execute a full integration of marian into the browser and move forward to the next step of the project.

With that said, we request the implementation of a method in the API that accepts the content of these files to be passed as byte streams, instead filenames.

@jerinphilip @kpu how can we achieve that?

Thanks

Andre

Make marian-decoder-new consume Response instead of Histories

#53, #65 requires this as a prerequisite.

Not having histories_ will break functionality at replacement-decoder for benchmarks and running speed tests. Adjust the marian-decoder replacements to consume processed from history data (translation + annotation). Remove OutputCollector dependency. The process can also get rid of the lineNumber information when streamed from a large corpus, which we do not need in a request response mode of operation.

Options covered by TranslationRequest

Meta issue to discuss and complete docs for keys and possible values for the message passed in regarding what or how should Response be constructed.

@abhi-agg

We will have to add a documentation listing all the keys and the corresponding values that can be provided as translation request.

@motin This is where I want your inputs, this is not API design, this is slight change/discussion in what you communicate to me and what I respond with. Unified API is a wall which changes the objectives to something else and an unnecessary time-sink. I put forth the following configurable parameters.

alignment: true # true | false
alignment-threshold: 0.2f # Float value
quality: false # true | false
quality-score-type: free # free | expensive
concat-strategy: faithful # faithful | space

Explanation

alignment-threshold: So alignments is a (dense) matrix per Unified API Example. This is wasteful, as the matrix is often sparse and your algorithm is expected to only operate with what is the high-match alignments. I'd therefore like to provide you this additional configurability as well, where you set this to 0.0f where you need the full alignment (the dense matrix) or some other tuned value where you want to experiment with different configurations.
quality-score-type: I can offer you a free quality score as of now, which should help you develop UI components. However, I cannot guarantee the API remains same as we accommodate both Mozilla and Sheffield requirements. We're effectively parallelizing development with a bit of overhead here. I have some background developing UIs and particularly with quality scores and I'll add this here to establish the credentials. You should be able to reuse UI components and run a few iterations while we make slight tweaks in the backend to get different but close to these structures quality.
concat-strategy: I am not sure if you want to have this, but you might already be aware that there are newline no newline etc issues with bergamot-translator. You can ask me here to translate text faithful to it's source structure or not if such provisions are present. Think you're translating a .txt, you can offload everything down and print back what we provide - in which case you'd want faithful. Not so much so if you're working with sentences picked up from HTML nodes.

We can add many more as we go ahead. With a dict, the possibilities increase. We'll also need some place to document these, maybe the wiki here or sphinx being generated. Let know your suggestions, or maybe more configurability you want.

Edit: Added quality score yes/no option.

Possible race condition between Batcher and BatchTranslator

http://vali.inf.ed.ac.uk/jenkins/job/bergamot-translator-regression-tests/31/

SIGSEGV: gdb backtrace

Thread 18 "marian-decoder-" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffeec26700 (LWP 56005)]
0x000055555566ac27 in std::__shared_ptr<marian::History, (__gnu_cxx::_Lock_policy)2>::operator= (this=0x8d2df0) at /usr/include/c++/8/bits/shared_ptr_base.h:1078
1078        class __shared_ptr
(gdb) backtrace
#0  0x000055555566ac27 in std::__shared_ptr<marian::History, (__gnu_cxx::_Lock_policy)2>::operator= (this=0x8d2df0) at /usr/include/c++/8/bits/shared_ptr_base.h:1078
#1  0x000055555566ac6f in std::shared_ptr<marian::History>::operator= (this=0x8d2df0) at /usr/include/c++/8/bits/shared_ptr.h:103
#2  0x0000555555669428 in marian::bergamot::Request::processHistory (this=0x55568e2b4840, index=578271, history=std::shared_ptr<marian::History> (use count 3, weak count 0) = {...})
    at /home/jphilip/code/bergamot-translator@integration/src/translator/request.cpp:39
#3  0x0000555555669678 in marian::bergamot::RequestSentence::completeSentence (this=0x55572e949200, history=std::shared_ptr<marian::History> (use count 3, weak count 0) = {...})
    at /home/jphilip/code/bergamot-translator@integration/src/translator/request.cpp:77
#4  0x00005555556395ef in marian::bergamot::Batch::completeBatch (this=0x7fffeec248d0, histories=std::vector of length 256, capacity 256 = {...})
    at /home/jphilip/code/bergamot-translator@integration/src/translator/batch.cpp:24
#5  0x0000555555646f27 in marian::bergamot::BatchTranslator::translate (this=0x555558bdd600, batch=...)
    at /home/jphilip/code/bergamot-translator@integration/src/translator/batch_translator.cpp:96
#6  0x00005555556471b2 in marian::bergamot::BatchTranslator::consumeFrom (this=0x555558bdd600, pcqueue=...)
    at /home/jphilip/code/bergamot-translator@integration/src/translator/batch_translator.cpp:109
#7  0x0000555555625ac6 in marian::bergamot::Service::<lambda()>::operator()(void) const (__closure=0x55555a053fe8)
    at /home/jphilip/code/bergamot-translator@integration/src/translator/service.cpp:41
#8  0x0000555555627436 in std::__invoke_impl<void, marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> >(std::__invoke_other, marian::bergamot::Service::<lambda()> &&)
    (__f=...) at /usr/include/c++/8/bits/invoke.h:60
#9  0x000055555562727f in std::__invoke<marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> >(marian::bergamot::Service::<lambda()> &&) (__fn=...)
    at /usr/include/c++/8/bits/invoke.h:95
#10 0x00005555556287f8 in std::thread::_Invoker<std::tuple<marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> > >::_M_invoke<0>(std::_Index_tuple<0>) (
    this=0x55555a053fe8) at /usr/include/c++/8/thread:244
#11 0x00005555556287ce in std::thread::_Invoker<std::tuple<marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> > >::operator()(void) (this=0x55555a053fe8)
    at /usr/include/c++/8/thread:253
#12 0x00005555556287b2 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::bergamot::Service::Service(marian::Ptr<marian::Options>)::<lambda()> > > >::_M_run(void) (
    this=0x55555a053fe0) at /usr/include/c++/8/thread:196
#13 0x00007ffff78cdd80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007ffff702f6db in start_thread (arg=0x7fffeec26700) at pthread_create.c:463
#15 0x00007ffff6d58a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

bergamot-translator/src/translator/request.cpp

Line 39 in c2b1c6e

histories_[index] = history;

bergamot-translator/src/translator/request.cpp

Line 77 in c2b1c6e

request_->processHistory(index_, history);

bergamot-translator/src/translator/batch.cpp

Line 24 in c2b1c6e

sentences_[i].completeSentence(histories[i]);

bergamot-translator/src/translator/batch_translator.cpp

Line 96 in c2b1c6e

batch.completeBatch(histories);

bergamot-translator/src/translator/batch_translator.cpp

Line 109 in c2b1c6e

translate(batch);

Something amiss with a shared_ptr and copy/move. Unsure what is happening, opening issue to keep track.

Adjust WASM (bindings) side of bytebuffer loads

Model and shortlist byte array loads are being prepared. However, these are untested on the WASM side of integration in this repo. Particularly, that of TranslationModel, where only the string based constructor and not the const void* based constructor is being exported.

bergamot-translator/wasm/bindings/TranslationModelBindings.cpp

Lines 14 to 23 in f89c989

    
           EMSCRIPTEN_BINDINGS(translation_model) { 
        
             class_<TranslationModel>("TranslationModel") 
        
               .constructor<std::string>() 
        
               .function("translate", &TranslationModel::translate) 
        
           	  .function("isAlignmentSupported", &TranslationModel::isAlignmentSupported) 
        
               ; 
        
             register_vector<std::string>("VectorString"); 
        
             register_vector<TranslationResult>("VectorTranslationResult"); 
        
           }

Incorrect cmake level requirement

browsermt/marian-dev#33

We need to bump this project's cmake requirements too...

Transfer tiny tests to GitHub CI

Jenkins@vali is dead. While I bring it back up, it seems reasonable to move some tests to GitHub CI. There's also the plus of ensuring a clean environment regression-tests every single time, unlike Jenkins which caches models etc.

Many of the tests run quickly on native right? Why aren't these github actions?

Maybe not a bad idea, with 30min debug cycles we're looking at:

Use upload artifacts to generated expected output on GitHub runners.
Change expected outputs on bergamot-translator repository with the uploaded artifacts.

CMakeLists.txt reduce to only bergamot-requirements

@abhi-agg

Importing Compilation Flags of marian and applying them to the whole bergamot translator sources doesn't make sense as lot of them (e.g -msse2 -msse3 -msse4.1 -msse4.2 -mavx -mavx2 -mavx512f -DUSE_SENTENCEPIECE -DMKL_ILP64) are not even relevant for these sources. A better way would be to explicitly specifying flags that are relevant for optimization (e.g. -O3) in this project.

Build fails for some users when using the Docker build setup in the wasm-integration branch

Re-filing this here: https://github.com/mozilla-extensions/bergamot-browser-extension/issues/28

Jenkins bergamot-translator #44 failed

Build 'bergamot-translator' is failing!

Last 50 lines of build output:

[...truncated 34.17 KB...]
[ 92%] Building CXX object src/translator/CMakeFiles/bergamot-translator.dir/sentence_splitter.cpp.o
[ 92%] Building CXX object src/translator/CMakeFiles/bergamot-translator.dir/batch_translator.cpp.o
[ 93%] Building CXX object src/translator/CMakeFiles/bergamot-translator.dir/multifactor_priority.cpp.o
[ 93%] Building CXX object src/translator/CMakeFiles/bergamot-translator.dir/request.cpp.o
[ 94%] Building CXX object src/translator/CMakeFiles/bergamot-translator.dir/service.cpp.o
[ 94%] Building CXX object src/translator/CMakeFiles/bergamot-translator.dir/batcher.cpp.o
[ 94%] Building CXX object src/translator/CMakeFiles/bergamot-translator.dir/response.cpp.o
[ 95%] Building CXX object src/translator/CMakeFiles/bergamot-translator.dir/batch.cpp.o
[ 95%] Linking CXX executable ../../../marian-vocab
[ 95%] Built target marian_vocab
[ 95%] Building CXX object src/translator/CMakeFiles/bergamot-translator.dir/sentence_ranges.cpp.o
[ 96%] Linking CXX static library libbergamot-translator.a
[ 96%] Built target bergamot-translator
[ 97%] Building CXX object app/CMakeFiles/marian-decoder-new.dir/marian-decoder-new.cpp.o
[ 97%] Building CXX object app/CMakeFiles/bergamot-translator-app.dir/main.cpp.o
[ 98%] Building CXX object app/CMakeFiles/service-cli.dir/main-mts.cpp.o
[ 98%] Linking CXX executable ../../../marian-conv
[ 98%] Linking CXX executable bergamot-translator-app
[ 98%] Linking CXX executable ../../../marian-decoder
[ 98%] Linking CXX executable service-cli
[ 98%] Linking CXX executable marian-decoder-new
[ 98%] Built target marian_conv
[100%] Linking CXX executable ../../../marian-scorer
[100%] Built target marian_decoder
[100%] Built target bergamot-translator-app
[100%] Built target service-cli
[100%] Built target marian-decoder-new
[100%] Built target marian_scorer
[100%] Linking CXX executable ../../../marian
[100%] Built target marian_train
+ cd ..
+ tar zcvf bergamot-translator.tgz build/CMakeCache.txt build/libmarian.a build/libssplit.a build/src/translator/libbergamot-translator.a build/marian build/marian-conv build/marian-decoder build/marian-scorer build/marian-vocab build/spm_decode build/spm_encode build/spm_export_vocab build/spm_normalize build/spm_train build/app/{bergamot-translator-app,service-cli,marian-decoder-new}
build/CMakeCache.txt
build/libmarian.a
build/libssplit.a
build/src/translator/libbergamot-translator.a
build/marian
build/marian-conv
build/marian-decoder
build/marian-scorer
build/marian-vocab
build/spm_decode
build/spm_encode
build/spm_export_vocab
build/spm_normalize
build/spm_train
tar: build/app/{bergamot-translator-app,service-cli,marian-decoder-new}: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors
Build step 'Execute shell' marked build as failure

Changes since last successful build:

[aaggarwal] 2538fb6 - Added workflows for compilation with custom marian

View full output

Model loading from byte buffer

@XapaJIaMnu

For model load I see modifications to Service and BatchTranslator. Service is currently subclassed as a NonThreaded implementation and a multithreaded one. Calls to initialize BatchTranslator happens in respective constructors.

The model loading happens actually in BatchTranslator.

bergamot-translator/src/translator/batch_translator.cpp

Lines 26 to 39 in f17f02a

    
           graph_ = New<ExpressionGraph>(true); // always optimize 
        
           auto prec = options_->get<std::vector<std::string>>("precision", {"float32"}); 
        
           graph_->setDefaultElementType(typeFromString(prec[0])); 
        
           graph_->setDevice(device_); 
        
           graph_->getBackend()->configureDevice(options_); 
        
           graph_->reserveWorkspaceMB(options_->get<size_t>("workspace")); 
        
           scorers_ = createScorers(options_); 
        
           for (auto scorer : scorers_) { 
        
             scorer->init(graph_); 
        
             if (slgen_) { 
        
               scorer->setShortlistGenerator(slgen_); 
        
             } 
        
           } 
        
           graph_->forward();

So I'll need createScorers (L32) from marian with a bytesbuffer which I pass all the way from whichever implementation of Service requires it. The concurrent-queuing implementation is a bit ahead and both messier and cleaner depending on places (imo), but I'm taking responsibility of bringing it to sync once you have integrated your changes in main.

Build of bergamot-translator-regression-tests on jenkins@vali are failing

marion-minion @ jenkins on vali: Build bergamot-translator-regression-tests build failed.

View full output here.

Click here to expand last 50 lines of build output

$(OUTPUT, lines=50)

Complete TranslationModelConfiguration

If TranslationModelConfiguration needs to be kept, I suggest we write Options (service-cli) -> TranslationModelConfiguration -> Options and then init Service with the Options to to check if TranslationModelConfiguration is comprehensive enough to cover the configuration required to build the model.

I must say I am of the opinion not to be re-implementing Options capability inside TranslationModelConfiguration, but if this is a hard requirement somebody needs to do this.

@abhi-agg: Consider this a self-assigned todo?

bergamot-translator/src/translator/TranslationModelConfigToOptionsAdaptor.cpp

Line 15 in fd897dc

// ToDo: Add actual implementation

Remove Histories from Response

Moving from #50.

Given alignments and quality-scores have been extracted in #46, there is potentially no further use for keeping histories_. (@kpu let know if we need anything more from histories) This allows removing histories_, any lazy constructions I was keen on keeping before and the remaining data-members are not strictly marian internal anymore.

This could pave way for replacing TranslationResult (and thus not having two translation results) mentioned in #50. However, TranslationResult remains constrained by WASM limitations (like https://github.com/mozilla/bergamot-translator-old/issues/14), which I'm not sure if we want Response to be constrained by as well. Response is the intended output struct for marian-server replacement.

Assigning task to self.

/cc @kpu, please advise on the WASM situation with TranslationResult.

Making ssplit-cpp self contained for bergamot project

Currently, users are required to have pcre2 library already installed on the system to be able to use ssplit-cpp in bergamot project. To be able to run it for wasm, we need the pcre2 library that is compiled for WASM and link it.

There are 2 ways to solve this issue:

Either link against a pcre2 library that is compiled for wasm (it is a bad way of intergrating)
Include pcre2 sources in ssplit-cpp in 3rd_party folder, build it and link it in ssplit-cpp as an alternative to linking with pre-installed pcre2 on system

2nd way is clearly more maintainable and clean solution.

I am ready to do this task and can submit a PR.
If @ugermann doesn't want to always compile pcre2 from sources then I am fine with that as well. I can add a cmake option USE_INTERNAL_PCRE2 which can enable/disable building pcre2 from sources.

This is blocking my work and we need to agree fast.

WASM Build Failures

I'm setting up wasm to build locally. Branch wasm-integration. Is there a known solution the following error @abhi-agg, @motin.

Click to Expand Log

EMULATOR="/mnt/Storage/jphilip/bergamot/emsdk/node/14.15.5_64bit/bin/node"
-- Project name: marian
-- Project version: v1.9.56+b86f8a7
CMake Warning at 3rd_party/marian-dev/CMakeLists.txt:97 (message):
  CMAKE_BUILD_TYPE not set; setting to Release


-- Checking support for CPU intrinsics
-- Could not find hardware support for SSE3 on this machine.
-- Could not find hardware support for SSSE3 on this machine.
-- Could not find hardware support for SSE4.1 on this machine.
-- Could not find hardware support for AVX on this machine.
-- Could not find hardware support for AVX2 on this machine.
-- Could not find hardware support for AVX512 on this machine.
-- SSE2 support found
CMake Warning at 3rd_party/marian-dev/CMakeLists.txt:462 (message):
  COMPILE_CUDA=off : Building only CPU version


-- Not Found Tcmalloc
CMake Warning at 3rd_party/marian-dev/CMakeLists.txt:499 (message):
  Cannot find TCMalloc library.  Continuing.


CMake Deprecation Warning at 3rd_party/marian-dev/src/3rd_party/onnxjs/deps/eigen/CMakeLists.txt:3 (cmake_minimum_required):
  Compatibility with CMake < 2.8.12 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.


-- Performing Test COMPILER_SUPPORT_std=cpp03
-- Performing Test COMPILER_SUPPORT_std=cpp03 - Failed
CMake Error at 3rd_party/marian-dev/src/3rd_party/onnxjs/deps/eigen/CMakeLists.txt:112 (message):
  Can't link to the standard math library.  Please report to the Eigen
  developers, telling them about your platform.


-- Configuring incomplete, errors occurred!

Here's commands used to build: https://gist.github.com/jerinphilip/f1caac37ac1eced45e6e9d94380bc4d0

Adapt JS bindings for WASM module to pass files as byte array from browser extension

Current bindings only support the single std::string config constructor. The adjustment which succeeds at least in builds in the WASM - TranslationModel boundary needn't wait until the feature is developed in marian end to speed things up. Ideally, this part should be testable using stubs so that we know that if something fails we are to look in marian or the WASM boundary.

mozilla#8 complains about std::string_view exports (which I hope is just absence of first class support) while conceptually a string_view is only const void* + size_t. If you need to test with the actual pointer, @XapaJIaMnu has already gotten it till WASM-TranslationModel boundary (7da6392) with the model in current main.

bergamot-translator/wasm/bindings/TranslationModelBindings.cpp

Lines 14 to 22 in e0dca1b

    
           EMSCRIPTEN_BINDINGS(translation_model) { 
        
             class_<TranslationModel>("TranslationModel") 
        
               .constructor<std::string>() 
        
               .function("translate", &TranslationModel::translate) 
        
           	  .function("isAlignmentSupported", &TranslationModel::isAlignmentSupported) 
        
               ; 
        
             register_vector<std::string>("VectorString"); 
        
             register_vector<TranslationResult>("VectorTranslationResult");

Refactoring const void *model_memory, size_t model_memory_size with a MemoryGift struct or something can be done later once I generate JS bindings.

Please prioritize this and bring this asap so #69 gets timely feedback.

EDIT: Files in question here are model, vocabulary and shortlist files.

Update marian-dev submodule to use its master branch

Currently, the marian-dev submodule of this repository links to wasm branch of marian-dev

Change it back to master branch.

Be careful with integers that can overflow

There's a lot of int in here.

bergamot-translator/src/translator/request.h

Lines 98 to 150 in 45a8309

    
           class Batch { 
        
           public: 
        
             Batch() { reset(); } 
        
             void reset() { 
        
               Id_ = 0; 
        
               sentences_.clear(); 
        
             } 
        
             // Convenience function to determine poison. 
        
             bool isPoison() { return (Id_ == -1); } 
        
             static Batch poison() { 
        
               Batch poison_; 
        
               poison_.Id_ = -1; 
        
               return poison_; 
        
             } 
        
             void log() { 
        
               int numTokens{0}, maxLength{0}; 
        
               for (auto &sentence : sentences_) { 
        
                 numTokens += sentence.numTokens(); 
        
                 maxLength = std::max(maxLength, static_cast<int>(sentence.numTokens())); 
        
               } 
        
               LOG(info, "Batch(Id_={}, tokens={}, max-length={}, sentences_={})", Id_, 
        
                   numTokens, maxLength, sentences_.size()); 
        
             } 
        
             void add(const RequestSentence &sentence) { sentences_.push_back(sentence); } 
        
             size_t size() { return sentences_.size(); } 
        
             void setId(int Id) { 
        
               assert(Id > 0); 
        
               Id_ = Id; 
        
               if (Id % 500 == 0) { 
        
                 log(); 
        
               } 
        
             } 
        
             const RequestSentences &sentences() { return sentences_; } 
        
             void completeBatch(const Histories &histories) { 
        
               for (int i = 0; i < sentences_.size(); i++) { 
        
                 sentences_[i].completeSentence(histories[i]); 
        
               } 
        
             } 
        
           private: 
        
             int Id_; 
        
             RequestSentences sentences_; 
        
           }; 
        
           } // namespace bergamot 
        
           } // namespace marian

We shouldn't need an integer batch id anyway, so best to get rid of it. Poison can be something else, a bool even. Sequential IDs shouldn't be signed 32-bit integers.
In general sizes can be > 4 billion if size_t permits. Don't be sloppy with types here.

As is, one could run 2^32-1 batches through then it would see poison and deadlock.

WASM allows copy-construction on object explicitly disallowed by nature to copy-construct

Currently in source due to @abhi-agg: "I had to hack something together for WASM to work", for which we now have sharable build error logs courtesy of #46.

I deem this an issue - whether hack or not, unless explained and got to the bottom of this is a point of potential error (we transfer this object through future) allowing copy-construction allows for garbled outputs, which is why this was explicitly disallowed.

bergamot-translator/src/TranslationResult.h

Lines 23 to 60 in a9e0d80

    
           #ifdef WASM_BINDINGS 
        
             TranslationResult(const std::string &original, const std::string &translation) 
        
                 : originalText(original), translatedText(translation), 
        
                   sentenceMappings() {} 
        
           #endif 
        
             TranslationResult(const std::string &original, const std::string &translation, 
        
                               SentenceMappings &sentenceMappings) 
        
                 : originalText(original), translatedText(translation), 
        
                   sentenceMappings(sentenceMappings) {} 
        
             TranslationResult(TranslationResult &&other) 
        
                 : originalText(std::move(other.originalText)), 
        
                   translatedText(std::move(other.translatedText)), 
        
                   sentenceMappings(std::move(other.sentenceMappings)) {} 
        
           #ifdef WASM_BINDINGS 
        
             TranslationResult(const TranslationResult &other) 
        
                 : originalText(other.originalText), 
        
                   translatedText(other.translatedText), 
        
                   sentenceMappings(other.sentenceMappings) {} 
        
           #endif 
        
             TranslationResult(std::string &&original, std::string &&translation, 
        
                               SentenceMappings &&sentenceMappings) 
        
                 : originalText(std::move(original)), 
        
                   translatedText(std::move(translation)), 
        
                   sentenceMappings(std::move(sentenceMappings)) {} 
        
           #ifndef WASM_BINDINGS 
        
             TranslationResult &operator=(const TranslationResult &) = delete; 
        
           #else 
        
             TranslationResult &operator=(const TranslationResult &result) { 
        
               originalText = result.originalText; 
        
               translatedText = result.translatedText; 
        
               sentenceMappings = result.sentenceMappings; 
        
               return *this; 
        
             } 
        
           #endif

WASM Build errors

I'm getting the errors below when following the README instructions to build a WASM artifact.

macbook-pro:build-wasm anatal$ emcmake cmake -DCOMPILE_WASM=on ../
configure: cmake -DCOMPILE_WASM=on ../ -DCMAKE_TOOLCHAIN_FILE=/Users/anatal/projects/mozilla/bergamot/bergamot-translator/emsdk/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake -DCMAKE_CROSSCOMPILING_EMULATOR="/Users/anatal/projects/mozilla/bergamot/bergamot-translator/emsdk/node/12.18.1_64bit/bin/node"
-- Project name: marian
-- Project version: v1.9.37+0200843
CMake Warning at 3rd_party/marian-dev/CMakeLists.txt:73 (message):
  CMAKE_BUILD_TYPE not set; setting to Release


-- Checking support for CPU intrinsics
-- Could not find hardware support for SSE3 on this machine.
-- Could not find hardware support for SSSE3 on this machine.
-- Could not find hardware support for SSE4.1 on this machine.
-- Could not find hardware support for AVX on this machine.
-- Could not find hardware support for AVX2 on this machine.
-- Could not find hardware support for AVX512 on this machine.
-- SSE2 support found
CMake Warning at 3rd_party/marian-dev/CMakeLists.txt:375 (message):
  COMPILE_CUDA=off : Building only CPU version


-- Not Found Tcmalloc
CMake Warning at 3rd_party/marian-dev/CMakeLists.txt:412 (message):
  Cannot find TCMalloc library.  Continuing.


-- Could NOT find MKL (missing: MKL_LIBRARIES MKL_INCLUDE_DIRS MKL_INTERFACE_LIBRARY MKL_SEQUENTIAL_LAYER_LIBRARY MKL_CORE_LIBRARY)
-- Looking for sgemm_
-- Looking for sgemm_ - not found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - not found
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - no
-- Could NOT find Threads (missing: Threads_FOUND)
-- Could NOT find BLAS (missing: BLAS_LIBRARIES)
CMake Warning at 3rd_party/marian-dev/src/3rd_party/intgemm/CMakeLists.txt:25 (message):
  Not building AVX512BW-based multiplication because your compiler is
  too old.

  For details rerun cmake with --debug-trycompile then try to build in
  compile_tests/CMakeFiles/CMakeTmp.


CMake Warning at 3rd_party/marian-dev/src/3rd_party/intgemm/CMakeLists.txt:33 (message):
  Not building AVX512VNNI-based multiplication because your compiler is
  too old.

  For details rerun cmake with --debug-trycompile then try to build in
  compile_tests/CMakeFiles/CMakeTmp.


CMake Warning at 3rd_party/marian-dev/src/3rd_party/CMakeLists.txt:63 (message):
  You are compiling SentencePiece binaries with -DUSE_STATIC_LIBS=on.  This
  will cause spm_train to segfault.  No need to worry if you do not intend to
  use that binary.  Marian support for SentencePiece will work fine.


-- VERSION: 0.1.6
CMake Error at /usr/local/Cellar/cmake/3.13.1/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
  Could NOT find Protobuf (missing: Protobuf_LIBRARIES Protobuf_INCLUDE_DIR)
Call Stack (most recent call first):
  /usr/local/Cellar/cmake/3.13.1/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
  /usr/local/Cellar/cmake/3.13.1/share/cmake/Modules/FindProtobuf.cmake:595 (FIND_PACKAGE_HANDLE_STANDARD_ARGS)
  3rd_party/marian-dev/src/3rd_party/sentencepiece/src/CMakeLists.txt:15 (find_package)


-- Configuring incomplete, errors occurred!
See also "/Users/anatal/projects/mozilla/bergamot/bergamot-translator/build-wasm/CMakeFiles/CMakeOutput.log".
See also "/Users/anatal/projects/mozilla/bergamot/bergamot-translator/build-wasm/CMakeFiles/CMakeError.log".
macbook-pro:build-wasm anatal$

Callback on Request completion to construct Response

The changes proposed in this issue is to be implemented if approved following Alignments PR (#46), and can help #56, #53.

tl;dr: f(x; \theta) = ResponseBuilder(histories; vocabs, std::promise<Response>)

It's a bit unnatural to have std::promise<Response> and vocabs in Request, which should only hold request information.

bergamot-translator/src/translator/request.h

Lines 38 to 40 in 317433a

    
           Request(size_t Id, size_t lineNumberBegin, 
        
                   std::vector<Ptr<Vocab const>> &vocabs_, AnnotatedBlob &&source, 
        
                   Segments &&segments, std::promise<Response> responsePromise);

Similarly Response is constructed from Histories and Vocabs (both of which needn't be there, #53). This can be simplified / neatened by having a ResponseBuilder outside Response, breaking the following messy constructor into meaningful units.

bergamot-translator/src/translator/response.cpp

Lines 11 to 103 in 317433a

    
           Response::Response(AnnotatedBlob &&source, Histories &&histories, 
        
                              std::vector<Ptr<Vocab const>> &vocabs) 
        
               : source(std::move(source)), histories_(std::move(histories)) { 
        
             // Reserving length at least as much as source_ seems like a reasonable thing 
        
             // to do to avoid reallocations. 
        
             target.blob.reserve(source.blob.size()); 
        
             // In a first step, the decoded units (individual senteneces) are compiled 
        
             // into a huge string. This is done by computing indices first and appending 
        
             // to the string as each sentences are decoded. 
        
             std::vector<std::pair<size_t, size_t>> translationRanges; 
        
             std::vector<size_t> sentenceBegins; 
        
             size_t offset{0}; 
        
             bool first{true}; 
        
             for (auto &history : histories_) { 
        
               // TODO(jerin): Change hardcode of nBest = 1 
        
               NBestList onebest = history->nBest(1); 
        
               Result result = onebest[0]; // Expecting only one result; 
        
               Words words = std::get<0>(result); 
        
               auto targetVocab = vocabs.back(); 
        
               std::string decoded; 
        
               std::vector<string_view> targetMappings; 
        
               targetVocab->decodeWithByteRanges(words, decoded, targetMappings); 
        
               if (first) { 
        
                 first = false; 
        
               } else { 
        
                 target.blob += " "; 
        
                 ++offset; 
        
               } 
        
               sentenceBegins.push_back(translationRanges.size()); 
        
               target.blob += decoded; 
        
               auto decodedStringBeginMarker = targetMappings.front().begin(); 
        
               for (auto &sview : targetMappings) { 
        
                 size_t startIdx = offset + sview.begin() - decodedStringBeginMarker; 
        
                 translationRanges.emplace_back(startIdx, startIdx + sview.size()); 
        
               } 
        
               offset += decoded.size(); 
        
               // Alignments 
        
               // TODO(jerinphilip): The following double conversion might not be 
        
               // necessary. Hard alignment can directly be exported, but this would mean 
        
               // WASM bindings for a structure deep within marian source. 
        
               auto hyp = std::get<1>(result); 
        
               auto softAlignment = hyp->tracebackAlignment(); 
        
               auto hardAlignment = data::ConvertSoftAlignToHardAlign( 
        
                   softAlignment, /*threshold=*/0.2f); // TODO(jerinphilip): Make this a 
        
                                                       // configurable parameter. 
        
               Alignment unified_alignment; 
        
               for (auto &p : hardAlignment) { 
        
                 unified_alignment.emplace_back((Point){p.srcPos, p.tgtPos, p.prob}); 
        
               } 
        
               alignments.push_back(std::move(unified_alignment)); 
        
               // Quality scores: Sequence level is obtained as normalized path scores. 
        
               // Word level using hypothesis traceback. These are most-likely logprobs. 
        
               auto normalizedPathScore = std::get<2>(result); 
        
               auto wordQualities = hyp->tracebackWordScores(); 
        
               wordQualities.pop_back(); 
        
               qualityScores.push_back((Quality){normalizedPathScore, wordQualities}); 
        
             } 
        
             // Once we have the indices in translation (which might be resized a few 
        
             // times) ready, we can prepare and store the string_view as annotations 
        
             // instead. This is accomplished by iterating over available sentences using 
        
             // sentenceBegin and using addSentence(...) API from Annotation. 
        
             for (size_t i = 1; i <= sentenceBegins.size(); i++) { 
        
               std::vector<string_view> targetMappings; 
        
               size_t begin = sentenceBegins[i - 1]; 
        
               size_t safe_end = (i == sentenceBegins.size()) ? translationRanges.size() 
        
                                                              : sentenceBegins[i]; 
        
               for (size_t idx = begin; idx < safe_end; idx++) { 
        
                 auto &p = translationRanges[idx]; 
        
                 size_t begin_idx = p.first; 
        
                 size_t end_idx = p.second; 
        
                 const char *data = &target.blob[begin_idx]; 
        
                 size_t size = end_idx - begin_idx; 
        
                 targetMappings.emplace_back(data, size); 
        
               } 
        
               target.addSentence(targetMappings); 
        
             }

ResponseBuilder will handle taking in histories and be initialized with vocabs and the promise. Using histories and vocab, moving the present constructor of Response into ResponseBuilder can enable the construction of a Response there, following which the std::promise<Response> can be set with the newly constructed instance of Response.

Response will thus end up carrying only data members (AnnotatedBlobs of source, target. QualityScores, Alignments).
Consequently, Response should become very thin, and in theory ready for WASM export (conditioned on Annotation being exportable). No string_view offending, alignments and stub QualityScores ready.
An instance of ResponseBuilder initialized with vocabs and promise and accepting histories additionally can be registered as a callback to Request instead of the existing spread mechanism (to be fired once translation of request is completed), consolidating the transition from processed Request -> Response logic into this callback.
QualityEstimation people can be pointed towards just ResponseBuilder, where they will have additional access to histories to just operate and vocabs etc. Feels like a saner API.
When amend/cancel capable futures are required, the std::promise can be replaced with std::promise equivalent of the enhanced future.

Collapse draft API and actual implementation

The following class pairs should collapse into one class.

Service and TranslationModel
Response and TranslationResult
Request and TranslationRequest

My understanding is the Translation classes came from the "Unified API" which was a draft of what the API could look like. During that exercise, we agreed that the API would change with actual implementation. It always does. An API written without implementation inevitably does not capture the full complexity of the task. It also adds unnecessary complexity with functionality that neither the backend nor front end actually uses. Moreover, it does not consider what is efficient to pass and there's no point to multiple format conversions.

Ideally a skeleton API class should be subsumed into the implemented class. There will be no converter between them, which is just a source of bugs.

My understanding is nobody is calling alignments and quality estimation, so it's easiest to merge there.

@jerinphilip Let's avoid arguing about casing for file names and try to avoid reinvention where possible.

@abhi-agg A working implementation should not be taxed because its design differs.

@motin @mlopatka @andrenatal Happy to discuss.

ByteRange capability WASM Bindings

TranslationResult as it stands now only has sentence-mappings which are string_views. However string_view is not supported by embind per mozilla#8, and has since been changed to the following structure.

Proposed alternative is ByteRange:

ByteRange {
    size_t begin_byte_offset;
    size_t end_byte_offset;
}

There are no means yet to operate with this in 1) QualityScore (uses string_view), 2) TranslationResult (uses sentence-mappings for string_views, missing word byteranges). Alignments are ready with the above specified datatype.

I propose we export Annotation in WASM instead to reduce work. Annotation is a simple view on top of the reference string composed of the ByteRange primitive and can provide access to both sentences and words using a flat container. These can one to one map with indices in an Alignment Structure and reduce overall export. An alternate annotation (view on string) can simply be exported for a QualityScore reducing work.

Performance Analysis: BatchTranslator

To devise an optimal batching policy, it's required to know what combinations of B x T works best treating BatchTranslator as a black box. The current choice of --mini-batch-words may perhaps be suboptimal, as N = BxT needn't give the same translation time for all sequences. A 1000x5 may take longer or shorter than a 200x25. This should be varying with hardware, sequences provided etc. If hardware is frozen (for a given machine) we can get real data "expected" time taken for batches, by running and finding an average of these values through a real dataset, say like WNGT20.

B = number of sentences in a batch
T = time-axis (maxLength among the sequences in a batch)

We're looking at a command line application variation of marian-decoder-new which logs statistics on a BxT matrix with elements containing expected times.

Consensus on a style guideline with supported tooling

There is really no fixed style guideline for bergamot-translator. I'm using vim-codefmt to get something consistent adhering to clang-format. google/styleguide as much as possible. Since the majority of the code adheres to this, I propose to make this standard.

Also no java style setters and getters please, if possible? Can we get rid of the header files with camel-case names as well?

It's getting a bit inconsistent among all of us, might be convenient to get this sorted.

/cc @kpu @abhi-agg

Create an autosync mechanism between this repo and mozilla/bergamot-translator

Priority handling in the request queue

I've just jotted down an idea on how to handle priorities in the request queue.

I humbly think my suggestion is trivial to implement as a comparison operator between requests, and serves the purpose well. We may want to introduce a patience scaling factor to transform intuitive priorities (e.g. -20 (now!!!!!) - +20 (whenever ...) into actual patience levels. See my proposal for details. Comments are welcome!

@jerinphilip
cc: @kpu

Getting CI to build: Conflicts

Problem:

C++11: Builds successfully on Mac in CI , fails on Ubuntu. This is due to std::string_view being present in @abhi-agg's code. I can potentially change @abhi-agg's code here to use marian::string_view instead of std::string_view, which is loaded all the way from abseil inside sentencepiece. 😄
C++17: Builds successfully on Ubuntu CPU, fails on Mac due to half_float containing register which is removed in C++17(?). I can try to fix, no Mac so will be burning CI minutes.

Alternative ideas? Is this even a standards problem?

Outbound Translation Requirements

A placeholder to gather and document the requirements of OT feature

Jenkins bergamot-translator #24 failed

Build 'bergamot-translator' is failing!

Last 50 lines of build output:

Started by user Jerin Philip
Running as SYSTEM
Building remotely on var (vnni) in workspace /var/lib/jenkins/workspace/bergamot-translator
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/browsermt/bergamot-translator # timeout=10
Fetching upstream changes from https://github.com/browsermt/bergamot-translator
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/browsermt/bergamot-translator +refs/heads/*:refs/remotes/origin/*
 > git rev-parse refs/remotes/origin/master^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/origin/master^{commit} # timeout=10
 > git rev-parse origin/master^{commit} # timeout=10
ERROR: Couldn't find any revision to build. Verify the repository and branch configuration for this job.

Changes since last successful build:
No changes

View full output

Pass size of model data

We can't just pass a void * for the model data. There needs to be some validation that the size allocated by the client is consistent with the size encoded in the model.

Change TranslationRequest into something like TranslationModelConfig

Re-file in bergamot-translator and tag me there if you need feedback on if some particular API design/binding approach is possible/performant in JS.

@motin Would you mind changing the TranslationRequest object into something similar to a JSON object which has key value pairs and you pass it in as a string, as you now do with TranslationModelConfig? I can provide you more configurability (for starters, an alignment-threshold, how to combine text for getTranslatedText , maybe different type of quality scores to experiment on extension-side, and many more as development progresses). Key thing I put forward is this will be a dict of sorts and extensible with key values easily. The current TranslationRequest doesn't allow me this flexibility. I see it's rather unused below, at least for now.
https://github.com/mozilla-extensions/bergamot-browser-extension/blob/21f43e6739faff4b2a600dd9aea4ac48321bda58/src/core/static/wasm/bergamot-translator-worker.appendix.js#L101-L118

The above will map one to one with a YAML (and therefore JSON) capable structure which is marian::Options similar to what's done over here. You can pass in the std::string like for TranslationModelConfig and gain a boost with higher configurability on your side.

/cc @kpu

Builds with vani

marion-minion @ jenkins on vali:

Have version information baked into executable for regression-tests mapping

This makes it easy to identify commit hashes with regression test outputs. Currently It shows marian, need to supersede this with bergamot-translator git-commit hashes.

Jenkins bergamot-translator #18 failed

Build 'bergamot-translator' is failing!

Last 50 lines of build output:

Started by user Jerin Philip
Running as SYSTEM
Building remotely on var (vnni) in workspace /var/lib/jenkins/workspace/bergamot-translator
Cloning the remote Git repository
Cloning repository https://github.com/browsermt/bergamot-translator
 > git init /var/lib/jenkins/workspace/bergamot-translator # timeout=10
Fetching upstream changes from https://github.com/browsermt/bergamot-translator
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/browsermt/bergamot-translator +refs/heads/*:refs/remotes/origin/*
 > git config remote.origin.url https://github.com/browsermt/bergamot-translator # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/browsermt/bergamot-translator # timeout=10
Fetching upstream changes from https://github.com/browsermt/bergamot-translator
 > git fetch --tags --progress https://github.com/browsermt/bergamot-translator +refs/heads/*:refs/remotes/origin/*
 > git rev-parse jp/match-marian-decoder^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/jp/match-marian-decoder^{commit} # timeout=10
Checking out Revision 4e26aa32bc009c4a4d809927f224f59ae130f03e (refs/remotes/origin/jp/match-marian-decoder)
Commit message: "Duplicating bergamot-translator for benchmarks"
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 4e26aa32bc009c4a4d809927f224f59ae130f03e
 > git rev-list 4e26aa32bc009c4a4d809927f224f59ae130f03e # timeout=10
[bergamot-translator] $ /bin/sh -xe /tmp/jenkins7386354468175542899.sh
+ . /etc/environment
+ PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
+ rm -rf build
+ mkdir -p build
+ cd build
+ cat /var/lib/jenkins/cuda-10.2/version.txt
cat: /var/lib/jenkins/cuda-10.2/version.txt: No such file or directory
Build step 'Execute shell' marked build as failure

Changes since last successful build:
No changes

View full output

Collapse Service to one class

Currently there are three Service classes which leads to bizarre stuff like binary models are property of a base Service class that doesn't do anything with it. This is effectively a tax on people trying to integrate binary models.

There should be one Service class with a choice of blocking or non blocking by the user (currently only blocking will be offered in WASM). #ifdef away threads and PCQueue if WASM can't use them, which may lead to a few of them to avoid initializing stuff.

Patch wasm artifact to enable simd

Examples to construct TranslationRequest and corresponding TranslationResult

In the current examples, there's no TranslationResult constructed as f(TranslationRequest).

Can @abhi-agg provide:

one example code (with some dummy of TranslationRequest), except comprehensive (need to cover QualityScore, Alignment etc etc). Using this TranslationRequest, generate (construct) the expected TranslationResult.
Once single example is agreed on construct TranslationResult from TranslationRequest for all permutations and combinations of TranslationRequest (bunch of bools in there, so >= 2^{#bools}).

To not be concerned about underlying translation problem and code, think translation is identity, that translating English to English same sentence so everything is known. Alignment, QualityScore etc etc. In short, boilerplate code (CPP, not examples in comments) to generate all possible (TranslationRequest, TranslationResult) pairs with one source-text in English identity mapping to the same source-text in English of your choice.

I will happily copy-paste this in places and generate the required (TranslationRequest, TranslationResult) implementing TranslationModel, assuming the boilerplate is approved.

Jenkins bergamot-translator #26 failed

Build 'bergamot-translator' is failing!

Last 50 lines of build output:

[...truncated 1.69 KB...]
Checking out Revision c28687fffb1ceab1befc094f06ac0671108a2c5c (origin/integration)
Commit message: "Merge pull request #38 from browsermt/wasm-integration"
 > git config core.sparsecheckout # timeout=10
 > git checkout -f c28687fffb1ceab1befc094f06ac0671108a2c5c
 > git rev-list 38e8b3cd6d5a2db561ce201c3e69fb79c676389c # timeout=10
[bergamot-translator] $ /bin/sh -xe /tmp/jenkins6534391889599615024.sh
+ . /etc/environment
+ PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
+ rm -rf build
+ mkdir -p build
+ cd build
+ CMAKE=/usr/bin/cmake
+ CC=/usr/bin/gcc-8 CXX=/usr/bin/g++-8 /usr/bin/cmake -DCOMPILE_CUDA=off -DCOMPILE_TESTS=OFF -DCOMPILE_EXAMPLES=OFF -DUSE_SENTENCEPIECE=ON -DCOMPILE_SERVER=OFF ..
-- The CXX compiler identification is GNU 8.4.0
-- The C compiler identification is GNU 8.4.0
-- Check for working CXX compiler: /usr/bin/g++-8
-- Check for working CXX compiler: /usr/bin/g++-8 -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working C compiler: /usr/bin/gcc-8
-- Check for working C compiler: /usr/bin/gcc-8 -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Submodule update
Submodule path '3rd_party/marian-dev': checked out '467c43a292a68b7913af2a00d353de97c1740f92'
Submodule 'src/3rd_party/onnxjs' (https://github.com/abhi-agg/onnxjs.git) registered for path '3rd_party/marian-dev/src/3rd_party/onnxjs'
Cloning into '/var/lib/jenkins/workspace/bergamot-translator/3rd_party/marian-dev/src/3rd_party/onnxjs'...
Submodule path '3rd_party/marian-dev/src/3rd_party/intgemm': checked out 'cc71e5c2a69755009667330af1f60a4ed15b5b63'
Submodule path '3rd_party/marian-dev/src/3rd_party/onnxjs': checked out 'dfefde914fcc79b4c0f9eafcfc97e4b606af700e'
Submodule 'deps/eigen' (https://github.com/abhi-agg/eigen-git-mirror.git) registered for path '3rd_party/marian-dev/src/3rd_party/onnxjs/deps/eigen'
Cloning into '/var/lib/jenkins/workspace/bergamot-translator/3rd_party/marian-dev/src/3rd_party/onnxjs/deps/eigen'...
Submodule path '3rd_party/marian-dev/src/3rd_party/onnxjs/deps/eigen': checked out 'fff37f4ca0397af9ed7e04f3bd6b893a1ea2b08e'
From https://github.com/marian-nmt/sentencepiece
 * branch            bd18c834559ef4a25fa3a740b97465df2daae6eb -> FETCH_HEAD
Submodule path '3rd_party/marian-dev/src/3rd_party/sentencepiece': checked out 'bd18c834559ef4a25fa3a740b97465df2daae6eb'
From https://github.com/ugermann/ssplit-cpp
 * branch            432208826ee27e7b3984b53774b1a16d74256d77 -> FETCH_HEAD
Submodule path '3rd_party/ssplit-cpp': checked out '432208826ee27e7b3984b53774b1a16d74256d77'
CMake Error at 3rd_party/marian-dev/CMakeLists.txt:42 (add_compile_definitions):
  Unknown CMake command "add_compile_definitions".


-- Configuring incomplete, errors occurred!
See also "/var/lib/jenkins/workspace/bergamot-translator/build/CMakeFiles/CMakeOutput.log".
Build step 'Execute shell' marked build as failure

Changes since last successful build:

[aaggarwal] 9a54d21 - Updated marian-dev submodule
[aaggarwal] 47b4bae - Changed encodePreservingSource -> encodeWithByteRanges
[aaggarwal] 5683168 - Updated ssplit submodule to a different repository
[aaggarwal] 584700c - Changed translate() API from non-blocking to blocking
[aaggarwal] a2d3269 - Updated ssplit submodule
[aaggarwal] 9747d9b - Add cmake option to compile project on WASM
[aaggarwal] b73d4f4 - Set cmake option to compile marian library only
[aaggarwal] 838547e - Set cmake options of marian properly for this project
[aaggarwal] 9b89650 - cmake compile option changes
[aaggarwal] 79c445a - cmake compile option changes for wasm builds
[aaggarwal] a06530e - Fixed a bug in TranslationModel class
[aaggarwal] 23a9527 - Source code changes to compile the project without threads
[aaggarwal] 7b80003 - Added code to generate proper JS bindings of translator
[aaggarwal] 74b06d8 - Add wasm folder to compile JS bindings
[aaggarwal] de501e8 - Added JS binding files and cmake infrastructure to build them
[aaggarwal] e126470 - Updated README with wasm build and use instructions
[aaggarwal] ff95e37 - Improved cmake option PACKAGE_DIR
[aaggarwal] 28dcf55 - Improved cmake to use wasm compilation flags across project
[aaggarwal] 3b7673b - Updated marian-dev submodule
[github] 9108d9f - Update README.md
[github] 3a53a68 - Update README.md
[github] a97bf7b - Update README.md
[github] 47db659 - Update README.md
[Jerin Philip] 4764f11 - Move BatchTranslator::thread_ to Service (#10)
[Jerin Philip] f1d9f67 - single-threaded run with --cpu-threads 0 (#10)
[Jerin Philip] 77a600b - Removing join() (#10)
[Jerin Philip] 73a56a8 - Refactoring batching-mechanisms into Batcher
[Jerin Philip] e585a9e - Sanitizing Batch construction
[andrenatal] 1e413f7 - Including a more elaborated test page, a node webserver containing the
[Jerin Philip] 47323d2 - Getting rid of unused variables in Batch
[Jerin Philip] ecc91c5 - BatchTranslator* -> unique_ptr
[andrenatal] 0dbc861 - Adding missing bergamot-httpserver.js
[Jerin Philip] 5bd4a1a - Refactor: marian-TranslationResult and associated
[Jerin Philip] 0fc6105 - No more two TranslationResults (sort-of)
[Jerin Philip] 370e9e2 - {translation_result -> response}.h; propogates;
[Jerin Philip] be455a3 - Straightening multithreading in translator workers
[Jerin Philip] 45a8309 - Missed translation_result -> response rename
[motin] d27a96f - Updated wasm readme
[motin] f7c8651 - Update test page package-lock.json
[motin] 26ea5bb - Some cleanup
[motin] d3969bc - Add support for translating multiple sentences on the test page + report
[motin] 28c0ab2 - Tweak words per second metric in the test page log
[motin] a33b3a3 - Add instructions on how to assemble and package the set of files
[motin] 53e0b9f - Fix typo in lexical shortlist argument on test page
[motin] e50dd09 - Ignore contents in models directory
[motin] 7030fa0 - Ignore test page bundled artifacts
[motin] 49ad651 - Add reproducible docker-based builds + let test page use these by
[motin] 77f3954 - Add time it takes to arrive to preRun to test page
[motin] dbdcdab - Avoid use of unsafe eval in glue code
[motin] 70bdcd4 - Fix typo from when fixing typo
[motin] da56501 - Finally found the original typo that made it appear as if loading the
[motin] 1e94d78 - Formatting
[motin] fcc998f - Add 10 lines of esen benchmark sentences to test page
[motin] f3ff1d2 - Make modelConfig an object instead of string (less likelihood of typos)
[motin] 7d6346d - Add model config used in pr6 benchmarks
[motin] 64d57d8 - Use yaml for modelConfig on test page
[aaggarwal] 3dd7a60 - Enabled simd shuffle pattern for intgemm compilation
[motin] 91e45cb - Prepend shortlist path with /
[motin] 9a5ae95 - Turn of assertions and disable exception catching for wasm builds
[motin] 9a5cf30 - Revert "Enabled simd shuffle pattern for intgemm compilation"
[Jerin Philip] ca6ca15 - Changing fn name from enqueue to produceTo(pcqueue)
[aaggarwal] 0374ac4 - Updated marian submodule
[aaggarwal] 3607523 - Enabled COMPILE_WITHOUT_EXCEPTIONS for marian submodule
[aaggarwal] c5c5339 - Re-enable simd shuffle pattern for intgemm compilation
[Jerin Philip] d5a5e75 - Renaming variables; Enhancing documentation
[aaggarwal] 921c2ee - Updated config for min inference time
[motin] b1e72ce - Updated instructions on how to get all relevant models in place for the
[motin] d907400 - Updated test page to use the model structure from bergamot-models repo
[Jerin Philip] 65e7406 - Comments and lazy stuff to response
[Jerin Philip] 4c8b655 - Batch cleanup
[Jerin Philip] 9c907ea - another int to size_t
[Jerin Philip] d7556bc - SentenceRanges: Class to work with string_views
[Jerin Philip] 0296a38 - Bunch of integers on containers to size_ts
[Jerin Philip] 69201ba - Unify options with marian
[Jerin Philip] fba44be - Improving Batcher error message with new option names
[Jerin Philip] c205c82 - Updates to README with option changes
[Jerin Philip] 44a44fa - CMake build with submodule recursive clones
[Jerin Philip] d005f73 - Reverting changes to PCQueue
[aaggarwal] b86f8a7 - Improved README
[Jerin Philip] 72848ba - Fixes UEdin builds after wasm-integration merge
[Jerin Philip] 47b9db0 - Documentation formatting/syntax fix
[Jerin Philip] 7b10c35 - Hard abort if multithread path launched without multithread-support
[Jerin Philip] 70b57ee - Redundant parser include fixed
[Jerin Philip] d723435 - BatchTranslator doesn't do thread_, residue from merge removed
[aaggarwal] 9feebe5 - Allow using relative paths for packaging files
[Jerin Philip] b9d081d - Temporary: Updating marian-dev to wasm branch
[Jerin Philip] d249dcb - Build doc updated with wasm-branch compatible command
[aaggarwal] b75e72e - Added more explanation for FILES_TO_PACKAGE in README
[Jerin Philip] ca9aa64 - Switch to work with ssplit-cpp both pcre2 and pcrecpp
[Jerin Philip] fbff738 - Temporary: Switch to abhi-agg/ssplit-cpp@wasm
[aaggarwal] c2371dd - Replaced "build-wasm-docker" with "build-wasm"
[aaggarwal] 79571ba - Improved wasm/README
[motin] 51f702e - Remove Docker-based builds since they are no more reproducible than
[aaggarwal] 5dcbb72 - Update ssplit submodule to master branch
[aaggarwal] fa4a1ed - Adapted model config in test example of bergamot
[aaggarwal] 462a850 - Changed Sentences to Paragraphs in test page of WASM
[aaggarwal] 458176c - Enable building pcre2 from sources for ssplit submodule
[aaggarwal] 415d16b - Single cmake option to enable/disable wasm compatible marian compilation

View full output

Single source file as an Application

Currently, there are 3 different files in app folder.

The app folder should contain only one file that demonstrates the use of unified APIs (which is done in main.cpp)

The other files should be removed once everything gets stabilized.

	#include "translator/output_collector.h"
	#include "translator/output_printer.h"
	#include "translator/parser.h"
	#include "translator/response.h"
	#include "translator/service.h"

	std::vector<TranslationResult>
	TranslationModel::translate(std::vector<std::string> &&texts,
	TranslationRequest request) {
	// Implementing a non-async version first. Unpleasant, but should work.
	std::promise<std::vector<TranslationResult>> promise;

	// TODO(@jerinphilip):
	// Currently considers target tokens as whole text. Needs
	// to be further enhanced in marian-dev to extract alignments.
	for (auto &range : translationRanges) {
	std::vector<string_view> targetMappings;
	const char *begin = &translation_[range.first];
	targetMappings.emplace_back(begin, range.second);
	targetRanges_.push_back(std::move(targetMappings));
	}

	EMSCRIPTEN_BINDINGS(translation_model) {
	class_<TranslationModel>("TranslationModel")
	.constructor<std::string>()
	.function("translate", &TranslationModel::translate)
	.function("isAlignmentSupported", &TranslationModel::isAlignmentSupported)
	;

	register_vector<std::string>("VectorString");
	register_vector<TranslationResult>("VectorTranslationResult");
	}

	graph_ = New<ExpressionGraph>(true); // always optimize
	auto prec = options_->get<std::vector<std::string>>("precision", {"float32"});
	graph_->setDefaultElementType(typeFromString(prec[0]));
	graph_->setDevice(device_);
	graph_->getBackend()->configureDevice(options_);
	graph_->reserveWorkspaceMB(options_->get<size_t>("workspace"));
	scorers_ = createScorers(options_);
	for (auto scorer : scorers_) {
	scorer->init(graph_);
	if (slgen_) {
	scorer->setShortlistGenerator(slgen_);
	}
	}
	graph_->forward();

	class Batch {
	public:
	Batch() { reset(); }
	void reset() {
	Id_ = 0;
	sentences_.clear();
	}
	// Convenience function to determine poison.
	bool isPoison() { return (Id_ == -1); }
	static Batch poison() {
	Batch poison_;
	poison_.Id_ = -1;
	return poison_;
	}

	void log() {
	int numTokens{0}, maxLength{0};
	for (auto &sentence : sentences_) {
	numTokens += sentence.numTokens();
	maxLength = std::max(maxLength, static_cast<int>(sentence.numTokens()));
	}

	LOG(info, "Batch(Id_={}, tokens={}, max-length={}, sentences_={})", Id_,
	numTokens, maxLength, sentences_.size());
	}

	void add(const RequestSentence &sentence) { sentences_.push_back(sentence); }

	size_t size() { return sentences_.size(); }

	void setId(int Id) {
	assert(Id > 0);
	Id_ = Id;
	if (Id % 500 == 0) {
	log();
	}
	}

	const RequestSentences &sentences() { return sentences_; }
	void completeBatch(const Histories &histories) {
	for (int i = 0; i < sentences_.size(); i++) {
	sentences_[i].completeSentence(histories[i]);
	}
	}

	private:
	int Id_;
	RequestSentences sentences_;
	};

	} // namespace bergamot
	} // namespace marian

	#ifdef WASM_BINDINGS
	TranslationResult(const std::string &original, const std::string &translation)
	: originalText(original), translatedText(translation),
	sentenceMappings() {}
	#endif
	TranslationResult(const std::string &original, const std::string &translation,
	SentenceMappings &sentenceMappings)
	: originalText(original), translatedText(translation),
	sentenceMappings(sentenceMappings) {}

	TranslationResult(TranslationResult &&other)
	: originalText(std::move(other.originalText)),
	translatedText(std::move(other.translatedText)),
	sentenceMappings(std::move(other.sentenceMappings)) {}

	#ifdef WASM_BINDINGS
	TranslationResult(const TranslationResult &other)
	: originalText(other.originalText),
	translatedText(other.translatedText),
	sentenceMappings(other.sentenceMappings) {}
	#endif

	TranslationResult(std::string &&original, std::string &&translation,
	SentenceMappings &&sentenceMappings)
	: originalText(std::move(original)),
	translatedText(std::move(translation)),
	sentenceMappings(std::move(sentenceMappings)) {}

	#ifndef WASM_BINDINGS
	TranslationResult &operator=(const TranslationResult &) = delete;
	#else
	TranslationResult &operator=(const TranslationResult &result) {
	originalText = result.originalText;
	translatedText = result.translatedText;
	sentenceMappings = result.sentenceMappings;
	return *this;
	}
	#endif

	Request(size_t Id, size_t lineNumberBegin,
	std::vector<Ptr<Vocab const>> &vocabs_, AnnotatedBlob &&source,
	Segments &&segments, std::promise<Response> responsePromise);

	Response::Response(AnnotatedBlob &&source, Histories &&histories,
	std::vector<Ptr<Vocab const>> &vocabs)
	: source(std::move(source)), histories_(std::move(histories)) {
	// Reserving length at least as much as source_ seems like a reasonable thing
	// to do to avoid reallocations.
	target.blob.reserve(source.blob.size());

	// In a first step, the decoded units (individual senteneces) are compiled
	// into a huge string. This is done by computing indices first and appending
	// to the string as each sentences are decoded.
	std::vector<std::pair<size_t, size_t>> translationRanges;
	std::vector<size_t> sentenceBegins;

	size_t offset{0};
	bool first{true};

	for (auto &history : histories_) {
	// TODO(jerin): Change hardcode of nBest = 1
	NBestList onebest = history->nBest(1);

	Result result = onebest[0]; // Expecting only one result;
	Words words = std::get<0>(result);
	auto targetVocab = vocabs.back();

	std::string decoded;
	std::vector<string_view> targetMappings;
	targetVocab->decodeWithByteRanges(words, decoded, targetMappings);

	if (first) {
	first = false;
	} else {
	target.blob += " ";
	++offset;
	}

	sentenceBegins.push_back(translationRanges.size());
	target.blob += decoded;
	auto decodedStringBeginMarker = targetMappings.front().begin();
	for (auto &sview : targetMappings) {
	size_t startIdx = offset + sview.begin() - decodedStringBeginMarker;
	translationRanges.emplace_back(startIdx, startIdx + sview.size());
	}

	offset += decoded.size();

	// Alignments
	// TODO(jerinphilip): The following double conversion might not be
	// necessary. Hard alignment can directly be exported, but this would mean
	// WASM bindings for a structure deep within marian source.
	auto hyp = std::get<1>(result);
	auto softAlignment = hyp->tracebackAlignment();
	auto hardAlignment = data::ConvertSoftAlignToHardAlign(
	softAlignment, /threshold=/0.2f); // TODO(jerinphilip): Make this a
	// configurable parameter.

	Alignment unified_alignment;
	for (auto &p : hardAlignment) {
	unified_alignment.emplace_back((Point){p.srcPos, p.tgtPos, p.prob});
	}

	alignments.push_back(std::move(unified_alignment));

	// Quality scores: Sequence level is obtained as normalized path scores.
	// Word level using hypothesis traceback. These are most-likely logprobs.
	auto normalizedPathScore = std::get<2>(result);
	auto wordQualities = hyp->tracebackWordScores();
	wordQualities.pop_back();
	qualityScores.push_back((Quality){normalizedPathScore, wordQualities});
	}

	// Once we have the indices in translation (which might be resized a few
	// times) ready, we can prepare and store the string_view as annotations
	// instead. This is accomplished by iterating over available sentences using
	// sentenceBegin and using addSentence(...) API from Annotation.

	for (size_t i = 1; i <= sentenceBegins.size(); i++) {
	std::vector<string_view> targetMappings;
	size_t begin = sentenceBegins[i - 1];
	size_t safe_end = (i == sentenceBegins.size()) ? translationRanges.size()
	: sentenceBegins[i];

	for (size_t idx = begin; idx < safe_end; idx++) {
	auto &p = translationRanges[idx];
	size_t begin_idx = p.first;
	size_t end_idx = p.second;

	const char *data = &target.blob[begin_idx];
	size_t size = end_idx - begin_idx;
	targetMappings.emplace_back(data, size);
	}

	target.addSentence(targetMappings);
	}

browsermt / bergamot-translator Goto Github PK

bergamot-translator's People

Contributors

Stargazers

Watchers

Forkers

bergamot-translator's Issues

Recommend Projects

Recommend Topics

Recommend Org