browsermt / marian-dev Goto Github PK

This project forked from marian-nmt/marian-dev

Fast Neural Machine Translation in C++ - development repository

License: Other

C++ 80.65% CMake 3.82% Cuda 7.91% Shell 0.28% Python 2.12% C 0.08% Batchfile 0.42% Perl 0.10% PowerShell 0.19% Dockerfile 0.16% Makefile 0.14% HTML 3.98% JavaScript 0.13% Vim Script 0.02%

marian-dev's People

Contributors

Stargazers

Watchers

Forkers

andrenatal abhi-agg felipesantosk targoman kroketio joachimschurig purplesparkle

marian-dev's Issues

Log about fallback multiply is causing performance issues

This is causing excessive logging:

marian-dev/src/tensors/cpu/wasm_intgemm_fallback.cpp

Line 91 in 200e81c

LOG(info, "Calling fallback implementation of \"int8MultiplyAndAddBias\"");

Let's reduce that to once at initialization.

Jenkins browsermt-marian-regression-tests #1 failed

Build 'browsermt-marian-regression-tests' is failing!

Last 50 lines of build output:

Started by user Jerin Philip
Running as SYSTEM
Building on master in workspace /var/lib/jenkins/workspace/browsermt-marian-regression-tests
Cloning the remote Git repository
Cloning repository http://github.com/browsermt/marian-regression-tests
 > git init /var/lib/jenkins/workspace/browsermt-marian-regression-tests # timeout=10
Fetching upstream changes from http://github.com/browsermt/marian-regression-tests
 > git --version # timeout=10
 > git fetch --tags --progress http://github.com/browsermt/marian-regression-tests +refs/heads/*:refs/remotes/origin/*
 > git config remote.origin.url http://github.com/browsermt/marian-regression-tests # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url http://github.com/browsermt/marian-regression-tests # timeout=10
Fetching upstream changes from http://github.com/browsermt/marian-regression-tests
 > git fetch --tags --progress http://github.com/browsermt/marian-regression-tests +refs/heads/*:refs/remotes/origin/*
 > git rev-parse refs/remotes/origin/master^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/origin/master^{commit} # timeout=10
Checking out Revision 97b2f95abab6134c1632b286e373e513ecc52020 (refs/remotes/origin/master)
Commit message: "Merge pull request #56 from marian-nmt/tests-intgemm"
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 97b2f95abab6134c1632b286e373e513ecc52020
First time build. Skipping changelog.
Copied 0 artifacts from "browsermt-marian-dev-cpu" build number 1
ERROR: Failed to copy artifacts from browsermt-marian-dev-cpu with filter: marian-dev.tgz

Changes since last successful build:

View full output

Remove popen feature in native builds

As we move towards repositories with models provided by other people, a yaml config file should not be able to execute arbitrary commands with a file name that begins with |

marian-dev/src/common/file_stream.cpp

Lines 25 to 37 in 08b1544

    
             if (marian::utils::endsWith(file, "|")) { 
        
           #if defined(__unix__) && !defined(WASM_COMPATIBLE_SOURCE) 
        
               auto command = file.substr(0, file.size() - 1); 
        
               // open as a pipe 
        
               pipe_ = popen(command.c_str(), "r"); 
        
               ABORT_IF(!pipe_, "Command failed to execute ({}): {}", errno, command); 
        
               // there is no official way to construct a filebuf from a FILE* or fd, so we use /proc/{pid}/fd/{fd} 
        
               // For now, this only works on Linux. There are similar workarounds for Windows. 
        
               file_ = "/proc/" + std::to_string(getpid()) + "/fd/" + std::to_string(fileno(pipe_)); 
        
           #else 
        
               ABORT("Pipe syntax not supported in this build of Marian: {}", file); 
        
           #endif 
        
             } else {

That should be disabled in consumer-oriented compiles of Marian. The feature really only seems useful for training.

Sense int8 or int8Shift

Routing/dispatch precedence can be preconfigured in source. There's already platform detection lying around.

My problem with the existing setup is the easy way for me to maintain a continuous integration/test setup (python) for #79 now requires maintaining two sets of configuration files or detecting platform and switching at the client.

At least for native, compiled marian targetting ARM or Intel should be aware which is the best path to take.
marian-nmt#762 appears to have omitted Shift path entirely.

Separate workspace/allocations from graph

Feature description

In the current state of browsermt/marian-dev, the concept of a workspace which manages allocation of tensors is placed behind a graph accessible to the library API bergamot-translator uses. This leads to a temporarily inefficient implementation of multiple-models handling (browsermt/bergamot-translator#210), where the workspaces grow proportional to the number of models active.

@XapaJIaMnu and @kpu have previously solved swapping multiple models by means of swapping tensors onto an active graph. This is "dynamic" and a reference implementation available at https://github.com/kpu/marian-dev/blob/dynamic_swap_mvp/src/translator/swappable.cpp. While this is doable in the case of shared-architectures without incurring much expense, a change in architecture involves reconstructing the graph (eg: tied embedding model swapped out for a non-tied embedding model).

It is optimal to keep the concept of a workspace bound to threads/workers active instead, separate the graph and architecture aside to avoid the blow-up in memory usage than what is originally required.

This issue is intended to investigate how best to make the modifications to solve the above problem in this repository.

/cc @graemenail

Merge from marian-nmt/marian-dev

There's a bunch of kludges like https://github.com/browsermt/mts/blob/master/3rd_party/CMakeLists.txt#L20-L22 that are fixed in marian-nmt/marian-dev

Tensors allocated on cache does not appear deallocation capable

Assuming TensorAllocator is an attempt at implementing an equivalent of std::allocator.

Both tensors_ and cache_ in ExpressionGraph are TensorAllocators.

The cache_ allocates some tensors that are incapable of being deallocated from outside. cache_ is private to this class, and the missing free is a flaw(?).

marian-dev/src/graph/expression_graph.h

Lines 57 to 64 in 844800e

    
             if(!node->val()) { 
        
               if(node->memoize()) 
        
                 cache_->allocate(node->val(), node->shape(), node->value_type()); 
        
               else 
        
                 tensors_->allocate(node->val(), node->shape(), node->value_type()); 
        
             } 
        
           }

I could not find a cache_->free(), tensors_ on the other hand are capable of being freed manually through external calls.

marian-dev/src/graph/expression_graph.h

Line 70 in 844800e

void free(const Tensor& tensor) { tensors_->free(tensor); }

Core Dumped when vocab size not a multiple of 8

When I run:

/firefox-translations-training/3rd_party/browsermt-marian-dev/build/marian-decoder -m /data/models/spoken-signed/spoken_to_signed/student-finetuned/final.model.npz.best-chrf.npz -v /data/models/spoken-signed/spoken_to_signed/vocab/vocab.spm /data/models/spoken-signed/spoken_to_signed/vocab/vocab.spm -c decoder.yml -i /data/data/spoken-signed/spoken_to_signed/original/devset.spoken.gz -o /data/models/spoken-signed/spoken_to_signed/speed/output.signed --shortlist /data/data/spoken-signed/spoken_to_signed/alignment/lex.s2t.pruned.gz false  --dump-quantmult

I get:

[2023-11-08 12:41:15] Error: Rows of matrix: param must be multiple of 8.
[2023-11-08 12:41:15] Error: Aborted from marian::cpu::integer::PrepareBNodeOp::PrepareBNodeOp(marian::Expr, marian::Expr, float, bool) [with marian::Type vtype = (marian::Type)257u; marian::Expr = IntrusivePtr<marian::Chainable<IntrusivePtrmarian::TensorBase > >] in /firefox-translations-training/3rd_party/browsermt-marian-dev/src/tensors/cpu/intgemm_interface.h:92

I suspect that this is the case because my vocab size is 1668 which is not divisible by 8.
Is there a way to pad the vocabulary for this step particularly? Up until this step, I can train browermt fully.

Port marian to WASM for inference

Merge wasm port of marian for inference to it's main branch

Rebase on main branch
Merge wasm compatible SentencePiece in marian's main branch
- Enable (a cmake option based) decoder-only builds in SentencePiece's main branch
- Update marian's SentencePiece submodule
Merge wasm compatible intgemm submodule in marian's main branch
- Merge wormhole changes to intgemm's main branch
- Update marian's intgemm submodule

~~- [ ] Enable decoder-only build capability in marian sources~~

Add onnxjs submodule (a wasm compatible SGEMM routine) in marian (this is required as long as marian can't solely use intgemm for Int-8 inference)
Enable try/catch free builds in marian (for fast inference)
Enable capability to build marian without multi-threading constructs (this is required until threading starts working in WASM)
Add workflow for building wasm compatible marian
Wasm builds: README and docker based build scripts

Upgrade emsdk to the latest tag

Currently, emsdk 2.0.9 is being used to build engine artifacts. We would like to upgrade it to the latest tag (3.1.8 at the time of writing).

Jenkins browsermt-marian-dev-cuda-10.2 #32 failed

Build 'browsermt-marian-dev-cuda-10.2' is failing!

Last 50 lines of build output:

Started by upstream project "marian-dev-cuda-10.0" build number 188
originally caused by:
 Started by upstream project "marian-dev-cuda-10.1" build number 179
 originally caused by:
  Started by GitHub push by qianqianzhu
  Started by GitHub push by qianqianzhu
Running as SYSTEM
Building on master in workspace /var/lib/jenkins/workspace/browsermt-marian-dev-cuda-10.2
The recommended git tool is: NONE
No credentials specified
 > git rev-parse --resolve-git-dir /var/lib/jenkins/workspace/browsermt-marian-dev-cuda-10.2/.git # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/browsermt/marian-dev # timeout=10
Fetching upstream changes from https://github.com/browsermt/marian-dev
 > git --version # timeout=10
 > git --version # 'git version 2.7.4'
 > git fetch --tags --progress https://github.com/browsermt/marian-dev +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git rev-parse origin/master^{commit} # timeout=10
Checking out Revision 548a1ed2cda7d80b08852f72a65e2b02aa69cbc6 (origin/master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 548a1ed2cda7d80b08852f72a65e2b02aa69cbc6 # timeout=10
Commit message: "Merge pull request #34 from browsermt/cmake-upgrade"
 > git rev-list --no-walk cd41f7e1c3230679556f1637f057b0eeadf6ff8b # timeout=10
[browsermt-marian-dev-cuda-10.2] $ /bin/sh -xe /tmp/jenkins804140882289464717.sh
+ . /etc/environment
+ PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
+ rm -rf build
+ mkdir -p build
+ cd build
+ cat /var/lib/jenkins/cuda-10.2/version.txt
CUDA Version 10.2.89
+ CC=/usr/bin/gcc-8 CXX=/usr/bin/g++-8 CUDAHOSTCXX=/usr/bin/g++-8 cmake -DCOMPILE_TESTS=ON -DCOMPILE_EXAMPLES=ON -DCUDA_TOOLKIT_ROOT_DIR=/var/lib/jenkins/cuda-10.2 -DUSE_SENTENCEPIECE=ON -DCOMPILE_SERVER=ON ..
CMake Error at CMakeLists.txt:1 (cmake_minimum_required):
  CMake 3.12.4 or higher is required.  You are running version 3.5.1


-- Configuring incomplete, errors occurred!
Build step 'Execute shell' marked build as failure
Skipped archiving because build is not successful

Changes since last successful build:

[qianqian.zhu] af67906 - some code refactoring and add binary shortlist
[qianqian.zhu] 0a9f49f - update marian decoder with option of loading binary shortlist
[qianqian.zhu] 3a2ad4a - update hash.h for binary shortlist
[qianqian.zhu] 5aa07b5 - add binary shortlist converter
[qianqian.zhu] 20e42c9 - small fixes on option message
[github] 40423d6 - Update GitHub workflows with Ubuntu+CUDA
[qianqian.zhu] e9226e3 - address review comments
[qianqian.zhu] d1e9149 - fix blob naming
[qianqian.zhu] 62f1537 - fix rebase mistakes in translator.h
[qianqian.zhu] 8762daa - fix CI broken and update CHANGELOG
[qianqian.zhu] 7121dd4 - add binary shortlist contructor directly from buffer
[github] 9337105 - Port marian to WASM for inference (#24)
[qianqian.zhu] 55babfc - add logger info and address jerin's comment
[66322306+abhi-agg] bbde958 - Enable loading binary files, against master
[66322306+abhi-agg] 984ca20 - Added USE_WASM_COMPATIBLE_MARIAN cmake option
[66322306+abhi-agg] 42b297d - Added workflow for compiling wasm compatible sources natively
[66322306+abhi-agg] c24d0fd - Added wasm compile workflow for ubuntu
[github] cdf02dc - Update intgemm
[66322306+abhi-agg] 130902c - Replaced compile definition WASM with WASM_COMPATIBLE_SOURCE
[66322306+abhi-agg] 0f0bcf9 - Replaced cmake option USE_WASM_COMPATIBLE_MARIAN with USE_WASM_COMPATIBLE_SOURCE
[66322306+abhi-agg] b77d728 - Fixing Mac and Ubuntu CI failures
[aaggarwal] a37b0c8 - Upgraded cmake_minimum_required to 3.12.4

View full output

What determines whether or not the output layer should be converted as an intgemm?

@XapaJIaMnu explained in #3:

marian already recognises what is inside model.bin (it knows whether it's fbgemm or intgemm) . However, it still needs to be told whether to convert the output layer as an intgemm.

What determines whether or not the output layer should be converted as an intgemm?

Optional apply logprob computation at call site instead of construction site

This issue is meant to track the possibility of some workaround to get QE to be optional at run time, as opposed to construction time (skip-cost= true).

Trace:

skip-cost:

marian-dev/src/translator/scorers.cpp

Lines 41 to 43 in 08b1544

    
           bool skipCost = options->get<bool>("skip-cost"); 
        
           auto encdec = models::createModelFromOptions( 
        
               options, skipCost ? models::usage::raw : models::usage::translation);

createModelFromOptions:

marian-dev/src/models/model_factory.cpp

Lines 370 to 376 in 08b1544

    
           // add (log)softmax if requested 
        
           if (use == usage::translation) { 
        
             if(std::dynamic_pointer_cast<EncoderDecoder>(baseModel)) { 
        
               if(options->get<bool>("output-sampling", false)) 
        
                 return New<Stepwise>(std::dynamic_pointer_cast<EncoderDecoder>(baseModel), New<GumbelSoftmaxStep>()); 
        
               else 
        
                 return New<Stepwise>(std::dynamic_pointer_cast<EncoderDecoder>(baseModel), New<LogSoftmaxStep>());

StepWise:

marian-dev/src/models/costs.h

Lines 309 to 313 in 08b1544

    
           // class to wrap an IEncoderDecoder and a ILogProbStep that are executed in sequence, 
        
           // wrapped again in the IEncoderDecoder interface 
        
           // @TODO: seems we are conflating an interface defition with its implementation? 
        
           // @TODO: needs a better name. Stepwise is an adjective. Classes are things=nouns. StepwiseWhat? 
        
           class Stepwise : public IEncoderDecoder {

StepWise Relevant call site:

marian-dev/src/models/costs.h

Lines 360 to 369 in 08b1544

    
           virtual Ptr<DecoderState> step(Ptr<ExpressionGraph> graph, 
        
                                          Ptr<DecoderState> state, 
        
                                          const std::vector<IndexType>& hypIndices,   // [beamIndex * activeBatchSize + batchIndex] 
        
                                          const Words& words,                         // [beamIndex * activeBatchSize + batchIndex] 
        
                                          const std::vector<IndexType>& batchIndices, // [batchIndex] 
        
                                          int beamSize) override { 
        
             auto nextState = encdec_->step(graph, state, hypIndices, words, batchIndices, beamSize); 
        
             return cost_->apply(nextState); 
        
           }

If I insert a bool skipCost defaulting to false as part of the arguments here and ignore the cost operation if skipCost=true and trigger the param via beamsearch (see below), there is a possibility?

Call-site:

marian-dev/src/translator/beam_search.cpp

Line 421 in 08b1544

    
           states[i] = scorers_[i]->step(graph, states[i], hypIndices, prevWords, batchIndices, (int)maxBeamSize);

marian-dev/src/translator/scorers.h

Lines 129 to 138 in 08b1544

    
           virtual Ptr<ScorerState> step(Ptr<ExpressionGraph> graph, 
        
                                         Ptr<ScorerState> state, 
        
                                         const std::vector<IndexType>& hypIndices, 
        
                                         const Words& words, 
        
                                         const std::vector<IndexType>& batchIndices, 
        
                                         int beamSize) override { 
        
             graph->switchParams(getName()); 
        
             auto wrapperState = std::dynamic_pointer_cast<ScorerWrapperState>(state); 
        
             auto newState = encdec_->step(graph, wrapperState->getState(), hypIndices, words, batchIndices, beamSize); 
        
             return New<ScorerWrapperState>(newState);

Cmake requirements are incorrect. We now require 3.12.4

marian-dev/CMakeLists.txt

Line 67 in 0f0bcf9

add_compile_definitions(USE_PTHREADS)

-- Submodule update
CMake Error at 3rd_party/marian-dev/CMakeLists.txt:67 (add_compile_definitions):
  Unknown CMake command "add_compile_definitions".

add_compile_definitions is only available in 3.12.4
@abhi-agg or @motin , not sure who changed it, as the commit got squashed. Is there an alternative to that, as it breaks compilation on ubuntu 16.04. At the very least we should update the requirements on the top:

marian-dev/CMakeLists.txt

Line 1 in 0f0bcf9

cmake_minimum_required(VERSION 3.5.1)

Multiple ways to change gemmPrecision

--gem-precision and --int* are two ways to do the same thing. Functionality would still work and be accessible without the following.

marian-dev/src/common/config_parser.cpp

Lines 933 to 947 in 844800e

    
           cli.add<bool>("--int8", 
        
               "Optimize speed even more aggressively sacrificing memory or precision by using 8bit integer GEMM with intgemm instead of floats. Only available on CPU. Corresponds to --gemm-precision int8"); 
        
           cli.add<bool>("--int8Alpha", 
        
               "Use a precomputed quantisation multipliers for the activations. Requires a special model. Corresponds to --gemm-precision int8Alpha"); 
        
           cli.add<bool>("--int8shift", 
        
               "Use a faster, shifted integer 8bit GEMM implementation. Corresponds to --gemm-precision int8shift"); 
        
           cli.add<bool>("--int8shiftAlpha", 
        
               "Use a faster, shifted integer 8bit GEMM implementation, with precomputed alphas. Corresponds to --gemm-precision int8shiftAlpha"); 
        
           cli.add<bool>("--int8shiftAll", 
        
               "Use a faster, shifted integer 8bit GEMM implementation even for matrices that don't have a bias. Beneficial on VNNI. Corresponds to --gemm-precision int8shiftAll"); 
        
           cli.add<bool>("--int8shiftAlphaAll", 
        
               "Use a faster, shifted integer 8bit GEMM implementation even for matrices that don't have a bias, with precomputed alphas. Should be the fastest option. Corresponds to --gemm-precision int8shiftAlphaAll"); 
        
           cli.add<std::string>("--gemm-precision", 
        
               "Use lower precision for the GEMM operations only. Supported values: float32, int16, int8, int8Alpha, int8shift, int8shiftAlpha, int8shiftAll, int8shiftAlphaAll", "float32"); 
        
           cli.add<bool>("--dump-quantmult",

Unexplained heavy for-loop

I was looking at profiler output (https://share.firefox.dev/3HWms77)
that reported a remarkable number of samples in marian::Shape::dim() while loading model data.
This seems to stem from the rows() and cols() helpers.
I traced it a bit further to binary.cpp's marian::io::loadItems().
The only function in there that has a hot loop is unquantizeWemb() in integer_common.h
for (size_t i = 0; i < rows(item.shape) * cols(item.shape); i++)
I replaced that i < rows * cols bit with i < n and initialized n just before the loop.
Runtime for a very small test (so most time is spend loading) goes from 6s to 4.4s?
Whut?
It looks like the compiler didn't optimise that loop properly, and maybe there is some vtable trouble happening?

diff --git a/src/tensors/cpu/integer_common.h b/src/tensors/cpu/integer_common.h
index f92028091800eb60f4a04959df1c45ed058857b9..f8b82eabff0607458684378887188e495c456c4b 100644
--- a/src/tensors/cpu/integer_common.h
+++ b/src/tensors/cpu/integer_common.h
@@ -96,7 +96,8 @@ void unquantizeWemb(io::Item& item, const char * input) {
     typedef typename intgemm_<vtype>::type Integer;
     float quantMult = *(reinterpret_cast<const float *>(reinterpret_cast<const Integer *>(input) + item.shape.elements()));
     float * output_tensor = reinterpret_cast<float *>(&(*item.bytes.begin()));
-    for (size_t i = 0; i < rows(item.shape) * cols(item.shape); i++) {
+    size_t n = rows(item.shape) * cols(item.shape);
+    for (size_t i = 0; i < n; i++) {
         output_tensor[i] = reinterpret_cast<const Integer *>(input)[i]*(1/quantMult);
     }
 }

Edit: Looking more into it with Instruments, my stack trace might be wrong. Still, that diff shaved off a second pretty reliably which I don't fully understand yet.

Jenkins browsermt-marian-dev-avx512 #1 failed

Build 'browsermt-marian-dev-avx512' is failing!

Last 50 lines of build output:

[...truncated 53.79 KB...]
[01/27/2021 12:25:50] Running setup script
[01/27/2021 12:25:50] Running tests/models/wngt19/test_model_base_fbgemm_packed16.sh ... OK
[01/27/2021 12:26:04] Test took 00:00:13.158s
[01/27/2021 12:26:04] Running tests/models/wngt19/test_model_base_fbgemm_packed8.sh ... OK
[01/27/2021 12:26:14] Test took 00:00:10.808s
[01/27/2021 12:26:14] Checking directory: tests/models/wnmt18
[01/27/2021 12:26:14] Running setup script
[01/27/2021 12:26:14] Running tests/models/wnmt18/test_student_small.sh ... OK
[01/27/2021 12:26:26] Test took 00:00:11.213s
[01/27/2021 12:26:26] Running tests/models/wnmt18/test_student_small_aan.sh ... OK
[01/27/2021 12:26:36] Test took 00:00:10.118s
[01/27/2021 12:26:36] Running tests/models/wnmt18/test_student_small_aan_intgemm16.sh ... failed
[01/27/2021 12:26:46] Test took 00:00:10.032s
[01/27/2021 12:26:46] Running tests/models/wnmt18/test_student_small_aan_intgemm8.sh ... OK
[01/27/2021 12:26:55] Test took 00:00:9.373s
[01/27/2021 12:26:55] Checking directory: tests/scorer/scores
[01/27/2021 12:26:55] Running setup script
[01/27/2021 12:26:55] Running tests/scorer/scores/test_scores_cpu.sh ... failed
[01/27/2021 12:26:56] Test took 00:00:1.154s
[01/27/2021 12:26:56] Checking directory: tests/server
[01/27/2021 12:26:56] Running setup script
[01/27/2021 12:26:58] Running tests/server/test_ende_cpu.sh ... OK
[01/27/2021 12:27:36] Test took 00:00:38.131s
[01/27/2021 12:27:36] Checking directory: tests/training/restoring/multi-gpu
[01/27/2021 12:27:36] Running setup script
[01/27/2021 12:27:36] Running tests/training/restoring/multi-gpu/test_adam_sync_cpu.sh ... OK
[01/27/2021 12:28:10] Test took 00:00:33.805s
---------------------
Failed:
  - tests/decoder/intgemm/test_intgemm_16bit.sh
  - tests/decoder/intgemm/test_intgemm_16bit_avx2.sh
  - tests/decoder/intgemm/test_intgemm_16bit_sse2.sh
  - tests/decoder/intgemm/test_intgemm_8bit.sh
  - tests/decoder/intgemm/test_intgemm_8bit_avx2.sh
  - tests/decoder/intgemm/test_intgemm_8bit_ssse3.sh
  - tests/models/wnmt18/test_student_small_aan_intgemm16.sh
  - tests/scorer/scores/test_scores_cpu.sh
Logs:
  - /var/lib/jenkins/workspace/browsermt-marian-dev-avx512/regression-tests/tests/decoder/intgemm/test_intgemm_16bit.sh.log
  - /var/lib/jenkins/workspace/browsermt-marian-dev-avx512/regression-tests/tests/decoder/intgemm/test_intgemm_16bit_avx2.sh.log
  - /var/lib/jenkins/workspace/browsermt-marian-dev-avx512/regression-tests/tests/decoder/intgemm/test_intgemm_16bit_sse2.sh.log
  - /var/lib/jenkins/workspace/browsermt-marian-dev-avx512/regression-tests/tests/decoder/intgemm/test_intgemm_8bit.sh.log
  - /var/lib/jenkins/workspace/browsermt-marian-dev-avx512/regression-tests/tests/decoder/intgemm/test_intgemm_8bit_avx2.sh.log
  - /var/lib/jenkins/workspace/browsermt-marian-dev-avx512/regression-tests/tests/decoder/intgemm/test_intgemm_8bit_ssse3.sh.log
  - /var/lib/jenkins/workspace/browsermt-marian-dev-avx512/regression-tests/tests/models/wnmt18/test_student_small_aan_intgemm16.sh.log
  - /var/lib/jenkins/workspace/browsermt-marian-dev-avx512/regression-tests/tests/scorer/scores/test_scores_cpu.sh.log
---------------------
Ran 19 tests in 00:04:33.306s, 11 passed, 0 skipped, 8 failed
Build step 'Execute shell' marked build as failure

Changes since last successful build:

View full output

Different translations for the same sentence (same batch size, CPU, different surrounding sentences)

Bug description

Hi!

When I translate the sentence "Nokkuð mun minna en 50% skal gisti tær.", I get different results. The sentences that are provided in the batch size are the only difference. I've prepared a minimum example (sentences1.txt and sentences2.txt):

Files sentences1.txt and sentences2.txt contain very similar sentences, but the only common sentence is just the one which should report the same translation, but it does not: "Anything less than 50% should be edible." vs "Anything less than 50% should be ed.".

I think this might be expected if the execution was in GPU, since the order of the instructions might be different, but I think this should not happen in CPU.

I used the isen.student.base model from browsermt/students#74 (not sure if it is the same version of the PR, since I see changes in the files of the PR and mines, but should be the same).

sentences2.txt
sentences1.txt

How to reproduce

Translation script (marian-translate-is2en.sh):

#!/usr/bin/env bash

THREADS=$([[ -z "$1" ]] && echo "1" || echo "$1")

if [[ ! "$THREADS" =~ ^[0-9]+$ ]]; then
  THREADS="1"
fi

/home/cgarcia/Documentos/marian-dev/marian-dev/build/marian-decoder \
  -c /home/cgarcia/Documentos/marian-dev/models/isen.student.base/config.intgemm8bit.alphas.yml --quiet --cpu-threads "$THREADS"

Translate:

cat sentences1.txt |  marian-dev/scripts/marian-translate-is2en.sh 5

# The HCA percent is very important, make sure that these ranges in between 50 and 60%.
# Anything less than 50% should be edible.
# You need to keep an eye out for uncountable items, binders and fillers.

cat sentences2.txt |  marian-dev/scripts/marian-translate-is2en.sh 5

# The HCA percent is very important, make sure that this limit in between 50 and 60%.
# Anything less than 50% should be ed.
# You need to look out for synthetic components, binders as well as fillers.

If I translate both files together (cat sentences{1,2}.txt or cat sentences{2,1}.txt), the result of the sentence is "Anything less than 50% should be edible.".

Context

Marian version: v1.9.56 a1a82ff 2021-10-18 18:17:11 +0200
CMake command: cannot
Log file: inference

Maybe is related to huggingface/transformers#25921 (?)

compile to wasm

@lhk writes in marian-nmt/marian#343 and I've transferred it here.

Feature description

I would like to embed machine translation in my webapp. There is tensorflow.js but so far I've been unable to find suitable pretrained translation models.

Opus-MT hosts a large repository of pretrained models for many language pairs. It uses marian for the neural machine translation.

The pre- and postprocessing is cheap. I would be able to host the tokenizer on a server. But marian-decoder is too costly to host myself.
It would be great if it was possible to compile the code to webassembly and run it client-side.

I have written small projects in C/C++ and in principle would be happy to dig deeper. But guidance from someone with more experience would be really helpful.

Is this feasible at all?

Fat binary that optimizes execution path with respect to CPU features at runtime instead of at compile time

The current setup optimizes Marian to exploit CPU features at compile time. This is fine as long as we assume that Marian is re-compiled from scratch for each architecture it runs on. For distributable binaries, this is not possible.

Instead, Marian should determine CPU features at run time and optimize its execution path at that time by setting function pointers to functions optimized for the respective architecture / CPU type.

Google provides a library for detecting CPU features at run time: https://github.com/google/cpu_features

In my preliminary assessment it seems to be that the biggest hurdle will be to rework avx_mathfun.h and sse_mathfun.h.

SGEMM is tangled and cumbersome to edit

There are several problems with the existing sgemm integration. The main problem I find is an if-else/ifdef-else ladder determines BLAS provider where we verbose provide precedence and also check for existence.

This begins in CMakeLists.txt. Then trickles down into CPP source. sgemm routes through an sgemm which follows the BLAS defined API to MKL to CBLAS to ONNX SGEMM under ifdefs and multiple wraps. There is also a ProdBatched giving way to a ProdBatchedOld as another level of indirection?

At two ifdefs / ifs are acceptable. At more than two some form of switch/dispatch needs to appear to reduce headhurt. The suggestion to do f32 sgemm with ruy in #79 (review) is also the addition of a fourth provider.

There are ODR compatible ways for multiple sgemm to exist, and what seems to perhaps better in my opinion is a provider::sgemm(...) with sgemm following a standard API. It is possible to allow multiple providers (MKL, OpenBLAS, Accelerate etc) to exist in an ODR compatible way and give them ranks and a mechanism to prioritize which one at runtime (maybe decide at compile-time even).

Jenkins browsermt-marian-dev-cuda-9.2 #24 failed

Build 'browsermt-marian-dev-cuda-9.2' is failing!

Last 50 lines of build output:

[...truncated 1.38 KB...]
Building on master in workspace /var/lib/jenkins/workspace/browsermt-marian-dev-cuda-9.2
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/browsermt/marian-dev # timeout=10
Fetching upstream changes from https://github.com/browsermt/marian-dev
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/browsermt/marian-dev +refs/heads/*:refs/remotes/origin/*
 > git rev-parse refs/remotes/origin/master^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/origin/master^{commit} # timeout=10
Checking out Revision c24d0fd909d319f0a5380e4bcf83587ca3881386 (refs/remotes/origin/master)
Commit message: "Added wasm compile workflow for ubuntu"
 > git config core.sparsecheckout # timeout=10
 > git checkout -f c24d0fd909d319f0a5380e4bcf83587ca3881386
 > git rev-list b08b26b5cfb9c6498ef48c1ee253f5bd5f788236 # timeout=10
[browsermt-marian-dev-cuda-9.2] $ /bin/sh -xe /tmp/jenkins8775765910282522168.sh
+ . /etc/environment
+ PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
+ mkdir -p build
+ rm -rf build/CMakeCache.txt build/CMakeFiles build/CPackConfig.cmake build/CPackSourceConfig.cmake build/CTestTestfile.cmake build/Doxyfile build/Makefile build/Testing build/cmake_install.cmake build/iris_example build/libmarian.a build/local build/marian build/marian-conv build/marian-decoder build/marian-scorer build/marian-vocab build/mnist_example build/spm_decode build/spm_encode build/spm_export_vocab build/spm_normalize build/spm_train build/src build/test_cli build/test_dropout build/test_logger build/test_pooling build/test_prod build/test_sentencepiece_norm build/test_sqlite
+ cd build
+ cmake --version
cmake version 3.5.1

CMake suite maintained and supported by Kitware (kitware.com/cmake).
+ cat /var/lib/jenkins/cuda-9.2/version.txt
CUDA Version 9.2.148
+ cmake -DCOMPILE_TESTS=ON -DCOMPILE_EXAMPLES=ON -DUSE_CUDNN=ON -DCUDA_TOOLKIT_ROOT_DIR=/var/lib/jenkins/cuda-9.2 ..
-- The CXX compiler identification is GNU 5.4.0
-- The C compiler identification is GNU 5.4.0
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
CMake Error at CMakeLists.txt:64 (add_compile_definitions):
  Unknown CMake command "add_compile_definitions".


-- Configuring incomplete, errors occurred!
See also "/var/lib/jenkins/workspace/browsermt-marian-dev-cuda-9.2/build/CMakeFiles/CMakeOutput.log".
Build step 'Execute shell' marked build as failure
Skipped archiving because build is not successful

Changes since last successful build:

[github] 9337105 - Port marian to WASM for inference (#24)
[66322306+abhi-agg] bbde958 - Enable loading binary files, against master
[66322306+abhi-agg] 984ca20 - Added USE_WASM_COMPATIBLE_MARIAN cmake option
[66322306+abhi-agg] 42b297d - Added workflow for compiling wasm compatible sources natively
[66322306+abhi-agg] c24d0fd - Added wasm compile workflow for ubuntu

View full output

nullptr dereference in WASM code path

Bug description

When trying to run the WASM compiled version of bergamot-translator, some models (specifically, the student.base models from browsermt/students) produce invalid output. Typically a bunch of repetitions of a single word or character per sentence. Not unlike this bug report.

In an attempt to figure out what was going on, I basically compiled the WASM code path into a native app so I could run it through llvm. This caught a nullptr dereference inside intgemm which was ultimately caused by the nullptr which is supposed to be const float* input_bias this line:

marian-dev/src/tensors/cpu/intgemm_interface.h

Line 386 in 53c4f7e

    
           int8PrepareBias((const int8_t *)b->data(), scale_a, 0.0 /*zero_point_a*/, scale_b, 0.0 /*zero_point_b*/, rows(b), cols(b), nullptr/*input_bias*/, val_->data());

This then translates/binds/magics into

marian-dev/src/tensors/cpu/wasm_intgemm_fallback.cpp

Lines 66 to 72 in 53c4f7e

    
             float unquant_factor = (-1) * ((127.0f / scale_A) * (127.0f / scale_B)) / (127.0f); 
        
             intgemm::Int8Shift::PrepareBias( 
        
                 input_B_prepared, 
        
                 width, 
        
                 cols_B, 
        
                 intgemm::callbacks::UnquantizeAndAddBiasAndWrite(unquant_factor, input_bias, output)); 
        
           }

That nullptr then ends up here as config.bias_addr:

template <> class CallbackImpl<CPUType::CPU_NAME, UnquantizeAndAddBiasAndWrite> {
  ...
    auto result = kernels::unquantize(input, mult_reg);
    result = kernels::add_bias(result, config.bias_addr, info.col_idx);
    kernels::write(result, config.output_addr, info.row_idx * info.cols + info.col_idx);
  ...
};

And I'm not sure which of these implementations of kernels::add_bias it ends up at, but none of these seem to be happy with nullptr.

As an experiment, I re-implemented the fallback function to handle the nulltr and call the callback without the bias term if the bias was null:

extern "C" void int8PrepareBias(const int8_t* input_B_prepared,
                                        float scale_A,
                                        float zero_point_A,
                                        float scale_B,
                                        float zero_point_B,
                                        Index width,
                                        Index cols_B,
                                        const float* input_bias,
                                        float* output) {
  float unquant_factor = (-1) * ((127.0f / scale_A) * (127.0f / scale_B)) / (127.0f);
  if (input_bias == nullptr) {
    intgemm::Int8Shift::PrepareBias(
        input_B_prepared,
        width,
        cols_B,
        intgemm::callbacks::UnquantizeAndWrite(unquant_factor, output));
  } else {
    intgemm::Int8Shift::PrepareBias(
        input_B_prepared,
        width,
        cols_B,
        intgemm::callbacks::UnquantizeAndAddBiasAndWrite(unquant_factor, input_bias, output));
  }
}

That fixes both the nullptr dereference error and the broken model output for my non-wasm wasm build.

I'm reporting this as a bug as I imagine there was some reasoning behind writing a nullptr there.

I'm also surprised that what seems to be buggy code compiled to a (mostly) functioning wasm build that works for the tiny models. The base and tiny11 models don't seem to differ all that much. Same layers it seems, they're just larger in base?

Lastly, I'm not sure how to fix this. The quick hack above won't work since that's the fallback code path. Something similar would need to be added to the intgemm code in Mozilla's tree.

Reproduce

Tested by building app/bergamot.cpp from bergamot-translator, after patching cmake files to not pass emcc specific flags when COMPILE_WASM is defined. Also changed wasm_intgemm_interface.h to remove the compiler attributes, and wasm_intgemm_fallback.cpp to remove all the Fallback bits from the names so that those functions get called directly.

I also needed to patch the config.intgemm8bitalpha.yml to include:

max-length-break: 128
mini-batch-words: 1024

After that, this works (or without the patch to fallback.cpp, gives a useful crash):

> lldb -- app/bergamot --model-config-paths ~/.config/translateLocally/ende.student.base-1647129297/config.intgemm8bitalpha.yml
(lldb) target create "app/bergamot"
Current executable set to '/Users/jelmer/Workspace/statmt/firefox-translations/bergamot-translator/build/app/bergamot' (x86_64).
(lldb) settings set -- target.run-args  "--model-config-paths" "/Users/jelmer/.config/translateLocally/ende.student.base-1647129297/config.intgemm8bitalpha.yml"
(lldb) run
Process 3022 launched: '/Users/jelmer/Workspace/statmt/firefox-translations/bergamot-translator/build/app/bergamot' (x86_64)
Hello world!
Hallo Welt!
Process 3022 exited with status = 0 (0x00000000)

Jenkins unit tests (operator) are broken

Minion auto-closed #36.

[2022-03-24 12:38:46] Error: Child 1 has different type (first: float32 != child: float16)
[2022-03-24 12:38:46] Error: Aborted from static marian::Type marian::NaryNodeOp::commonType(const std::vector<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase> > > >&) in /var/lib/jenkins/workspace/browsermt-marian-dev-cuda-10.2/src/graph/node.h:199

[CALL STACK]
[0x564a2b34ed0d]                                                       + 0x303d0d
[0x564a2b385305]                                                       + 0x33a305
[0x564a2b3858b4]                                                       + 0x33a8b4
[0x564a2b2fc0a1]                                                       + 0x2b10a1
[0x564a2b245820]                                                       + 0x1fa820
[0x564a2b111773]                                                       + 0xc6773
[0x564a2b1236b5]                                                       + 0xd86b5
[0x564a2b138a50]                                                       + 0xeda50
[0x564a2b139484]                                                       + 0xee484
[0x564a2b13a1ce]                                                       + 0xef1ce
[0x564a2b0dc3eb]                                                       + 0x913eb
[0x7fb357b8e0b3]    __libc_start_main                                  + 0xf3
[0x564a2b106cce]                                                       + 0xbbcce


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
run_operator_tests is a Catch v2.10.1 host application.
Run with -? for options

-------------------------------------------------------------------------------
Expression graph supports basic math operations (gpu fp16)
  cross entropy with label smoothing vs logsoftmax with gather
-------------------------------------------------------------------------------
/var/lib/jenkins/workspace/browsermt-marian-dev-cuda-10.2/src/tests/units/operator_tests.cpp:935
...............................................................................

/var/lib/jenkins/workspace/browsermt-marian-dev-cuda-10.2/src/tests/units/operator_tests.cpp:935: FAILED:
due to a fatal error condition:
  SIGABRT - Abort (abnormal termination) signal

===============================================================================
test cases:   2 |   1 passed | 1 failed
assertions: 354 | 353 passed | 1 failed

Disabling temporarily to get to working regression-tests. Leaving this issue open here to attend to later.

Jenkins browsermt-marian-dev-cuda-9.2 #13 failed

Build 'browsermt-marian-dev-cuda-9.2' is failing!

Last 50 lines of build output:

Started by upstream project "marian-dev-cuda-10.0" build number 188
originally caused by:
 Started by upstream project "marian-dev-cuda-10.1" build number 179
 originally caused by:
  Started by GitHub push by qianqianzhu
  Started by GitHub push by qianqianzhu
Running as SYSTEM
Building on master in workspace /var/lib/jenkins/workspace/browsermt-marian-dev-cuda-9.2
The recommended git tool is: NONE
No credentials specified
 > git rev-parse --resolve-git-dir /var/lib/jenkins/workspace/browsermt-marian-dev-cuda-9.2/.git # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/browsermt/marian-dev # timeout=10
Fetching upstream changes from https://github.com/browsermt/marian-dev
 > git --version # timeout=10
 > git --version # 'git version 2.7.4'
 > git fetch --tags --progress https://github.com/browsermt/marian-dev +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git rev-parse refs/remotes/origin/master^{commit} # timeout=10
Checking out Revision 548a1ed2cda7d80b08852f72a65e2b02aa69cbc6 (refs/remotes/origin/master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 548a1ed2cda7d80b08852f72a65e2b02aa69cbc6 # timeout=10
Commit message: "Merge pull request #34 from browsermt/cmake-upgrade"
 > git rev-list --no-walk cd41f7e1c3230679556f1637f057b0eeadf6ff8b # timeout=10
[browsermt-marian-dev-cuda-9.2] $ /bin/sh -xe /tmp/jenkins5221462111645576734.sh
+ . /etc/environment
+ PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
+ mkdir -p build
+ rm -rf build/CMakeCache.txt build/CMakeFiles build/CPackConfig.cmake build/CPackSourceConfig.cmake build/CTestTestfile.cmake build/Doxyfile build/Makefile build/Testing build/cmake_install.cmake build/iris_example build/libmarian.a build/local build/marian build/marian-conv build/marian-decoder build/marian-scorer build/marian-vocab build/mnist_example build/spm_decode build/spm_encode build/spm_export_vocab build/spm_normalize build/spm_train build/src build/test_cli build/test_dropout build/test_logger build/test_pooling build/test_prod build/test_sentencepiece_norm build/test_sqlite
+ cd build
+ cmake --version
cmake version 3.5.1

CMake suite maintained and supported by Kitware (kitware.com/cmake).
+ cat /var/lib/jenkins/cuda-9.2/version.txt
CUDA Version 9.2.148
+ cmake -DCOMPILE_TESTS=ON -DCOMPILE_EXAMPLES=ON -DUSE_CUDNN=ON -DCUDA_TOOLKIT_ROOT_DIR=/var/lib/jenkins/cuda-9.2 ..
CMake Error at CMakeLists.txt:1 (cmake_minimum_required):
  CMake 3.12.4 or higher is required.  You are running version 3.5.1


-- Configuring incomplete, errors occurred!
Build step 'Execute shell' marked build as failure
Skipped archiving because build is not successful

Changes since last successful build:

[qianqian.zhu] af67906 - some code refactoring and add binary shortlist
[qianqian.zhu] 0a9f49f - update marian decoder with option of loading binary shortlist
[qianqian.zhu] 3a2ad4a - update hash.h for binary shortlist
[qianqian.zhu] 5aa07b5 - add binary shortlist converter
[qianqian.zhu] 20e42c9 - small fixes on option message
[github] 40423d6 - Update GitHub workflows with Ubuntu+CUDA
[qianqian.zhu] e9226e3 - address review comments
[qianqian.zhu] d1e9149 - fix blob naming
[qianqian.zhu] 62f1537 - fix rebase mistakes in translator.h
[qianqian.zhu] 8762daa - fix CI broken and update CHANGELOG
[qianqian.zhu] 7121dd4 - add binary shortlist contructor directly from buffer
[github] 9337105 - Port marian to WASM for inference (#24)
[qianqian.zhu] 55babfc - add logger info and address jerin's comment
[66322306+abhi-agg] bbde958 - Enable loading binary files, against master
[66322306+abhi-agg] 984ca20 - Added USE_WASM_COMPATIBLE_MARIAN cmake option
[66322306+abhi-agg] 42b297d - Added workflow for compiling wasm compatible sources natively
[66322306+abhi-agg] c24d0fd - Added wasm compile workflow for ubuntu
[github] cdf02dc - Update intgemm
[66322306+abhi-agg] 130902c - Replaced compile definition WASM with WASM_COMPATIBLE_SOURCE
[66322306+abhi-agg] 0f0bcf9 - Replaced cmake option USE_WASM_COMPATIBLE_MARIAN with USE_WASM_COMPATIBLE_SOURCE
[66322306+abhi-agg] b77d728 - Fixing Mac and Ubuntu CI failures
[aaggarwal] a37b0c8 - Upgraded cmake_minimum_required to 3.12.4

View full output

Improve lexical shortlist loading time

Bug description

It takes a long time to load the lexicon shortlist in memory when operating in an WASM environment. Just as a comparison, we have benchmarks where it takes 14X longer to start a marian object when we load the shortlist along the model, instead of the the model solo.

How to reproduce

Follow the steps in this page: in order to build a marian-dev wasm module: https://github.com/browsermt/bergamot-translator/tree/wasm-integration#bergamot-translator
Follow the steps in this page to run your test page: https://github.com/browsermt/bergamot-translator/tree/wasm-integration/wasm
Run the test page and note the time to load the models in the log area
Comment the shortlist loading here https://github.com/browsermt/bergamot-translator/blob/wasm-integration/wasm/test_page/bergamot.html#L92-L95 and note how faster it runs.

Context

Please refer to the discussion here browsermt/bergamot-translator#28 (comment) for more context.

Get automatic build actions working for this fork

Compressed model crashes mysteriously

Bug description

Trying to translate with a compressed model crashes with:
[2021-12-15 17:17:17] Loading scorer of type transformer as feature F0
[2021-12-15 17:17:17] [memory] Reserving 31 MB, device cpu0
[2021-12-15 17:17:17] [memory] Reserving 8 MB, device cpu0
[2021-12-15 17:17:23] Error: Segmentation fault
[2021-12-15 17:17:23] Error: Aborted from setErrorHandlers()::<lambda(int, siginfo_t*, void*)> in /home/pkoehn/statmt/project/marian-browsermt2/src/common/logging.cpp:134

How to reproduce

Model is in https://www.cs.jhu.edu/~phi/system-cpu-v1.tgz

Context

Latest checkout of
git clone https://github.com/browsermt/marian-dev.git marian-browsermt2
compiled with
~/statmt/project/cmake-3.12.2/bin/cmake -DUSE_SENTENCEPIECE=ON ..
make -j

Enable 8-bit integer computations in Attention layer of Marian framework

Feature description

Currently, computations in Attention layer are 32-bit (floating point) while rest of the layers can do integer computations (8-bit and 16-bit). It would be great if the computations in Attention layer can also happen in 8-bit.

We already have intgemm to do 8-bit integer gemm operations in other layers and the same can be used for Attention layer as well.

Some advantages of doing it:

Faster inference
Removal of an sgemm library dependency for consumers who only want to do 8-bit int gemm

cc @andrenatal @kpu @XapaJIaMnu

Avoid calling C functions wrapping intgemm if intrinsic isn't native

I'm letting browsermt/bergamot-translator#265 in because it's a net improvement. But for the use case where Firefox isn't providing native multiplication, it will be slower due to bouncing WASM to JS to WASM every call. Instead the C++ should just not call the C functions and use the C++ interface to intgemm directly.

This requires sensing the presence of the native support at runtime. Options are JS tells us (in configuration) or C++ senses it by e.g. having incorrect implementations of the C functions internally / setting a global variable from the implementation that tells it to use the internal implementation (C++ on top of webassembly).

Wrong use of parameters

marian-dev/src/tensors/cpu/wasm_intgemm_fallback.cpp

Line 92 in 200e81c

float unquant_factor = unquant_multiplier / (scale_A * scale_B);

This looks wrong, Scale_A and Scale_B are already used to compute the unquantisation mutliplier? @kpu @abhi-agg am I missing something? Spotted by @jerinphilip

Support decoding with ranges to avoid a copy

For browsermt/bergamot-translator#202, it was suggested that a view for a vector/range backed by a binary representaion be substituted to eliminate conversions between vector and binary representation.

However, decodeWithByteRanges assumes a vector to be provided:

marian-dev/src/data/sentencepiece_vocab.cpp

Lines 259 to 281 in 62bac85

    
           void decodeWithByteRanges(const Words& sentence, 
        
                                     std::string &decoded, 
        
                                     std::vector<string_view> &byteRanges, 
        
                                     bool ignoreEOS) const override { 
        
             sentencepiece::SentencePieceText spt; 
        
             std::vector<int> spmSentence; 
        
             spmSentence.reserve(sentence.size()); 
        
             for(auto&& word : sentence) 
        
               spmSentence.push_back(word.toWordIndex()); 
        
             spm_->Decode(spmSentence, &spt); 
        
             decoded = spt.text(); // Creates copy of string. 
        
             string_view decoded_view(decoded); 
        
             for(auto piece : spt.pieces()) { 
        
               string_view byteRange = decoded_view.substr(piece.begin(), piece.end() - piece.begin()); 
        
               byteRanges.push_back(byteRange); 
        
             } 
        
             if(ignoreEOS){ 
        
                 byteRanges.pop_back(); 
        
             } 
        
           }

We need to provide something that can work with ranges/iterators instead to avoid the additional copy. Since this is a function only used by bergamot we need not worry about breaking any backwards compatibility. Consistency with the remaining may still be a concern, in which case we can provide an overload which internally calls the range-based method.

Get rid of TargetArch.cmake

TargetArch.cmake was added so we can easily detect ARM. Unfortunately it breaks the apple universal binary compilation because it generates invalid gcc commands.

-Werror=use-after-free in IntrusivePtr

yay builds of translateLocally fail with the following error, coming from marian-dev.

In file included from /home/jerin/.cache/yay/translatelocally-git/src/translateLocally/3rd_party/bergamot-translator/3rd_party/marian-dev/src/common/definitions.h:5,
                 from /home/jerin/.cache/yay/translatelocally-git/src/translateLocally/3rd_party/bergamot-translator/3rd_party/marian-dev/src/common/cli_wrapper.h:6,
                 from /home/jerin/.cache/yay/translatelocally-git/src/translateLocally/3rd_party/bergamot-translator/3rd_party/marian-dev/src/common/config_parser.h:4,
                 from /home/jerin/.cache/yay/translatelocally-git/src/translateLocally/3rd_party/bergamot-translator/3rd_party/marian-dev/src/common/config.h:5,
                 from /home/jerin/.cache/yay/translatelocally-git/src/translateLocally/3rd_party/bergamot-translator/3rd_party/marian-dev/src/graph/expression_graph.h:3,
                 from /home/jerin/.cache/yay/translatelocally-git/src/translateLocally/3rd_party/bergamot-translator/3rd_party/marian-dev/src/graph/expression_graph.cpp:1:
In function ‘void marian::intrusivePtrRelease(TensorBase*)’,
    inlined from ‘IntrusivePtr<T>::~IntrusivePtr() [with T = marian::TensorBase]’ at /home/jerin/.cache/yay/translatelocally-git/src/translateLocally/3rd_party/bergamot-translator/3rd_party/marian-dev/src/common/intrusive_ptr.h:66:26,
    inlined from ‘void marian::ExpressionGraph::backward(bool, float)’ at /home/jerin/.cache/yay/translatelocally-git/src/translateLocally/3rd_party/bergamot-translator/3rd_party/marian-dev/src/graph/expression_graph.cpp:207:14:
/home/jerin/.cache/yay/translatelocally-git/src/translateLocally/3rd_party/bergamot-translator/3rd_party/marian-dev/src/common/intrusive_ptr.h:24:23: error: pointer used after ‘void operator delete(void*)’ [-Werror=use-after-free]
   24 |     if(x != 0 && --x->references_ == 0) {    \
      |                    ~~~^~~~~~~~~~~
/home/jerin/.cache/yay/translatelocally-git/src/translateLocally/3rd_party/bergamot-translator/3rd_party/marian-dev/src/tensors/tensor.h:30:3: note: in expansion of macro ‘ENABLE_INTRUSIVE_PTR’
   30 |   ENABLE_INTRUSIVE_PTR(TensorBase)
      |   ^~~~~~~~~~~~~~~~~~~~
In file included from /home/jerin/.cache/yay/translatelocally-git/src/translateLocally/3rd_party/bergamot-translator/3rd_party/marian-dev/src/tensors/tensor_allocator.h:8,
                 from /home/jerin/.cache/yay/translatelocally-git/src/translateLocally/3rd_party/bergamot-translator/3rd_party/marian-dev/src/graph/expression_graph.h:7:
In destructor ‘virtual marian::TensorBase::~TensorBase()’,
    inlined from ‘void marian::intrusivePtrRelease(TensorBase*)’ at /home/jerin/.cache/yay/translatelocally-git/src/translateLocally/3rd_party/bergamot-translator/3rd_party/marian-dev/src/tensors/tensor.h:30:3,
    inlined from ‘IntrusivePtr<T>::~IntrusivePtr() [with T = marian::TensorBase]’ at /home/jerin/.cache/yay/translatelocally-git/src/translateLocally/3rd_party/bergamot-translator/3rd_party/marian-dev/src/common/intrusive_ptr.h:66:26,
    inlined from ‘void marian::Element(Functor, Tensor, Tensors ...) [with Functor = functional::Assign<functional::Var<1>, functional::BinaryFunctor<functional::elem::Clip, functional::Assignee<1>, functional::Capture> >; Tensors = {}]’ at /home/jerin/.cache/yay/translatelocally-git/src/translateLocally/3rd_party/bergamot-translator/3rd_party/marian-dev/src/tensors/tensor_operators.h:50:17,
    inlined from ‘void marian::ExpressionGraph::backward(bool, float)’ at /home/jerin/.cache/yay/translatelocally-git/src/translateLocally/3rd_party/bergamot-translator/3rd_party/marian-dev/src/graph/expression_graph.cpp:207:14:
/home/jerin/.cache/yay/translatelocally-git/src/translateLocally/3rd_party/bergamot-translator/3rd_party/marian-dev/src/tensors/tensor.h:67:26: note: call to ‘void operator delete(void*)’ here
   67 |   virtual ~TensorBase() {}
      |                          ^
cc1plus: all warnings being treated as errors
make[2]: *** [3rd_party/bergamot-translator/3rd_party/marian-dev/src/CMakeFiles/marian.dir/build.make:664: 3rd_party/bergamot-translator/3rd_party/marian-dev/src/CMakeFiles/marian.dir/graph/expression_graph.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:557: 3rd_party/bergamot-translator/3rd_party/marian-dev/src/CMakeFiles/marian.dir/all] Error 2
make: *** [Makefile:156: all] Error 2
==> ERROR: A failure occurred in build().
    Aborting...
 -> error making: translatelocally-git

I am assuming the following compiler is used on my ArchLinux:

$ g++ --version
g++ (GCC) 12.1.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Generate Wasm builds of marian without SharedArrayBuffer

Feature description

Described here: mozilla/bergamot-translator#63

Windows build fails on intgemm

For details see here: https://github.com/browsermt/marian-dev/runs/961679913.

browsermt/marian-dev regression-test-failures

Status

tests/scorer/scores/test_scores_cpu.sh
tests/decoder/intgemm/test_intgemm_16bit.sh
tests/decoder/intgemm/test_intgemm_16bit_sse2.sh
tests/decoder/intgemm/test_intgemm_8bit.sh
tests/decoder/intgemm/test_intgemm_8bit_ssse3.sh
tests/models/wnmt18/test_student_small_aan_intgemm16.sh

Logs

Logs
http://vali.inf.ed.ac.uk/jenkins/job/browsermt-marian-regression-tests/7/console

Failed:
  - tests/scorer/scores/test_scores_cpu.sh
  - tests/decoder/intgemm/test_intgemm_16bit.sh
  - tests/decoder/intgemm/test_intgemm_16bit_sse2.sh
  - tests/decoder/intgemm/test_intgemm_8bit.sh
  - tests/decoder/intgemm/test_intgemm_8bit_ssse3.sh
  - tests/models/wnmt18/test_student_small_aan_intgemm16.sh
Logs:
  - /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/scorer/scores/test_scores_cpu.sh.log
  - /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/decoder/intgemm/test_intgemm_16bit.sh.log
  - /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/decoder/intgemm/test_intgemm_16bit_sse2.sh.log
  - /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/decoder/intgemm/test_intgemm_8bit.sh.log
  - /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/decoder/intgemm/test_intgemm_8bit_ssse3.sh.log
  - /var/lib/jenkins/workspace/browsermt-marian-regression-tests/tests/models/wnmt18/test_student_small_aan_intgemm16.sh.log

Issue updated as I figure what exactly is failing.

Available Machines, vector instructions

ansible -m shell -a "grep -o -e 'avx[^ ]*' -e 'sse[^ ]*' -e ssse3 /proc/cpuinfo | sort | uniq | tr '\n' ' '" gpu --limit '!fulla'
dagr | CHANGED | rc=0 >>
avx avx2 sse sse2 sse4_1 sse4_2 ssse3 
elli | CHANGED | rc=0 >>
avx avx2 sse sse2 sse4_1 sse4_2 ssse3 
baldur | CHANGED | rc=0 >>
avx avx2 sse sse2 sse4_1 sse4_2 ssse3 
bil | CHANGED | rc=0 >>
avx avx2 avx512cd avx512f sse sse2 sse4_1 sse4_2 ssse3 
buri | CHANGED | rc=0 >>
sse sse2 sse4_1 sse4_2 ssse3 
hodor | CHANGED | rc=0 >>
avx avx2 sse sse2 sse4_1 sse4_2 ssse3 
frigg | CHANGED | rc=0 >>
avx avx2 sse sse2 sse4_1 sse4_2 ssse3 
hretha | CHANGED | rc=0 >>
avx avx2 sse sse2 sse4_1 sse4_2 ssse3 
gna | CHANGED | rc=0 >>
avx sse sse2 sse4_1 sse4_2 ssse3 
lofn | CHANGED | rc=0 >>
avx sse sse2 sse4_1 sse4_2 ssse3 
mani | CHANGED | rc=0 >>
avx avx2 avx512cd avx512f sse sse2 sse4_1 sse4_2 ssse3 
mimir | CHANGED | rc=0 >>
avx avx2 sse sse2 sse4_1 sse4_2 ssse3 
meili | CHANGED | rc=0 >>
avx avx2 sse sse2 sse4_1 sse4_2 ssse3 
rindr | CHANGED | rc=0 >>
avx avx2 sse sse2 sse4_1 sse4_2 ssse3 
sigyn | CHANGED | rc=0 >>
avx avx2 avx512cd avx512f sse sse2 sse4_1 sse4_2 ssse3 
startiger | CHANGED | rc=0 >>
avx avx2 sse sse2 sse4_1 sse4_2 ssse3 
vor | CHANGED | rc=0 >>
avx avx2 avx512cd avx512f sse sse2 sse4_1 sse4_2 ssse3 
snotra | CHANGED | rc=0 >>
avx sse sse2 sse4_1 sse4_2 ssse3 
thrud | CHANGED | rc=0 >>
avx sse sse2 sse4_1 sse4_2 ssse3 
zisa | CHANGED | rc=0 >>
avx avx2 sse sse2 sse4_1 sse4_2 ssse3

Native implementation of the wasm matrix multiply interface for Intel architecture

Feature description

As per browsermt/bergamot-translator#205, we would like to have an optimized implementation of the generic matrix multiply interface for Intel architecture

cc @andrenatal @kpu @XapaJIaMnu

	if (marian::utils::endsWith(file, "\|")) {
	#if defined(__unix__) && !defined(WASM_COMPATIBLE_SOURCE)
	auto command = file.substr(0, file.size() - 1);
	// open as a pipe
	pipe_ = popen(command.c_str(), "r");
	ABORT_IF(!pipe_, "Command failed to execute ({}): {}", errno, command);
	// there is no official way to construct a filebuf from a FILE* or fd, so we use /proc/{pid}/fd/{fd}
	// For now, this only works on Linux. There are similar workarounds for Windows.
	file_ = "/proc/" + std::to_string(getpid()) + "/fd/" + std::to_string(fileno(pipe_));
	#else
	ABORT("Pipe syntax not supported in this build of Marian: {}", file);
	#endif
	} else {

	if(!node->val()) {
	if(node->memoize())
	cache_->allocate(node->val(), node->shape(), node->value_type());
	else
	tensors_->allocate(node->val(), node->shape(), node->value_type());
	}
	}

	bool skipCost = options->get<bool>("skip-cost");
	auto encdec = models::createModelFromOptions(
	options, skipCost ? models::usage::raw : models::usage::translation);

	// add (log)softmax if requested
	if (use == usage::translation) {
	if(std::dynamic_pointer_cast<EncoderDecoder>(baseModel)) {
	if(options->get<bool>("output-sampling", false))
	return New<Stepwise>(std::dynamic_pointer_cast<EncoderDecoder>(baseModel), New<GumbelSoftmaxStep>());
	else
	return New<Stepwise>(std::dynamic_pointer_cast<EncoderDecoder>(baseModel), New<LogSoftmaxStep>());

	// class to wrap an IEncoderDecoder and a ILogProbStep that are executed in sequence,
	// wrapped again in the IEncoderDecoder interface
	// @TODO: seems we are conflating an interface defition with its implementation?
	// @TODO: needs a better name. Stepwise is an adjective. Classes are things=nouns. StepwiseWhat?
	class Stepwise : public IEncoderDecoder {

	virtual Ptr<DecoderState> step(Ptr<ExpressionGraph> graph,
	Ptr<DecoderState> state,
	const std::vector<IndexType>& hypIndices, // [beamIndex * activeBatchSize + batchIndex]
	const Words& words, // [beamIndex * activeBatchSize + batchIndex]
	const std::vector<IndexType>& batchIndices, // [batchIndex]
	int beamSize) override {
	auto nextState = encdec_->step(graph, state, hypIndices, words, batchIndices, beamSize);
	return cost_->apply(nextState);
	}

	virtual Ptr<ScorerState> step(Ptr<ExpressionGraph> graph,
	Ptr<ScorerState> state,
	const std::vector<IndexType>& hypIndices,
	const Words& words,
	const std::vector<IndexType>& batchIndices,
	int beamSize) override {
	graph->switchParams(getName());
	auto wrapperState = std::dynamic_pointer_cast<ScorerWrapperState>(state);
	auto newState = encdec_->step(graph, wrapperState->getState(), hypIndices, words, batchIndices, beamSize);
	return New<ScorerWrapperState>(newState);

	cli.add<bool>("--int8",
	"Optimize speed even more aggressively sacrificing memory or precision by using 8bit integer GEMM with intgemm instead of floats. Only available on CPU. Corresponds to --gemm-precision int8");
	cli.add<bool>("--int8Alpha",
	"Use a precomputed quantisation multipliers for the activations. Requires a special model. Corresponds to --gemm-precision int8Alpha");
	cli.add<bool>("--int8shift",
	"Use a faster, shifted integer 8bit GEMM implementation. Corresponds to --gemm-precision int8shift");
	cli.add<bool>("--int8shiftAlpha",
	"Use a faster, shifted integer 8bit GEMM implementation, with precomputed alphas. Corresponds to --gemm-precision int8shiftAlpha");
	cli.add<bool>("--int8shiftAll",
	"Use a faster, shifted integer 8bit GEMM implementation even for matrices that don't have a bias. Beneficial on VNNI. Corresponds to --gemm-precision int8shiftAll");
	cli.add<bool>("--int8shiftAlphaAll",
	"Use a faster, shifted integer 8bit GEMM implementation even for matrices that don't have a bias, with precomputed alphas. Should be the fastest option. Corresponds to --gemm-precision int8shiftAlphaAll");
	cli.add<std::string>("--gemm-precision",
	"Use lower precision for the GEMM operations only. Supported values: float32, int16, int8, int8Alpha, int8shift, int8shiftAlpha, int8shiftAll, int8shiftAlphaAll", "float32");
	cli.add<bool>("--dump-quantmult",

	float unquant_factor = (-1) * ((127.0f / scale_A) * (127.0f / scale_B)) / (127.0f);
	intgemm::Int8Shift::PrepareBias(
	input_B_prepared,
	width,
	cols_B,
	intgemm::callbacks::UnquantizeAndAddBiasAndWrite(unquant_factor, input_bias, output));
	}

	void decodeWithByteRanges(const Words& sentence,
	std::string &decoded,
	std::vector<string_view> &byteRanges,
	bool ignoreEOS) const override {
	sentencepiece::SentencePieceText spt;

	std::vector<int> spmSentence;
	spmSentence.reserve(sentence.size());
	for(auto&& word : sentence)
	spmSentence.push_back(word.toWordIndex());
	spm_->Decode(spmSentence, &spt);

	decoded = spt.text(); // Creates copy of string.
	string_view decoded_view(decoded);
	for(auto piece : spt.pieces()) {
	string_view byteRange = decoded_view.substr(piece.begin(), piece.end() - piece.begin());
	byteRanges.push_back(byteRange);
	}

	if(ignoreEOS){
	byteRanges.pop_back();
	}
	}

browsermt / marian-dev Goto Github PK

marian-dev's People

Contributors

Stargazers

Watchers

Forkers

marian-dev's Issues

Feature description

Bug description

How to reproduce

Context

Bug description

Reproduce

Bug description

How to reproduce

Context

Bug description

How to reproduce

Context

Feature description

Feature description

Feature description

Recommend Projects

Recommend Topics

Recommend Org