Comments (29)
Couple options.
Marian can do model loading via mmap
on native. You'll see it in https://github.com/marian-nmt/marian-dev/search?p=1&q=mmap . Of course on native the mmap is backed by... files. But that call to mmap
could be replaced with just getting a byte pointer in from you.
Or the input readers are already abstracted and could change to byte-backed streams instead of file-backed streams.
from bergamot-translator.
Either is fine in my opinion, but I'd like to hear from @abhi-agg what's the best for our environment (of course we need to choose what offers the best performance).
Regardless of the option we choose, would it take too long to @jerinphilip (or the proper person in your team) to implement that?
from bergamot-translator.
@XapaJIaMnu Comment on complexity?
from bergamot-translator.
Is the idea that you'd keep blobs alive for the life of the process? Would we have a mechanism to free them?
from bergamot-translator.
I think the ideal would be to free them after the graph was properly built and populated (memory consumption is also another point of concern from fx folks, so we need to prevent duplication), so yes, we can have a mechanism in place to free the blob, like a callback from marian after everything is finalized and clear to free it for example (or whatever approach is more appropriate).
from bergamot-translator.
@andrenatal , this will not be possible with the approach of using the MMAP() codepath, as the models which are memory mapped are never "loaded" into the graph, they become the graph (with zero copy).
In terms of complexity, that again depends on what type of models we are using. If we want to pass a .npz
model as a byte-stream, we need to go into the library that deals with the .npz
loading which I don't really want to do.
When dealing with binary files, this should be relatively simple as we literally work with a byte buffer: https://github.com/marian-nmt/marian-dev/blob/master/src/common/binary.cpp#L95 . Replacing this function should be sufficient.
However binary file loading is still not working correctly in the WASM environment.
from bergamot-translator.
@XapaJIaMnu Isn't the mmap better, we just never free it while marian is alive right? (Or replace the munmap
with whatever Firefox free)
from bergamot-translator.
Yeah, true as long as firefox doesn't mind marian to do the memory management. But let's have it named something else...
from bergamot-translator.
I think getting the model as a blob is the way to go.
class MemoryGift {
const char *data() const;
size_t length() const;
void early_free();
};
Regarding reading npz, it uses fopen
-style APIs: https://github.com/marian-nmt/marian-dev/blob/6f6d484665dfb35fe1d85b49b6b9d29c2661fd15/src/3rd_party/cnpy/cnpy.cpp#L161 . Doable but a bit ugly.
from bergamot-translator.
However binary file loading is still not working correctly in the WASM environment.
I'm set to start working on a fix to properly load the binary files in wasm env later this week, and we expect to be able to use just the quantized models in production (model size was also a no-go when discussing with our product teams)
So, with all that said, what is best approach you suggest?
from bergamot-translator.
Yeah, true as long as firefox doesn't mind marian to do the memory management. But let's have it named something else...
I think this might be fine...
from bergamot-translator.
(I'm pretty sure the binary files not working is due to a rogue size_t
in this function: https://github.com/marian-nmt/marian-dev/blob/6f6d484665dfb35fe1d85b49b6b9d29c2661fd15/src/common/binary.cpp#L30)
I think the preferred approach would be create an option in the constructor to call this https://github.com/marian-nmt/marian-dev/blob/6f6d484665dfb35fe1d85b49b6b9d29c2661fd15/src/common/binary.cpp#L30 directly without the need to go through the file reader wrapper.
It is up to you to decide whether to leave the memory management to marian or you insist for it to happen inside FF
from bergamot-translator.
To be clear marian will still need to allocate and deallocate its own scratch. Way too many STL classes; ship has sailed on that.
The model blob can come in as a mmap-like way and just live the entire duration of the model anyway. Managing this on either side is easy; we just need to be clear. And it's probably better for Firefox to manage deallocation of models if it's already allocating them.
Are lexical shortlists also memory mapped? They probably should be if only for loading efficiency...
from bergamot-translator.
I'm pretty sure lexical shortlists are not mmap'd...
from bergamot-translator.
(I'm pretty sure the binary files not working is due to a rogue
size_t
in this function: https://github.com/marian-nmt/marian-dev/blob/6f6d484665dfb35fe1d85b49b6b9d29c2661fd15/src/common/binary.cpp#L30)I think the preferred approach would be create an option in the constructor to call this https://github.com/marian-nmt/marian-dev/blob/6f6d484665dfb35fe1d85b49b6b9d29c2661fd15/src/common/binary.cpp#L30 directly without the need to go through the file reader wrapper.
Thanks for pointing that out, I will take a look later today or tomorrow morning.
It is up to you to decide whether to leave the memory management to marian or you insist for it to happen inside FF
marian can manage it as part of its lifecycle, as long the blob gets freed when we call delete()
like what we are doing here: https://github.com/browsermt/bergamot-translator/blob/wasm-integration/wasm/test_page/bergamot.html#L111.
In this approach, would we be required to destroy the object and instantiate it again to switch to a different model, just call the method to load the different model and keep the object alive would be enough? I'd prefer the latter if possible but it's ok if not although this would save time due wasm jit compilation
from bergamot-translator.
Are lexical shortlists also memory mapped? They probably should be if only for loading efficiency...
Since you touched this point: currently we are having a long slowdown when loading the shortlists in wasm. The model loads in .5 seconds, but the shortlist in 7-10 seconds, and we are also trying to find where's the bottleneck.
I'd appreciate if you could provide suggestions, please.
from bergamot-translator.
And we also expect that loading the model via bytes would also help with that (besides all the aforementioned situations) given we'd remove the emscripten filesystem emulation from the equation.
from bergamot-translator.
The lexical shortlist is ~40 megabytes of plain text that needs to be parsed and then the vocabulary items need to be replaced with vocabulary IDs with sentencepiece. I haven't looked at the code that does the loading before, but when I do now:
We have a std::vector<std::unordered_map<WordIndex, float>> data_;
that gets expanded dynamically as the file is read. AFAIK memory allocations are very slow on WASM and that's most likely your issue.
As for switching the model dynamically, without destroying the object, we do not currently have this functionality. There are couple of issues and it looks like a lot of work:
- Different models require different vocabulary. Vocabulary is currently loaded always from plain text file (either through sentencepiece or from yaml).
- Different models require different shortlists, same as above.
- No support whatsoever for swapping the ptrs to data inside the tensors on the graph. It needs to be implemented if dynamic model switching is to work.
from bergamot-translator.
Aww and here I thought were were going to do something actually efficient and run inside SECCOMP instead of wasm.
The LexicalShortlistGenerator
class is ripe for optimization. It doesn't use the probabilities after pruning, which happens immediately after load
, so it shouldn't keep them after pruning either. Can be flattened to this:
std::vector<size_t> word_to_offset_;
std::vector<WordIndex> short_lists_;
where [&short_lists_[word_to_offset_[word]], &short_lists_[word_to_offset_[word + 1]])
is a sorted array of word indices in the short list for word
. Less formally, we concatenate all the shortlist arrays and find the beginning of a shortlist array with word_to_offset_
.
These two fixed-length vectors would be easy to replace with something memory mapped.
Since all the shortlists are already sorted, we can take the union in generate
using a priority queue to pop the list with the lowest index off each time.
from bergamot-translator.
from bergamot-translator.
from bergamot-translator.
Aww and here I thought were were going to do something actually efficient and run inside SECCOMP instead of wasm.
The
LexicalShortlistGenerator
class is ripe for optimization. It doesn't use the probabilities after pruning, which happens immediately afterload
, so it shouldn't keep them after pruning either. Can be flattened to this:std::vector<size_t> word_to_offset_; std::vector<WordIndex> short_lists_;
where
[&short_lists_[word_to_offset_[word]], &short_lists_[word_to_offset_[word + 1]])
is a sorted array of word indices in the short list forword
. Less formally, we concatenate all the shortlist arrays and find the beginning of a shortlist array withword_to_offset_
.These two fixed-length vectors would be easy to replace with something memory mapped.
Since all the shortlists are already sorted, we can take the union in
generate
using a priority queue to pop the list with the lowest index off each time.
It's great to know there's room for optimizations here. That, in conjunction with this issue, are the two major bottlenecks we currently have, and with them sorted out our path to integration would clear.
I'll file another issue just for the shortlist optimization in marian-dev as well.
from bergamot-translator.
There are three tasks here:
- Model passed as bytes (complicated but doable)
- Vocab passed as bytes (plumbing)
- Lexical shortlist passed as bytes (may as well be done with optimization)
Given the urgency you appear to be assigning to this and given my team has a deliverable due 28 February, we're going to need help.
I'd suggest Mozilla take a look at 2: the vocab file. This one is relatively well-defined to somebody not familiar with the code and will also mean that you can put together what the end-to-end stack looks like (because I'm unclear on this). Here's a rough guide to how.
Your goal is to plumb bytes to call this method: https://github.com/google/sentencepiece/blob/ba7e11a17f606327d0652528d58d2dd8cd265c6f/src/sentencepiece_processor.h#L204 in lieu of load
.
Suggest model bytes be passed explicitly as a second argument to the Service
object constructor:
Then loadVocabularies
at
IVocab
. And an implementation added to https://github.com/browsermt/marian-dev/blob/master/src/data/sentencepiece_vocab.cpp that ultimately calls https://github.com/google/sentencepiece/blob/ba7e11a17f606327d0652528d58d2dd8cd265c6f/src/sentencepiece_processor.h#L204from bergamot-translator.
Ok, I can take a look at the vocab file along @abhi-agg
from bergamot-translator.
Since @kpu is pointing to an updated source. Pointing out here that #30 might be of help to you while compiling the newer code pointed with wasm. (It's not strictly my job to do this, the PR exists because of my curiosity to see how wasm related stuff builds and how the two branches are diverging right now).
1, Ideal case is you'll need to update ssplit-cpp (in wasm build CMake) to use PCRE2 instead of PCRECPP (there's some compiled from source modifications for WASM).
2, You can probably manage with some #ifdefs
here to adjust to compile with PCRECPP, a duct tape I wouldn't recommend. (A relevant commit: ca9aa64)
from bergamot-translator.
Or the input readers are already abstracted and could change to byte-backed streams instead of file-backed streams.
@kpu Could you please point me to the file that has the input reader abstraction code? I am curious.
On a side note, does mmap work for .bin model files natively? I am asking because it is commented out
from bergamot-translator.
This function does the graph matrix loading is here: https://github.com/browsermt/marian-dev/blob/467c43a292a68b7913af2a00d353de97c1740f92/src/common/binary.cpp#L91
It has an overload that already accepts byte stream (The previous one performs the file reading wrapper): https://github.com/browsermt/marian-dev/blob/467c43a292a68b7913af2a00d353de97c1740f92/src/common/binary.cpp#L30
As for the MMAP() code I realise it won't work for our usecase.
MMAP() is not enabled by default because it's used only internally at M$ and it will work only with binary models that do not do any transformations at load time. Unfortunately for us, we do not know what architecture the user will be running, so we dynamically re-order the matrices into a format suitable for the user's architecture (SSSE3/AVX2/AVX512). That means that we would need to modify the byte buffer in order to create our matrices and only then free it.
from bergamot-translator.
As for the MMAP() code I realise it won't work for our usecase.
MMAP() is not enabled by default because it's used only internally at M$ and it will work only with binary models that do not do any transformations at load time. Unfortunately for us, we do not know what architecture the user will be running, so we dynamically re-order the matrices into a format suitable for the user's architecture (SSSE3/AVX2/AVX512). That means that we would need to modify the byte buffer in order to create our matrices and only then free it.
WASM downgrades everybody to SSSE3 anyway 👎 💩 so it is actually a constant. Also remember this isn't a proper read-only mmap against a file anyway. It's a buffer we could agree is modifiable. And maybe in the general case we could come with an in-place reshape operation from a format that's already transposed the right way.
from bergamot-translator.
Closing this one as this capability is added in latest main branch via #69 and #55 👍
from bergamot-translator.
Related Issues (20)
- Ignore elements with `translate=no` attribute HOT 1
- Upgrade emsdk to the latest tag
- Expose C API
- Compile WASM release also only for generic CPU capabilities (without SIMD). HOT 7
- If input is pure punctuation or numbers, pass it through unmodified HOT 1
- [Feature Request] Make a LibreOffice Extension HOT 3
- Remove wormhole references HOT 2
- Confusing error message from WebAssembly + JS HOT 1
- Importing bergamot for Python fails without network connection. HOT 1
- Content types? HOT 1
- Descreptions what the Bergamond AddOn do on Firefox HOT 1
- Build WASM fails, for failure to build marian-dev
- Simpler, more general Worker HOT 2
- App is hillariously difficult to use HOT 1
- Add Italian language HOT 3
- Compile to .so file and install headers HOT 4
- Add Russian language
- update url models on example.sh
- Build native fix, line 87 on CMakeList.txt HOT 1
- Python package seems broken HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bergamot-translator.