Giter Site home page Giter Site logo

Comments (29)

kpu avatar kpu commented on July 25, 2024

Couple options.

Marian can do model loading via mmap on native. You'll see it in https://github.com/marian-nmt/marian-dev/search?p=1&q=mmap . Of course on native the mmap is backed by... files. But that call to mmap could be replaced with just getting a byte pointer in from you.

Or the input readers are already abstracted and could change to byte-backed streams instead of file-backed streams.

from bergamot-translator.

andrenatal avatar andrenatal commented on July 25, 2024

Either is fine in my opinion, but I'd like to hear from @abhi-agg what's the best for our environment (of course we need to choose what offers the best performance).

Regardless of the option we choose, would it take too long to @jerinphilip (or the proper person in your team) to implement that?

from bergamot-translator.

kpu avatar kpu commented on July 25, 2024

@XapaJIaMnu Comment on complexity?

from bergamot-translator.

kpu avatar kpu commented on July 25, 2024

Is the idea that you'd keep blobs alive for the life of the process? Would we have a mechanism to free them?

from bergamot-translator.

andrenatal avatar andrenatal commented on July 25, 2024

I think the ideal would be to free them after the graph was properly built and populated (memory consumption is also another point of concern from fx folks, so we need to prevent duplication), so yes, we can have a mechanism in place to free the blob, like a callback from marian after everything is finalized and clear to free it for example (or whatever approach is more appropriate).

from bergamot-translator.

XapaJIaMnu avatar XapaJIaMnu commented on July 25, 2024

@andrenatal , this will not be possible with the approach of using the MMAP() codepath, as the models which are memory mapped are never "loaded" into the graph, they become the graph (with zero copy).

In terms of complexity, that again depends on what type of models we are using. If we want to pass a .npz model as a byte-stream, we need to go into the library that deals with the .npz loading which I don't really want to do.

When dealing with binary files, this should be relatively simple as we literally work with a byte buffer: https://github.com/marian-nmt/marian-dev/blob/master/src/common/binary.cpp#L95 . Replacing this function should be sufficient.

However binary file loading is still not working correctly in the WASM environment.

from bergamot-translator.

kpu avatar kpu commented on July 25, 2024

@XapaJIaMnu Isn't the mmap better, we just never free it while marian is alive right? (Or replace the munmap with whatever Firefox free)

from bergamot-translator.

XapaJIaMnu avatar XapaJIaMnu commented on July 25, 2024

Yeah, true as long as firefox doesn't mind marian to do the memory management. But let's have it named something else...

from bergamot-translator.

kpu avatar kpu commented on July 25, 2024

I think getting the model as a blob is the way to go.

class MemoryGift {
  const char *data() const;
  size_t length() const;
  void early_free();
};

Regarding reading npz, it uses fopen-style APIs: https://github.com/marian-nmt/marian-dev/blob/6f6d484665dfb35fe1d85b49b6b9d29c2661fd15/src/3rd_party/cnpy/cnpy.cpp#L161 . Doable but a bit ugly.

from bergamot-translator.

andrenatal avatar andrenatal commented on July 25, 2024

However binary file loading is still not working correctly in the WASM environment.

I'm set to start working on a fix to properly load the binary files in wasm env later this week, and we expect to be able to use just the quantized models in production (model size was also a no-go when discussing with our product teams)

So, with all that said, what is best approach you suggest?

from bergamot-translator.

andrenatal avatar andrenatal commented on July 25, 2024

Yeah, true as long as firefox doesn't mind marian to do the memory management. But let's have it named something else...

I think this might be fine...

from bergamot-translator.

XapaJIaMnu avatar XapaJIaMnu commented on July 25, 2024

(I'm pretty sure the binary files not working is due to a rogue size_t in this function: https://github.com/marian-nmt/marian-dev/blob/6f6d484665dfb35fe1d85b49b6b9d29c2661fd15/src/common/binary.cpp#L30)

I think the preferred approach would be create an option in the constructor to call this https://github.com/marian-nmt/marian-dev/blob/6f6d484665dfb35fe1d85b49b6b9d29c2661fd15/src/common/binary.cpp#L30 directly without the need to go through the file reader wrapper.

It is up to you to decide whether to leave the memory management to marian or you insist for it to happen inside FF

from bergamot-translator.

kpu avatar kpu commented on July 25, 2024

To be clear marian will still need to allocate and deallocate its own scratch. Way too many STL classes; ship has sailed on that.

The model blob can come in as a mmap-like way and just live the entire duration of the model anyway. Managing this on either side is easy; we just need to be clear. And it's probably better for Firefox to manage deallocation of models if it's already allocating them.

Are lexical shortlists also memory mapped? They probably should be if only for loading efficiency...

from bergamot-translator.

XapaJIaMnu avatar XapaJIaMnu commented on July 25, 2024

I'm pretty sure lexical shortlists are not mmap'd...

from bergamot-translator.

andrenatal avatar andrenatal commented on July 25, 2024

(I'm pretty sure the binary files not working is due to a rogue size_t in this function: https://github.com/marian-nmt/marian-dev/blob/6f6d484665dfb35fe1d85b49b6b9d29c2661fd15/src/common/binary.cpp#L30)

I think the preferred approach would be create an option in the constructor to call this https://github.com/marian-nmt/marian-dev/blob/6f6d484665dfb35fe1d85b49b6b9d29c2661fd15/src/common/binary.cpp#L30 directly without the need to go through the file reader wrapper.

Thanks for pointing that out, I will take a look later today or tomorrow morning.

It is up to you to decide whether to leave the memory management to marian or you insist for it to happen inside FF

marian can manage it as part of its lifecycle, as long the blob gets freed when we call delete() like what we are doing here: https://github.com/browsermt/bergamot-translator/blob/wasm-integration/wasm/test_page/bergamot.html#L111.
In this approach, would we be required to destroy the object and instantiate it again to switch to a different model, just call the method to load the different model and keep the object alive would be enough? I'd prefer the latter if possible but it's ok if not although this would save time due wasm jit compilation

from bergamot-translator.

andrenatal avatar andrenatal commented on July 25, 2024

Are lexical shortlists also memory mapped? They probably should be if only for loading efficiency...

Since you touched this point: currently we are having a long slowdown when loading the shortlists in wasm. The model loads in .5 seconds, but the shortlist in 7-10 seconds, and we are also trying to find where's the bottleneck.

I'd appreciate if you could provide suggestions, please.

from bergamot-translator.

andrenatal avatar andrenatal commented on July 25, 2024

And we also expect that loading the model via bytes would also help with that (besides all the aforementioned situations) given we'd remove the emscripten filesystem emulation from the equation.

from bergamot-translator.

XapaJIaMnu avatar XapaJIaMnu commented on July 25, 2024

The lexical shortlist is ~40 megabytes of plain text that needs to be parsed and then the vocabulary items need to be replaced with vocabulary IDs with sentencepiece. I haven't looked at the code that does the loading before, but when I do now:

https://github.com/marian-nmt/marian-dev/blob/6f6d484665dfb35fe1d85b49b6b9d29c2661fd15/src/data/shortlist.h#L149

We have a std::vector<std::unordered_map<WordIndex, float>> data_; that gets expanded dynamically as the file is read. AFAIK memory allocations are very slow on WASM and that's most likely your issue.

As for switching the model dynamically, without destroying the object, we do not currently have this functionality. There are couple of issues and it looks like a lot of work:

  1. Different models require different vocabulary. Vocabulary is currently loaded always from plain text file (either through sentencepiece or from yaml).
  2. Different models require different shortlists, same as above.
  3. No support whatsoever for swapping the ptrs to data inside the tensors on the graph. It needs to be implemented if dynamic model switching is to work.

from bergamot-translator.

kpu avatar kpu commented on July 25, 2024

Aww and here I thought were were going to do something actually efficient and run inside SECCOMP instead of wasm.

The LexicalShortlistGenerator class is ripe for optimization. It doesn't use the probabilities after pruning, which happens immediately after load, so it shouldn't keep them after pruning either. Can be flattened to this:

std::vector<size_t> word_to_offset_;
std::vector<WordIndex> short_lists_;

where [&short_lists_[word_to_offset_[word]], &short_lists_[word_to_offset_[word + 1]]) is a sorted array of word indices in the short list for word. Less formally, we concatenate all the shortlist arrays and find the beginning of a shortlist array with word_to_offset_.

These two fixed-length vectors would be easy to replace with something memory mapped.

Since all the shortlists are already sorted, we can take the union in generate using a priority queue to pop the list with the lowest index off each time.

from bergamot-translator.

andrenatal avatar andrenatal commented on July 25, 2024

from bergamot-translator.

andrenatal avatar andrenatal commented on July 25, 2024

from bergamot-translator.

andrenatal avatar andrenatal commented on July 25, 2024

Aww and here I thought were were going to do something actually efficient and run inside SECCOMP instead of wasm.

The LexicalShortlistGenerator class is ripe for optimization. It doesn't use the probabilities after pruning, which happens immediately after load, so it shouldn't keep them after pruning either. Can be flattened to this:

std::vector<size_t> word_to_offset_;
std::vector<WordIndex> short_lists_;

where [&short_lists_[word_to_offset_[word]], &short_lists_[word_to_offset_[word + 1]]) is a sorted array of word indices in the short list for word. Less formally, we concatenate all the shortlist arrays and find the beginning of a shortlist array with word_to_offset_.

These two fixed-length vectors would be easy to replace with something memory mapped.

Since all the shortlists are already sorted, we can take the union in generate using a priority queue to pop the list with the lowest index off each time.

It's great to know there's room for optimizations here. That, in conjunction with this issue, are the two major bottlenecks we currently have, and with them sorted out our path to integration would clear.

I'll file another issue just for the shortlist optimization in marian-dev as well.

from bergamot-translator.

kpu avatar kpu commented on July 25, 2024

There are three tasks here:

  1. Model passed as bytes (complicated but doable)
  2. Vocab passed as bytes (plumbing)
  3. Lexical shortlist passed as bytes (may as well be done with optimization)
    Given the urgency you appear to be assigning to this and given my team has a deliverable due 28 February, we're going to need help.

I'd suggest Mozilla take a look at 2: the vocab file. This one is relatively well-defined to somebody not familiar with the code and will also mean that you can put together what the end-to-end stack looks like (because I'm unclear on this). Here's a rough guide to how.

Your goal is to plumb bytes to call this method: https://github.com/google/sentencepiece/blob/ba7e11a17f606327d0652528d58d2dd8cd265c6f/src/sentencepiece_processor.h#L204 in lieu of load.

Suggest model bytes be passed explicitly as a second argument to the Service object constructor:

Service::Service(Ptr<Options> options)

Then loadVocabularies at

vocabs_(std::move(loadVocabularies(options))),
changes
m.first->second->load(vfiles[i]);
to use a new loadFromSerialized method in Vocab: https://github.com/browsermt/marian-dev/blob/2f65280459737c37c270e4ad0b6d41de215d11e0/src/data/vocab.cpp#L69 which then implies a new method in IVocab. And an implementation added to https://github.com/browsermt/marian-dev/blob/master/src/data/sentencepiece_vocab.cpp that ultimately calls https://github.com/google/sentencepiece/blob/ba7e11a17f606327d0652528d58d2dd8cd265c6f/src/sentencepiece_processor.h#L204

from bergamot-translator.

andrenatal avatar andrenatal commented on July 25, 2024

Ok, I can take a look at the vocab file along @abhi-agg

from bergamot-translator.

jerinphilip avatar jerinphilip commented on July 25, 2024

Since @kpu is pointing to an updated source. Pointing out here that #30 might be of help to you while compiling the newer code pointed with wasm. (It's not strictly my job to do this, the PR exists because of my curiosity to see how wasm related stuff builds and how the two branches are diverging right now).

1, Ideal case is you'll need to update ssplit-cpp (in wasm build CMake) to use PCRE2 instead of PCRECPP (there's some compiled from source modifications for WASM).
2, You can probably manage with some #ifdefs here to adjust to compile with PCRECPP, a duct tape I wouldn't recommend. (A relevant commit: ca9aa64)

from bergamot-translator.

abhi-agg avatar abhi-agg commented on July 25, 2024

Or the input readers are already abstracted and could change to byte-backed streams instead of file-backed streams.

@kpu Could you please point me to the file that has the input reader abstraction code? I am curious.

On a side note, does mmap work for .bin model files natively? I am asking because it is commented out

from bergamot-translator.

XapaJIaMnu avatar XapaJIaMnu commented on July 25, 2024

@abhi-agg ,

This function does the graph matrix loading is here: https://github.com/browsermt/marian-dev/blob/467c43a292a68b7913af2a00d353de97c1740f92/src/common/binary.cpp#L91
It has an overload that already accepts byte stream (The previous one performs the file reading wrapper): https://github.com/browsermt/marian-dev/blob/467c43a292a68b7913af2a00d353de97c1740f92/src/common/binary.cpp#L30

As for the MMAP() code I realise it won't work for our usecase.
MMAP() is not enabled by default because it's used only internally at M$ and it will work only with binary models that do not do any transformations at load time. Unfortunately for us, we do not know what architecture the user will be running, so we dynamically re-order the matrices into a format suitable for the user's architecture (SSSE3/AVX2/AVX512). That means that we would need to modify the byte buffer in order to create our matrices and only then free it.

from bergamot-translator.

kpu avatar kpu commented on July 25, 2024

As for the MMAP() code I realise it won't work for our usecase.
MMAP() is not enabled by default because it's used only internally at M$ and it will work only with binary models that do not do any transformations at load time. Unfortunately for us, we do not know what architecture the user will be running, so we dynamically re-order the matrices into a format suitable for the user's architecture (SSSE3/AVX2/AVX512). That means that we would need to modify the byte buffer in order to create our matrices and only then free it.

WASM downgrades everybody to SSSE3 anyway 👎 💩 so it is actually a constant. Also remember this isn't a proper read-only mmap against a file anyway. It's a buffer we could agree is modifiable. And maybe in the general case we could come with an in-place reshape operation from a format that's already transposed the right way.

from bergamot-translator.

abhi-agg avatar abhi-agg commented on July 25, 2024

Closing this one as this capability is added in latest main branch via #69 and #55 👍

from bergamot-translator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.