Giter Site home page Giter Site logo

Comments (4)

loretoparisi avatar loretoparisi commented on May 26, 2024 1

Thanks for reference have a look at hf-tokenizers-experiments
Here I have put together the whole tokenizer pipeline for
SentencePiece BPE Tokenizer.

from fasttext.js.

DoctorSlimm avatar DoctorSlimm commented on May 26, 2024

🤩🤩🤩

from fasttext.js.

DoctorSlimm avatar DoctorSlimm commented on May 26, 2024

@loretoparisi do you know why the node tokenizer returns zero padding tokens after the input ids are finished? { input_ids: [ 101, 7592, 2088, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

let { promisify } = require("util");
let { Tokenizer } = require("tokenizers/bindings/tokenizer");


( async () => {
    const tokenizer = Tokenizer.fromFile('./MiniLM-L6-v2/tokenizer.json')
    console.log(tokenizer);

    const encode = promisify(tokenizer.encode.bind(tokenizer));
    const decode = promisify(tokenizer.decode.bind(tokenizer));

    const encoded = await encode("Hello World!");

    const modelInputs = {
        input_ids: encoded.getIds(),
        attention_mask: encoded.getAttentionMask(),
        token_type_ids: encoded.getTypeIds()
    }

    console.log(modelInputs);

})();

from fasttext.js.

loretoparisi avatar loretoparisi commented on May 26, 2024

@loretoparisi do you know why the node tokenizer returns zero padding tokens after the input ids are finished? ```{

input_ids: [

101, 7592, 2088, 999, 102, 0, 0, 0, 0, 0, 0, 0,

  0,    0,    0,   0,   0, 0, 0, 0, 0, 0, 0, 0,

  0,    0,    0,   0,   0, 0, 0, 0, 0, 0, 0, 0,

  0,    0,    0,   0,   0, 0, 0, 0, 0, 0, 0, 0,

  0,    0,    0,   0,   0, 0, 0, 0, 0, 0, 0, 0,

  0,    0,    0,   0,   0, 0, 0, 0, 0, 0, 0, 0,

  0,    0,    0,   0,   0, 0, 0, 0, 0, 0, 0, 0,

  0,    0,    0,   0,   ```
let { promisify } = require("util");

let { Tokenizer } = require("tokenizers/bindings/tokenizer");





( async () => {

    const tokenizer = Tokenizer.fromFile('./MiniLM-L6-v2/tokenizer.json')

    console.log(tokenizer);



    const encode = promisify(tokenizer.encode.bind(tokenizer));

    const decode = promisify(tokenizer.decode.bind(tokenizer));



    const encoded = await encode("Hello World!");



    const modelInputs = {

        input_ids: encoded.getIds(),

        attention_mask: encoded.getAttentionMask(),

        token_type_ids: encoded.getTypeIds()

    }



    console.log(modelInputs);



})();

Have a look to my examples here

you will find that there is an option to pad to max length the input:

var lpTokenizer = await LPSentencePieceBPETokenizer.fromOptions({
        padMaxLength: false,
        vocabFile: "../vocab/minilm/minilm-vocab.json"
        , mergesFile: "../vocab/minilm/minilm-merges.txt"
    });

this is necessary to feed the model with the correct (fixed) size (typically the max sequence size of the input).

from fasttext.js.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.