Giter Site home page Giter Site logo

Comments (5)

leitro avatar leitro commented on May 1, 2024 23

In my case, I do:

tokens = ['[UNK]', '[CLS]', '[SEP]', 'want', '##ed', 'wa', 'un', 'runn', '##ing', ',']
text = ' '.join([x for x in tokens])
fine_text = text.replace(' ##', '')

from transformers.

pertschuk avatar pertschuk commented on May 1, 2024 22

@thomwolf could you point to the specific section of run_squad.py that handles this, I'm having trouble

EDIT: is it this bit from processors/squad.py?

tok_to_orig_index = []
        orig_to_tok_index = []
        all_doc_tokens = []
        for (i, token) in enumerate(example.doc_tokens):
            orig_to_tok_index.append(len(all_doc_tokens))
            sub_tokens = tokenizer.tokenize(token)
            for sub_token in sub_tokens:
                tok_to_orig_index.append(i)
                all_doc_tokens.append(sub_token)

from transformers.

thomwolf avatar thomwolf commented on May 1, 2024 19

Yes. I don't plan to include a reverse conversion of tokens in the tokenizer.
For an example on how to keep track of the original characters position, please read the run_squad.py example.

from transformers.

artemisart avatar artemisart commented on May 1, 2024

You can remove ' ##' but you cannot know if there was a space around punctuations tokens or uppercase words.

from transformers.

igrinis avatar igrinis commented on May 1, 2024

Apostrophe is considered as a punctuation mark, but often it is an integrated part of the word. Regular .tokenize() always converts apostrophe to the stand alone token, so the information to which word it belongs is lost. If the original sentence contains apostrophes, it is impossible to recreate the original sentence from its' tokens (for example when apostrophe is a last symbol in some word convert_tokens_to_string() will join it with the following one). In order to overcome this, one can check the surroundings of the apostrophe and add ## immediately after the tokenization. For example:

sent = "The Smiths' used their son's car" 
tokens = tokenizer.tokenize(sent)

now if you fix tokens to look like:

original =>['the', 'smith', '##s', "'", 'used', 'their', 'son', "'", 's', 'car']
fixed => ['the', 'smith', '##s', "##'", 'used', 'their', 'son', "##'", '##s', 'car']

you will be able to restore the original words.

from transformers.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.