I was wondering if there's a proper way of detokenizing the output tokens, i.e., const

In my case, I do: <div class="snippet-clipboard-content notranslate position-relat

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How to detokenize a BertTokenizer output? about transformers HOT 5 CLOSED

huggingface commented on May 1, 2024

How to detokenize a BertTokenizer output?

from transformers.

Comments (5)

leitro commented on May 1, 2024 23

In my case, I do:

tokens = ['[UNK]', '[CLS]', '[SEP]', 'want', '##ed', 'wa', 'un', 'runn', '##ing', ',']
text = ' '.join([x for x in tokens])
fine_text = text.replace(' ##', '')

from transformers.

pertschuk commented on May 1, 2024 22

@thomwolf could you point to the specific section of run_squad.py that handles this, I'm having trouble

EDIT: is it this bit from processors/squad.py?

tok_to_orig_index = []
        orig_to_tok_index = []
        all_doc_tokens = []
        for (i, token) in enumerate(example.doc_tokens):
            orig_to_tok_index.append(len(all_doc_tokens))
            sub_tokens = tokenizer.tokenize(token)
            for sub_token in sub_tokens:
                tok_to_orig_index.append(i)
                all_doc_tokens.append(sub_token)

from transformers.

thomwolf commented on May 1, 2024 19

Yes. I don't plan to include a reverse conversion of tokens in the tokenizer.
For an example on how to keep track of the original characters position, please read the run_squad.py example.

from transformers.

artemisart commented on May 1, 2024

You can remove ' ##' but you cannot know if there was a space around punctuations tokens or uppercase words.

from transformers.

igrinis commented on May 1, 2024

Apostrophe is considered as a punctuation mark, but often it is an integrated part of the word. Regular .tokenize() always converts apostrophe to the stand alone token, so the information to which word it belongs is lost. If the original sentence contains apostrophes, it is impossible to recreate the original sentence from its' tokens (for example when apostrophe is a last symbol in some word convert_tokens_to_string() will join it with the following one). In order to overcome this, one can check the surroundings of the apostrophe and add ## immediately after the tokenization. For example:

sent = "The Smiths' used their son's car" 
tokens = tokenizer.tokenize(sent)

now if you fix tokens to look like:

original =>['the', 'smith', '##s', "'", 'used', 'their', 'son', "'", 's', 'car']
fixed => ['the', 'smith', '##s', "##'", 'used', 'their', 'son', "##'", '##s', 'car']

you will be able to restore the original words.

from transformers.

Recommend Projects

How to detokenize a BertTokenizer output? about transformers HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent