lucidrains / charformer-pytorch Goto Github PK

View Code? Open in Web Editor NEW

119.0 5.0 9.0 79 KB

Implementation of the GBST block from the Charformer paper, in Pytorch

License: MIT License

Python 100.00%

artificial-intelligence deep-learning tokenization transformer

charformer-pytorch's Issues

example of how to read in/tokenize a text file, for use with HuggingFace Transformers?

Hello, I was attempting to adapt this guide for use with Charformer Pytorch. Colab notebook for that guide is here.

I'd like to be able to use GBST on the same data, https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt, but I'm not sure how to pass that in.

I tried looking at the source code, and the other issues here, but haven't yet found the details.

Some specific questions:

how do I "train" this tokenizer on a .txt file?
is it compatible with this section of the HF notebook, aka can it be passed into LineByLineTextDataset?

from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./oscar.eo.txt",
    block_size=128,
)

When I tried doing that line, I got the following error:

/usr/local/lib/python3.7/dist-packages/transformers/data/datasets/language_modeling.py:124: FutureWarning: This dataset will be removed from the library soon, preprocessing should be handled with the 🤗 Datasets library. You can have a look at this example script for pointers: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py
  FutureWarning,

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-38-1688c68b48be> in <module>()
      5     tokenizer=tokenizer,
      6     file_path="./oscar.eo.txt",
----> 7     block_size=128,
      8 )

1 frames

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

TypeError: forward() got an unexpected keyword argument 'add_special_tokens'

Bytes vs. Characters

The authors address the difference between bytes and characters in footnote 2, it seems like the byte is just the char embedding with dimension of 256. However, in the last sentence, For other languages, each character corresponds to 2–3 bytes in general. For simplicity and to align with prior work, we will generally talk about characters unless stated otherwise. and the example 子词分词, it becomes 子子子词词词分分分词词词, with the 3 bytes in every character.

What I want to know is, 3 bytes mean we replicate three times for every single character, then feed into embedding?
If so, how to decide the number of bytes.

Thank you.

in section 2.1.1 in the paper, the authors claim that by adding intra-block positional embeddings https://github.com/lucidrains/charformer-pytorch/blob/main/charformer_pytorch/charformer_pytorch.py#L90-L96 the block representations will be aware of the position of each character. however, if one were to be doing mean pooling as the author propose, wouldn't this amount to just adding the mean of the positional embeddings for every block? If anyone has any insights, please leave a comment

Cannot tokenize on GPU

Hi,

I'm using Charformer to do some error corrections on Colab. But I found that after I pass tokens to CUDA and start tokenizing, this would show up:

Did I do it in a wrong way?

lucidrains / charformer-pytorch Goto Github PK

charformer-pytorch's Issues

example of how to read in/tokenize a text file, for use with HuggingFace Transformers?

Bytes vs. Characters

Can you use as an auto encoder

Sequence Length Problem in NMT

positional embedding

Cannot tokenize on GPU

Any plan to add pretrained model weight?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent