lucidrains / charformer-pytorch Goto Github PK
View Code? Open in Web Editor NEWImplementation of the GBST block from the Charformer paper, in Pytorch
License: MIT License
Implementation of the GBST block from the Charformer paper, in Pytorch
License: MIT License
Hello, I was attempting to adapt this guide for use with Charformer Pytorch. Colab notebook for that guide is here.
I'd like to be able to use GBST on the same data, https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt, but I'm not sure how to pass that in.
I tried looking at the source code, and the other issues here, but haven't yet found the details.
Some specific questions:
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="./oscar.eo.txt",
block_size=128,
)
When I tried doing that line, I got the following error:
/usr/local/lib/python3.7/dist-packages/transformers/data/datasets/language_modeling.py:124: FutureWarning: This dataset will be removed from the library soon, preprocessing should be handled with the ๐ค Datasets library. You can have a look at this example script for pointers: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py
FutureWarning,
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-38-1688c68b48be> in <module>()
5 tokenizer=tokenizer,
6 file_path="./oscar.eo.txt",
----> 7 block_size=128,
8 )
1 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1050 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051 return forward_call(*input, **kwargs)
1052 # Do not call functions when jit is used
1053 full_backward_hooks, non_full_backward_hooks = [], []
TypeError: forward() got an unexpected keyword argument 'add_special_tokens'
The authors address the difference between bytes and characters in footnote 2, it seems like the byte is just the char embedding with dimension of 256. However, in the last sentence, For other languages, each character corresponds to 2โ3 bytes in general. For simplicity and to align with prior work, we will generally talk about characters unless stated otherwise.
and the example ๅญ่ฏๅ่ฏ
, it becomes ๅญๅญๅญ่ฏ่ฏ่ฏๅๅๅ่ฏ่ฏ่ฏ
, with the 3 bytes in every character.
What I want to know is, 3 bytes mean we replicate three times for every single character, then feed into embedding?
If so, how to decide the number of bytes.
Thank you.
Sorry about the newbie questions but I was wondering if you could quickly show me how you would use this in the auto encoder.
I'm wondering would it be possible to use this model to generate text using it with a transformer. Will I need to be using up a sampling to generate as part of the tokenizing step? Would it be possible to for you to point me in the right direction.
After downsampling, the length of the sequence has been shortened. But how can I return the sequence to its original length since I may need to do sentence generation in error correction?
Thank you!
in section 2.1.1 in the paper, the authors claim that by adding intra-block positional embeddings https://github.com/lucidrains/charformer-pytorch/blob/main/charformer_pytorch/charformer_pytorch.py#L90-L96 the block representations will be aware of the position of each character. however, if one were to be doing mean pooling as the author propose, wouldn't this amount to just adding the mean of the positional embeddings for every block? If anyone has any insights, please leave a comment
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.