Giter Site home page Giter Site logo

indobenchmark-toolkit's Introduction

Indobenchmark Toolkit

Pull Requests Welcome GitHub license Contributor Covenant

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources for Bahasa Indonesia such as Institut Teknologi Bandung, Universitas Multimedia Nusantara, The Hong Kong University of Science and Technology, Universitas Indonesia, DeepMind, Gojek, and Prosa.AI.

Toolkit Modules

IndoNLGTokenizer

IndoNLGTokenizer is the tokenizer used by both IndoBART and IndoGPT models. The example for using the IndoNLGTokenizer is shown as follow:

  • IndoNLGTokenizer for IndoGPT
## Encode ##
from indobenchmark import IndoNLGTokenizer
tokenizer = IndoNLGTokenizer.from_pretrained('indobenchmark/indogpt')
inputs = tokenizer.prepare_input_for_generation('hai, bagaimana kabar', model_type='indogpt', return_tensors='pt')
# inputs: {'input_ids': tensor([[    0,  4693, 39956,  1119,  3447]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

## Decode ##
from indobenchmark import IndoNLGTokenizer
tokenizer = IndoNLGTokenizer.from_pretrained('indobenchmark/indogpt')
text = tokenizer.decode([0,  4693, 39956,  1119,  3447])
# text: '<s> hai, bagaimana kabar'
  • IndoNLGTokenizer for IndoBART
## Encode ##
from indobenchmark import IndoNLGTokenizer
tokenizer = IndoNLGTokenizer.from_pretrained('indobenchmark/indobart')
inputs = tokenizer.prepare_input_for_generation('hai, bagaimana kabar', return_tensors='pt', 
                       lang_token = '[indonesian]', decoder_lang_token='[indonesian]')
# inputs: {'input_ids': tensor([    0,  4693, 39956,  1119,  3447,     2, 40002]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1])}

## Decode ##
from indobenchmark import IndoNLGTokenizer
tokenizer = IndoNLGTokenizer.from_pretrained('indobenchmark/indobart')
text = tokenizer.decode([0,  4693, 39956,  1119,  3447, 2, 40002])
# text: '<s> hai, bagaimana kabar </s> [indonesian]'

note: IndoNLGTokenizer will automatically lower case the text input since both the IndoNLGTokenizer, the IndoBart, and the IndoGPT models are only trained on lower-cased texts.

Research Paper

IndoNLU has been accepted by AACL-IJCNLP 2020 and you can find the details in our paper https://www.aclweb.org/anthology/2020.aacl-main.85.pdf. If you are using any component on IndoNLU including Indo4B, FastText-Indo4B, or IndoBERT in your work, please cite the following paper:

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}

IndoNLG has been accepted by EMNLP 2021 and you can find the details in our paper https://arxiv.org/abs/2104.08200. If you are using any component on IndoNLG including Indo4B-Plus, IndoBART, or IndoGPT in your work, please cite the following paper:

@misc{cahyawijaya2021indonlg,
      title={IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation}, 
      author={Samuel Cahyawijaya and Genta Indra Winata and Bryan Wilie and Karissa Vincentio and Xiaohong Li and Adhiguna Kuncoro and Sebastian Ruder and Zhi Yuan Lim and Syafri Bahar and Masayu Leylia Khodra and Ayu Purwarianti and Pascale Fung},
      year={2021},
      eprint={2104.08200},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

IndoNLU and IndoNLG Models

IndoBERT and IndoBERT-lite Models

We provide 4 IndoBERT and 4 IndoBERT-lite Pretrained Language Model [Link]

FastText (Indo4B)

We provide the full uncased FastText model file (11.9 GB) and the corresponding Vector file (3.9 GB)

  • FastText model (11.9 GB) [Link]
  • Vector file (3.9 GB) [Link]

We provide smaller FastText models with smaller vocabulary for each of the 12 downstream tasks

IndoBART and IndoGPT Models

We provide IndoBART and IndoGPT Pretrained Language Model [Link]

indobenchmark-toolkit's People

Contributors

bryanwilie avatar maziyank avatar mettikodeva avatar samuelcahyawijaya avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

indobenchmark-toolkit's Issues

'IndoNLGTokenizer' object has no attribute 'sp_model'

Hello, I currently tried to fine-tune the IndoGPT2 model using transformers package form huggingface. It asked me to use the IndoNLGTokenizer for the tokenizer but when I run the sample code to encode the text, it gave me AttributeError: 'IndoNLGTokenizer' object has no attribute 'sp_model'.

the sample code

from indobenchmark import IndoNLGTokenizer
tokenizer = IndoNLGTokenizer.from_pretrained('indobenchmark/indogpt')
inputs = tokenizer.prepare_input_for_generation('hai, bagaimana kabar', model_type='indogpt', return_tensors='pt')

Installed related packages on Google Colab:

  • transformers==4.34.0
  • torch==2.0.1+cu118
  • sentencepiece==0.1.99

Is there any way to resolve the error?

IndoGPT: TypeError: sequence item 0: expected str instance, tokenizers.AddedToken found

I tried to run an example notebook from examples/load_indonlg.ipynb to test IndoGPT model. It ran on Colaboratory with Python 3.10. This version is consistent with setup.py. Here is all of my code cells:

!git clone https://github.com/indobenchmark/indobenchmark-toolkit.git
!pip install /content/indobenchmark-toolkit/.
import os, sys
sys.path.append("/content/indobenchmark-toolkit")
import torch
from transformers import GPT2LMHeadModel
from src.indobenchmark import IndoNLGTokenizer
from torch.utils.data import DataLoader
def count_param(module, trainable=False):
    if trainable:
        return sum(p.numel() for p in module.parameters() if p.requires_grad)
    else:
        return sum(p.numel() for p in module.parameters())
%%time
gpt_model = GPT2LMHeadModel.from_pretrained("indobenchmark/indogpt")
gpt_tokenizer = IndoNLGTokenizer.from_pretrained("indobenchmark/indogpt")
gpt_input = gpt_tokenizer.prepare_input_for_generation('aku adalah anak', model_type='indogpt', return_tensors='pt')
gpt_out = gpt_model.generate(**gpt_input)
gpt_tokenizer.decode(gpt_out[0]) # <-- Error exists here.

However, the last cell triggered an error. Here is the full traceback:

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1273: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

[<ipython-input-6-d94e6e0a3dd3>](https://localhost:8080/#) in <cell line: 3>()
      1 gpt_input = gpt_tokenizer.prepare_input_for_generation('aku adalah anak', model_type='indogpt', return_tensors='pt')
      2 gpt_out = gpt_model.generate(**gpt_input)
----> 3 gpt_tokenizer.decode(gpt_out[0])

3 frames

[/usr/local/lib/python3.10/dist-packages/indobenchmark/tokenization_indonlg.py](https://localhost:8080/#) in decode(self, inputs, skip_special_tokens)
    343 
    344     def decode(self, inputs, skip_special_tokens=False):
--> 345         outputs = super().decode(inputs, skip_special_tokens=skip_special_tokens)
    346         return outputs.replace(' ','').replace('โ–', ' ')
    347 

[/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py](https://localhost:8080/#) in decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
   3744         token_ids = to_py_obj(token_ids)
   3745 
-> 3746         return self._decode(
   3747             token_ids=token_ids,
   3748             skip_special_tokens=skip_special_tokens,

[/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils.py](https://localhost:8080/#) in _decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, spaces_between_special_tokens, **kwargs)
   1022                 current_sub_text.append(token)
   1023         if current_sub_text:
-> 1024             sub_texts.append(self.convert_tokens_to_string(current_sub_text))
   1025 
   1026         if spaces_between_special_tokens:

[/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils.py](https://localhost:8080/#) in convert_tokens_to_string(self, tokens)
    987 
    988     def convert_tokens_to_string(self, tokens: List[str]) -> str:
--> 989         return " ".join(tokens)
    990 
    991     def _decode(

TypeError: sequence item 0: expected str instance, tokenizers.AddedToken found

I also tried to use pip install indobenchmark-toolkit instead of clone from GitHub. Still, the result was same as before. Any solution for this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.