indobenchmark / indobenchmark-toolkit Goto Github PK

Toolkit for Indobenchmark

License: MIT License

Python 99.79% Shell 0.21%

indobenchmark-toolkit's Introduction

Indobenchmark Toolkit

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources for Bahasa Indonesia such as Institut Teknologi Bandung, Universitas Multimedia Nusantara, The Hong Kong University of Science and Technology, Universitas Indonesia, DeepMind, Gojek, and Prosa.AI.

Toolkit Modules

IndoNLGTokenizer

IndoNLGTokenizer is the tokenizer used by both IndoBART and IndoGPT models. The example for using the IndoNLGTokenizer is shown as follow:

IndoNLGTokenizer for IndoGPT

## Encode ##
from indobenchmark import IndoNLGTokenizer
tokenizer = IndoNLGTokenizer.from_pretrained('indobenchmark/indogpt')
inputs = tokenizer.prepare_input_for_generation('hai, bagaimana kabar', model_type='indogpt', return_tensors='pt')
# inputs: {'input_ids': tensor([[    0,  4693, 39956,  1119,  3447]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

## Decode ##
from indobenchmark import IndoNLGTokenizer
tokenizer = IndoNLGTokenizer.from_pretrained('indobenchmark/indogpt')
text = tokenizer.decode([0,  4693, 39956,  1119,  3447])
# text: '<s> hai, bagaimana kabar'

IndoNLGTokenizer for IndoBART

## Encode ##
from indobenchmark import IndoNLGTokenizer
tokenizer = IndoNLGTokenizer.from_pretrained('indobenchmark/indobart')
inputs = tokenizer.prepare_input_for_generation('hai, bagaimana kabar', return_tensors='pt', 
                       lang_token = '[indonesian]', decoder_lang_token='[indonesian]')
# inputs: {'input_ids': tensor([    0,  4693, 39956,  1119,  3447,     2, 40002]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1])}

## Decode ##
from indobenchmark import IndoNLGTokenizer
tokenizer = IndoNLGTokenizer.from_pretrained('indobenchmark/indobart')
text = tokenizer.decode([0,  4693, 39956,  1119,  3447, 2, 40002])
# text: '<s> hai, bagaimana kabar </s> [indonesian]'

note: IndoNLGTokenizer will automatically lower case the text input since both the IndoNLGTokenizer, the IndoBart, and the IndoGPT models are only trained on lower-cased texts.

Research Paper

IndoNLU has been accepted by AACL-IJCNLP 2020 and you can find the details in our paper https://www.aclweb.org/anthology/2020.aacl-main.85.pdf. If you are using any component on IndoNLU including Indo4B, FastText-Indo4B, or IndoBERT in your work, please cite the following paper:

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}

IndoNLG has been accepted by EMNLP 2021 and you can find the details in our paper https://arxiv.org/abs/2104.08200. If you are using any component on IndoNLG including Indo4B-Plus, IndoBART, or IndoGPT in your work, please cite the following paper:

@misc{cahyawijaya2021indonlg,
      title={IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation}, 
      author={Samuel Cahyawijaya and Genta Indra Winata and Bryan Wilie and Karissa Vincentio and Xiaohong Li and Adhiguna Kuncoro and Sebastian Ruder and Zhi Yuan Lim and Syafri Bahar and Masayu Leylia Khodra and Ayu Purwarianti and Pascale Fung},
      year={2021},
      eprint={2104.08200},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

IndoNLU and IndoNLG Models

IndoBERT and IndoBERT-lite Models

We provide 4 IndoBERT and 4 IndoBERT-lite Pretrained Language Model [Link]

IndoBERT-base
- Phase 1 [Link]
- Phase 2 [Link]
IndoBERT-large
- Phase 1 [Link]
- Phase 2 [Link]
IndoBERT-lite-base
- Phase 1 [Link]
- Phase 2 [Link]
IndoBERT-lite-large
- Phase 1 [Link]
- Phase 2 [Link]

FastText (Indo4B)

We provide the full uncased FastText model file (11.9 GB) and the corresponding Vector file (3.9 GB)

FastText model (11.9 GB) [Link]
Vector file (3.9 GB) [Link]

We provide smaller FastText models with smaller vocabulary for each of the 12 downstream tasks

FastText-Indo4B [Link]
FastText-CC-ID [Link]

IndoBART and IndoGPT Models

We provide IndoBART and IndoGPT Pretrained Language Model [Link]

IndoBART [Link]
IndoBART-v2 [Link]
IndoGPT2 [Link]

indobenchmark-toolkit's People

Contributors

Stargazers

Watchers

Forkers

khavitidala erland366 nafkhanzam-thesis adibenc maziyank fjoeda haruray lazarusnlp mettikodeva abdiharyadi cenahcoid ketanbasi

indobenchmark-toolkit's Issues

'IndoNLGTokenizer' object has no attribute 'sp_model'

Hello, I currently tried to fine-tune the IndoGPT2 model using transformers package form huggingface. It asked me to use the IndoNLGTokenizer for the tokenizer but when I run the sample code to encode the text, it gave me AttributeError: 'IndoNLGTokenizer' object has no attribute 'sp_model'.

the sample code

from indobenchmark import IndoNLGTokenizer
tokenizer = IndoNLGTokenizer.from_pretrained('indobenchmark/indogpt')
inputs = tokenizer.prepare_input_for_generation('hai, bagaimana kabar', model_type='indogpt', return_tensors='pt')

Installed related packages on Google Colab:

transformers==4.34.0
torch==2.0.1+cu118
sentencepiece==0.1.99

Is there any way to resolve the error?

IndoGPT: TypeError: sequence item 0: expected str instance, tokenizers.AddedToken found

I tried to run an example notebook from examples/load_indonlg.ipynb to test IndoGPT model. It ran on Colaboratory with Python 3.10. This version is consistent with setup.py. Here is all of my code cells:

!git clone https://github.com/indobenchmark/indobenchmark-toolkit.git

!pip install /content/indobenchmark-toolkit/.

import os, sys
sys.path.append("/content/indobenchmark-toolkit")
import torch
from transformers import GPT2LMHeadModel
from src.indobenchmark import IndoNLGTokenizer
from torch.utils.data import DataLoader

def count_param(module, trainable=False):
    if trainable:
        return sum(p.numel() for p in module.parameters() if p.requires_grad)
    else:
        return sum(p.numel() for p in module.parameters())

%%time
gpt_model = GPT2LMHeadModel.from_pretrained("indobenchmark/indogpt")
gpt_tokenizer = IndoNLGTokenizer.from_pretrained("indobenchmark/indogpt")

gpt_input = gpt_tokenizer.prepare_input_for_generation('aku adalah anak', model_type='indogpt', return_tensors='pt')
gpt_out = gpt_model.generate(**gpt_input)
gpt_tokenizer.decode(gpt_out[0]) # <-- Error exists here.

However, the last cell triggered an error. Here is the full traceback:

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1273: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

[<ipython-input-6-d94e6e0a3dd3>](https://localhost:8080/#) in <cell line: 3>()
      1 gpt_input = gpt_tokenizer.prepare_input_for_generation('aku adalah anak', model_type='indogpt', return_tensors='pt')
      2 gpt_out = gpt_model.generate(**gpt_input)
----> 3 gpt_tokenizer.decode(gpt_out[0])

3 frames

[/usr/local/lib/python3.10/dist-packages/indobenchmark/tokenization_indonlg.py](https://localhost:8080/#) in decode(self, inputs, skip_special_tokens)
    343 
    344     def decode(self, inputs, skip_special_tokens=False):
--> 345         outputs = super().decode(inputs, skip_special_tokens=skip_special_tokens)
    346         return outputs.replace(' ','').replace('▁', ' ')
    347 

[/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py](https://localhost:8080/#) in decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
   3744         token_ids = to_py_obj(token_ids)
   3745 
-> 3746         return self._decode(
   3747             token_ids=token_ids,
   3748             skip_special_tokens=skip_special_tokens,

[/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils.py](https://localhost:8080/#) in _decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, spaces_between_special_tokens, **kwargs)
   1022                 current_sub_text.append(token)
   1023         if current_sub_text:
-> 1024             sub_texts.append(self.convert_tokens_to_string(current_sub_text))
   1025 
   1026         if spaces_between_special_tokens:

[/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils.py](https://localhost:8080/#) in convert_tokens_to_string(self, tokens)
    987 
    988     def convert_tokens_to_string(self, tokens: List[str]) -> str:
--> 989         return " ".join(tokens)
    990 
    991     def _decode(

TypeError: sequence item 0: expected str instance, tokenizers.AddedToken found

I also tried to use pip install indobenchmark-toolkit instead of clone from GitHub. Still, the result was same as before. Any solution for this?

indobenchmark / indobenchmark-toolkit Goto Github PK

indobenchmark-toolkit's Introduction

Indobenchmark Toolkit

Toolkit Modules

IndoNLGTokenizer

Research Paper

IndoNLU and IndoNLG Models

IndoBERT and IndoBERT-lite Models

FastText (Indo4B)

IndoBART and IndoGPT Models

indobenchmark-toolkit's People

Contributors

Stargazers

Watchers

Forkers

indobenchmark-toolkit's Issues

'IndoNLGTokenizer' object has no attribute 'sp_model'

IndoGPT: TypeError: sequence item 0: expected str instance, tokenizers.AddedToken found

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent