rinnakk / japanese-pretrained-models Goto Github PK

View Code? Open in Web Editor NEW

567.0 14.0 40.0 847 KB

Code for producing Japanese pretrained models provided by rinna Co., Ltd.

Home Page: https://huggingface.co/rinna

License: Apache License 2.0

Python 100.00%

nlp japanese gpt2 roberta

japanese-pretrained-models's Introduction

japanese-pretrained-models

(previously: japanese-gpt2)

This repository provides the code for training Japanese pretrained models. This code has been used for producing japanese-gpt2-medium, japanese-gpt2-small, japanese-gpt2-xsmall, and japanese-roberta-base released on HuggingFace model hub by rinna Co., Ltd.

Currently supported pretrained models include: GPT-2, RoBERTa.

Table of Contents
Update log
Use tips
Use our pretrained models via Huggingface
Train `japanese-gpt2-xsmall` from scratch
Train `japanese-roberta-base` from scratch
License

Please open an issue (in English/日本語) if you encounter any problem using the code or using our models via Huggingface.

if you find this work useful, please cite the following paper:

@article{rinna_pretrained2021,
    title={日本語自然言語処理における事前学習モデルの公開},
    author={趙 天雨 and 沢田 慶},
    journal={人工知能学会研究会資料 言語・音声理解と対話処理研究会},
    volume={93},
    pages={169-170},
    year={2021},
    doi={10.11517/jsaislud.93.0_169}
}

Update log

2022/01/25 Updated link to rinna/japanese-gpt-1b in the model summary table.
2022/01/17 Updated citation information.
2021/11/01 Updated corpora links.
2021/09/13 Added tips on using position_ids with japanese-roberta-base. Refer to issue 3 for details.
2021/08/26 [Important] Updated license from the MIT license to the Apache 2.0 license due to the use of the Wikipedia pre-processing code from cl-tohoku/bert-japanese. See issue 1 for details.
2021/08/23 Added Japanese Wikipedia to training corpora. Published code for training rinna/japanese-gpt2-small, rinna/japanese-gpt2-xsmall, and rinna/japanese-roberta-base.
2021/08/18 Changed repo name from japanese-gpt2 to japanese-pretrained-models
2021/06/15 Fixed best PPL tracking bug when using a checkpoint.
2021/05/04 Fixed random seeding bug for Multi-GPU training.
2021/04/06 Published code for training rinna/japanese-gpt2-medium.

Use tips

Tips for `rinna/japanese-roberta-base`

Use [CLS]: To predict a masked token, be sure to add a [CLS] token before the sentence for the model to correctly encode it, as it is used during the model training.
Use [MASK] after tokenization: A) Directly typing [MASK] in an input string and B) replacing a token with [MASK] after tokenization will yield different token sequences, and thus different prediction results. It is more appropriate to use [MASK] after tokenization (as it is consistent with how the model was pretrained). However, the Huggingface Inference API only supports typing [MASK] in the input string and produces less robust predictions.
Provide position_ids as an argument explicitly: When position_ids are not provided for a Roberta* model, Huggingface's transformers will automatically construct it but start from padding_idx instead of 0 (see issue and function create_position_ids_from_input_ids() in Huggingface's implementation), which unfortunately does not work as expected with rinna/japanese-roberta-base since the padding_idx of the corresponding tokenizer is not 0. So please be sure to constrcut the position_ids by yourself and make it start from position id 0.

Use our pretrained models via Huggingface

Model summary

language model	# params	# layers	# emb dim	# epochs	dev ppl	training time*
rinna/japanese-gpt-1b	1.3B	24	2048	10+	13.9	n/a**
rinna/japanese-gpt2-medium	336M	24	1024	4	18	45 days
rinna/japanese-gpt2-small	110M	12	768	3	21	15 days
rinna/japanese-gpt2-xsmall	37M	6	512	3	28	4 days

masked language model	# params	# layers	# emb dim	# epochs	dev ppl	training time*
rinna/japanese-roberta-base	110M	12	768	8	3.9	15 days

* Training was conducted on a 8x V100 32GB machine.

** Training was conducted using a different codebase and a different computing environment.

Example: use `rinna/japanese-roberta-base` for predicting masked token

import torch
from transformers import T5Tokenizer, RobertaForMaskedLM

# load tokenizer
tokenizer = T5Tokenizer.from_pretrained("rinna/japanese-roberta-base")
tokenizer.do_lower_case = True  # due to some bug of tokenizer config loading

# load model
model = RobertaForMaskedLM.from_pretrained("rinna/japanese-roberta-base")
model = model.eval()

# original text
text = "4年に1度オリンピックは開かれる。"

# prepend [CLS]
text = "[CLS]" + text

# tokenize
tokens = tokenizer.tokenize(text)
print(tokens)  # output: ['[CLS]', '▁4', '年に', '1', '度', 'オリンピック', 'は', '開かれる', '。']']

# mask a token
masked_idx = 5
tokens[masked_idx] = tokenizer.mask_token
print(tokens)  # output: ['[CLS]', '▁4', '年に', '1', '度', '[MASK]', 'は', '開かれる', '。']

# convert to ids
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)  # output: [4, 1602, 44, 24, 368, 6, 11, 21583, 8]

# convert to tensor
token_tensor = torch.LongTensor([token_ids])

# provide position ids explicitly
position_ids = list(range(0, token_tensor.size(1)))
print(position_ids)  # output: [0, 1, 2, 3, 4, 5, 6, 7, 8]
position_id_tensor = torch.LongTensor([position_ids])

# get the top 10 predictions of the masked token
with torch.no_grad():
    outputs = model(input_ids=token_tensor, position_ids=position_id_tensor)
    predictions = outputs[0][0, masked_idx].topk(10)

for i, index_t in enumerate(predictions.indices):
    index = index_t.item()
    token = tokenizer.convert_ids_to_tokens([index])[0]
    print(i, token)

"""
0 総会
1 サミット
2 ワールドカップ
3 フェスティバル
4 大会
5 オリンピック
6 全国大会
7 党大会
8 イベント
9 世界選手権
"""

Train `japanese-gpt2-xsmall` from scratch

Install dependencies

Install required packages by running the following command under the repo directory:

pip install -r requirements.txt

Data construction and model training

Set up fugashi tokenzier for preprocessing Wikipedia corpus by running:

python -m unidic download

Download training corpus Japanese CC-100 and extract the ja.txt file.
Move the ja.txt file or modify src/corpus/jp_cc100/config.py to match the filepath of ja.txt with self.raw_data_dir in the config file.
Split ja.txt to smaller files by running:

cd src/
python -m corpus.jp_cc100.split_to_small_files

First check the versions of Wikipedia dump at Wikipedia cirrussearch and fill in self.download_link (in file src/corpus/jp_wiki/config.py) with the link to your preferred Wikipedia dump version. Then download training corpus Japanese Wikipedia and split it by running:

python -m corpus.jp_wiki.build_pretrain_dataset
python -m corpus.jp_wiki.split_to_small_files

Train a xsmall-sized GPT-2 on, for example, 4 V100 GPUs by running:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m task.pretrain_gpt2.train \
    --n_gpus 4 \
    --save_model True \
    --enable_log True \
    --model_size xsmall \
    --model_config_filepath model/gpt2-ja-xsmall-config.json \
    --batch_size 20 \
    --eval_batch_size 40 \
    --n_training_steps 1600000 \
    --n_accum_steps 3 \
    --init_lr 0.0007

Interact with the trained model

Assume you have run the training script and saved your xsmall-sized GPT-2 to data/model/pretrain_gpt2/gpt2-ja-xsmall-xxx.checkpoint. Run the following command to use it to complete text on one GPU by nucleus sampling with p=0.95 and k=40:

CUDA_VISIBLE_DEVICES=0 python -m task.pretrain_gpt2.interact \
    --checkpoint_path ../data/model/pretrain_gpt2/gpt2-ja-medium-xxx.checkpoint \
    --gen_type top \
    --top_p 0.95 \
    --top_k 40

Prepare files for uploading to Huggingface

Make your Huggingface account. Create a model repo. Clone it to your local machine.
Create model and config files from a checkpoint by running:

python -m task.pretrain_gpt2.checkpoint2huggingface \
    --checkpoint_path ../data/model/gpt2-medium-xxx.checkpoint \
    --save_dir {huggingface's model repo directory}

Validate the created files by running:

python -m task.pretrain_gpt2.check_huggingface \
    --model_dir {huggingface's model repo directory}

Add files, commit, and push to your Huggingface repo.

Customize your GPT-2 training

Check available arguments of GPT-2 training script by running:

python -m task.pretrain_gpt2.train --help

Train `japanese-roberta-base` from scratch

Assume you have finished the data construction process as described above, run the following command to train a base-sized Japanese RoBERTa on, for example, 8 V100 GPUs:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m task.pretrain_roberta.train \
    --n_gpus 8 \
    --save_model True \
    --enable_log True \
    --model_size base \
    --model_config_filepath model/roberta-ja-base-config.json \
    --batch_size 32 \
    --eval_batch_size 32 \
    --n_training_steps 3000000 \
    --n_accum_steps 16 \
    --init_lr 0.0006

License

The Apache 2.0 license

japanese-pretrained-models's People

Contributors

Stargazers

Watchers

Forkers

yonggwi-cho otspace0715 picaosgeo akinoriosamura alicture hpprc kyodocn shintaro3 etpython k-utsubo mokamotosan guolong-zhang jyukipann jeff52415 wataru-nakata hiroshigeaoki hellorusk-sandbox knok pqmsoft1 quanganhquanganh huyhoan1109 oooic lan2016286 uho186 triper1022 yswjane azmadoppler wanglaiqi azonti tiyaro cocoapc99213 huang123x cuongdv1 ukaserge miyauchi0224 corticalcode bonoshunki genus-ninjin bombless taikis

japanese-pretrained-models's Issues

License Issues

Hi, I'm @singletongue, a maintainer of the cl-tohoku/bert-japanese.

Thank you for sharing your great work.

However, I'm a little concerned that some parts of your code src/corpus/build_pretrain_dataset.py are possibly taken from our code make_corpus_wiki.py.
Since we are releasing our codes under the Apache License 2.0, It might be better if you adopted the same license, not the MIT license.

Thank you.

Tensor size does not match

Description

GPT-2 train fails with an error "RuntimeError: The size of tensor a (768) must match the size of tensor b (1024) at non-singleton dimension 3".

I followed the steps of "Train japanese-gpt2-xsmall from scratch", except that n_gpus was set to 1 and mecab_dict_path was changed to the path of unidic-csj-3.0.1.1.

What's wrong?

Full output of python -m task.pretrain_gpt2.train:

local rank: [0], global_rank: [0]
Number of training files: 502
Number of dev files: 1
----- Loading dev data -----
{'n_docs': 10000, 'n_sents': 131762, 'n_tokens': 4241376}
----- Hyper-parameters -----
balanced_corpora: None
batch_size: 20
check_loss_after_n_step: 100.0
checkpoint_path: None
corpora: ['jp_cc100', 'jp_wiki']
enable_log: True
eval_batch_size: 40
filename_note: None
init_lr: 0.0007
l2_penalty: 0.01
master_port: 12321
max_grad_norm: 1.0
max_seq_len: 1024
model_config_filepath: model/gpt2-ja-xsmall-config.json
model_size: xsmall
n_accum_steps: 3
n_epochs: 10
n_gpus: 1
n_nodes: 1
n_train_files_per_group: 10
n_training_steps: 1600000
n_warmup_steps: 2000.0
node_rank: 0
resume_training: False
save_model: True
seed: 42
small_data: False
use_amp: True
validate_after_n_step: 5000.0
world_size: 1
{'n_docs': 1367409, 'n_sents': 8632681, 'n_tokens': 288213354}
Traceback (most recent call last):
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/var/tmp/hiroki/japanese-pretrained-models/src/task/pretrain_gpt2/train.py", line 580, in <module>
    train(0, config)
  File "/var/tmp/hiroki/japanese-pretrained-models/src/task/pretrain_gpt2/train.py", line 409, in train
    loss, ppl = forward_step(model, tokenizer, batch_data)
  File "/var/tmp/hiroki/japanese-pretrained-models/src/task/pretrain_gpt2/train.py", line 85, in forward_step
    gpt2_outputs = model(input_ids=input_ids, return_dict=True)
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/transformers-4.4.2-py3.8.egg/transformers/models/gpt2/modeling_gpt2.py", line 904, in forward
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/transformers-4.4.2-py3.8.egg/transformers/models/gpt2/modeling_gpt2.py", line 752, in forward
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/transformers-4.4.2-py3.8.egg/transformers/models/gpt2/modeling_gpt2.py", line 290, in forward
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/transformers-4.4.2-py3.8.egg/transformers/models/gpt2/modeling_gpt2.py", line 241, in forward
  File "/Users/hiroki/.conda/envs/transformers/lib/python3.8/site-packages/transformers-4.4.2-py3.8.egg/transformers/models/gpt2/modeling_gpt2.py", line 176, in _attn
RuntimeError: The size of tensor a (768) must match the size of tensor b (1024) at non-singleton dimension 3

Environment

python == 3.8.13
PyTorch == 1.12.1
transformers == 4.4.2

Please add "tokenizer_class" in "config.json"

Please add tokenizer_class in config.json like

  "tokenizer_class": "T5Tokenizer",

.
This enables use of AutoTokenizer like

tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-1b")

instead of

tokenizer = T5Tokenizer.from_pretrained("rinna/japanese-gpt-1b")

(Other models can be changed in the same way.)

Related to: cl-tohoku/bert-japanese#24

rinna RoBERTa's max_length is 510 not 512?

Hi, I have been using rinna RoBERTa for a while now.
I have a question.
The max_length of rinna RoBERTa is 510 (not 512), right?
Is this the intended result? If this was the intended result, why did you use 510 instead of 512 for max_length?

rinna RoBERTa's padding_idx is 3 (not 1). So I think the starting position of position_embeddings is padding_idx+1=4 as in the following problem, but the size of position_embeddings in rinna RoBERTa is (514, 768). If I actually enter text with a length of 512, I get an index error.

facebookresearch/fairseq#1187

The load_docs_from_filepath method in src/task/pretrain_roberta/train.py just return empty list.

The load_docs_from_filepath method in src/task/pretrain_roberta/train.py only return empty list.
Is it intended behavior?
Thank you.

def load_docs_from_filepath(filepath, tokenizer):
    docs = []
    with open(filepath, encoding="utf-8") as f:
        doc = []
        for line in f:
            line = line.strip()
            if line == "":
                if len(doc) > 0:
                    docs.append(doc)
                doc = []
            else:
                sent = line
                tokens = tokenizer.tokenize(sent)
                token_ids = tokenizer.convert_tokens_to_ids(tokens)
                if len(token_ids) > 0:
                    doc.append(token_ids)
    return docs

Train japanese-gpt2-xsmall from scratch

After the following command,

python -m corpus.jp_wiki.build_pretrain_dataset

the following command is necessary for training japanese-gpt2-xsmall from scratch.

python -m corpus.jp_wiki.split_to_small_files

If so, please update the usage.

Japanese Wikipedia dump link has changed

First of all, thanks for great project!

Currently, wikipedia link is fixed to https://dumps.wikimedia.org/other/cirrussearch/20210329/jawiki-20210329-cirrussearch-content.json.gz. However, it looks like the manager dispose the dump as it becomes older. The latest version is https://dumps.wikimedia.org/other/cirrussearch/20211025/jawiki-20211025-cirrussearch-content.json.gz. It would be grateful to note it in README.

(I also find that CC-100 link is broken now, but it is not your fault.)

Can I use `rinna/japanese-roberta-base` through `AutoTokenizer` ?

Hi, thank you very much for publishing such a wonderful Japanese pre-trained model! I am very happy to use this model.

I would like to load the pre-trained tokenizer from AutoTokenizer.from_pretrained, but I encountered the following error. Do you support loading the pre-trained tokenizer from AutoTokenizer.from_pretrained ?

$ python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('rinna/japanese-roberta-base')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/shunk031/.pyenv/versions/japanese-dev/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 423, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/shunk031/.pyenv/versions/japanese-dev/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1709, in from_pretrained
    return cls._from_pretrained(
  File "/home/shunk031/.pyenv/versions/japanese-dev/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1722, in _from_pretrained
    slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
  File "/home/shunk031/.pyenv/versions/japanese-dev/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1781, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/shunk031/.pyenv/versions/japanese-dev/lib/python3.8/site-packages/transformers/models/roberta/tokenization_roberta.py", line 159, in __init__
    super().__init__(
  File "/home/shunk031/.pyenv/versions/japanese-dev/lib/python3.8/site-packages/transformers/models/gpt2/tokenization_gpt2.py", line 179, in __init__
    with open(vocab_file, encoding="utf-8") as vocab_handle:
TypeError: expected str, bytes or os.PathLike object, not NoneType

Fortunately, AutoModel.from_pretrained can be run successfully (the warning message can be ignored this time).

$ python -c "from transformers import AutoModel; AutoModel.from_pretrained('rinna/japanese-roberta-base')"
Some weights of RobertaModel were not initialized from the model checkpoint at rinna/japanese-roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

The following is my system environment:

python 3.8.8
transformers 4.5.1

I would appreciate any advice on how to load it this way. Thanks.

japanese-roberta-base/README.md typo

Hello. I am enjoying researching with this model😊

I found a typo.

https://huggingface.co/rinna/japanese-roberta-base/blob/main/README.md

typo?

masked_idx = 6 # This is 5

reason

tokens [MASK] index is 5.

print(tokens)  # output: ['[CLS]', '▁4', '年に', '1', '度', '[MASK]', 'は', '開かれる', '。']

Please check it out.

Please update data’s url.

I noticed that the wikipedia dataset has been updated in all languages.

as is (src/coupus/jp_wiki/config.py)

class Config(object):
    def __init__(self):
        self.corpus_name = "jp_wiki"

        # Management
        self.download_link = "https://dumps.wikimedia.org/other/cirrussearch/current/jawiki-20211025-cirrussearch-content.json.gz"
        self.raw_data_dir = "../data/jp_wiki/raw_data"
        self.raw_data_path = f"{self.raw_data_dir}/wiki.json.gz"
        self.extracted_data_path = f"{self.raw_data_dir}/wiki.extracted.txt"
        self.doc_data_dir = "../data/jp_wiki/doc_data"

to be (src/coupus/jp_wiki/config.py)

class Config(object):
    def __init__(self):
        self.corpus_name = "jp_wiki"

        # Management
        self.download_link = "https://dumps.wikimedia.org/other/cirrussearch/current/jawiki-20220228-cirrussearch-content.json.gz"
        self.raw_data_dir = "../data/jp_wiki/raw_data"
        self.raw_data_path = f"{self.raw_data_dir}/wiki.json.gz"
        self.extracted_data_path = f"{self.raw_data_dir}/wiki.extracted.txt"
        self.doc_data_dir = "../data/jp_wiki/doc_data"