ymcui / chinese-xlnet Goto Github PK

View Code? Open in Web Editor NEW

1.6K 33.0 280.0 369 KB

Pre-Trained Chinese XLNet（中文XLNet预训练模型）

Home Page: http://xlnet.hfl-rc.com

License: Apache License 2.0

Python 100.00%

natural-language-processing xlnet tensorflow pytorch nlp

chinese-xlnet's Introduction

中文说明 | English

本项目提供了面向中文的XLNet预训练模型，旨在丰富中文自然语言处理资源，提供多元化的中文预训练模型选择。我们欢迎各位专家学者下载使用，并共同促进和发展中文资源建设。

本项目基于CMU/谷歌官方的XLNet：https://github.com/zihangdai/xlnet

查看更多哈工大讯飞联合实验室（HFL）发布的资源：https://github.com/ymcui/HFL-Anthology

新闻

2023/3/28 开源了中文LLaMA&Alpaca大模型，可快速在PC上部署体验，查看：https://github.com/ymcui/Chinese-LLaMA-Alpaca

2022/10/29 我们提出了一种融合语言学信息的预训练模型LERT。查看：https://github.com/ymcui/LERT

2022/3/30 我们开源了一种新预训练模型PERT。查看：https://github.com/ymcui/PERT

2021/12/17 哈工大讯飞联合实验室推出模型裁剪工具包TextPruner。查看：https://github.com/airaria/TextPruner

2021/10/24 哈工大讯飞联合实验室发布面向少数民族语言的预训练模型CINO。查看：https://github.com/ymcui/Chinese-Minority-PLM

2021/7/21 由哈工大SCIR多位学者撰写的《自然语言处理：基于预训练模型的方法》已出版，欢迎大家选购。

2021/1/27 所有模型已支持TensorFlow 2，请通过transformers库进行调用或下载。https://huggingface.co/hfl

历史新闻

2020/9/15 我们的论文["Revisiting Pre-Trained Models for Chinese Natural Language Processing"](https://arxiv.org/abs/2004.13922)被[Findings of EMNLP](https://2020.emnlp.org)录用为长文。

2020/8/27 哈工大讯飞联合实验室在通用自然语言理解评测GLUE中荣登榜首，查看GLUE榜单，新闻。

2020/3/11 为了更好地了解需求，邀请您填写调查问卷，以便为大家提供更好的资源。

2020/2/26 哈工大讯飞联合实验室发布知识蒸馏工具TextBrewer

2019/12/19 本目录发布的模型已接入Huggingface-Transformers，查看快速加载

2019/9/5 XLNet-base已可下载，查看模型下载

2019/8/19 提供了在大规模通用语料（5.4B词数）上训练的中文XLNet-mid模型，查看模型下载

内容导引

章节	描述
模型下载	提供了中文预训练XLNet下载地址
基线系统效果	列举了部分基线系统效果
预训练细节	预训练细节的相关描述
下游任务微调细节	下游任务微调细节的相关描述
FAQ	常见问题答疑
引用	本目录的技术报告

模型下载

XLNet-mid：24-layer, 768-hidden, 12-heads, 209M parameters
XLNet-base：12-layer, 768-hidden, 12-heads, 117M parameters

模型简称	语料	Google下载	百度网盘下载
`XLNet-mid, Chinese`	中文维基+ 通用数据^[1]	TensorFlow PyTorch	TensorFlow（密码2jv2）
`XLNet-base, Chinese`	中文维基+ 通用数据^[1]	TensorFlow PyTorch	TensorFlow（密码ge7w）

[1] 通用数据包括：百科、新闻、问答等数据，总词数达5.4B，与我们发布的BERT-wwm-ext训练语料相同。

PyTorch版本

如需PyTorch版本，

1）请自行通过🤗Transformers提供的转换脚本进行转换。

2）或者通过huggingface官网直接下载PyTorch版权重：https://huggingface.co/hfl

方法：点击任意需要下载的model → 拉到最下方点击"List all files in model" → 在弹出的小框中下载bin和json文件。

使用说明

**大陆境内建议使用百度网盘下载点，境外用户建议使用谷歌下载点，XLNet-mid模型文件大小约800M。以TensorFlow版XLNet-mid, Chinese为例，下载完毕后对zip文件进行解压得到：

chinese_xlnet_mid_L-24_H-768_A-12.zip
    |- xlnet_model.ckpt      # 模型权重
    |- xlnet_model.meta      # 模型meta信息
    |- xlnet_model.index     # 模型index信息
    |- xlnet_config.json     # 模型参数
    |- spiece.model          # 词表

快速加载

依托于Huggingface-Transformers 2.2.2，可轻松调用以上模型。

tokenizer = AutoTokenizer.from_pretrained("MODEL_NAME")
model = AutoModel.from_pretrained("MODEL_NAME")

其中MODEL_NAME对应列表如下：

模型名	MODEL_NAME
XLNet-mid	hfl/chinese-xlnet-mid
XLNet-base	hfl/chinese-xlnet-base

基线系统效果

为了对比基线效果，我们在以下几个中文数据集上进行了测试。对比了中文BERT、BERT-wwm、BERT-wwm-ext以及XLNet-base、XLNet-mid。其中中文BERT、BERT-wwm、BERT-wwm-ext结果取自中文BERT-wwm项目。时间及精力有限，并未能覆盖更多类别的任务，请大家自行尝试。

注意：为了保证结果的可靠性，对于同一模型，我们运行10遍（不同随机种子），汇报模型性能的最大值和平均值。不出意外，你运行的结果应该很大概率落在这个区间内。

评测指标中，括号内表示平均值，括号外表示最大值。

简体中文阅读理解：CMRC 2018

**CMRC 2018数据集**是哈工大讯飞联合实验室发布的中文机器阅读理解数据。根据给定问题，系统需要从篇章中抽取出片段作为答案，形式与SQuAD相同。评测指标为：EM / F1

模型	开发集	测试集	挑战集
BERT	65.5 (64.4) / 84.5 (84.0)	70.0 (68.7) / 87.0 (86.3)	18.6 (17.0) / 43.3 (41.3)
BERT-wwm	66.3 (65.0) / 85.6 (84.7)	70.5 (69.1) / 87.4 (86.7)	21.0 (19.3) / 47.0 (43.9)
BERT-wwm-ext	67.1 (65.6) / 85.7 (85.0)	71.4 (70.0) / 87.7 (87.0)	24.0 (20.0) / 47.3 (44.6)
XLNet-base	65.2 (63.0) / 86.9 (85.9)	67.0 (65.8) / 87.2 (86.8)	25.0 (22.7) / 51.3 (49.5)
XLNet-mid	66.8 (66.3) / 88.4 (88.1)	69.3 (68.5) / 89.2 (88.8)	29.1 (27.1) / 55.8 (54.9)

繁体中文阅读理解：DRCD

**DRCD数据集**由****台达研究院发布，其形式与SQuAD相同，是基于繁体中文的抽取式阅读理解数据集。评测指标为：EM / F1

模型	开发集	测试集
BERT	83.1 (82.7) / 89.9 (89.6)	82.2 (81.6) / 89.2 (88.8)
BERT-wwm	84.3 (83.4) / 90.5 (90.2)	82.8 (81.8) / 89.7 (89.0)
BERT-wwm-ext	85.0 (84.5) / 91.2 (90.9)	83.6 (83.0) / 90.4 (89.9)
XLNet-base	83.8 (83.2) / 92.3 (92.0)	83.5 (82.8) / 92.2 (91.8)
XLNet-mid	85.3 (84.9) / 93.5 (93.3)	85.5 (84.8) / 93.6 (93.2)

情感分类：ChnSentiCorp

在情感分类任务中，我们使用的是ChnSentiCorp数据集。模型需要将文本分成积极, 消极两个类别。评测指标为：Accuracy

模型	开发集	测试集
BERT	94.7 (94.3)	95.0 (94.7)
BERT-wwm	95.1 (94.5)	95.4 (95.0)
XLNet-base
XLNet-mid	95.8 (95.2)	95.4 (94.9)

预训练细节

以下以XLNet-mid模型为例，对预训练细节进行说明。

生成词表

按照XLNet官方教程步骤，首先需要使用Sentence Piece生成词表。在本项目中，我们使用的词表大小为32000，其余参数采用官方示例中的默认配置。

spm_train \
	--input=wiki.zh.txt \
	--model_prefix=sp10m.cased.v3 \
	--vocab_size=32000 \
	--character_coverage=0.99995 \
	--model_type=unigram \
	--control_symbols=\<cls\>,\<sep\>,\<pad\>,\<mask\>,\<eod\> \
	--user_defined_symbols=\<eop\>,.,\(,\),\",-,–,£,€ \
	--shuffle_input_sentence \
	--input_sentence_size=10000000

生成tf_records

生成词表后，开始利用原始文本语料生成训练用的tf_records文件。原始文本的构造方式与原教程相同：

每行都是一个句子
空行代表文档末尾

以下是生成数据时的命令（num_task与task请根据实际切片数量进行设置）：

SAVE_DIR=./output_b32
INPUT=./data/*.proc.txt

python data_utils.py \
	--bsz_per_host=32 \
	--num_core_per_host=8 \
	--seq_len=512 \
	--reuse_len=256 \
	--input_glob=${INPUT} \
	--save_dir=${SAVE_DIR} \
	--num_passes=20 \
	--bi_data=True \
	--sp_path=spiece.model \
	--mask_alpha=6 \
	--mask_beta=1 \
	--num_predict=85 \
	--uncased=False \
	--num_task=10 \
	--task=1

预训练

获得以上数据后，正式开始预训练XLNet。之所以叫XLNet-mid是因为仅相比XLNet-base增加了层数（12层增加到24层），其余参数没有变动，主要因为计算设备受限。使用的命令如下：

DATA=YOUR_GS_BUCKET_PATH_TO_TFRECORDS
MODEL_DIR=YOUR_OUTPUT_MODEL_PATH
TPU_NAME=v3-xlnet
TPU_ZONE=us-central1-b

python train.py \
	--record_info_dir=$DATA \
	--model_dir=$MODEL_DIR \
	--train_batch_size=32 \
	--seq_len=512 \
	--reuse_len=256 \
	--mem_len=384 \
	--perm_size=256 \
	--n_layer=24 \
	--d_model=768 \
	--d_embed=768 \
	--n_head=12 \
	--d_head=64 \
	--d_inner=3072 \
	--untie_r=True \
	--mask_alpha=6 \
	--mask_beta=1 \
	--num_predict=85 \
	--uncased=False \
	--train_steps=2000000 \
	--save_steps=20000 \
	--warmup_steps=20000 \
	--max_save=20 \
	--weight_decay=0.01 \
	--adam_epsilon=1e-6 \
	--learning_rate=1e-4 \
	--dropout=0.1 \
	--dropatt=0.1 \
	--tpu=$TPU_NAME \
	--tpu_zone=$TPU_ZONE \
	--use_tpu=True

下游任务微调细节

下游任务微调使用的设备是谷歌Cloud TPU v2（64G HBM），以下简要说明各任务精调时的配置。如果你使用GPU进行精调，请更改相应参数以适配，尤其是batch_size, learning_rate等参数。 相关代码请查看src目录。

CMRC 2018

对于阅读理解任务，首先需要生成tf_records数据。请参考XLNet官方教程之SQuAD 2.0处理方法，在这里不再赘述。以下是CMRC 2018中文机器阅读理解任务中使用的脚本参数：

XLNET_DIR=YOUR_GS_BUCKET_PATH_TO_XLNET
MODEL_DIR=YOUR_OUTPUT_MODEL_PATH
DATA_DIR=YOUR_DATA_DIR_TO_TFRECORDS
RAW_DIR=YOUR_RAW_DATA_DIR
TPU_NAME=v2-xlnet
TPU_ZONE=us-central1-b

python -u run_cmrc_drcd.py \
	--spiece_model_file=./spiece.model \
	--model_config_path=${XLNET_DIR}/xlnet_config.json \
	--init_checkpoint=${XLNET_DIR}/xlnet_model.ckpt \
	--tpu_zone=${TPU_ZONE} \
	--use_tpu=True \
	--tpu=${TPU_NAME} \
	--num_hosts=1 \
	--num_core_per_host=8 \
	--output_dir=${DATA_DIR} \
	--model_dir=${MODEL_DIR} \
	--predict_dir=${MODEL_DIR}/eval \
	--train_file=${DATA_DIR}/cmrc2018_train.json \
	--predict_file=${DATA_DIR}/cmrc2018_dev.json \
	--uncased=False \
	--max_answer_length=40 \
	--max_seq_length=512 \
	--do_train=True \
	--train_batch_size=16 \
	--do_predict=True \
	--predict_batch_size=16 \
	--learning_rate=3e-5 \
	--adam_epsilon=1e-6 \
	--iterations=1000 \
	--save_steps=2000 \
	--train_steps=2400 \
	--warmup_steps=240

DRCD

以下是DRCD繁体中文机器阅读理解任务中使用的脚本参数：

XLNET_DIR=YOUR_GS_BUCKET_PATH_TO_XLNET
MODEL_DIR=YOUR_OUTPUT_MODEL_PATH
DATA_DIR=YOUR_DATA_DIR_TO_TFRECORDS
RAW_DIR=YOUR_RAW_DATA_DIR
TPU_NAME=v2-xlnet
TPU_ZONE=us-central1-b

python -u run_cmrc_drcd.py \
	--spiece_model_file=./spiece.model \
	--model_config_path=${XLNET_DIR}/xlnet_config.json \
	--init_checkpoint=${XLNET_DIR}/xlnet_model.ckpt \
	--tpu_zone=${TPU_ZONE} \
	--use_tpu=True \
	--tpu=${TPU_NAME} \
	--num_hosts=1 \
	--num_core_per_host=8 \
	--output_dir=${DATA_DIR} \
	--model_dir=${MODEL_DIR} \
	--predict_dir=${MODEL_DIR}/eval \
	--train_file=${DATA_DIR}/DRCD_training.json \
	--predict_file=${DATA_DIR}/DRCD_dev.json \
	--uncased=False \
	--max_answer_length=30 \
	--max_seq_length=512 \
	--do_train=True \
	--train_batch_size=16 \
	--do_predict=True \
	--predict_batch_size=16 \
	--learning_rate=3e-5 \
	--adam_epsilon=1e-6 \
	--iterations=1000 \
	--save_steps=2000 \
	--train_steps=3600 \
	--warmup_steps=360

ChnSentiCorp

与阅读理解任务不同，分类任务无需提前生成tf_records。以下是ChnSentiCorp情感分类任务中使用的脚本参数：

XLNET_DIR=YOUR_GS_BUCKET_PATH_TO_XLNET
MODEL_DIR=YOUR_OUTPUT_MODEL_PATH
DATA_DIR=YOUR_DATA_DIR_TO_TFRECORDS
RAW_DIR=YOUR_RAW_DATA_DIR
TPU_NAME=v2-xlnet
TPU_ZONE=us-central1-b

python -u run_classifier.py \
	--spiece_model_file=./spiece.model \
	--model_config_path=${XLNET_DIR}/xlnet_config.json \
	--init_checkpoint=${XLNET_DIR}/xlnet_model.ckpt \
	--task_name=csc \
	--do_train=True \
	--do_eval=True \
	--eval_all_ckpt=False \
	--uncased=False \
	--data_dir=${RAW_DIR} \
	--output_dir=${DATA_DIR} \
	--model_dir=${MODEL_DIR} \
	--train_batch_size=48 \
	--eval_batch_size=48 \
	--num_hosts=1 \
	--num_core_per_host=8 \
	--num_train_epochs=3 \
	--max_seq_length=256 \
	--learning_rate=2e-5 \
	--save_steps=5000 \
	--use_tpu=True \
	--tpu=${TPU_NAME} \
	--tpu_zone=${TPU_ZONE}

FAQ

Q: 会发布更大的模型吗？
A: 不一定，不保证。如果我们获得了显著性能提升，会考虑发布出来。

Q: 在某些数据集上效果不好？
A: 选用其他模型或者在这个checkpoint上继续用你的数据做预训练。

Q: 预训练数据会发布吗？
A: 抱歉，因为版权问题无法发布。

Q: 训练XLNet花了多长时间？
A: XLNet-mid使用了Cloud TPU v3 (128G HBM)训练了2M steps（batch=32），大约需要3周时间。XLNet-base则是训练了4M steps。

Q: 为什么XLNet官方没有发布Multilingual或者Chinese XLNet？
A: （以下是个人看法）不得而知，很多人留言表示希望有，戳XLNet-issue-#3。以XLNet官方的技术和算力来说，训练一个这样的模型并非难事（multilingual版可能比较复杂，需要考虑各语种之间的平衡，也可以参考multilingual-bert中的描述。）。 不过反过来想一下，作者们也并没有义务一定要这么做。 作为学者来说，他们的technical contribution已经足够，不发布出来也不应受到指责，呼吁大家理性对待别人的工作。

Q: XLNet多数情况下比BERT要好吗？
A: 目前看来至少上述几个任务效果都还不错，使用的数据和我们发布的BERT-wwm-ext是一样的。

Q: ？
A: 。

引用

如果本目录中的内容对你的研究工作有所帮助，欢迎在论文中引用下述技术报告： https://arxiv.org/abs/2004.13922

@inproceedings{cui-etal-2020-revisiting,
    title = "Revisiting Pre-Trained Models for {C}hinese Natural Language Processing",
    author = "Cui, Yiming  and
      Che, Wanxiang  and
      Liu, Ting  and
      Qin, Bing  and
      Wang, Shijin  and
      Hu, Guoping",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.58",
    pages = "657--668",
}

致谢

项目作者：崔一鸣（哈工大讯飞联合实验室）、车万翔（哈工大）、刘挺（哈工大）、王士进（科大讯飞）、胡国平（科大讯飞）

本项目受到谷歌TensorFlow Research Cloud (TFRC)计划资助。

建设该项目过程中参考了如下仓库，在这里表示感谢：

XLNet: https://github.com/zihangdai/xlnet
Malaya: https://github.com/huseinzol05/Malaya/tree/master/xlnet
Korean XLNet（韩文描述，无翻译）: https://github.com/yeontaek/XLNET-Korean-Model

免责声明

本项目并非XLNet官方发布的Chinese XLNet模型。同时，本项目不是哈工大或科大讯飞的官方产品。该项目中的内容仅供技术研究参考，不作为任何结论性依据。使用者可以在许可证范围内任意使用该模型，但我们不对因使用该项目内容造成的直接或间接损失负责。

关注我们

欢迎关注哈工大讯飞联合实验室官方微信公众号。

问题反馈 & 贡献

如有问题，请在GitHub Issue中提交。
我们没有运营，鼓励网友互相帮助解决问题。
如果发现实现上的问题或愿意共同建设该项目，请提交Pull Request。

chinese-xlnet's People

Contributors

Stargazers

Watchers

Forkers

yugenlgy yuanjie-ai george86028 xgdlt haojiepan1 just-do-it-for-everything balatatree deepfool xuhaiming1996 angle233 ianliyi1996 michael-wzhu yonghangzhou kiminh cdj0311 pageone77 1148270327 ewrfcas wqw123 qsong4 xyyhcl phychaos ewail cutecha yesxiaoyu yzsun foreseez doodlejz pku-wuwei xiaomi2008 ericxsun stodgers zhihao-chen tchigher himoutoumaru littlehacker26 lianxiaolei 90217 g2064 dukehan1 slidersun lzx4627 wq2018 awyshw frankchu0229 daishu7 qianrenjian jadeluo szrayic ericbk xiaojia1234 guanlongtianzi chenmoshushi aaronwang81 jiniaoxu b2220333 tangelian zhangjiekui lishengfever gaoyiyeah siyuxin ldug synlive xiaodanjiao zhyq liuyuuan gm19900510 xiaojie2018 skywd carrychang maogwleon hell-to-heaven matiji66 chapzq77 auscenery shadowkun stefensa shihuaxing wangpeng3891 lastrei rdpli 17714196157 gdh756462786 xrosliang jxh4945777 kevinou121029 angel8023 jiadiwu dlreseach zhangbo2008 yinmingqi jessie0624 victor-wzq yuandongdongdong zjjhym dan-wu hxjj jnupython fishredleaf jonyboy2000

chinese-xlnet's Issues

能不能给一个pytorch的简易教程谢谢

如题

请问使用sentencepiece分词做中文预训练模型是如何处理token级别任务的(例如抽取式阅读理解cmrc2018或NER任务)

例如cmrc2018,由于sentencepiece经常会把答案切分到奇怪的地方导致结果EM降低。
举个例子，原句是”1994年被任命为XXX“，答案是”1994年“。但是由于”年被“的出现频率很高，sentencepiece吧”年被“分成一个词了，结果变成”1994年被“了，这无论在训练还是测试的时候都会遇到，请问是怎么解决的？非常感谢！

[question] 请问开源的xlnet-mid和xlnet-base预训练的loss分别是多少？

请问是否能说明预计发布日期呢？多谢！

XLNet-mid OOM

XLNet-mid OOM when run run_classifier.py.
Parameters info: max_seq_length=512 train_batch_size=1.
GPU info: V100 RAM 32G.
what parameters or anything else can i tuning except max_seq_length and train_batch_size, to escape OOM.

SPIECE_UNDERLINE

你好，请问下代码中的SPIECE_UNDERLINE起着什么作用呀？因为encode_pieces分词的同时会给每句话的开始都加上 SPIECE_UNDERLINE

How to load this pre-train model in pytorch?

i found this code transform the tf model into pytorch using hugg face 's transformer.
So i download the pytorch-pretrained model, using huggface's code to load this model.
the code is as follows:

model_class = XLNetForSequenceClassification
tokenizer_class = XLNetTokenizer
pretrained_weights = "./pretrain_model" #here is a dir where the pretrain model  is unzipped
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights, num_labels=10)

however, model_class.from_pretrained(xxx) throw out an error:

any idea is appreciated, thanks!
another issue is that when using the chinese pretrained model, what should i do with tokenizer? the input seq of chinese should be split in word or sub-word?

序列标注的问题

请问这里使用 Sentence Piece 进行分子词而不用字，序列标注的应该怎么映射，有什么建议吗？

执行分类任务报错:ModuleNotFoundError: No module named 'sentencepiece'

准备啥时候放出base呢

XLNet 在 DRCD dev set 的表現與 DRCD test set 的表現差異甚大

問題: 使用DRCD train set 去 fine tune XLNet-Chinese，並套用此模型到DRCD dev set 與 test set，結果大不同。不知道我哪個環節出錯了..

Model	Max sequence Length	Batch size	Learning rate	Train steps	Dev set (EM/F1)	Test set (EM/F1)
XLNet-Chinese	256	2	3e-5	12000	85.44 / 93.32	4.85 / 0.43

操作步驟:

首先先執行 preprocess.sh，先設max seq length = 256，超過這數字，我的GPU負荷不了。

#!/bin/bash
#### local path
DRCD_DIR=raw_data/
INIT_CKPT_DIR=XLNet/xlnet_pretrain_model/chinese_xlnet_mid_L-24_H-768_A-12
#### google storage path
GS_ROOT=
GS_PROC_DATA_DIR=XLNet/proc_data
python3 XLNet/xlnet/run_squad.py \
  --use_tpu=False \
  --do_prepro=True \
  --spiece_model_file=${INIT_CKPT_DIR}/spiece.model \
  --train_file=${DRCD_DIR}/DRCD_training.json \
  --output_dir=${GS_PROC_DATA_DIR} \
  --uncased=False \
  --max_seq_length=256 \
  $@

Fine tune on DRCD train set，並predict dev set，最後使用 cmrc2018_evaluate.py 產生dev set 的EM和F1。

Fine tune on train set and predict dev set

#!/bin/bash
#### local path
DRCD_DIR=raw_data/
INIT_CKPT_DIR=XLNet/xlnet_pretrain_model/chinese_xlnet_mid_L-24_H-768_A-12
PROC_DATA_DIR=XLNet/proc_data
MODEL_DIR=XLNet/experiment/chinese_xlnet_mid_L-24_H-768_A-12_S-256_B-2
CUDA_VISIBLE_DEVICES=0,1 python3 XLNet/xlnet/run_squad.py \
  --use_tpu=False \
  --num_hosts=1 \
  --num_core_per_host=3 \
  --model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \
  --spiece_model_file=${INIT_CKPT_DIR}/spiece.model \
  --output_dir=${PROC_DATA_DIR} \
  --init_checkpoint=${INIT_CKPT_DIR}/xlnet_model.ckpt \
  --model_dir=${MODEL_DIR}/model_ckpt \
  --train_file=${DRCD_DIR}/DRCD_training.json \
  --predict_file=${DRCD_DIR}/DRCD_dev.json \
  --predict_dir=${MODEL_DIR}/predict_result/dev \
  --uncased=False \
  --max_seq_length=256 \
  --do_train=True \
  --train_batch_size=2 \
  --do_predict=True \
  --predict_batch_size=32 \
  --learning_rate=3e-5 \
  --adam_epsilon=1e-6 \
  --iterations=1000 \
  --save_steps=1000 \
  --train_steps=12000 \
  --warmup_steps=1000 \
  $@

Evaluate on dev set

#!/bin/bash
####local path
DRCD_DIR=raw_data/
EVALUATE_DIR=XLNet/xlnet/
PREDICT_RESULT=XLNet/experiment/chinese_xlnet_mid_L-24_H-768_A-12_S-256_B-2/predict_result
python2 $EVALUATE_DIR/cmrc2018_evaluate.py $DRCD_DIR/DRCD_dev.json $PREDICT_RESULT/dev/predictions.json

Performance on dev set

使用訓練好的XLNet，predict test set，並使用 cmrc2018_evaluate.py 產生test set 的EM和F1。

Predict test set

#!/bin/bash
#### local path
DRCD_DIR=raw_data/
INIT_CKPT_DIR=XLNet/xlnet_pretrain_model/chinese_xlnet_mid_L-24_H-768_A-12
INIT_CKPT_DIR_1=XLNet/experiment/chinese_xlnet_mid_L-24_H-768_A-12_S-256_B-2
PROC_DATA_DIR=XLNet/proc_data
MODEL_DIR=XLNet/experiment/chinese_xlnet_mid_L-24_H-768_A-12_S-256_B-2
CUDA_VISIBLE_DEVICES=0,1 python3 XLNet/xlnet/run_squad.py \
  --use_tpu=False \
  --num_hosts=1 \
  --num_core_per_host=3 \
  --model_config_path=${INIT_CKPT_DIR}/xlnet_config.json \
  --spiece_model_file=${INIT_CKPT_DIR}/spiece.model \
  --output_dir=${PROC_DATA_DIR} \
  --init_checkpoint=${INIT_CKPT_DIR_1}/model.ckpt-12000 \
  --model_dir=${MODEL_DIR}/model_ckpt \
  --train_file=${DRCD_DIR}/DRCD_training.json \
  --predict_file=${DRCD_DIR}/DRCD_test.json \
  --predict_dir=${MODEL_DIR}/predict_result/test \
  --uncased=False \
  --max_seq_length=256 \
  --do_train=False \
  --train_batch_size=2 \
  --do_predict=True \
  --predict_batch_size=32 \
  --learning_rate=3e-5 \
  --adam_epsilon=1e-6 \
  --iterations=1000 \
  --save_steps=1000 \
  --train_steps=12000 \
  --warmup_steps=1000 \
  $@

Evaluate on test set

#!/bin/bash
####local path
DRCD_DIR=raw_data/
EVALUATE_DIR=XLNet/xlnet/
PREDICT_RESULT=XLNet/experiment/chinese_xlnet_mid_L-24_H-768_A-12_S-256_B-2/predict_result
python2 $EVALUATE_DIR/cmrc2018_evaluate.py $DRCD_DIR/DRCD_test.json $PREDICT_RESULT/test/predictions.json

Performance on test set

微调分类问题时，学习率decay代码有问题

get_train_op 里面计算 "poly" decay 时，使用的decay_steps=FLAGS.train_steps - FLAGS.warmup_steps，里面的FLAGS.train_steps 一直是初始化的1000，这里是忘了在run_classifier.py 里面计算出真正的train_step后对FLAGS.train_steps进行赋值了吧

Will the corpus for training be open-sourced?

As we all know, chinese NLP research has been slowed down by inavailability of large open-source corpus, and this issue has become more and more severe due to the recent advances of large pre-trained LMs. So could you make the training corpus open-source, for further research or followup works?

Error reported to Coordinator: Expected float32, got '/part_0' of type 'str' instead

使用xlnet-base预训练模型，用于多标签分类任务，我的label类型是float32的，当模型运行到model_utils.py的82行时（tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 时报错：Error reported to Coordinator: Expected float32, got '/part_0' of type 'str' instead，这个加载预训练模型的过程会和我的label类型相关吗？

OOM when load model

THX a lot for the excellect work! When I try to do a text classification task, I got an error message:
2019-11-09 11:49:17.220991: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at training_ops.cc:2816 : Resource exhausted: OOM when allocating tensor with shape[3072,768] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
So I want to know if there is a chance for 2080ti to run XLNET?
Setting train_batch_size=1 and max_seq_length=100 do not work!
THX again!

mid 的参数大概是多少我能在V100上面进行微调吗，比bert base大多少

train.py

Hi. Is this script used for pre-training removed out of this repository?
Thanks a lot.

sentencepiece加载词表出错

直接使用下载的词表，加载时出现如下错误
RuntimeError: Internal: /sentencepiece/src/sentencepiece_processor.cc(73) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

sentence-pair classification任务上效果不好

请问有没有在sentence-pair classification任务上进行过评测?我试了下效果相对于BERT官方(BERT-Base, Chinese)效果差很多

对于做下游任务ner的话，xlnet如何把BIO标注的训练数据转为xlnet的输入数据，这个能否提供个py代码呢，谢谢

数据处理脚本

请问是否可以提供cmrc和drcd的train/dev/test数据处理脚本/tfrecord文件，直接使用XLNet中的处理脚本并不兼容

原生代码配的参数是1024，给的预训练大小是768 ，怎么修改？

Shape of variable model/transformer/layer_0/ff/LayerNorm/beta:0 ((1024,)) doesn't match with shape of tensor model/transformer/layer_0/ff/LayerNorm/beta ([768]) from checkpoint reader

请问使用Sentence Piece的时候，text（中文）是否人为加了空格？如，是('你好吗')，还是('你好吗')

我训练xlnet_mid的F1 score一直在60%，EM也是在60%，可以帮忙看下吗？

请问词表怎么使用呢？

作者你好，我之前finetune过bert，但是我拿到您的xlnet后，发现其词表和bert的形式（txt格式）不一样，我想知道该如何使用这个预训练模型呢，有相关的库吗，例如pytorch_pretrained_bert这个库（我用pytorch）？

tensorflow支持哪些版本

使用tensorflow 2.0有语法错误，tensorflow-gpu 1.13 eval步骤报错：
Traceback (most recent call last):
File "run_classifier.py", line 1002, in
tf.app.run()
File "/root/miniconda3/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "run_classifier.py", line 912, in main
global_step = int(cur_filename.split("-")[-1])
ValueError: invalid literal for int() with base 10: '/home/workspace/models/xlnet_model.ckpt'
应该是不同版本的tensorflow模型文件的命名格式不匹配导致的能否给一个tensorflow支持的版本范围

支持

有關於ChnSentiCorp資料集的下載

您好，
想請問哪裡能夠下載ChnSentiCorp這個資料集呢?
網路上的連結似乎都無法指引到下載點。
感謝!

pip安装sentencepiece安装不了

序列标注类任务，比如分词，NER,关系抽取有吗

some config parameters of downloaded pre-trained model are different from the config in 'Readme'

the config of downloaded model('chinese_xlnet_base_pytorch')

Model config {
"attn_type": "bi",
"bi_data": false,
"clamp_len": -1,
"d_head": 64,
"d_inner": 3072,
"d_model": 768,
"dropout": 0.1,
"end_n_top": 5,
"ff_activation": "relu",
"finetuning_task": null,
"initializer_range": 0.02,
"layer_norm_eps": 1e-12,
"mem_len": null,
"n_head": 12,
"n_layer": 12,
"n_token": 32000,
"num_labels": 2,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pruned_heads": {},
"reuse_len": null,
"same_length": false,
"start_n_top": 5,
"summary_activation": "tanh",
"summary_last_dropout": 0.1,
"summary_type": "last",
"summary_use_proj": true,
"torchscript": false,
"untie_r": true,
"use_bfloat16": false
}

the pre-trained config in 'Readme'

python train.py
--record_info_dir=$DATA
--model_dir=$MODEL_DIR
--train_batch_size=32
--seq_len=512
*--reuse_len=256 *
*--mem_len=384 *
--perm_size=256
--n_layer=24
--d_model=768
--d_embed=768
--n_head=12
--d_head=64
--d_inner=3072
--untie_r=True
--mask_alpha=6
--mask_beta=1
--num_predict=85
--uncased=False
--train_steps=2000000
--save_steps=20000
--warmup_steps=20000
--max_save=20
--weight_decay=0.01
--adam_epsilon=1e-6
--learning_rate=1e-4
--dropout=0.1
--dropatt=0.1
--tpu=$TPU_NAME
--tpu_zone=$TPU_ZONE
--use_tpu=True

My question is why 'mem_len' and 'reuse_len' are null(None) in downloaded models. Thx

xlnet-base loss=-0.0

为什么使用xlnet-base 的中文模型，再加上谷歌的官方的xlnet的代码去finetune的过程中
会显示loss=-0.0 从第0个step开始一直到微调结束，都会显示loss为0.0

Thanks
weizhen

where can i download this pretrained-xlnet

i rencently was build a chinese ner project , so i want use a chinese vec .but i didn't find the pretrained-xlnet model , could you please give me a download page url?thank you so much.

Chinese XLNet for text generation?

Hi, Thanks for your work.

I was trying to use your model to generate Chinese text (as can be done in terms of English with XLNetLMHeadModel in huggingface transformers). But I got:

    "You tried to generate sequences with a model that does not have a LM Head."
AttributeError: You tried to generate sequences with a model that does not have a LM Head.Please use another model class (e.g. `OpenAIGPTLMHeadModel`, `XLNetLMHeadModel`, `GPT2LMHeadModel`, `CTRLLMHeadModel`, `T5WithLMHeadModel`, `TransfoXLLMHeadModel`)

Does this model contain LM Head for text generation task and is there a plan to release one?

如何使用单机多卡GPU训练呢？

比如在分类任务中，还需要传 --use_tpu=True 吗？
xlnet本身十分支持分布式训练呢？
多卡训练是否支持google的这种方式呢？
CUDA_VISIBLE_DEVICES=0,1,2,3 python run_classifier.py

期待您的回答

请问模型的输入和输出Tensor对象的名称是什么呢？

在固化ckpt模型文件的时候，需要知道输入输出节点的名称。

关于MRC任务

使用XLNet在MRC任务上进行微调的时候，发现效果明显要比RoBERTa-wwm-ext-large差很多很多，数据加载部分应该是没有啥问题的，想问一下是模型出了问题吗？

请问：代码实现中的tensorflow是什么版本？

Failed to get matching files

When I ran run_cmrc_drcd.py, there was a problem that "Failed to get matching files" when create checkpoint. I guess it's because there isn't xlnet_model.ckpt in pretrained modal files. I changed the xlnet_modal.ckpt.meta into xlnet_modal.ckpt. Still, it can not find xlnet_modal.ckpt.

INFO:tensorflow:Create CheckpointSaverHook.
I0120 13:58:49.598975 140028613015424 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
INFO:tensorflow:Done calling model_fn.
I0120 13:58:50.135126 140028613015424 estimator.py:1150] Done calling model_fn.
INFO:tensorflow:TPU job name tpu_worker
I0120 13:58:53.244385 140028613015424 tpu_estimator.py:506] TPU job name tpu_worker
INFO:tensorflow:Graph was finalized.
I0120 13:58:55.594104 140028613015424 monitored_session.py:240] Graph was finalized.
ERROR:tensorflow:Error recorded from training_loop: From /job:tpu_worker/replica:0/task:0:
Unsuccessful TensorSliceReader constructor: Failed to get matching files on /content/drive/My Drive/chinese_xlnet_mid_L-24_H-768_A-12/xlnet_model.ckpt: Unimplemented: File system scheme '[local]' not implemented (file: '/content/drive/My Drive/chinese_xlnet_mid_L-24_H-768_A-12/xlnet_model.ckpt')
	 [[node checkpoint_initializer_117 (defined at usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'checkpoint_initializer_117':
  File "content/drive/My Drive/Chinese-PreTrained-XLNet-master/src/run_cmrc_drcd.py", line 1292, in <module>
    tf.app.run()
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "content/drive/My Drive/Chinese-PreTrained-XLNet-master/src/run_cmrc_drcd.py", line 1193, in main
    estimator.train(input_fn=train_input_fn, max_steps=FLAGS.train_steps)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2857, in _call_model_fn
    config)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3184, in _model_fn
    scaffold = _get_scaffold(scaffold_fn)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3749, in _get_scaffold
    scaffold = scaffold_fn()
  File "content/drive/My Drive/Chinese-PreTrained-XLNet-master/src/model_utils.py", line 77, in tpu_scaffold
    tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 291, in init_from_checkpoint
    init_from_checkpoint_fn)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1940, in merge_call
    return self._merge_call(merge_fn, args, kwargs)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1947, in _merge_call
    return merge_fn(self._strategy, *args, **kwargs)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 286, in <lambda>
    ckpt_dir_or_file, assignment_map)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 334, in _init_from_checkpoint
    _set_variable_or_list_initializer(var, ckpt_file, tensor_name_in_ckpt)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 458, in _set_variable_or_list_initializer
    _set_checkpoint_initializer(variable_or_list, ckpt_file, tensor_name, "")
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 412, in _set_checkpoint_initializer
    ckpt_file, [tensor_name], [slice_spec], [base_type], name=name)[0]
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

使用pytorch加载XLNet-base时，如何加载分词器呢？

我使用样例中的方式加载模型没有报错，但是加载分词器的时候出现问题

需多大显存可进行finetune

您好！请问需要多大显存可以在这个mid xlnet 上进行下游任务的finetune，比如文本分类等

还没出来吗，等的花都谢了

tensorflow版本与pytorch版本相差多大呢？

你好！
非常感谢开源模型！
但在线下我自己测试的时候，发现pytorch的效果要比tensorflow的小很多。请问：你们有测试过pytorch版本的效果吗？或者是我自己的代码有问题？

漫长的等待

占个楼

**人NB

prediction example長度和實際example長度不同

想請問為何我的prediction example長度和實際example長度不同。
是因為以下程式碼的關係嗎?
我沒有使用TPU，而是使用GPU。

  if FLAGS.do_eval:
    # TPU requires a fixed batch size for all batches, therefore the number
    # of examples must be a multiple of the batch size, or else examples
    # will get dropped. So we pad with fake examples which are ignored
    # later on. These do NOT count towards the metric (all tf.metrics
    # support a per-instance weight, and these get a weight of 0.0).
    #
    # Modified in XL: We also adopt the same mechanism for GPUs.
    while len(eval_examples) % FLAGS.eval_batch_size != 0:
      eval_examples.append(PaddingInputExample())

xlnet为什么要基于词，而不是基于字？？

cmrc2018 finetuning结果无法复现（GPU）

用pytorch遇到错误

使用huggingface的Quick tour方法
代码:
import torch
import tokenization_xlnet
import modeling_xlnet
tokenizer = tokenization_xlnet.XLNetTokenizer.from_pretrained('xlnet-mid-chinese')
model = modeling_xlnet.XLNetModel.from_pretrained('xlnet-mid-chinese')
input_ids = torch.tensor([tokenizer.encode("我喜欢吃西红柿炒鸡蛋", add_special_tokens=True)])
with torch.no_grad():
last_hidden_states = model(input_ids)[0]
all_hidden_states, all_attentions = model(input_ids)[-2:]
traced_model = torch.jit.trace(model, (input_ids,))
model.save_pretrained('./test_save') # save

遇到的问题:
/py3.6/lib/python3.6/site-packages/torch/tensor.py:389: RuntimeWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
'incorrect results).', category=RuntimeWarning)
Traceback (most recent call last):
File "xlnet_test.py", line 14, in
traced_model = torch.jit.trace(model, (input_ids,))
File "/py3.6/lib/python3.6/site-packages/torch/jit/init.py", line 772, in trace
check_tolerance, _force_outplace, _module_class)
File "/py3.6/lib/python3.6/site-packages/torch/jit/init.py", line 904, in trace_module
module._c._create_method_from_trace(method_name, func, example_inputs, var_lookup_fn, _force_outplace)
RuntimeError: Tracer cannot infer type of (tensor([[[ 1.8302, -0.2841, 1.7623, ..., -4.0171, -2.8738, -2.7551],
[-0.1806, -0.4168, -0.9308, ..., -3.9143, -1.5399, -1.9979],
[ 1.8243, 1.3354, -0.4644, ..., -3.2942, -1.5304, -1.4603],
...,
[-2.4907, -0.2998, 1.6560, ..., -1.6929, 2.9048, 0.2806],
[-3.3055, 2.5498, 2.3597, ..., -2.5295, 1.5212, -1.0081],
[-0.8349, 0.0219, 1.2810, ..., -3.9269, 1.6507, -0.4940]]],
grad_fn=), (None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None))
:Cannot infer type of a None value (toTraceableIValue at /pytorch/torch/csrc/jit/pybind_utils.h:268)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f256bee8273 in /py3.6/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x44e288 (0x7f256cf27288 in /py3.6/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #2: + 0x4bdda2 (0x7f256cf96da2 in /py3.6/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: + 0x4d1d81 (0x7f256cfaad81 in /py3.6/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x1d3ef4 (0x7f256ccacef4 in /py3.6/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #6: python() [0x5067b0]
frame #8: python() [0x504232]
frame #9: python() [0x505e83]
frame #10: python() [0x5066f0]
frame #12: python() [0x504232]
frame #13: python() [0x505e83]
frame #14: python() [0x5066f0]
frame #16: python() [0x504232]
frame #18: python() [0x647fa2]
frame #23: __libc_start_main + 0xf0 (0x7f2570fb4830 in /lib/x86_64-linux-gnu/libc.so.6)

readme中给出了不少开源数据集的中文xlnet的训练结果，有点看不懂结果的呈现。

83.1 (82.7) / 89.9 (89.6)；这个斜杠两边的和括号内的结果分别对应什么情况下的结果呢？base_best（average）/large_best（average）?