Awesome Pretrained Chinese NLP Models
在自然语言处理领域中,预训练语言模型(Pretrained Language Models)已成为非常重要的基础技术,本仓库主要收集目前网上公开的一些高质量中文预训练模型(感谢分享资源的大佬),并将持续更新......
注 : 🤗huggingface 模型下载地址: 1. huggingface官方地址
2018 | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | Jacob Devlin, et al. | arXiv | PDF
2019 | Pre-Training with Whole Word Masking for Chinese BERT | Yiming Cui, et al. | arXiv | PDF
备注:
wwm全称为**Whole Word Masking **,一个完整的词的部分WordPiece子词被mask,则同属该词的其他部分也会被mask
ext表示在更多数据集下训练
2021 | ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information | Zijun Sun, et al. | arXiv | PDF
2019 | RoBERTa: A Robustly Optimized BERT Pretraining Approach | Yinhan Liu, et al. | arXiv | PDF
2019 | ALBERT: A Lite BERT For Self-Supervised Learning Of Language Representations | Zhenzhong Lan, et al. | arXiv | PDF
2019 | NEZHA: Neural Contextualized Representation for Chinese Language Understanding | Junqiu Wei, et al. | arXiv | PDF
2020 | Revisiting Pre-Trained Models for Chinese Natural Language Processing | Yiming Cui, et al. | arXiv | PDF
2020 | 提速不掉点:基于词颗粒度的中文WoBERT | 苏剑林. | spaces | Blog post
2019 | XLNet: Generalized Autoregressive Pretraining for Language Understanding | Zhilin Yang, et al. | arXiv | PDF
2020 | ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators | Kevin Clark, et al. | arXiv | PDF
2019 | ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations | Shizhe Diao, et al. | arXiv | PDF
2019 | ERNIE: Enhanced Representation through Knowledge Integration | Yu Sun, et al. | arXiv | PDF
2020 | SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis | Hao Tian, et al. | arXiv | PDF
2020 | ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding | Dongling Xiao, et al. | arXiv | PDF
备注:
PaddlePaddle转TensorFlow可参考: tensorflow_ernie
PaddlePaddle转PyTorch可参考: ERNIE-Pytorch
2021 | ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation | Yu Sun, et al. | arXiv | PDF
2021 | ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation | Shuohuan Wang, et al. | arXiv | PDF
模型
版本
PaddlePaddle
PyTorch
作者
源地址
应用领域
ernie-3.0-base
12-layer, 768-hidden, 12-heads
link
PaddlePaddle
github
通用
ernie-3.0-medium
6-layer, 768-hidden, 12-heads
link
PaddlePaddle
github
通用
ernie-3.0-mini
6-layer, 384-hidden, 12-heads
link
PaddlePaddle
github
通用
ernie-3.0-micro
4-layer, 384-hidden, 12-heads
link
PaddlePaddle
github
通用
ernie-3.0-nano
4-layer, 312-hidden, 12-heads
link
PaddlePaddle
github
通用
2021 | RoFormer: Enhanced Transformer with Rotary Position Embedding | Jianlin Su, et al. | arXiv | PDF
2021 | Transformer升级之路:2、博采众长的旋转式位置编码 | 苏剑林. | spaces | Blog post
2019 | StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding | Wei Wang, et al. | arXiv | PDF
模型
版本
TensorFlow
PyTorch
作者
源地址
应用领域
StructBERT
large(L24)
阿里云
Alibaba
github
通用
2021 | Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models | Yuxuan Lai, et al. | arXiv | PDF
2021 | Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese | Zhuosheng Zhang, et al. | arXiv | PDF
2021 | TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning | Yixuan Su, et al. | arXiv | PDF
2021 | MC-BERT: Conceptualized Representation Learning for Chinese Biomedical Text Mining | alibaba-research | arXiv | PDF
2022 | PERT: Pre-Training BERT with Permuted Language Model | Yiming Cui, et al. | arXiv | PDF
2020 | MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices | Zhiqing Sun, et al. | arXiv | PDF
2022 | GAU-α: (FLASH) Transformer Quality in Linear Time | Weizhe Hua, et al. | arXiv | PDF
| blog
2019 | Improving Language Understandingby Generative Pre-Training | Alec Radford, et al. | arXiv | PDF
2019 | Language Models are Unsupervised Multitask Learners | Alec Radford, et al. | arXiv | PDF
2019 | Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context | Zihang Dai, et al. | arXiv | PDF
2020 | Language Models are Few-Shot Learners | Tom B. Brown, et al. | arXiv | PDF
2019 | NEZHA: Neural Contextualized Representation for Chinese Language Understanding | Junqiu Wei, et al. | arXiv | PDF
2019 | Improving Language Understandingby Generative Pre-Training | Alec Radford, et al. | arXiv | PDF
2020 | CPM: A Large-scale Generative Chinese Pre-trained Language Model | Zhengyan Zhang, et al. | arXiv | PDF
备注:
PyTorch转TensorFlow可参考: CPM-LM-TF2
PyTorch转PaddlePaddle可参考: CPM-Generate-Paddle
2019 | Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | Colin Raffel, et al. | arXiv | PDF
2019 | Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | Colin Raffel, et al. | arXiv | PDF
2019 | PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization | Jingqing Zhang, et al. | arXiv | PDF
2021 | T5 PEGASUS:开源一个中文生成式预训练模型 | 苏剑林. | spaces | Blog post
Keras转PyTorch可参考: t5-pegasus-pytorch
2021 | Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese | Zhuosheng Zhang, et al. | arXiv | PDF
2021 | PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation | Wei Zeng, et al. | arXiv | PDF
2021 | EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training | Hao Zhou, et al. | arXiv | PDF
2019 | BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension | Mike Lewis, et al. | arxiv | PDF
2019 | Unified Language Model Pre-training for Natural Language Understanding and Generation | Li Dong, et al. | arXiv | PDF
2020 | 鱼与熊掌兼得:融合检索和生成的SimBERT模型 | 苏剑林. | spaces | Blog post
2021 | SimBERTv2来了!融合检索和生成的RoFormer-Sim模型 | 苏剑林. | spaces | Blog post
2021 | CPM-2: Large-scale Cost-effective Pre-trained Language Models | Zhengyan Zhang, et al. | arXiv | PDF
2021 | CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation | Yunfan Shao, et al. | arxiv | PDF
2022 | GLM: General Language Model Pretraining with Autoregressive Blank Infilling | Zhengxiao Du, et al. | arXiv | PDF
2021 | WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training | Yuqi Huo, et al. | arXiv | PDF
2021 | CogView: Mastering Text-to-Image Generation via Transformers | Ming Ding, et al. | arXiv | PDF
2021 | Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese | Zhuosheng Zhang, et al. | arXiv | PDF
2022 | Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework | Chunyu Xie, et al. | arXiv | PDF
2021 | Learning Transferable Visual Models From Natural Language Supervision | Alec Radford, et al. | arXiv | PDF
2021 | Improving Text-to-SQL with Schema Dependency Learning | Binyuan Hui, et al. | arXiv | PDF
2022.07.10 增加Chinese-CLIP ,CLIP模型的中文版本,使用大规模中文数据进行训练(~2亿图文对),旨在帮助用户实现中文领域的跨模态检索、图像表示等.
2022.06.29 增加ERNIE 3.0 ,大规模知识增强预训练语言理解和生成.
2022.06.22 增加Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework ,基于大规模中文跨模态基准数据集Zero,训练视觉语言预训练框架 R2D2,用于大规模跨模态学习。
2022.06.15 增加GLM: General Language Model Pretraining with Autoregressive Blank Infilling ,提出了一种新的通用语言模型 GLM(General Language Model)。 使用自回归填空目标进行预训练,可以针对各种自然语言理解和生成任务进行微调。
2022.05.16 增加GAU-α ,主要提出了一个融合了Attention层和FFN层的新设计GAU(Gated Attention Unit,门控注意力单元),它是新模型更快、更省、更好的关键,此外它使得整个模型只有一种层,也显得更为优雅。
2022.03.27 增加RoFormer-V2 ,RoFormer升级版,主要通过结构的简化来提升速度,并通过无监督预训练和有监督预训练的结合来提升效果,从而达到了速度与效果的“双赢”。
2022.03.02 增加MobileBERT ,MobileBERT是BERT-large模型更“苗条”的版本,使用了瓶颈结构(bottleneck)并且对自注意力和前馈神经网络之间的平衡做了细致的设计。
2022.02.24 增加PERT: Pre-Training BERT with Permuted Language Model ,一种基于乱序语言模型的预训练模型(PERT),在不引入掩码标记[MASK]的情况下自监督地学习文本语义信息。
2021.12.06 增加SDCUP: Improving Text-to-SQL with Schema Dependency Learning ,达摩院深度语言模型体系 AliceMind 发布中文社区首个表格预训练模型 SDCUP。
2021.11.27 增加RWKV 中文预训练生成模型,类似 GPT-2,模型参考地址:RWKV-LM
2021.11.27 增加IDEA研究院开源的封神榜系列语言模型,包含二郎神 、周文王 、闻仲 、余元 。
2021.11.25 增加MC-BERT: Conceptualized Representation Learning for Chinese Biomedical Text Mining , 生物医学领域的中文预训练模型.
2021.11.24 增加TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning , Token-aware对比学习预训练模型.
2021.10.18 增加Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese ,基于语言学信息融入和训练加速等方法研发了 Mengzi 系列模型.
2021.10.14 增加中文版BART ,训练比较可靠的中文版BART,为中文生成类任务如摘要等提供Baseline.
2021.10.14 增加CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation ,CPT:兼顾理解和生成的中文预训练模型.
2021.10.13 增加紫东太初多模态大模型 : 全球首个多模态图文音预训练模型,实现了视觉-文本-语音三模态统一表示,构建了三模态预训练大模型。
2021.09.19 增加CogView: Mastering Text-to-Image Generation via Transformers ,世界最大的中文多模态生成模型,模型支持文生成图为基础的多领域下游任务.
2021.09.10 增加WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training ,首个中文通用图文多模态大规模预训练模型。
2021.09.10 增加EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training ,一个开放领域的中文对话预训练模型。
2021.08.19 增加Chinese-Transformer-XL :基于中文预训练语料WuDaoCorpus(290G)训练的GPT-3模型。
2021.08.16 增加CPM-2: Large-scale Cost-effective Pre-trained Language Models
2021.08.16 增加Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models
2021.07.19 增加roformer-sim-v2 :利用标注数据增强版本
2021.07.15 增加BERT-CCPoem :古典诗歌语料训练的BERT
2021.07.06 增加ChineseBERT:Chinese Pretraining Enhanced by Glyph and Pinyin Information
2021.06.22 增加StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding
2021.06.14 增加RoFormer:Enhanced Transformer with Rotary Position Embedding
2021.05.25 增加ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding
2021.04.28 增加PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation
2021.03.16 增加T5-PEGASUS: 开源一个中文生成式预训练模型
2021.03.09 增加UER系列模型
2021.03.04 增加WoBERT: 基于词颗粒度的中文
2020.11.11 初始化BERT系列模型BERT