sinovation / zen Goto Github PK
View Code? Open in Web Editor NEWA BERT-based Chinese Text Encoder Enhanced by N-gram Representations
License: Apache License 2.0
A BERT-based Chinese Text Encoder Enhanced by N-gram Representations
License: Apache License 2.0
Args.bert_model need a path of pre_trained bert weights. But I don't know how to locate it because it is cached automatically.
Hi~
1、Is ZEN trained from any base bert(e.g. google) or trained from scratch? If from scrach, I guess the n-gram emb is randomly initialized, If from base bert, the n-gram emb maybe the average of characters included?
2、According to "We use the same parameter setting for the n-gram encoder as in BERT" in the paper,I want to know that the params of n-gram encoder is shared and the same with bert tower(maybe the bottom six layer?),or is initialized and trained independently?
thank you~
您好,我想问一下ZEN模型在构建ngram字典是使用了什么工具?我想在自己的领域的文本上构建一个ngram字典,但不知道如何构建比较好。
Excuse me,is the ''fine-tuned model for NER'' mean, I can directly test the dataset(ontonote、resume、msra...) without train or only test the msra without train?
Thanks!
python run_token_level_classification.py
--task_name cwsmsra
--do_train
--do_eval
--do_lower_case
--data_dir data/msra_ner
--bert_model data/ZEN_pretrain_base_v0.1.0
--max_seq_length 256
--do_train
--do_eval
--train_batch_size 96
--num_train_epochs 30
--warmup_proportion 0.1
比如,想进行上面的finetune,但是这个任务cwsmsra,使用的训练数据格式应该是怎样的,从哪里能比较方便获取到?
我们有相同的配置NVIDIA Tesla V100 GPUs with 16GB memory,打算换切换为百度百科进行预训练,请问 1epoch 大概需要多久?
We have the same configuration of NVIDIA Tesla V100 GPUs with 16GB memory, we plan to switch to baidu baike for pre-training, may I ask how long it will take for 1 epoch?
另外语义相似匹配应该用哪个fine tuning比较好呢?谢谢~
Firstly, thanks a lot for your open source contribution.
Could you please provide some Python scripts for converting the originally official datasets format to the TSV format ? For example, XML to TSV for the NER task of MSRA, ...... therefore, we can use your project much more conveniently.
Thanks a lot again.
如题,下载了好久都没成功
您好,感谢您的付出!我想请问一下,当我下载了一个中文数据集THUCNew后,我应该怎么做(或者说使用什么命令)才能让./examples/create_pre_train_data.py正常运行并生成正确的训练集呢?
期待您的回复!感激不尽!
请问可以直接执行分类任务吗?还是必须finetun.
我下载了所有数据,直接执行这个报错:
python run_sequence_level_classification.py
--task_name ChnSentiCorp
--do_train
--do_eval
--do_lower_case
--data_dir /path/to/dataset/ChnSentiCorp
--bert_model /path/to/zen_model
--max_seq_length 512
--train_batch_size 32
--learning_rate 2e-5
--num_train_epochs 30.0
07/20/2020 22:14:06 - INFO - ZEN.tokenization - loading vocabulary file /data/ceph/arikchen/TitleScoring_withData/zen_ngram/ZEN_ft_NLI_v0.1.0/vocab.txt
07/20/2020 22:14:06 - INFO - ZEN.ngram_utils - loading ngram frequency file /data/ceph/arikchen/TitleScoring_withData/zen_ngram/ZEN_ft_NLI_v0.1.0/ngram.txt
07/20/2020 22:14:08 - INFO - ZEN.modeling - loading weights file /data/ceph/arikchen/TitleScoring_withData/zen_ngram/ZEN_ft_NLI_v0.1.0/pytorch_model.bin
07/20/2020 22:14:08 - INFO - ZEN.modeling - loading configuration file /data/ceph/arikchen/TitleScoring_withData/zen_ngram/ZEN_ft_NLI_v0.1.0/config.json
07/20/2020 22:14:08 - INFO - ZEN.modeling - Model config {
"attention_probs_dropout_prob": 0.1,
"directionality": "bidi",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"num_hidden_word_layers": 6,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"type_vocab_size": 2,
"vocab_size": 21128,
"word_size": 104089
}
Traceback (most recent call last):
File "examples/run_sequence_level_classification.py", line 396, in
main()
File "examples/run_sequence_level_classification.py", line 361, in main
if task_name not in processors:
File "/data/anaconda3/lib/python3.6/site-packages/ZEN-0.1.0-py3.6.egg/ZEN/modeling.py", line 839, in from_pretrained
RuntimeError: Error(s) in loading state_dict for ZenForSequenceClassification:
size mismatch for classifier.weight: copying a param with shape torch.Size([3, 768]) from checkpoint, the shape in current model is torch.Size([2, 768]).
size mismatch for classifier.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([2]).
sh-4.2$
thanks a lot.
是只在预训练预料和下游任务预料中得到还是说有其他的无监督语料?
I have trained 30 epochs in a classification dataset, according to examples/README.md run_sequence_level_classification.py, and got a bad performance (test acc < 50) that is not fit to expectations (test acc > 80 based on BERT).
Or I just need to fine-tune for 3 epochs just like fine-tune BERT?
Hi, this is a nice work!
Could you give some more details about the hyperparameters used in pre-training?
ZEN (P) is trained based on Google BERT. How many epochs used in the additional pre-training?
Thanks!
感谢您的开源
我想知道 我怎么将我自己的数据集处理成N-gram.txt
请问在哪里可以得到ngram字典啊?是否可以提供一个链接?多谢!
运行python run_pre_train.py时出错:
——————————————————————————
Traceback (most recent call last):
File "run_pre_train.py", line 33, in
from ZEN import WEIGHTS_NAME, CONFIG_NAME
ModuleNotFoundError: No module named 'ZEN'
The original website http://zen.chuangxin.com/ZEN/models/ZEN_pretrain_base_v0.1.0.zip is too slow to download.
Excese me, could you please release the MSRA data, because I cann't download from the http://sighan.cs.uchicago.edu/bakeoff2006
Thanks!!!
Thank you for ZEN! Now researchers have another great choice in NLP pretrained model. We have witnessed that ZEN compares favorably with BERT in many NLP tasks. Would you like to evaluate ZEN in benchmark of CLUE?
Our group CLUE is also devoted to promoting the progress of Chinese NLP, we chosen 9 representative Chinese tasks and the leaderboard is open now (including human's performance).
We hope to see ZEN in this leaderboard : )
CLUE Group: https://github.com/CLUEbenchmark/CLUE
CLUE Benchmark: https://www.cluebenchmarks.com/
as i said,3ks
您好,
非常喜欢你们的工作!
因为想要follow你们的工作,所以想知道 ZEN 基于的 BERT 实现是哪一个?这样也好方便我们后续的使用 BERT 来进行比较。
能否给出你们所使用的 BERT 的实现(implementation)和链接?
谢谢!
如题
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.