supercoderhawk / deep-keyphrase Goto Github PK

View Code? Open in Web Editor NEW

50.0 3.0 8.0 165 KB

seq2seq based keyphrase generation model sets, including copyrnn copycnn and copytransfomer

Python 99.00% Shell 1.00%

seq2seq keyphrase-generation keyphrase-extraction keyword-extraction copynet pytorch

deep-keyphrase's Introduction

deep-keyphrase

Implement some keyphrase generation algorithm

Description

Implemented Paper

CopyRNN

Deep Keyphrase Generation (Meng et al., 2017)

ToDo List

CopyCNN

CopyTransformer

Usage

required files (4 files in total)

vocab_file: word line by line (don't with index!!!!) :
```
this
paper
proposes
```
training, valid and test file

data format for training, valid and test

json line format, every line is a dict:

{'tokens': ['this', 'paper', 'proposes', 'using', 'virtual', 'reality', 'to', 'enhance', 'the', 'perception', 'of', 'actions', 'by', 'distant', 'users', 'on', 'a', 'shared', 'application', '.', 'here', ',', 'distance', 'may', 'refer', 'either', 'to', 'space', '(', 'e.g.', 'in', 'a', 'remote', 'synchronous', 'collaboration', ')', 'or', 'time', '(', 'e.g.', 'during', 'playback', 'of', 'recorded', 'actions', ')', '.', 'our', 'approach', 'consists', 'in', 'immersing', 'the', 'application', 'in', 'a', 'virtual', 'inhabited', '3d', 'space', 'and', 'mimicking', 'user', 'actions', 'by', 'animating', 'avatars', '.', 'we', 'illustrate', 'this', 'approach', 'with', 'two', 'applications', ',', 'the', 'one', 'for', 'remote', 'collaboration', 'on', 'a', 'shared', 'application', 'and', 'the', 'other', 'to', 'playback', 'recorded', 'sequences', 'of', 'user', 'actions', '.', 'we', 'suggest', 'this', 'could', 'be', 'a', 'low', 'cost', 'enhancement', 'for', 'telepresence', '.'] ,
'keyphrases': [['telepresence'], ['animation'], ['avatars'], ['application', 'sharing'], ['collaborative', 'virtual', 'environments']]}

Training

download the kp20k

mkdir data
mkdir data/raw
mkdir data/raw/kp20k_new
# !! please unzip kp20k data put the files into above folder manually
python -m nltk.downloader punkt
bash scripts/prepare_kp20k.sh
bash scripts/train_copyrnn_kp20k.sh

# start tensorboard
# enter the experiment result dir, suffix is time that experiment starts
cd data/kp20k/copyrnn_kp20k_basic-20191212-080000
# start tensorboard services
tenosrboard --bind_all --logdir logs --port 6006

Notes

compared with the original seq2seq-keyphrase-pytorch
1. fix the implementation error:
  
  copy mechanism
  
  train and inference are not correspond (training doesn't have input feeding and inference has input feeding)
2. easy data preparing
3. tensorboard support
4. faster beam search (6x faster used cpu and more than 10x faster used gpu)

deep-keyphrase's People

Contributors

Stargazers

Watchers

Forkers

byyy233 crazicoco atomicjets lhy2749 luzhongqiu sjchasel xxr5566833 gokyori

deep-keyphrase's Issues

generation_logits = torch.cat([generation_logits, generation_oov_logits], dim=1) RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 49999 but got size 100 for tensor number 1 in the list.

我遇到的这个问题，请问有遇到的嘛

error

AttributeError: 'Namespace' object has no attribute 'fix_batch_size'

OSError: [Errno 12] Cannot allocate memory

[2021-03-18 09:34:44,921] [train] destination dir:/home/yons/deep-keyphrase/data/kp20k/copyrnn_kp20k_basic-20210318-093444/ [2021-03-18 09:35:14,404] [train] exception occurred [2021-03-18 09:35:14,406] [train] Traceback (most recent call last): File "/home/yons/.virtualenvs/kpg/lib/python3.7/site-packages/deep_keyphrase/base_trainer.py", line 70, in train self.train_func() File "/home/yons/.virtualenvs/kpg/lib/python3.7/site-packages/deep_keyphrase/base_trainer.py", line 92, in train_func for batch_idx, batch in enumerate(self.train_loader): File "/home/yons/.virtualenvs/kpg/lib/python3.7/site-packages/deep_keyphrase/dataloader.py", line 55, in __iter__ return iter(KeyphraseDataIterator(self)) File "/home/yons/.virtualenvs/kpg/lib/python3.7/site-packages/deep_keyphrase/dataloader.py", line 129, in __init__ worker.start() File "/home/yons/anaconda3/lib/python3.7/multiprocessing/process.py", line 112, in start self._popen = self._Popen(self) File "/home/yons/anaconda3/lib/python3.7/multiprocessing/context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/home/yons/anaconda3/lib/python3.7/multiprocessing/context.py", line 277, in _Popen return Popen(process_obj) File "/home/yons/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__ self._launch(process_obj) File "/home/yons/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 70, in _launch self.pid = os.fork() OSError: [Errno 12] Cannot allocate memory

请问为什么会报这个错呢？
我试了把worker_num修改成0、1，都不行。
把优化器设成SGD，把batch size设成32，把max_length设成600也不行

运行脚本出错

你好，我在运行脚本bash scripts/train_copyrnn_kp20k.sh时报错，错误如下
报内存用完错误，请问我这内存够吗？然后我减小了批次，减少为32 结果还是报这样的错误，
请问有什么解决方法吗？感谢

运行train出错

2020-07-19 09:20:26.977419: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
[2020-07-19 09:20:30,777] [train] destination dir:../destination/json-20200719-092026/
[2020-07-19 09:21:06,466] [train] exception occurred
[2020-07-19 09:21:06,494] [train] Traceback (most recent call last):
File "D:\PycharmProject\DeepLearning\deep_keyphrase\deep_keyphrase\base_trainer.py", line 70, in train
self.train_func()
File "D:\PycharmProject\DeepLearning\deep_keyphrase\deep_keyphrase\base_trainer.py", line 92, in train_func
for batch_idx, batch in enumerate(self.train_loader):
File "D:\PycharmProject\DeepLearning\deep_keyphrase\deep_keyphrase\dataloader.py", line 55, in iter
return iter(KeyphraseDataIterator(self))
File "D:\PycharmProject\DeepLearning\deep_keyphrase\deep_keyphrase\dataloader.py", line 129, in init
worker.start()
File "C:\Software\Anaconda3\envs\tf-gpu\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "C:\Software\Anaconda3\envs\tf-gpu\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Software\Anaconda3\envs\tf-gpu\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Software\Anaconda3\envs\tf-gpu\lib\multiprocessing\popen_spawn_win32.py", line 65, in init
reduction.dump(process_obj, to_child)
File "C:\Software\Anaconda3\envs\tf-gpu\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle generator objects

Traceback (most recent call last):
File "", line 1, in
File "C:\Software\Anaconda3\envs\tf-gpu\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "C:\Software\Anaconda3\envs\tf-gpu\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

# Your model path
model_path = 'data/kp20k/copyrnn_kp20k_basic-20210318-152102/copyrnn_kp20k_basic_epoch_3_batch_1355000.model'
# your vocab path
vocab_path = 'data/vocab_kp20k.txt'
keyword_generator = CopyRnnPredictor({'model': model_path},
                                     vocab_info=vocab_path,
                                     beam_size=50,
                                     max_target_len=5,
                                     max_src_length=800)

# test some single cases, or use as component in online service
tokens = ['numerous', 'studies', 'have', 'demonstrated', 'that', 'h2o2-induced', 'apoptosis', 'is', 'mediated', 'by', 'activation', 'of', 'mapks']
keyword_generator.predict([tokens], delimiter=' ', tokenized=True)

# evaluate file
from munch import Munch
src_filename = 'data/kp20k.test.jsonl'
dest_filename = 'data/kp20k_pred.jsonl'
config = read_json('data/kp20k/copyrnn_kp20k_basic-20210318-152102/copyrnn_kp20k_basic_epoch_3_batch_1355000.json')
keyword_generator.eval_predict(src_filename, dest_filename,args=Munch(config))

这样成功得到了kp20k_pred.jsonl文件。那如何对最后的效果进行评估呢？