Giter Site home page Giter Site logo

supercoderhawk / deep-keyphrase Goto Github PK

View Code? Open in Web Editor NEW
50.0 3.0 8.0 165 KB

seq2seq based keyphrase generation model sets, including copyrnn copycnn and copytransfomer

Python 99.00% Shell 1.00%
seq2seq keyphrase-generation keyphrase-extraction keyword-extraction copynet pytorch

deep-keyphrase's Introduction

deep-keyphrase

Implement some keyphrase generation algorithm

image

image

image

Description

Implemented Paper

CopyRNN

Deep Keyphrase Generation (Meng et al., 2017)

ToDo List

CopyCNN

CopyTransformer

Usage

required files (4 files in total)

  1. vocab_file: word line by line (don't with index!!!!) :

    this
    paper
    proposes
  2. training, valid and test file

data format for training, valid and test

json line format, every line is a dict:

{'tokens': ['this', 'paper', 'proposes', 'using', 'virtual', 'reality', 'to', 'enhance', 'the', 'perception', 'of', 'actions', 'by', 'distant', 'users', 'on', 'a', 'shared', 'application', '.', 'here', ',', 'distance', 'may', 'refer', 'either', 'to', 'space', '(', 'e.g.', 'in', 'a', 'remote', 'synchronous', 'collaboration', ')', 'or', 'time', '(', 'e.g.', 'during', 'playback', 'of', 'recorded', 'actions', ')', '.', 'our', 'approach', 'consists', 'in', 'immersing', 'the', 'application', 'in', 'a', 'virtual', 'inhabited', '3d', 'space', 'and', 'mimicking', 'user', 'actions', 'by', 'animating', 'avatars', '.', 'we', 'illustrate', 'this', 'approach', 'with', 'two', 'applications', ',', 'the', 'one', 'for', 'remote', 'collaboration', 'on', 'a', 'shared', 'application', 'and', 'the', 'other', 'to', 'playback', 'recorded', 'sequences', 'of', 'user', 'actions', '.', 'we', 'suggest', 'this', 'could', 'be', 'a', 'low', 'cost', 'enhancement', 'for', 'telepresence', '.'] ,
'keyphrases': [['telepresence'], ['animation'], ['avatars'], ['application', 'sharing'], ['collaborative', 'virtual', 'environments']]}

Training

download the kp20k

mkdir data
mkdir data/raw
mkdir data/raw/kp20k_new
# !! please unzip kp20k data put the files into above folder manually
python -m nltk.downloader punkt
bash scripts/prepare_kp20k.sh
bash scripts/train_copyrnn_kp20k.sh

# start tensorboard
# enter the experiment result dir, suffix is time that experiment starts
cd data/kp20k/copyrnn_kp20k_basic-20191212-080000
# start tensorboard services
tenosrboard --bind_all --logdir logs --port 6006

Notes

  1. compared with the original seq2seq-keyphrase-pytorch
    1. fix the implementation error:
      1. copy mechanism
      2. train and inference are not correspond (training doesn't have input feeding and inference has input feeding)
    2. easy data preparing
    3. tensorboard support
    4. faster beam search (6x faster used cpu and more than 10x faster used gpu)

deep-keyphrase's People

Contributors

gokyori avatar supercoderhawk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

deep-keyphrase's Issues

error

AttributeError: 'Namespace' object has no attribute 'fix_batch_size'

OSError: [Errno 12] Cannot allocate memory

[2021-03-18 09:34:44,921] [train] destination dir:/home/yons/deep-keyphrase/data/kp20k/copyrnn_kp20k_basic-20210318-093444/ [2021-03-18 09:35:14,404] [train] exception occurred [2021-03-18 09:35:14,406] [train] Traceback (most recent call last): File "/home/yons/.virtualenvs/kpg/lib/python3.7/site-packages/deep_keyphrase/base_trainer.py", line 70, in train self.train_func() File "/home/yons/.virtualenvs/kpg/lib/python3.7/site-packages/deep_keyphrase/base_trainer.py", line 92, in train_func for batch_idx, batch in enumerate(self.train_loader): File "/home/yons/.virtualenvs/kpg/lib/python3.7/site-packages/deep_keyphrase/dataloader.py", line 55, in __iter__ return iter(KeyphraseDataIterator(self)) File "/home/yons/.virtualenvs/kpg/lib/python3.7/site-packages/deep_keyphrase/dataloader.py", line 129, in __init__ worker.start() File "/home/yons/anaconda3/lib/python3.7/multiprocessing/process.py", line 112, in start self._popen = self._Popen(self) File "/home/yons/anaconda3/lib/python3.7/multiprocessing/context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/home/yons/anaconda3/lib/python3.7/multiprocessing/context.py", line 277, in _Popen return Popen(process_obj) File "/home/yons/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__ self._launch(process_obj) File "/home/yons/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 70, in _launch self.pid = os.fork() OSError: [Errno 12] Cannot allocate memory

请问为什么会报这个错呢?
我试了把worker_num修改成0、1,都不行。
把优化器设成SGD,把batch size设成32,把max_length设成600也不行

运行脚本出错

你好,我在运行脚本bash scripts/train_copyrnn_kp20k.sh时报错,错误如下
报内存用完错误,请问我这内存够吗? 然后我减小了批次,减少为32 结果还是报这样的错误,
请问有什么解决方法吗?感谢
image

运行train出错

2020-07-19 09:20:26.977419: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
[2020-07-19 09:20:30,777] [train] destination dir:../destination/json-20200719-092026/
[2020-07-19 09:21:06,466] [train] exception occurred
[2020-07-19 09:21:06,494] [train] Traceback (most recent call last):
File "D:\PycharmProject\DeepLearning\deep_keyphrase\deep_keyphrase\base_trainer.py", line 70, in train
self.train_func()
File "D:\PycharmProject\DeepLearning\deep_keyphrase\deep_keyphrase\base_trainer.py", line 92, in train_func
for batch_idx, batch in enumerate(self.train_loader):
File "D:\PycharmProject\DeepLearning\deep_keyphrase\deep_keyphrase\dataloader.py", line 55, in iter
return iter(KeyphraseDataIterator(self))
File "D:\PycharmProject\DeepLearning\deep_keyphrase\deep_keyphrase\dataloader.py", line 129, in init
worker.start()
File "C:\Software\Anaconda3\envs\tf-gpu\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "C:\Software\Anaconda3\envs\tf-gpu\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Software\Anaconda3\envs\tf-gpu\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Software\Anaconda3\envs\tf-gpu\lib\multiprocessing\popen_spawn_win32.py", line 65, in init
reduction.dump(process_obj, to_child)
File "C:\Software\Anaconda3\envs\tf-gpu\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle generator objects

Traceback (most recent call last):
File "", line 1, in
File "C:\Software\Anaconda3\envs\tf-gpu\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "C:\Software\Anaconda3\envs\tf-gpu\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

How to run test?

您好,
按照您的步骤跑通了,不过有两个疑问:
1)如何开启GPU;
2)如何进行验证test数据

请问训练一次要多久

训练的时候不会有任何信息出来是吗?现在的提示信息只有[2021-03-17 16:47:03,133] [train] destination dir:/home/yons/deep-keyphrase/data/kp20k/copyrnn_kp20k_basic-20210317-164703/是正常的吗?是等训练完再把信息全用tensorboardX可视化出来吗?

顺便,训练copyrnn的时候报错说找不到backend这个参数。我发现train.py里没有加这个参数,train_tf里加了且默认是tf,但是在训练时会需要判断模型是torch的还是tf的。于是我在train.py里加上这个参数,设为默认是torch再去训练。这样没问题吧?

关于最后的test与evaluate

在您对how to run test问题的回答中附了一段用来test的代码,您说运行copy_rnn/predict.py即可。但这个文件里没有主函数,并且在predict_kp20k.sh文件里是运行predict_runner.py。于是我把predict_runner.py的主函数里改成了

# Your model path
model_path = 'data/kp20k/copyrnn_kp20k_basic-20210318-152102/copyrnn_kp20k_basic_epoch_3_batch_1355000.model'
# your vocab path
vocab_path = 'data/vocab_kp20k.txt'
keyword_generator = CopyRnnPredictor({'model': model_path},
                                     vocab_info=vocab_path,
                                     beam_size=50,
                                     max_target_len=5,
                                     max_src_length=800)

# test some single cases, or use as component in online service
tokens = ['numerous', 'studies', 'have', 'demonstrated', 'that', 'h2o2-induced', 'apoptosis', 'is', 'mediated', 'by', 'activation', 'of', 'mapks']
keyword_generator.predict([tokens], delimiter=' ', tokenized=True)

# evaluate file
from munch import Munch
src_filename = 'data/kp20k.test.jsonl'
dest_filename = 'data/kp20k_pred.jsonl'
config = read_json('data/kp20k/copyrnn_kp20k_basic-20210318-152102/copyrnn_kp20k_basic_epoch_3_batch_1355000.json')
keyword_generator.eval_predict(src_filename, dest_filename,args=Munch(config))

这样成功得到了kp20k_pred.jsonl文件。那如何对最后的效果进行评估呢?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.