vsuthichai / paraphraser Goto Github PK

Sentence paraphrase generation at the sentence level

License: MIT License

Python 100.00%

paraphrase-generation paraphrases paraphrase tensorflow bidirectional-lstm sentence-generator insight-data-science insight-ai insight-artificial-intelligence deep-learning

paraphraser's Introduction

Paraphraser

This project provides users the ability to do paraphrase generation for sentences through a clean and simple API. A demo can be seen here: pair-a-phrase

The paraphraser was developed under the Insight Data Science Artificial Intelligence program.

Model

The underlying model is a bidirectional LSTM encoder and LSTM decoder with attention trained using Tensorflow. Downloadable link here: paraphrase model

Prerequisiteis

python 3.5
Tensorflow 1.4.1
spacy

Inference Execution

Download the model checkpoint from the link above and run:

python inference.py --checkpoint=<checkpoint_path/model-171856>

Datasets

The dataset used to train this model is an aggregation of many different public datasets. To name a few:

para-nmt-5m
Quora question pair
SNLI
Semeval
And more!

I have not included the aggregated dataset as part of this repo. If you're curious and would like to know more, contact me. Pretrained embeddings come from John Wieting's para-nmt-50m project.

Training

Training was done for 2 epochs on a Nvidia GTX 1080 and evaluated on the BLEU score. The Tensorboard training curves can be seen below. The grey curve is train and the orange curve is dev.

TODOs

pip installable package
Explore deeper number of layers
Recurrent layer dropout
Greater dataset augmentation
Try residual layer
Model compression
Byte pair encoding for out of set vocabulary

Citations

@inproceedings { wieting-17-millions, 
    author = {John Wieting and Kevin Gimpel}, 
    title = {Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations}, 
    booktitle = {arXiv preprint arXiv:1711.05732}, year = {2017} 
}

@inproceedings { wieting-17-backtrans, 
    author = {John Wieting, Jonathan Mallinson, and Kevin Gimpel}, 
    title = {Learning Paraphrastic Sentence Embeddings from Back-Translated Bitext}, 
    booktitle = {Proceedings of Empirical Methods in Natural Language Processing}, 
    year = {2017} 
}

Additional Setup Requirements

Create the environment in the /paraphraser directory

conda env create -f env.yml

Activate the environment

conda activate paraphraser-env

Download the model checkpoint from above, rename it to "checkpoints", and place it within the /paraphraser/paraphraser directory

Download para-nmt-50m here

Rename it to para-nmt-50m and place it inside the /paraphraser directory

You MAY need to run the following three commands (when prompted)

conda install tensorflow==1.14
conda install spacy
python3 -m spacy download en_core_web_sm

Run the inference.py script

cd paraphraser
python inference.py --checkpoint=checkpoints/model-171856

paraphraser's People

Contributors

Stargazers

Watchers

Forkers

tu-cao davidlenz laoli2046 ayushashish ofrik damianof pedrodiamel jankim terryqj0107 ishandutta2007 breeef mobilewish prariehill codeaudit wwwehr dobatymo ahi737320 sagnik scorpionhiccup allamm mauna-ai marchon didsustechfxxkedup xtreamtrader dbcintron blooprint fence waffercodes peppicelli syyunn yja2397 chenny0808 rflieshman abcnow afcarl yuulin pgigeruzh ppatil5 ami-agarwal xbad saradhix nimesh0505 feigeliudan01 fxxuuu nlpka6j liucongalbin exebi mahmoudeid789 hellowangcheng zxia8 stungkit edwardburgin jakewilliamsonuk vigneshblockdev tili-dev marziehngh mruayan syforcee manasvardhan shiwangi27 linuxperia personal-ai lesliecasas cpascal-gr dheerajiiitv ouykai filco306 dhikum shellydeng randeep2921 beingathar siffi26 boulama iamjaco belenaleman somenerdnamedsteve andreteixeira1998 ivandmitry7 yoonseokheo fly-123123 xingsh2000 sandbox-ly wpdew amo-chat meanos wh0a-com x11-com ntbdy 1muggle bellyfat yurikim2145 junior0803 tjudoubi ajunlonglive geofreymuriuki

paraphraser's Issues

how to training the model form scratch?

can you convert this to tensorflowjs ?

some examples:

My prompts are after "source:"

Source: The disease is the cause of the COVID-19 pandemic.



/usr/local/lib/python3.7/dist-packages/spacy/pipeline/lemmatizer.py:211: UserWarning: [W108] The rule-based lemmatizer did not find POS annotation for one or more tokens. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
  warnings.warn(Warnings.W108)
Paraph #0: disease is caused by the cause of the disease .
Paraph #1: disease 's disease cause a covid-19 pandemic .
Paraph #2: the disease is the cause of a epidemics against covid-19 .
Paraph #3: disease is causing a flurry of meningitis for covid-19 .
Paraph #4: that disease is cause a disease of covid-19 .
Paraph #5: the disease is the cause of the covid-19 epidemic .
Paraph #6: the disease is the cause of the disease .
Paraph #7: the disease is the cause of an covid-19 pandemic .
Paraph #8: this disease is the cause of a covid-19 pandemic disease .
Paraph #9: disease are caused by covid-19 pandemic .

--------------

Source: According to an April 2020 study by the American Gastroenterological Association, COVID-19 can make sick people vomit or have diarrhea but this is rare.

/usr/local/lib/python3.7/dist-packages/spacy/pipeline/lemmatizer.py:211: UserWarning: [W108] The rule-based lemmatizer did not find POS annotation for one or more tokens. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
  warnings.warn(Warnings.W108)
Paraph #0: according to april 2020 , the american covid-19 association of america may infect patients vomit or diarrhea , but this is rare .
Paraph #1: according to the un covid-19 association , the american association is not possible to puke or receive diarrhea , but this is rare .
Paraph #2: according to april 2020 , the american association of america is not the same .
Paraph #3: according to a report by the american covid-19 association , excellent people vomit or diarrhea , but this is rare .
Paraph #4: according to april 2020 , he can arrange sick people vomit or diarrhea , but this is rare .
Paraph #5: according to the study of an american covid-19 association , an american covid-19 association can make sick people vomit or enjoy diarrhea , but this is rare .
Paraph #6: according to april 2020 , the csa association of america is giving up sick people vomit or clots , but this is rare .
Paraph #7: according to april 2020 , the u.s . level is harmful to diseases , and the disease is a waste of vomit , or that 's rare .
Paraph #8: according to april 2020 , we can do sick people vomit or diarrhea , but this is rare .
Paraph #9: according to april 2020 , the u.s . system can produce sick people vomiting or diarrhea , but this is rare .

No matching distribution found for en-core-web-sm==2.0.0 (from -r requirements.txt (line 13))

Installing the pip dependencies give me the following error:

(env3.6) huntsman-ve501-0094:paraphraser daniel$ pip3.6 install -r requirements.txt 
  Could not find a version that satisfies the requirement en-core-web-sm==2.0.0 (from -r requirements.txt (line 13)) (from versions: )
No matching distribution found for en-core-web-sm==2.0.0 (from -r requirements.txt (line 13))

Some notes

Pretty cool stuff, but it's not easy to get running

For spacy add to .bashrc

#Locale
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

download english model sudo python3 -m spacy download en

And then i'm not sure how do i get the embeddings needed here? https://github.com/vsuthichai/paraphraser/blob/master/paraphraser/embeddings.py

I can't get the website to work either

python3 paraphraser/setup.py install -->not found (or not a regular file)

running install
running bdist_egg
running egg_info
writing paraphraser.egg-info/PKG-INFO
writing dependency_links to paraphraser.egg-info/dependency_links.txt
writing top-level names to paraphraser.egg-info/top_level.txt
package init file 'paraphraser/init.py' not found (or not a regular file)
file paraphraser/synonym_model.py (for module paraphraser.synonym_model) not found
file paraphraser/inference.py (for module paraphraser.inference) not found
file paraphraser/download_models.py (for module paraphraser.download_models) not foun

Paraphraser hangs when invoked repeatedly in a loop

I am getting this info from console, not sure where it comes from and how to make the change:

The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Could you point me to glove.6B.300d.pickle?

Usually, it's glove.6B.300d.txt but I think you did some preprocessing here. I'd appreciate if you could share how you pickled it.

how to run the paraphraser? meet with some problem

Traceback (most recent call last):
File "inference.py", line 3, in
from preprocess_data import preprocess_batch
File "/data/jquan/codes/paraphraser-master/paraphraser/preprocess_data.py", line 26, in
word_to_id, idx_to_word, embedding, start_id, end_id, unk_id, mask_id = load_sentence_embeddings()
File "/data/jquan/codes/paraphraser-master/paraphraser/embeddings.py", line 9, in load_sentence_embeddings
with open("../../para-nmt-50m/data/ngram-word-concat-40.pickle", 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../../para-nmt-50m/data/ngram-word-concat-40.pickle'

sample website dont work

http://pair-a-phrase.it/ do nothing

Unrecognized Arguments: Checkpoint

Whenever running the following command on Google Colab
!python /content/home/ami/paraphraser/paraphraser/inference.py --checkpoint = /content/model-171856.data-00000-of-00001

I get the following error:
usage: inference.py [-h] [--checkpoint CHECKPOINT]
inference.py: error: unrecognized arguments: /content/model-171856.data-00000-of-00001

Don't know how to address this. Runtime type: Python3, GPU
I have uploaded the train folder containing model on Google Drive and mounted Google Drive on Colab to access the model. I downloaded it from the link provided with the paraphraser README.

Thank You,
Ami Agarwal.

List of training datasets

Can you give the list of all training data? Thanks you so much.

The demo page doesn't return anything

Hello,
I just tried using the demo at http://pair-a-phrase.it/
I tried testing with a sentence but it didn't return any result.

The http request shows it failed.

thanks

web site does not respond

Regards,
Jay

Trained model

I am engaged in the study of paraphrase generation, I really liked your implementation. Could you upload a trained model that we can run right away. I understand that this is an additional difficulty for you, but it will be a huge help in research. Thank you so much in advance.

Training data (aggregate_paraphrase_corpus_0)

Hello Victor.
I would like to thank u first for your contribution.

I am trying to retrain your model but the aggregate_paraphrase_corpus_0 is missing,
Could you share me the files or maybe explain the format of the files ?

Thanks

Training data volume

Hi, Great project btw!
So I'm looking to get slightly better variety in the volume of paraphrased sentences.
The 50m dataset has 30 million high quality paraphrasing pairs.

How many have you used to train the model? Also do you reckon it'll increase performance by a lot?

Thanks!

paraphraser get stuck

Hi,
Sometimes the paraphraser gets stuck and never finish, e.g.
when running:

from inference import *
source_sentence="That 's a potential nightmare scenario for the GOP establishment : a populist outsider with unlimited resources attacking their nominee from the right in the general election , raising hell -- and attracting votes -- with his rhetoric on issues like illegal immigration ."
paraphrases = paraphraser.sample_paraphrase(source_sentence, sampling_temp=0.75, how_many=1)

I tried changing the parameters which doesn't help either.
Do you have any idea how to fix this problem?

Thanks!

Question: in inference.py, what does UUNNKK means?

OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Steps:

Step1:
Downloaded models to ~/Documents/Projects/paraphraser/checkpoint_path/train-20180325-001253

Step2:
cd ~/Documents/Projects/paraphraser/paraphraser

Step3:
ishandutta2007@MacBook-Pro:~/Documents/Projects/paraphraser/paraphraser$ python inference.py --checkpoint=../checkpoint_path/train-20180325-001253/model-171856

/Users/ishandutta2007/.pyenv/versions/3.5.0/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expect
return f(*args, **kwds)
/Users/ishandutta2007/.pyenv/versions/3.5.0/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expect
return f(*args, **kwds)
/Users/ishandutta2007/.pyenv/versions/3.5.0/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expect
return f(*args, **kwds)
Traceback (most recent call last):
File "inference.py", line 3, in
from preprocess_data import preprocess_batch
File "/Users/ishandutta2007/Documents/Projects/paraphraser/paraphraser/preprocess_data.py", line 23, in
from nlp_pipeline import openmp_nlp_pipeline
File "/Users/ishandutta2007/Documents/Projects/paraphraser/paraphraser/nlp_pipeline.py", line 7, in
nlp = spacy.load('en')
File "/Users/ishandutta2007/.pyenv/versions/3.5.0/lib/python3.5/site-packages/spacy/init.py", line 21, in load
return util.load_model(name, **overrides)
File "/Users/ishandutta2007/.pyenv/versions/3.5.0/lib/python3.5/site-packages/spacy/util.py", line 119, in load_model
raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Can not download data model.

Maybe max limit exceeded. Can you please provide me the data model's new link?