Giter Site home page Giter Site logo

vsuthichai / paraphraser Goto Github PK

View Code? Open in Web Editor NEW
405.0 16.0 98.0 440 KB

Sentence paraphrase generation at the sentence level

Home Page: http://pair-a-phrase.it

License: MIT License

Python 100.00%
paraphrase-generation paraphrases paraphrase tensorflow bidirectional-lstm sentence-generator insight-data-science insight-ai insight-artificial-intelligence deep-learning

paraphraser's Introduction

Paraphraser

This project provides users the ability to do paraphrase generation for sentences through a clean and simple API. A demo can be seen here: pair-a-phrase

The paraphraser was developed under the Insight Data Science Artificial Intelligence program.

Model

The underlying model is a bidirectional LSTM encoder and LSTM decoder with attention trained using Tensorflow. Downloadable link here: paraphrase model

Prerequisiteis

  • python 3.5
  • Tensorflow 1.4.1
  • spacy

Inference Execution

Download the model checkpoint from the link above and run:

python inference.py --checkpoint=<checkpoint_path/model-171856>

Datasets

The dataset used to train this model is an aggregation of many different public datasets. To name a few:

  • para-nmt-5m
  • Quora question pair
  • SNLI
  • Semeval
  • And more!

I have not included the aggregated dataset as part of this repo. If you're curious and would like to know more, contact me. Pretrained embeddings come from John Wieting's para-nmt-50m project.

Training

Training was done for 2 epochs on a Nvidia GTX 1080 and evaluated on the BLEU score. The Tensorboard training curves can be seen below. The grey curve is train and the orange curve is dev.

TODOs

  • pip installable package
  • Explore deeper number of layers
  • Recurrent layer dropout
  • Greater dataset augmentation
  • Try residual layer
  • Model compression
  • Byte pair encoding for out of set vocabulary

Citations

@inproceedings { wieting-17-millions, 
    author = {John Wieting and Kevin Gimpel}, 
    title = {Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations}, 
    booktitle = {arXiv preprint arXiv:1711.05732}, year = {2017} 
}

@inproceedings { wieting-17-backtrans, 
    author = {John Wieting, Jonathan Mallinson, and Kevin Gimpel}, 
    title = {Learning Paraphrastic Sentence Embeddings from Back-Translated Bitext}, 
    booktitle = {Proceedings of Empirical Methods in Natural Language Processing}, 
    year = {2017} 
}

Additional Setup Requirements

Create the environment in the /paraphraser directory

conda env create -f env.yml

Activate the environment

conda activate paraphraser-env

Download the model checkpoint from above, rename it to "checkpoints", and place it within the /paraphraser/paraphraser directory

Download para-nmt-50m here

  • Rename it to para-nmt-50m and place it inside the /paraphraser directory

You MAY need to run the following three commands (when prompted)

conda install tensorflow==1.14
conda install spacy
python3 -m spacy download en_core_web_sm

Run the inference.py script

cd paraphraser
python inference.py --checkpoint=checkpoints/model-171856

paraphraser's People

Contributors

akshar-code avatar dependabot[bot] avatar shellydeng avatar vsuthichai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

paraphraser's Issues

some examples:

My prompts are after "source:"

Source: The disease is the cause of the COVID-19 pandemic.



/usr/local/lib/python3.7/dist-packages/spacy/pipeline/lemmatizer.py:211: UserWarning: [W108] The rule-based lemmatizer did not find POS annotation for one or more tokens. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
  warnings.warn(Warnings.W108)
Paraph #0: disease is caused by the cause of the disease .
Paraph #1: disease 's disease cause a covid-19 pandemic .
Paraph #2: the disease is the cause of a epidemics against covid-19 .
Paraph #3: disease is causing a flurry of meningitis for covid-19 .
Paraph #4: that disease is cause a disease of covid-19 .
Paraph #5: the disease is the cause of the covid-19 epidemic .
Paraph #6: the disease is the cause of the disease .
Paraph #7: the disease is the cause of an covid-19 pandemic .
Paraph #8: this disease is the cause of a covid-19 pandemic disease .
Paraph #9: disease are caused by covid-19 pandemic .

--------------

Source: According to an April 2020 study by the American Gastroenterological Association, COVID-19 can make sick people vomit or have diarrhea but this is rare.

/usr/local/lib/python3.7/dist-packages/spacy/pipeline/lemmatizer.py:211: UserWarning: [W108] The rule-based lemmatizer did not find POS annotation for one or more tokens. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
  warnings.warn(Warnings.W108)
Paraph #0: according to april 2020 , the american covid-19 association of america may infect patients vomit or diarrhea , but this is rare .
Paraph #1: according to the un covid-19 association , the american association is not possible to puke or receive diarrhea , but this is rare .
Paraph #2: according to april 2020 , the american association of america is not the same .
Paraph #3: according to a report by the american covid-19 association , excellent people vomit or diarrhea , but this is rare .
Paraph #4: according to april 2020 , he can arrange sick people vomit or diarrhea , but this is rare .
Paraph #5: according to the study of an american covid-19 association , an american covid-19 association can make sick people vomit or enjoy diarrhea , but this is rare .
Paraph #6: according to april 2020 , the csa association of america is giving up sick people vomit or clots , but this is rare .
Paraph #7: according to april 2020 , the u.s . level is harmful to diseases , and the disease is a waste of vomit , or that 's rare .
Paraph #8: according to april 2020 , we can do sick people vomit or diarrhea , but this is rare .
Paraph #9: according to april 2020 , the u.s . system can produce sick people vomiting or diarrhea , but this is rare .

No matching distribution found for en-core-web-sm==2.0.0 (from -r requirements.txt (line 13))

Installing the pip dependencies give me the following error:

(env3.6) huntsman-ve501-0094:paraphraser daniel$ pip3.6 install -r requirements.txt 
  Could not find a version that satisfies the requirement en-core-web-sm==2.0.0 (from -r requirements.txt (line 13)) (from versions: )
No matching distribution found for en-core-web-sm==2.0.0 (from -r requirements.txt (line 13))

python3 paraphraser/setup.py install -->not found (or not a regular file)

running install
running bdist_egg
running egg_info
writing paraphraser.egg-info/PKG-INFO
writing dependency_links to paraphraser.egg-info/dependency_links.txt
writing top-level names to paraphraser.egg-info/top_level.txt
package init file 'paraphraser/init.py' not found (or not a regular file)
file paraphraser/synonym_model.py (for module paraphraser.synonym_model) not found
file paraphraser/inference.py (for module paraphraser.inference) not found
file paraphraser/download_models.py (for module paraphraser.download_models) not foun

Paraphraser hangs when invoked repeatedly in a loop

I am getting this info from console, not sure where it comes from and how to make the change:

The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

how to run the paraphraser? meet with some problem

Traceback (most recent call last):
File "inference.py", line 3, in
from preprocess_data import preprocess_batch
File "/data/jquan/codes/paraphraser-master/paraphraser/preprocess_data.py", line 26, in
word_to_id, idx_to_word, embedding, start_id, end_id, unk_id, mask_id = load_sentence_embeddings()
File "/data/jquan/codes/paraphraser-master/paraphraser/embeddings.py", line 9, in load_sentence_embeddings
with open("../../para-nmt-50m/data/ngram-word-concat-40.pickle", 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../../para-nmt-50m/data/ngram-word-concat-40.pickle'

Unrecognized Arguments: Checkpoint

Whenever running the following command on Google Colab
!python /content/home/ami/paraphraser/paraphraser/inference.py --checkpoint = /content/model-171856.data-00000-of-00001

I get the following error:
usage: inference.py [-h] [--checkpoint CHECKPOINT]
inference.py: error: unrecognized arguments: /content/model-171856.data-00000-of-00001

Don't know how to address this. Runtime type: Python3, GPU
I have uploaded the train folder containing model on Google Drive and mounted Google Drive on Colab to access the model. I downloaded it from the link provided with the paraphraser README.

Thank You,
Ami Agarwal.

Trained model

I am engaged in the study of paraphrase generation, I really liked your implementation. Could you upload a trained model that we can run right away. I understand that this is an additional difficulty for you, but it will be a huge help in research. Thank you so much in advance.

Training data (aggregate_paraphrase_corpus_0)

Hello Victor.
I would like to thank u first for your contribution.

I am trying to retrain your model but the aggregate_paraphrase_corpus_0 is missing,
Could you share me the files or maybe explain the format of the files ?

Thanks

Training data volume

Hi, Great project btw!
So I'm looking to get slightly better variety in the volume of paraphrased sentences.
The 50m dataset has 30 million high quality paraphrasing pairs.

How many have you used to train the model? Also do you reckon it'll increase performance by a lot?

Thanks!

paraphraser get stuck

Hi,
Sometimes the paraphraser gets stuck and never finish, e.g.
when running:

from inference import *
source_sentence="That 's a potential nightmare scenario for the GOP establishment : a populist outsider with unlimited resources attacking their nominee from the right in the general election , raising hell -- and attracting votes -- with his rhetoric on issues like illegal immigration ."
paraphrases = paraphraser.sample_paraphrase(source_sentence, sampling_temp=0.75, how_many=1)

I tried changing the parameters which doesn't help either.
Do you have any idea how to fix this problem?

Thanks!

OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Steps:

Step1:
Downloaded models to ~/Documents/Projects/paraphraser/checkpoint_path/train-20180325-001253

Step2:
cd ~/Documents/Projects/paraphraser/paraphraser

Step3:
ishandutta2007@MacBook-Pro:~/Documents/Projects/paraphraser/paraphraser$ python inference.py --checkpoint=../checkpoint_path/train-20180325-001253/model-171856

/Users/ishandutta2007/.pyenv/versions/3.5.0/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expect
return f(*args, **kwds)
/Users/ishandutta2007/.pyenv/versions/3.5.0/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expect
return f(*args, **kwds)
/Users/ishandutta2007/.pyenv/versions/3.5.0/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expect
return f(*args, **kwds)
Traceback (most recent call last):
File "inference.py", line 3, in
from preprocess_data import preprocess_batch
File "/Users/ishandutta2007/Documents/Projects/paraphraser/paraphraser/preprocess_data.py", line 23, in
from nlp_pipeline import openmp_nlp_pipeline
File "/Users/ishandutta2007/Documents/Projects/paraphraser/paraphraser/nlp_pipeline.py", line 7, in
nlp = spacy.load('en')
File "/Users/ishandutta2007/.pyenv/versions/3.5.0/lib/python3.5/site-packages/spacy/init.py", line 21, in load
return util.load_model(name, **overrides)
File "/Users/ishandutta2007/.pyenv/versions/3.5.0/lib/python3.5/site-packages/spacy/util.py", line 119, in load_model
raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.