Giter Site home page Giter Site logo

paradox's Introduction

Paradox: Automatic Paraphrase Identification

license

Given two sentences, paradox returns a continuous valued similarity score on a scale from 0 to 5, with 0 indicating that the semantics of the sentences are completely independent and 5 signifying semantic equivalence. Paradox uses Glove pre-trained models.

How to install

Paradox is dockerized! First install Docker and then run the following commands:

cd paradox
make install
make download_glove
make download_models

Training Corpus

For training, the semantic similarity corpora from SemEval (2012-2016) are used. The training data are available under /corpus.

Evaluation

The evaluation scipt reports the results on the test data set of the SemEval2016 challange. To see the resport run the following commands:

source env/bin/activate
python benchmark.py

Citation

This repository contains the code for the DeepLDA approach introduced in the following paper. Use the following bibtex entry to cite us:

@InProceedings{liebeck-EtAl:2016:SemEval,
    author    = {Liebeck, Matthias and Pollack, Philipp and Modaresi, Pashutan and Conrad, Stefan},
    title     = {HHU at SemEval-2016 Task 1: Multiple Approaches to Measuring Semantic Textual Similarity},
    booktitle = {Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)},
    month     = {June},
    year      = {2016},
    address   = {San Diego, California},
    publisher = {Association for Computational Linguistics},
    pages     = {607--613},
    url       = {TOBEFILLED-http://www.aclweb.org/anthology/W/W05/W05-0292}
}

ToDos:

  • Implement topical similarity based of the LDA models.

paradox's People

Contributors

aljohri avatar liebeck avatar qlaym-backup avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

paradox's Issues

Refactor keras_test

Seperate functions for:

  • get_data(language, task)
  • model generation in separate files
  • compile step into separate file, multiple methods for different optimizers

[feature] verbose mode for similarity transformer

I added a verbose mode for the similarity transformer using tqdm. let me know if you want a PR for this. it looks like this when running benchmark.py

screen shot 2018-08-08 at 10 28 13 pm

the benchmark script takes a rather long time on my computer so I wanted to figure out what was going on and save the model for reuse as well

diff --git a/paradox/benchmark.py b/paradox/benchmark.py
index b0beba2..5922ac7 100644
--- a/paradox/benchmark.py
+++ b/paradox/benchmark.py
@@ -1,3 +1,4 @@
+import logging
 from metrics import pearson, mse
 from pipeline import pipeline
 import k_neighbors_regressor
@@ -5,6 +6,7 @@ import numpy as np
 import similarity
 import parser
 
+logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
 
 def report(correlations, errors, y_pred_fold):
     print("PC:\t\t\t%0.2f\t(+/- %0.2f)" % (np.mean(correlations),
@@ -28,11 +30,14 @@ def test(model=None, categories=[]):
 pairs = parser.parse(mode="train")
 X = [pair[0] for pair in pairs]
 y = [pair[1] for pair in pairs]
-transformer = similarity.build()
+transformer = similarity.build(verbose=True)
 estimator = k_neighbors_regressor.build(n_neighbors=4)
 p = pipeline(transformers=[transformer], estimator=estimator)
 p.fit(X, y)
 
+import pickle
+with open('model.pickle', 'wb') as f:
+    pickle.dump(p, f)
 
 test(p, categories=["answer-answer"])
 test(p, categories=["question-question"])
diff --git a/paradox/similarity.py b/paradox/similarity.py
index d5b20b5..6496787 100644
--- a/paradox/similarity.py
+++ b/paradox/similarity.py
@@ -41,8 +41,8 @@ def similarity(text1, text2, levels=['surface', 'context']):
     return sims
 
 
-def build(levels=['surface', 'context']):
-    pipeline = Pipeline([('transformer', Similarity(levels=levels))])
+def build(levels=['surface', 'context'], verbose=False):
+    pipeline = Pipeline([('transformer', Similarity(levels=levels, verbose=verbose))])
     return ('similarity', pipeline)
 
 
@@ -52,15 +52,24 @@ def param_grid():
 
 
 class Similarity(BaseEstimator):
-    def __init__(self, levels=['surface']):
+    def __init__(self, levels=['surface'], verbose=False):
         self.levels = levels
+        self.verbose = verbose
 
     def fit(self, X, y):
         return self
 
     def transform(self, X):
         a = []
-        for x in X:
+
+        tqdm = lambda x: x
+        if self.verbose:
+            try:
+                from tqdm import tqdm
+            except ImportError:
+                pass
+
+        for x in tqdm(X):
             a.append(self._transform(x))
         return a

Refactor main keras call

Each model should specify the following parameters

  • In which representation the data set should be loaded into (chars / words)
  • get_model should be refactored
  • compile_optimizer should be in the same function as the model creation; no overengeneering

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.