pasmod / paradox Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 0.0 14.85 MB

An Automatic Paraphrase Detection System

License: MIT License

Makefile 1.08% Python 27.39% Perl 71.52%

paraphrase paraphrase-detection paraphrase-identification

paradox's People

Contributors

Stargazers

Watchers

paradox's Issues

Check if the Hindi tokenizer works from right to left?

Add evaluation data

Add CrossValidation

Encode sequence with vocabulary indexes like in reorderer

Experiment with character sequences and LSTMs

Train metric should be F1 not accuracy

Add Normalizer to SVM Baselines

Move result_logger into logger module

Add training data for the remaining three languages if available

Baseline: One-Hot-Encoding of the words

Word sequence max length via median

https://github.com/pasmod/paradox/blob/master/paradox/loaders/keras_loader.py#L40

[feature] verbose mode for similarity transformer

I added a verbose mode for the similarity transformer using tqdm. let me know if you want a PR for this. it looks like this when running benchmark.py

the benchmark script takes a rather long time on my computer so I wanted to figure out what was going on and save the model for reuse as well

diff --git a/paradox/benchmark.py b/paradox/benchmark.py
index b0beba2..5922ac7 100644
--- a/paradox/benchmark.py
+++ b/paradox/benchmark.py
@@ -1,3 +1,4 @@
+import logging
 from metrics import pearson, mse
 from pipeline import pipeline
 import k_neighbors_regressor
@@ -5,6 +6,7 @@ import numpy as np
 import similarity
 import parser
 
+logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
 
 def report(correlations, errors, y_pred_fold):
     print("PC:\t\t\t%0.2f\t(+/- %0.2f)" % (np.mean(correlations),
@@ -28,11 +30,14 @@ def test(model=None, categories=[]):
 pairs = parser.parse(mode="train")
 X = [pair[0] for pair in pairs]
 y = [pair[1] for pair in pairs]
-transformer = similarity.build()
+transformer = similarity.build(verbose=True)
 estimator = k_neighbors_regressor.build(n_neighbors=4)
 p = pipeline(transformers=[transformer], estimator=estimator)
 p.fit(X, y)
 
+import pickle
+with open('model.pickle', 'wb') as f:
+    pickle.dump(p, f)
 
 test(p, categories=["answer-answer"])
 test(p, categories=["question-question"])
diff --git a/paradox/similarity.py b/paradox/similarity.py
index d5b20b5..6496787 100644
--- a/paradox/similarity.py
+++ b/paradox/similarity.py
@@ -41,8 +41,8 @@ def similarity(text1, text2, levels=['surface', 'context']):
     return sims
 
 
-def build(levels=['surface', 'context']):
-    pipeline = Pipeline([('transformer', Similarity(levels=levels))])
+def build(levels=['surface', 'context'], verbose=False):
+    pipeline = Pipeline([('transformer', Similarity(levels=levels, verbose=verbose))])
     return ('similarity', pipeline)
 
 
@@ -52,15 +52,24 @@ def param_grid():
 
 
 class Similarity(BaseEstimator):
-    def __init__(self, levels=['surface']):
+    def __init__(self, levels=['surface'], verbose=False):
         self.levels = levels
+        self.verbose = verbose
 
     def fit(self, X, y):
         return self
 
     def transform(self, X):
         a = []
-        for x in X:
+
+        tqdm = lambda x: x
+        if self.verbose:
+            try:
+                from tqdm import tqdm
+            except ImportError:
+                pass
+
+        for x in tqdm(X):
             a.append(self._transform(x))
         return a

Refactor the choice of the baseline estimator into a function.

estimate_svm_baseline(... ) and estimate_svm_baseline(..., True) should be wrapped

Add model checkpoints

Solve the problem of absolute paths

Refactor evaluation.count_vectorizer_word_baseline

Move hindi_tokenizer into separate file
Split the creation of the pipeline and the estimate_svm_baseline into different files

Experiment with character n-grams of sizes 2 and 3

Baseline: Scikit CountVecotizer with SVM as baseline

Experiment with other deep learning architectures

Experiment with split random_state

Save the trained neural net

Baseline: Scikit char-based CountVectorizer with SVM as baseline

Add optional message to ResultLogger

Refactor keras_test

Seperate functions for:

get_data(language, task)
model generation in separate files
compile step into separate file, multiple methods for different optimizers

Refactor main keras call

Each model should specify the following parameters

In which representation the data set should be loaded into (chars / words)
get_model should be refactored
compile_optimizer should be in the same function as the model creation; no overengeneering

Baseline: One-Hot-Encoding of the characters

Implement STS Similarity

result_logger should be parallelizable

Set batchsize to max length

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.