brmson / dataset-sts Goto Github PK

Semantic Text Similarity Dataset Hub

R 1.82% Perl 1.04% Python 93.57% Shell 3.57%

dataset-sts's Introduction

Semantic Text Similarity Dataset Hub

A typical NLP machine learning task involves classifying a sequence of tokens such as a sentence or a document, i.e. approximating a function

f_1(s) ∈ [0,1]

(where f_1 may determine a domain, sentiment, etc.). But there is a large class of problems that are often harder and involve classifying a pair of sentences:

f_2(s1, s2) ∈ [0,1]*c

(where s1, s2 are sequences of tokens and c is a rescaling factor like c=5).

Typically, the function f_2 denotes some sort of semantic similarity, that is whether (or how much) the two parameters "say the same thing". (However, the function could do something else - like classify entailment or contradiction or just topic relatedness. We may include such datasets as well.)

This repo aims to gather a variety of standard datasets and tools for training and evaluating such models in a single place, with the base belief that it should be possible to build generic models for f_2 that aren't tailored to particular tasks (and even multitask learning should be possible).

Most of the datasets are pre-existing; text similarity datasets that may be redistributed (at least for research purposes) are included. Always check the licence of a particular dataset. Some datasets may be original though, because we are working on many applied problems that pertain training such a function...

The contents of dataset-sts and baseline results are described in the paper Sentence Pair Scoring: Towards Unified Framework for Text Comprehension. Hypothesis evaluation task is described in the paper Joint Learning of Sentence Embeddings for Relevance and Entailment presented at the repl4nlp workshop at ACL2016 (poster).

Pull requests welcome that extend the datasets, or add important comments, references or attributions. Please let us know if we misread some licence terms and shouldn't be including something, we'll take that down right away!

Pull request that include simple baselines for f_2 models are also welcome. (Simple == they fit in a couple of screenfuls of code and are batch-runnable. Python is preferred, but not mandatory.)

Software Tools

To get started with simple classifiers that use task-specific code, look at the examples/ directory. To get started with task-universal deep learning models, look at the tools/, models/ and tasks/ directory.

pysts/ Python module contains various tools for easy loading, manipulation and evaluation of the dataset.
pysts/kerasts the KeraSTS allows easy prototyping of deep learning models for many of the included tasks using the Keras library.
examples/ contains a couple of simple, self-contained baselines on various tasks.
models/ directory contains various strong baseline models using the KeraSTS toolkit, including state-of-art neural networks
tasks/ directory contains model-independent interfaces to datasets for various tasks (from Answer Sentence Selection to Paraphrasing)
tools/ directory contains tools that put models and tasks together; training, evaluating, tuning and transferring models on tasks

Datasets

This is for now as much a TODO list as an overview.

"Paraphrasing" Task

These datasets are about binary classification of independent sentence (or multi-sentence) pairs regarding whether they say the same thing; for example if they describe the same event (with same data), ask the same question, etc.

data/para/msr/ MSR Paraphrase Dataset (TODO: pysts manipulation tools)
data/para/askubuntu/ AskUbuntu StackOverflow Similar Questions
PPDB: The Paraphrase Database contains only short phrase snippets, but tens of millions of pairs
More Stack Exchange data? (some is also contained in the new STS datasets)

"Semantic Text Similarity" Task

These datasets consider the semantic similarity of independent pairs of texts (typically short sentences) and share a precise similarity metric definition of assigning a number between 0 to 5 to each pair denoting the level of similarity/entailment.

data/sts/semeval-sts/ SemEval STS Task - multiple years, each covers a bunch of topics that share the same precise similarity metric definition
data/sts/sick2014/ SemEval SICK2014 Task
SemEval 2014 Cross-level Semantic Similarity Task (TODO; 500 paragraph-to-sentence training items)

"Entailment" Task

These datasets classify independent pairs of "hypothesis" and "fact" sentences as entailment, contradiction or unknown.

data/rte/sick2014/ SemEval SICK2014 Task also includes entailment data
data/rte/snli/ The Stanford Natural Language Inference (SNLI) Corpus (570k pairs dataset for an RTE-type task).
RTE Datasets up to RTE-3 http://nlp.stanford.edu/RTE3-pilot/ (TODO)

"Answer Sentence Selection" Task

These datasets concern a "bipartite ranking" task. That is, each tuple of sentences is binary classified, but there are many different S1 sentences given the same S0 and the ultimate goal is to sort positive-labelled S1s above negative-labelled S1s for each S0.

Typically, S0 is a question and S1 are potentially-answer-bearing passages (in that case, identifying the actual answer might be an auxiliary task to consider; see anssel-yodaqa). However, other scenarios are possible, like the Ubuntu Dialogue Corpus where S1 are dialogue followups to S0.

data/anssel/wang/ Answer Sentence Selection - original Wang dataset
data/anssel/yodaqa/ Answer Sentence Selection - YodaQA-based
InsuranceQA Dataset (used in recent IBM papers, 25k question-answer pairs; unclear licencing)
data/anssel/wqmprop/ Property Path Selection (based on WebQuestions + YodaQA)
data/anssel/ubuntu/ The Ubuntu Dialogue Corpus contains pairs of sequences where the second sentence is a candidate for being a followup in a community techsupport chat dialog. 10M pairs make this awesome.

"Hypothesis Evidencing" Task

Similar to the "Answer Sentence Selection" task, these datasets need to consider a variety of S1 given a fixed S0 - the desired output should be however a judgement about S0 alone (typically true / false).

data/hypev/argus/ Argus Dataset (Yes/No Question vs. News Headline)
data/hypev/ai2-8grade/ AI2 8th Grade Science Questions are 641 school Science quiz questions (A/B/C/D test format), stemming from The Allen AI Science Challenge We are going to produce a dataset that merges questions and answers in a single sentence, and pairs each with potential-evidencing sentences from Wikipedia and CK12 textbooks. This will be probably the hardest dataset by far included in this repo for some time. (We may also want to include the Elementary dataset.)
bAbI has a variety of datasets, especially re memory networks (memory relevant to a given question), though with an extremely limited vocabulary.
data/hypev/mctest/ Machine Comprehension Test (MCTest) contains 300 children stories with many sentences and 4 questions each. A share-alike type licence.
More "Entrance Exam" tasks solving multiple-choice school tests.

Other Datasets

Some datasets are not universally available, but we may accept contributions regarding code to load them.

Non-Redistributable Datasets

Some datasets cannot be redistributed. Therefore, some scientists may not be able to agree with the licence and download them, and/or may decide not to use them for model development and research (if it is in commercial setting), but only for some final benchmarks to benefit cross-model comparisons. We discourage using these datasets.

Microsoft Research Video Description Corpus (video annotation task, 120k sentences in 2k clusters)
Microsoft Research WikiQA Corpus (Answer Selection task, 3k questions and 29k answers with 1.5k correct)
STS2013 Joint Student Response Analysis (RTE-8)

Non-free Datasets

Some datasets are completely non-free and not available on the internet, therefore as strong believers in reproducible experiments and open science, we strongly discourage their usage.

STS2013 Machine Translation pairs translated and translated-and-postedited newswire headlines. Payment required.
TAC tracks RTE-4 to RTE-7. Printed user agreement required.

Algorithm References

Here, we refer to some interesting models for sentence pair classification. We focus mainly on papers that consider multiple datasets or are hard to find; you can read e.g. about STS winners on the STS wiki, about anssel/wang models on the ACL wiki, about RTE models on the SNLI page.

https://github.com/alvations/stasis contains several baselines and another view of the datasets (incl. the CLSS task)
https://github.com/ryankiros/skip-thoughts
Standard memory networks (MemNN, MemN2N) are in fact f_2 models at their core; very similar to http://arxiv.org/abs/1412.1632

Licence and Attribution

Always check the licences of the respective datasets you are using! Some of them are plain CC-BY, others may be heavily restricted e.g. for non-commercial use only. Default licence for anything else in this repository is ASLv2 for the code, CC-BY 4.0 for data.

Work on this project has been in part kindly sponsored by the Medialab foundation (http://medialab.cz/), a Czech Technical University incubator. The rest of contributions by Petr Baudiš is licenced as open source via Ailao (http://ailao.eu/). (Ailao also provides commercial consulting, customization, deployment and support services.)

dataset-sts's People

Contributors

Stargazers

Watchers

Forkers

natsheh silvicek amoliu jfsantos vyskoto4 svundamati benjaminhess binbinbian protonish yanweifu alvaromorales peratham codeaudit hjk41 gali472 stevenlol anukat2015 arnavkj1995 yangliuy cuihengbin andradeandrey netankit angelo337 technobotz fancyerii rafat-islam1186 liuyang1123 beige90 wuzhongdehua kaidongyu mzdu phpmind colinmegill corbyrosset hemina robi56 lanseyege ye-lun fatmas1982 lumiqai abhishekkumarsingh schangpi chiranjibsuruf lc222 jeff19921021 praneetdutta vyraun 0xsimulacra fangzheng354 yukioichida cziszero tuan1101 locosoft1986 sindhuraelluri manjunath-s durgaprasd datagold2017 colinsongf mhjabreel royshan egaebel stephensebastin dconger owenljn jamesoneill12 lizihan021 uoops amitunix aa1607 zghzgx5 dhwajraj dattatrayshinde nidhinkrishnanv rikenshah w121211 silentflame ravi2493 reflen mldeveloper01 kyung-min connieshen0503 cryan2016 osamamukhtar11 yuminzhou threefoldo zhaosm fanglanting houzhenzhen nadaben josemarcosrf souravghosh97 plzdaye shadowridgedev contemn1 billaram nprathibha jeeveshn zacateras dsp6414 sabirdvd

dataset-sts's Issues

sts: Add 2016 gold standard

http://alt.qcri.org/semeval2016/task1/data/uploads/sts2016-english-with-gs-v1.0.zip

Experiment: Preinitialize layers by identity matrices

It is becoming popular to preinitialize matrices, especially projection matrices and MLP matrices) with identity. Recommended e.g. by the Maluuba guys in "A Parallel-Hierarchical Model for Machine Comprehension on Sparse Data".

Another part of this is taking a more serious look at relu again as a transfer function - of course with tanh() the identity will be a bit skewed even though repeated tanh() near zero doesn't have a big effect.

Argus dataset: Create validation set split

We need that for proper reporting, split from the training set.

Model: Bilinear form for attention

We should come back to attn1511 while trying out a bilinear form for attention rather than the dot product or elementwise weighed sum. This is basically an analog of our projection layer, and what MemNNs use for memory-level attention and what Danqi Chen, Jason Bolton and Christopher D. Manning report as quite helpful for the CNN/Daily Mail Reading Comprehension Task.

Model: Lexical (de)composition

We should include the model of http://arxiv.org/abs/1602.07019 - should be a relatively easy one, with probably some custom lambdas in the decomposition step.

Task: AskUbuntu is broken

Somehow, the AskUbuntu (asku) task is broken and the models don't get trained:

RunID: asku-avg--5e692e270bddb64b-00  ({"Ddim": "1", "balance_class": "False", "batch_size": "192", "deep": "0", "e_add_flags": "True", "embdim": "300", "epoch_fract": "0.25", "f_add_kw": "False", "fix_layers": "[]", "inp_e_dropout": "0.333333333333", "inp_w_dropout": "0", "l2reg": "1e-05", "loss": "<function ranknet at 0x995b5f0>", "mlpsum": "sum", "nb_epoch": "16", "nb_runs": "4", "nnact": "relu", "nninit": "glorot_uniform", "opt": "adam", "pact": "tanh", "pdim": "1", "prescoring": "None", "prescoring_input": "None", "prescoring_prune": "None", "project": "True", "ptscorer": "<function mlp_ptscorer at 0x99631b8>", "wact": "linear", "wdim": "1", "wproject": "False"})
Model
Training
Epoch 1/16
323637/323518 [==============================] - 384s - loss: 0.6949                                                          val mrr 0.471191
Epoch 2/16
323543/323518 [==============================] - 373s - loss: 0.6933                                                          val mrr 0.463537
Epoch 3/16
323620/323518 [==============================] - 374s - loss: 0.6932                                                          val mrr 0.455145
Epoch 4/16
323677/323518 [==============================] - 384s - loss: 0.6932                                                          val mrr 0.454692

Model: Attentive pooling

We should implement+benchmark the http://arxiv.org/abs/1602.03609 model of attentive pooling. I wonder if some DimShuffles will be enough for the "horizontal maxpooling" in the final composition step.

Clarification needed for pearsonobj function .

In the pearsonobj function (implemented in objectives.py) , class to score conversion is done on both y_true and y_pred and it is used as the loss function for the sts data set. But in sts data set the true values are floating point numbers so is it necessary to do class to score conversion(for y_true) here? If I am going wrong somewhere could you clarify the approach in pearsonobj you are using in context of sts dataset.

Model: "Sequential RNN with attention" SNLI-based models

For the SNLI task, models

are doing well. The basic idea is that we have two RNNs for the two sentences, but the second one is initialized by the output of the first one. Plus there is one-direction attention.

The models seem to be pretty incremental tweaks of each other, so it would be probably easiest to implement this as a single model with configurable features. Not sure how to coerce Keras to perform the initialization of the second RNN, though, might require Keras modifications.

Task (anssel): Make use of extra supervision

In the yoda datasets of anssel task, we have extra supervision in the form of binary markers for tokens that actually denote the answer. We should try to make use of this supervision during training by just passing that as another set of NLP-style token flags. It should be pretty easy (except that it'll of course break transfer learning; but there's a way to keep transfer learning working if we make some changes that include fixing N to original embedding size).

Dataset annotating both Semantic Similarity and Relatedness

Is there some dataset that has sentence pairs annotated for both semantic similarity and semantic relatedness?
Similarity here is 'same meaning', whereas relatedness is more general with similarity being one of the relationships between concepts.

I know there exists one for word pairs (WordSim), but is there such a dataset with sentence pairs?

Multi-level RNN: Check skip-layer connections

We didn't have much success with multi-level RNNs, but skip-layer connections represent an important innovation in that regard.

Task: AI2 8th grade dataset

This should be a hard one! Let's use our chios infrastructure to generate it from what AI2 publicly released.

Include chios output CSVs as v0 dataset
Add evaluation tools for the ABCD question choice (qid based accuracy) to hypev_eval
Benchmark models

Future plans: The winner models are now on github, use their output instead

Visualize individual RNN neurons

Add support for easy exploration of whether individual neurons are learning specific concepts - say similar as the heatmap table, but with extra javascript code that lets you quickly flip through highlighting based on individual dimensions rathre than the whole norm.

Model (rnn): Include skip-connections in multi-level

Skip-connections (TODO what's the reference?) mean connecting both input and previous layer in the inner layers of the RNN. rnnlevels>1 (introduced for anssel in (Wang+Nyberg, 2015)) didn't work great for us, but this might help.

Non-neural baselines

We should have some non-neural baselines.

TF-IDF baseline (maybe use the code in ubottu?)
BM25 baseline

anything else?

HypEv: Check Ddim=2, seems much better on MCTest

Accidentally discovered that mctest Ddim=2 results are much better than the Ddim=0 default.

Hyperparameter tuning using Spearmint

So far, we have support just for parameter tuning using random search (tools/anssel_tune.py). Since this search is pretty high dimensional, what we then sometimes do is look at what works in general and manually focus the parameters on that. But this is sort of inexact, and quite tiresome + boring too. We should use a smarter way to tune stuff!

I don't know if there's a better choice than https://github.com/JasperSnoek/spearmint .

(Using software allowing commercial usage etc. and ideally without CLA is pretty important to me.)

Model: Inner attention by Wang-Liu-Zhao

"Inner Attention based Recurrent Neural Networks for Answer Selection", attention applied before RNN is awesome, apparently. We could also try to "sandwich" attention in two layers of a multi-level RNN.

Model (general): Add support for TF-IDF ensembling

In the anssel task, it is semi-standard practice to ensemble the NN-based scores with TF-IDF-based scores in an additional logreg-like layer. That should be easy for us to do with the termfreq model.

Another approach that hasn't been done before but is eminently important from practical POV is to prerank by TF-IDF-like measure and then filter out just top N for NN scoring.

STS data set

For pairs of sentences that do not display the score , what is the similarity score ?
2015.test.tsv has 12250 pairs while 2015.train.tsv has 3000. Is that correct?

Task interface for the msr-para dataset

Python loading tools
tools/para_train.py
Python eval tools (accuracy, F-score)

Model: hypev-specific argus style relevance modelling

Model wrapper for relevance modelling in hypev selection.

Which keras version to use?

Thanks for the wonderful tool! But I got some errors when I tried one command in the readme: python tools/train.py cnn para data/para/msr/msr-para-train.tsv data/para/msr/msr-para-val.tsv

Using the latest version of keras (1.0.7), I found the following error:
ImportError: cannot import name LambdaMerge
I replaced all the LambdaMerge by Merge in blocks.py and re-run the command, another error appears:
Exception: Layer e0[0] does not support masking, but was passed an input_mask: Elemwise{neq,no_inplace}.0

Then I uninstall it and install keras 0.3.2, but another error occur.
AssertionError: Keyword argument not understood: dropout

Appreciate the help!

Model: Skip-thoughts

Implement a model that uses skip-thoughts (sentence-wide embeddings) to generate aggregate sentence representations. We probably want to just rely on an external component to precompute the representations. Not sure if we can meaningfully combine this with other architectures, but it should serve at least as a really strong baseline.

Check weights and gradients for anomalies

We are maybe a little careless in the way we train the more complex (CNN, RNN etc.) models, in that we should carefully check the gradients and also the actual weight matrices; for CNN, some papers renormalize weights if their norm is too large, for RNN something similar might be necessary. I suspect that since we are getting reasonable-looking results, it's probably not a crucial issue, nevertheless we might get some improvements from deeply understanding the practical progression of training in our models.

Keras deleted LambaMerge

Hi,

I am evaluating existing algorithms for paraphrasing and entailment. I wanted to run para.py, but I am unable to do so.

Keras have removed LambdaMerge and hence your code doesn't work. I am new to Keras and the topic.

Here is the link that talks about how to do lamda merge without it.
keras-team/keras#2342

Can you update your code?

thanks,

Port to Keras 1.0

We still depend on Keras 0.3.2 and its Graph model interface. Porting to Keras 1.0 functional interface is top priority.

Refactoring: Separate tools and tasks

We want fineval for all tasks, which is getting bothersome (I really don't want to implement a separate one for Ubuntu too...), and ubuntu_transfer_* also shows clear scaling limitations and we'd want a universal transfer script instead. Let's make a class interface and implement each task in that.

Work will happen in f/tasksep.

tasks/ for everything
tools/ for training
tools/ for evaluation
tools/ for tuning
tools/ for transfer learning
sts fineval per-test-split
phase out all per-task tools (possible edit notebooks... but maybe we can just ignore that until later)

Problem with evaluating "termfreq" model

Hi,

Thanks for this useful project.
There is an issue I guess in the evaluation of the termfreq model.

I'm running:
python3 tools/train.py termfreq anssel ./data/anssel/wang/train.csv ./data/anssel/wang/test.csv inp_e_dropout=1/2 nb_epoch=1

However, I get this error:

"... tools/pysts/eval.py", line 28, in binclass_accuracy rawacc = np.sum((ypred > 0.5) == (y > 0.5)) / ypred.shape[0]
TypeError: unorderable types: dict() > float()

The origin of this issue is that y_pred should be replaced by y_pred['score'] for this particular task (as the function predict in termfreq.py returns a dictionary).

This is also the case in the other function aggregate_s0.

Still after fixing this, I get another issue:

... /tools/pysts/eval.py", line 122, in mrr if yy[1] in ysd: TypeError: unhashable type: 'numpy.ndarray'

I appreciate your feedback on this issue or on whether I am running something incorrectly.

Limit embedding training to just randomly initialized embeddings

A common practice in neural NLP models is to have the embedding matrix adaptable, but only the portion of it that covers randomly initialized rather than preinitialized word embeddings. This might help overfitting.

Unfortunately, this is not completely straightforward in Keras. A possible idea would be to transform word indices to index tuples and have two embedding matrices, one fixed and another trainable. Or modify Keras to allow per-row trainability, but I don't know how hard that would be.

Progress tacker for conversion outlined below:

brmson / dataset-sts Goto Github PK

dataset-sts's Introduction

Semantic Text Similarity Dataset Hub

Software Tools

Datasets

"Paraphrasing" Task

"Semantic Text Similarity" Task

"Entailment" Task

"Answer Sentence Selection" Task

"Hypothesis Evidencing" Task

Other Datasets

Non-Redistributable Datasets

Non-free Datasets

Algorithm References

Licence and Attribution

dataset-sts's People

Contributors

Stargazers

Watchers

Forkers

dataset-sts's Issues

Recommend Projects

Recommend Topics

Recommend Org