Light

otanadzetsotne / paraphrase_datasets Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 0.0 6 KB

Paraphrase Datasets: contains researches and links to datasets that can be used to sentence paraphrase model training

dataset paraphrase paraphrase-generation paraphrased-data paraphrasing

paraphrase_datasets's Introduction

Paraphrasing datasets

GLUE (General Language Understanding Evaluation benchmark)

Home page ->

tensorflow ->

github ->
MRPC (Microsoft Research Paraphrase Corpus)

The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically retrieved from online news sources, with human annotations indicating whether the sentences in the pair are semantically equivalent.

Home page ->

Download ->
CoLA (The Corpus of Linguistic Acceptability)

The corpus of linguistic acceptability consists of judgments about the acceptability of the English language taken from books and journal articles on linguistic theory. Each example is a sequence of words annotated with whether it is grammatically an English sentence.

Home page ->

Download ->
QQP (Quora Question Pairs)

The Quora Question Pairs2 dataset is a collection of question pairs from the community question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent.

Kaggle ->
STS (The Semantic Textual Similarity Benchmark)

The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data. Each pair is human-annotated with a similarity score from 0 to 5.

Home page ->

Download ->
PAWS (Paraphrase Adversaries from Word Scrambling)

This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. The dataset has two subsets, one based on Wikipedia and the other one based on the Quora Question Pairs (QQP) dataset.

paper ->

github ->

Download (Wiki) (размеченный) ->

Download (Wiki) (размеченный, только с перестановками) ->
PAWS-x

This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki.

github ->

Download ->
PIT (Paraphrase and Semantic Similarity in Twitter)

Paraphrase and Semantic Similarity in Twitter (PIT) presents a constructed Twitter Paraphrase Corpus that contains 18,762 sentence pairs.

github ->
SciTail

The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis.

Home page ->

Paper ->

Download ->
TURL (Twitter News URL Corpus)

Requires Access

Twitter News URL Corpus is a human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking for automatic paraphrase identification.

github ->
CQADupStack

CQADupStack is a benchmark dataset for community question-answering research. It contains threads from twelve StackExchange subforums, annotated with duplicate question information. Pre-defined training and test splits are provided, both for retrieval and classification experiments, to ensure maximum comparability between different studies using the set. Furthermore, it comes with a script to manipulate the data in various ways.

Home page ->

github ->

Download ->
Paralex

Paralex learns from a collection of 18 million question-paraphrase pairs scraped from WikiAnswers.

Home page ->

Cкачать ->
Benchmark for Neural Paraphrase Detection

This is a benchmark for neural paraphrase detection, to differentiate between original and machine-generated content.

Home page ->

Download ->

paraphrase_datasets's People

Contributors

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.