Giter Site home page Giter Site logo

otanadzetsotne / paraphrase_datasets Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 6 KB

Paraphrase Datasets: contains researches and links to datasets that can be used to sentence paraphrase model training

dataset paraphrase paraphrase-generation paraphrased-data paraphrasing

paraphrase_datasets's Introduction

Paraphrasing datasets

  • GLUE (General Language Understanding Evaluation benchmark)

    Home page ->

    tensorflow ->

    github ->

  • MRPC (Microsoft Research Paraphrase Corpus)

    The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically retrieved from online news sources, with human annotations indicating whether the sentences in the pair are semantically equivalent.

    Home page ->

    Download ->

  • CoLA (The Corpus of Linguistic Acceptability)

    The corpus of linguistic acceptability consists of judgments about the acceptability of the English language taken from books and journal articles on linguistic theory. Each example is a sequence of words annotated with whether it is grammatically an English sentence.

    Home page ->

    Download ->

  • QQP (Quora Question Pairs)

    The Quora Question Pairs2 dataset is a collection of question pairs from the community question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent.

    Kaggle ->

  • STS (The Semantic Textual Similarity Benchmark)

    The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data. Each pair is human-annotated with a similarity score from 0 to 5.

    Home page ->

    Download ->

  • PAWS (Paraphrase Adversaries from Word Scrambling)

    This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. The dataset has two subsets, one based on Wikipedia and the other one based on the Quora Question Pairs (QQP) dataset.

    paper ->

    github ->

    Download (Wiki) (размеченный) ->

    Download (Wiki) (размеченный, только с перестановками) ->

  • PAWS-x

    This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki.

    github ->

    Download ->

  • PIT (Paraphrase and Semantic Similarity in Twitter)

    Paraphrase and Semantic Similarity in Twitter (PIT) presents a constructed Twitter Paraphrase Corpus that contains 18,762 sentence pairs.

    github ->

  • SciTail

    The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis.

    Home page ->

    Paper ->

    Download ->

  • TURL (Twitter News URL Corpus)

    Requires Access

    Twitter News URL Corpus is a human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking for automatic paraphrase identification.

    github ->

  • CQADupStack

    CQADupStack is a benchmark dataset for community question-answering research. It contains threads from twelve StackExchange subforums, annotated with duplicate question information. Pre-defined training and test splits are provided, both for retrieval and classification experiments, to ensure maximum comparability between different studies using the set. Furthermore, it comes with a script to manipulate the data in various ways.

    Home page ->

    github ->

    Download ->

  • Paralex

    Paralex learns from a collection of 18 million question-paraphrase pairs scraped from WikiAnswers.

    Home page ->

    Cкачать ->

  • Benchmark for Neural Paraphrase Detection

    This is a benchmark for neural paraphrase detection, to differentiate between original and machine-generated content.

    Home page ->

    Download ->

paraphrase_datasets's People

Contributors

otanadzetsotne avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.