Giter Site home page Giter Site logo

gpt-2-output-dataset's Introduction

gpt-2-output-dataset

This dataset contains:

  • 250K documents from the WebText test set
  • For each GPT-2 model (trained on the WebText training set), 250K random samples (temperature 1, no truncation) and 250K samples generated with Top-K 40 truncation

We look forward to the research produced using this data!

Download

For each model, we have a training split of 250K generated examples, as well as validation and test splits of 5K examples.

All data is located in Google Cloud Storage, under the directory gs://gpt-2/output-dataset/v1.

There, you will find files:

  • webtext.${split}.jsonl
  • small-117M.${split}.jsonl
  • small-117M-k40.${split}.jsonl
  • medium-345M.${split}.jsonl
  • medium-345M-k40.${split}.jsonl
  • large-762M.${split}.jsonl
  • large-762M-k40.${split}.jsonl
  • xl-1542M.${split}.jsonl
  • xl-1542M-k40.${split}.jsonl

where split is one of train, test, and valid.

We've provided a script to download all of them, in download_dataset.py.

Detectability baselines

We're interested in seeing research in detectability of GPT-2 model family generations.

We've provided a starter baseline which trains a logistic regression detector on TF-IDF unigram and bigram features, in baseline.py. The baseline achieves the following accuracies:

Model Temperature 1 Top-K 40
117M 88.29% 96.79%
345M 88.94% 95.22%
762M 77.16% 94.43%
1542M 74.31% 92.69%

Initial Analysis

Unsurprisingly, shorter documents are harder to detect and performance improves gradually with length. Accuracy of detection of short documents of 500 characters (a long paragraph) is about 15% lower.

Truncated sampling, which is commonly used for high-quality generations from the GPT-2 model family, results in a shift in the part of speech distribution of the generated text compared to real text. A clear example is the underuse of proper nouns and overuse of pronouns which are more generic. This shift contributes to the 8% to 18% higher detection rate of Top-K samples compared to random samples across models.

Data removal requests

If you believe your work is included in WebText and would like us to remove it, please let us know at [email protected].

gpt-2-output-dataset's People

Contributors

newmu avatar wuthefwasthat avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.