Giter Site home page Giter Site logo

dorucioclea / awesome-chatgpt-dataset Goto Github PK

View Code? Open in Web Editor NEW

This project forked from voidful/awesome-chatgpt-dataset

0.0 1.0 0.0 1.29 MB

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!

License: GNU General Public License v3.0

awesome-chatgpt-dataset's Introduction

awesome-chatgpt-dataset

Alt Text

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!

Dataset Name Size Languages Source License
TheoremQA 1K English We annotated 800 QA pairs covering 350+ theorems spanning across Math, EE&CS, Physics and Finance. mit
lima 1K English LIMA: Less Is More for Alignment CC BY-NC-SA
im-feeling-curious 3K English This public dataset is an extract from Google's "i'm feeling curious" feature. To learn more about this feature, search for "i'm feeling curious" on Google. -
Puffin 3K English Puffin dataset. Exactly 3,000 examples with each response created using GPT-4. -
cc_sbu_align 4K English MiniGPT-4 datadset BSD 3-Clause License
qa_feedback 4K English we re-construct the ASQA data and collect human feedback for it. We name the resulting dataset as qa-feedback. -
SLF5K 5K English The Summarization with Language Feedback (SLF5K) dataset is an English-language dataset containing 5K unique samples that can be used for the task of abstraction summarization. apache-2.0
blended_skill_talk 7K English A dataset of 7k conversations explicitly designed to exhibit multiple conversation modes: displaying personality, having empathy, and demonstrating knowledge. -
GSM-IC 8K English Grade-School Math with Irrelevant Context (GSM-IC) -
ChatAlpaca 10K English The data currently contain a total of 10,000 conversations with 95,558 utterances. Apache-2.0 license
PKU-SafeRLHF-10K 10K English PKU-SafeRLHF-10K, which is the first dataset of its kind and contains 10k instances with safety preferences. -
Dolly 15K English databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. CC 3.0
WebGPT 20K English This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. -
Code Alpaca 20K English Code generation task involving 20,022 samples -
openapi-function-invocations-25k 25K English The construction of this dataset involved a systematic procedure combining manual extraction and AI-assisted synthesis. mit
LongForm 28K English The LongForm dataset is created by leveraging English corpus examples with augmented instructions. The LongForm project is subject to a MIT License with custom limitations for restrictions imposed by OpenAI (for the instruction generation part), as well as the license of language models (OPT, LLaMA, and T5).
HC3 37K English, Chinese 37,175 instructions generated by ChatGPT and human -
Mol-Instructions 48K English An open, large-scale biomolecular instruction dataset for large language models. CC BY 4.0
RefGPT 50K English,chinese we introduce a cost-effective method called RefGPT, which generates a vast amount of high-quality multi-turn Q&A content. -
arxiv-math-instruct-50k 50K English Dataset consists of question-answer pairs derived from ArXiv abstracts from math categories -
Traditional Chinese Alpaca Dataset 52K Traditional Chinese Translated from Alpaca Data by ChatGPT API Apache-2.0 license
Cabrita Dataset 52K Portuguese Translated from Alpaca Data
Japanese Alpaca Dataset 52K Japanese Translated from Alpaca Data by ChatGPT API CC By NC 4.0; OpenAI terms of use
Alpaca Dataset 52K English 175 seed instructions by OpenAI API CC By NC 4.0; OpenAI terms of use
Alpaca Data Cleaned 52K English Revised version of Alpaca Dataset -
Alpaca GPT-4 Data 52K English Generated by GPT-4 using Alpaca prompts -
Alpaca GPT-4 Data (Chinese) 52K Chinese Generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT -
Dynosaur 66K English Dynosaur, a dynamic growth paradigm for instruction-tuning data curation. Apache-2.0 license
Finance 69K English 68,912 financial related instructions -
evol 70K English This is the training data of WizardLM. -
Vicuna Dataset 75K English ~100k ShareGPT conversations -
InstructionTranslation 80K Multi-lingual Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G). MIT
Self-Instruct 82K English We release a dataset that contains 52k instructions, paired with 82K instance inputs and outputs. -
OASST1 89K Multi-lingual a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. apache-2.0
HH-RLHF 91K English The data are described in the paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. MIT
Guanaco Dataset 98K English, Simplified Chinese, Traditional Chinese HK & TW, Japanese 175 tasks from the Alpaca model GPLv3
InstructionWild 104K English, Chinese 429 seed instructions and follow Alpaca to generate 52K Research only; OpenAI terms of use
Camel Dataset 107K Multi-lingual Role-playing between AIs (Open AI API) -
Tapir-Cleaned 117K English This is a revised version of the DAISLab dataset of IFTTT rules, which has been thoroughly cleaned, scored, and adjusted for the purpose of instruction-tuning. CC BY-NC 4.0
WizardLM_evol_instruct_V2_196k 143K English This datasets contains 143K mixture evolved data of Alpaca and ShareGPT. -
LLaVA Visual Instruct 150K English LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability. cc-by-nc-4.0
Prosocial Dialog 166K English 165,681 instructions produced by GPT-3 rewrites questions and human feedback -
COIG 191K Chinese Chinese Open Instruction Generalist (COIG) project to maintain a harmless, helpful, and diverse set of Chinese instruction corpora. apache-2.0
Unnatural Instructions 241K English a large dataset of cre- ative and diverse instructions, collected with virtually no human labor. MIT
SHP 358K English SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. Reddit non-exclusive, non-transferable, non-sublicensable, and revocable license
dromedary 361K English Dromedary-Verbose-Clone is a synthetic dataset of 360k instructions and demonstrations. cc-by-nc-4.0
ultrachat 404K English To ensure generation quality, two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response. cc-by-nc-4.0
ign_clean_instruct_dataset_500k 509K English This dataset contains ~508k prompt-instruction pairs with high quality responses. It was synthetically created from a subset of Ultrachat prompts. It does not contain any alignment focused responses or NSFW content. apache-2.0
ELI5 559K English The ELI5 dataset is an English-language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers. -
GPT4All Dataset 806K Multi-lingual Subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API. -
Instruct 889K English 888,969 English instructions, augmentation using AllenAI NLP tools MIT
MOSS 1M Chinese Generated by gpt-3.5-turbo Apache-2.0, AGPL-3.0 licenses
LaMini-Instruction 3M English a total of 2.58M pairs of instructions and responses using gpt-3.5-turbo based on several existing resources of prompts cc-by-nc-4.0
Natural Instructions 5M Multi-lingual 5,040,134 instructions collected from diverse NLP tasks -
BELLE 10M Chinese The 10M Chinese dataset is composed of subsets spanning multiple (instruction) types and multiple fields. Research only; OpenAI terms of use
Firefly 16M Chinese 1,649,398 Chinese instructions in 23 NLP tasks -
OIG-43M Dataset 43M Multi-lingual Together, LAION, and Ontocord.ai. -
xP3 79M Multi-lingual 78,883,588 instructions collected by prompts & datasets across 46 languages & 16 NLP tasks -
CodeParrot - python The database was queried for all Python files with less than 1MB in size resulting in a 180GB dataset with over 20M files. -
Alpaca-CoT Dataset - Multi-lingual Instruction Data Collection ODC-By
stack-exchange-paired - English This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. cc-by-sa-4.0
LangChainDatasets - English This is a community-drive dataset repository for datasets that can be used to evaluate LangChain chains and agents. -
ParlAI - English 100+ popular datasets available all in one place, dialogue models, from open-domain chitchat, to task-oriented dialogue, to visual question answering. -
GPTeacher - English A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer -
silk-road/Wizard-LM-Chinese-instruct-evol - chinese Wizard-LM-Chinese -

awesome-chatgpt-dataset's People

Contributors

voidful avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.