Giter Site home page Giter Site logo

instructionwild's Introduction

Instruction in the Wild: A User-based Instruction Dataset

News

We release InstructWild v2 under data v2 dir, which includes over 110K high-quailty user-based instructions. We did not use self-instruct to generate any instructions. We also label a subset of these instructions with instruction type and speical tag. Please see README for details.

Introduction

Instruction Tuning is a key component of ChatGPT. OpenAI used their user-based Instruction dataset, but unfortunately, this dataset is not open-sourced. Self-Instruct released a small instruction dataset including 175 instructions written by human labors. Standford Alpaca Team generated 52K instructions by text-davinci-003 model based on the the 175 seed instructions above.

This project targets on a larger and more diverse instruction dataset. To this end, we collected (110K in v2 dataset, 429 in v1 dataset) instructions from ChatGPT usage sharing and released both English and Chinese versions. We found these instructions are very diverse. We follow Alpaca to generate 52K instructions and their responses. All data can be found in data and data v2 dir.

Note: This is an ongoing project. We are still collecting and improving our data. We release this dataset as early as possible to speedup our LLM research. We will also release a whitepaper soon.

Data Release

Our dataset use the same format as Alpaca for fast and easy usage. Our instructions have no input field.

Data Collection (InsturctWild v1)

data_collection

We scrapt over 700 noisy instructions from Twitter and filter out noisy instructions. We then pick 429 clean insturctions to ensure the high quality.

We use a similar method as Alpaca to collect instructions. However, we do not need outputs for instructions thus avoid human involvement. The prompts generated are more diverse and covers more topics compared to the Alpaca's.

We provide 5 prompts as examples for generating new instructions from OpenAI API. After collecting prompts, we collect responses of these instructions from OpenAI API. The English and Chinese datasets are generated seperately. In total, 880$ are spent to collect the dataset. There are 52K instructions for English (around 24M tokens) and 52K instructions for Chinese.

How Good is InstructWild?

Colossal AI used our model to train the ColossalChat model. The ColossalChat-7B (only after stage-1) combines the original alpaca dataset and our dataset. We compare the ColossalChat-7B with Alpaca-7B to see what improvement our dataset brings.

It is difficult to evaluate Chatbot. We human-evaluate several examples under different categories of instructions. Our main findings are:

Pros

  • Our new dataset improves the model's ability in Generation, Open QA, and Mind Storm instructions. This corresponds to our data collection process. Our data is collected from Twitter, where users tend to share their interesting prompts of mostly generation, open QA, and mind-storm types.

Limitations for LLaMA-finetuned models

  • Both Alpaca and ColossalChat are based on LLaMA. It is hard to compensate for the missing knowledge in the pre-training stage.
  • Lack of counting ability: Cannot count the number of items in a list.
  • Lack of Logics (reasoning and calculation).
  • Tend to repeat the last sentence (fail to produce the end token).
  • Poor multilingual results: LLaMA is mainly trained on English datasets (Generation performs better than QA).

Limitations of dataset

  • Lack of summarization ability: No such instructions in finetune datasets.
  • Lack of multi-turn chat and role-playing: No such instructions in finetune datasets
  • Lack of self-recognition: No such instructions in finetune datasets
  • Lack of Safety:
    • When the input contains fake facts, the model makes up false facts and explanations.
    • Cannot abide by OpenAI's policy: When generating prompts from OpenAI API, it always abides by its policy. So no violation case is in the datasets.

Detailed Comparison

See HERE for detailed comparison.

TODO

  • Dataset v1
  • Dataset v2
  • Fine-grained Labeling (v2)
  • Larger Dataset

Authors

This project is maintained by the following authors (currently):

We also acknowledge the valuable suggestions from Prof. Aixin Sun, Dr. Tom Young.

Citation

Please cite the repo if you use the data or code in this repo.

@misc{instructionwild,
  author = {Jinjie Ni and Fuzhao Xue and Kabir Jain and Mahir Hitesh Shah and Zangwei Zheng and Yang You },
  title = {Instruction in the Wild: A User-based Instruction Dataset},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/XueFuzhao/InstructionWild}},
}

instructionwild's People

Contributors

ka-bear avatar psycoy avatar rottenlemons avatar xuefuzhao avatar zhengzangw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

instructionwild's Issues

Updates?

Hi, thank you for your work!

You mention working on a large update, is there an approximate timeline?

Structure of the training data

Hey guys,
I have two questions regarding the structure of your data set. In the ReadMe you say that you are using the "Alpaca" approach, where they use a triple of instruction, input, output.
Why are you prepending the concrete instruction at the end and not at the beginning?
I couldn't find any samples where you provide a triplet with instruction, input and output. What is the reason for that?
More concretely you are using the sample_seed.jsonl to create instructions where do you store the input and output?

niubi

NIUBI,keep smiling

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.