allenai / natural-instructions Goto Github PK

Expanding natural instructions

Home Page: https://instructions.apps.allenai.org/

License: Apache License 2.0

Python 76.93% Shell 0.64% Jinja 22.13% Dockerfile 0.30%

natural-instructions's Introduction

A Repository of Language Instructions for NLP Tasks

TLDR; this repository maintains a community effort to create a large collection of tasks and their natural language definitions/instructions. Check the releases for the summary of the latest changes and additions to the tasks.
If you have any suggestions to improve the data, let us know. We're looking for more contributions to make this data better and bigger! 🙌

News Bulletin

May 2022: We released the several models trained on our data. Check out the code and checkpoints.
April 2022: A paper on our data is out!
October 15, 2021: the goal date for the our v2 dataset.
- The community have contributed over 1500 tasks!! 🎉
- We are working on cleaning up the new tasks and publishing a paper summarizing our new findings!
- You can still submit new tasks! The new tasks will be part of the future data releases.
Sept 2021: general call for contributions is out!
June 2021: we initiated this repository with 61 tasks!

Background

Why define tasks in natural language?

While the current dominant paradigm (supervised learning with task-specific labeled examples) has been successful in building task-specific models, such models can't generalize to unseen tasks; for example, a model that is supervised to solve questions cannot solve a classification task. We hypothesize that a model equipped with understanding and reasoning with natural language instructions should be able to generalize to any task that can be defined in terms of natural language.

Any empirical evidence that this might be true?

In our earlier effort, we built a smaller data (61 tasks) and observed that language models benefit from language instructions, i.e., their generalization to unseen tasks when they were provided with more instructions.
Also, generalization to unseen tasks improves as the model is trained on more tasks.

Why build this dataset?

We believe that our earlier work is just scratching the surface and there is probably so much that be studied in this setup. We hope to put together a much larger dataset that covers a wider range of reasoning abilities. We believe that this expanded dataset will serve as a useful playground for the community to study and build the next generation of AI/NLP models. See this blog post for a summary of the motivation behind this work.

Task schema

Each consists of input/output. For example, think of the task of sentiment classification:

Input: I thought the Spiderman animation was good, but the movie disappointed me.
Output: Mixed

Here is another example from the same task:

Input: The pumpkin was one of the worst that I've had in my life.
Output: Negative

Additionally, each ask contains a task definition:

Given a tweet, classify it into one of 4 categories: Positive, Negative, Neutral, or Mixed.

Overall, each tasks follows this schema:

Or if you're comfortable with json files, here is how it would look like:

{
  "Contributors": [""],
  "Source": [""],
  "URL": [""],
  "Categories": [""],
  "Reasoning": [""],
  "Definition": [""],
  "Input_language": [""], 
  "Output_language": [""],
  "Instruction_language": [""],  
  "Domains": [""],    
  "Positive Examples": [ { "input": "", "output": "",  "explanation": ""} ], 
  "Negative Examples": [ { "input": "", "output": "",  "explanation": ""} ],
  "Instances": [ { "id": "", "input": "", "output": [""]} ],
}

How to contribute

We would appreciate any external contributions! 🙏 You can contribute in a variety of ways.

If you think an important task is missing, you can contribute it via Pull-Request. You can also get inspirations from the task suggestions in the Github issues which you can sign up to work on.
If you have any other suggested tasks but you're not sure if they're good fit, bring them up in the issues.
If you have any questions or suggestions, please use the issues feature.
If you're addimg a new task, make sure to review the following guidelines:
- Each task must contain contain a .json file that contains the task content. You can look inside the tasks/ directory for several examples.
  - Make sure that your json is human readable (use proper indentation; e.g., in Python: json.dumps(your_json_string, indent=4, ensure_ascii=False))
  - Make sure that you json file is not bigger than 50MB.
  - Make sure your task has no more 6.5k instances (input/output pairs).
  - Each instance must have a unique id, which should be the task number plus a string generated by uuid.uuid4().hex. E.g., task1356-bb5ff013dc5d49d7a962e85ed1de526b.
  - Make sure to include task category and domains, based on this list.
  - Make sure to number your task json correctly
    - Look at the task number in the latest pull request, task number in your submission should be the next number.
    - Make sure to include the source dataset name and the task type when naming your task json file.
      - You can use this format: taskabc_<source_dataset>_<task_type>.json E.g. in task001_quoref_question_generation.json, the source dataset is quoref and the task is question generation.
  - Note that, source need not necessarily be a dataset and can be a website e.g. leetcode.
    - If you have created the json without any reference, use synthetic in place of source.
  - You should have one pull request per dataset. Name your pull request as Task Name <start_task_number>-<end_task_number>.
  - If you're building your tasks based existing datasets and their crowdsourcing templates, see these guidelines.
- Add your task to our list of tasks.
- To make sure that your addition is formatted correctly, run the tests: > python src/test_all.py
  - To only test the formatting of a range of tasks, run > python src/test_all.py --task <begin_task_number> <end_task_number>. For example, running > python src/test_all.py --task 5 10 will run the test from task005 to task010.

Benchmarking cross-task generalization

As is introduced in our paper, this dataset can be used for systematic study of cross-task generalization, i.e., training on a subset of tasks and evaluating on the remaining unseen ones. To make the comparison among different methods easier, we create an official split here, as is described in the paper. You can follow the instructions to set up your experiments.

We also released our experiment code and checkpoints for reproducibility and future research.

License

All the data here (except the instances of each task) are released under Apache-2.0 license. The instances of each tasks are subject to the license under which the original dataset was released. These license information are available unders "Instance License" field within each task file.

Misc.

If you want to use Natural Instructions v1, here's the code: link

Feel free to cite us.

@inproceedings{naturalinstructions,
  title={Cross-task generalization via natural language crowdsourcing instructions},
  author={Mishra, Swaroop and Khashabi, Daniel and Baral, Chitta and Hajishirzi, Hannaneh},
  booktitle={ACL},
  year={2022}
}
@inproceedings{supernaturalinstructions,
  title={Super-NaturalInstructions:Generalization via Declarative Instructions on 1600+ Tasks},
  author={Wang, Yizhong and Mishra, Swaroop and Alipoormolabashi, Pegah and Kordi, Yeganeh and Mirzaei, Amirreza and Arunkumar, Anjana and Ashok, Arjun and Dhanasekaran, Arut Selvan and Naik, Atharva and Stap, David and others},
  booktitle={EMNLP},
  year={2022}
}

natural-instructions's People

Contributors

Stargazers

Watchers

Forkers

nrjvarshney venkatramaraju kurbster iqrazareef-1024 kuntalkumarpal chirasmita-mallick ankitacharya25 palipoor maitreyapatel mirzyaaliii mehrad0711 shau0000 veronica320 yeganehkordi colinzhaoust trendingtechnology eshaanpathak liusiyi641 mihir3009 ethankim00 savandoshi aarunku5 unwrap-demo-allenai slowy07 dzuoshi xudongolivershen amirrezamirzaei kushalchawla cosmicishan yyzhuang1991 jayavardhan3112 arutselvan gkaramanolakis nsvpc fsiar shailaja183 lm-prompting sinclaircoder phanirohithakaza bhavyasri1001 akhilkumargudipoodi nikhitha0911 aritc budhadityadeb saicharanpapani papanisaicharan ananth-duggirala divya0627 ayushkalani ssingulu ilobo22 sujan242 khushal1996 tanay2001 paik1 kaushikpunyamurthula hanut1909 raghulrajm siddhesh-jagtap sarnshreya nakulvaidya vijaykumawat256 ritvik7 parilghori mda233 krishnasree11 abhinawale12 chiragvartak akhileshamara jalanshmunshi sadneyap neilfranks ashutosh1608 shashankdavalgi mirror3 pruthvi98 kashyap467 aklagoo rubicon1887 selvagmj saianirud sai-pavan-kalyan karthikmuru meghpatel matthew-huff yashbhokare garyhlai karannaik3797 krishna-8 codehime vansh-patel-asu ssaivenkat176 ujjwalaananth ashok-arjun zchristian955 nikhil-chandra-nirukonda sumantapatro abhilashreddyy ravsehajsinghpuri rushangkaria

natural-instructions's Issues

Adding semantic parsing tasks

I think semantic parsing tasks could good additions, as long as the meaning of the "parse" can be easily defined in terms of natural language. Worth looking into the existing datasets:

http://nlpprogress.com/english/semantic_parsing.html
https://paperswithcode.com/task/semantic-parsing

temporal reasoning category verification -> classification

Can convert this task: https://instructions.apps.allenai.org/dataset_viewer?file=subtask019_mctaco_temporal_reasoning_category.json
into the task of classifying what is the temporal category of the given question.

PS. When splitting the data into test and train, we should make sure to include such sister tasks in the split.

PR 34 broke the tests

@Mihir3009 looks like your PR #34 was not properly tested and our main branch is failing the test.
Could you check the logs and fix the issue?:
https://github.com/allenai/natural-instructions-expansion/runs/3149658504

Answer plausibility task for MCTACO and other existing tasks

We can add answer plausibility tasks for MCTACO and other existing QA datasets in the repo.
Inspired from this:
https://github.com/bigscience-workshop/promptsource/blob/main/promptsource/templates/mc_taco/templates.yaml#L5-L20

zest: how to formulate

@yeganehkordi asks:

I'm extracting tasks from the zest dataset for the natural-instructions-expansion project. We should combine the questions given in groups to write new, natural-sounding questions for the combination task. Here is an example:
"input": [
                        "What college did PRESIDENT attend?",
                        "Where did PRESIDENT meet his wife?"
],
"output": "Did PRESIDENT meet his wife in college?"
since the test doesn't accept lists as input, is there any special formatting in this case, or should I concatenate questions?

Make sure to follow the schema outlined in the main readme: input should be a string and output should be a list of equally valid responses.

In terms of what tasks to include, we should follow the annotations template/results here:
https://github.com/allenai/zest/tree/master/mturk-templates

FYI @swarooprm

Automatic check to make sure that the task is described in the tasks readme.

To test_all.py, add a condition that checks if the task name is included in the tasks/README.md file.,

naming of the task files

I suggest we change the naming so that we don' have a task # in the file name (which creates a lot of headaches now that we have many submissions). Now the naming would be: {source}_{task}.json

ATOMIC

One can create tasks based on the atomic data, where the task definitions define the edges/relations in their dataset:
https://allenai.org/data/atomic-2020

Is the CODAH dataset a good addition?

I'm currently looking at the CODAH dataset that was listed in the spreadsheet (https://github.com/Websail-NU/CODAH). Is this a good addition? The paper states that this dataset was adversarially constructed so GPT-1 and BERT would struggle, but after looking at the dataset, I'm pretty confident that current large SOTA models (GPT-3, T5, RoBERTa, etc.) will crush this. Should I still go ahead and format this json file?

Turn on the constraint for negative examples

After fixing the issue with the old negatives, bring this line back:
https://github.com/allenai/natural-instructions-expansion/blob/master/src/test_all.py#L41

Language indicator key/values to the schema

Add language indicator to the schema:

"Input_language": language of the input/instructions
"Output_language": language of the responses/output

We can specify these languages via their two-digit ISO 639-1 codes.

tasks from neural turning machine

Tasks in this work could be good additions:
https://arxiv.org/pdf/1410.5401.pdf

list of task types

Have a readme file containing all the task categories and their description
The test script should ensure that the provided category type is included in the above readme file.

TellMeWhy

https://www.semanticscholar.org/paper/TellMeWhy%3A-A-Dataset-for-Answering-Why-Questions-in-Lal-Chambers/3eeedb6651a629a105c1185ada862e2cad7a0522?utm_content=title&utm_medium=alert&utm_source=slackbot

Randomization issue in tests

@nrjvarshney reports that there is possibly some randomization in the testing script. Sometimes an error appears, sometimes it does not. Also, sometimes all test passes in local, but on the repo, the test fails.

CREEK

https://openreview.net/pdf?id=mbW_GT3ZN-

grounding instruction following

Could be of interest:
https://arxiv.org/abs/1912.01734
https://aclanthology.org/D18-1286.pdf

Define an NER task

The task definition would have to define a list of tags (it could be coarse labels, such as Person, Location, etc. Or fine-grained like president, politician, doctor, etc).
each instance can be: what is the semantic type of the mention x in the following sentence: y?
there are a couple of datasets out there we can use for this: https://github.com/juand-r/entity-recognition-datasets

Allow multiple task definitions

Currently, we support only a single task definition.
It might be a good idea to have a list of task definitions, in case in future people add different task definitions to the data.
And some of these task definitions could be various forms of "reframed" task definitions.

Multiple possible outputs for one input

Hi,
If the "input"s are the same for several data points, but there're several possible outputs, should they be merged into one data point with multiple "output"s?

Fix negative example annnotation in classification tasks

There are some annotation issues in negative examples for classification tasks
https://instructions.apps.allenai.org/dataset_viewer?file=subtask022_cosmosqa_passage_inappropriate_binary.json

Need to find out those all and fix

language puzzles

https://aclanthology.org/D15-1118.pdf
https://aclanthology.org/W04-0902.pdf
https://aclanthology.org/D18-1182.pdf
https://aclanthology.org/P15-4014.pdf

Code Templates For Formatting Datasets!

Hey all, I decided to post all of my code for the tasks I have contributed so far here in a public repository on my account. They follow a pretty structured and neat approach (besides the one for Detoxifying LMs since that was my first contribution when I was getting the hang of things). I hope they help! Feel free to clone and ask any questions.

preventing search engine crawlers

We should add headers to the files so that internet crawlers won't crawl our data.
This will prevent future language models from pre-training on this data.
We may want to get some inspiration from Big-bench on this.

Repeated instances found in the tasks

@kurbster Looks task 77 merged in #15 has some repeated instances. For example, check the following input which appears twice:

Step 1: For each row in Movie table, find the corresponding rows in Rating table.
Step 2: find each value of director in the results of step 1 along with the minimum stars of the corresponding rows to each value

Note that the two outputs corresponding to this input are different. Assuming that both these outputs are valid outputs, you'd need to group them to form one single instance (one input and two distinct valid outputs).

Please send a PR with the fix.

lexical semantics tasks

https://aclanthology.org/P16-1226.pdf
https://arxiv.org/pdf/1608.05014v3.pdf
https://aclanthology.org/N18-2035.pdf
https://aclanthology.org/P18-1111.pdf
https://arxiv.org/pdf/2004.14979.pdf

test for short inputs

short inputs (less than 5 tokens or 2 words) are red flags

Counterfactual reasoning task

https://www.semanticscholar.org/paper/Counterfactual-Story-Reasoning-and-Generation-Qin-Bosselut/4d16457cded23bce6eaa91cd17aefd22af2279f0

Parsing error in tasks from Natural Instructions

Categories need to be fixed, it should be 'question generation', 'answer generation' etc., and not just generation. Readme contains correct categories.
All the negative examples are missing (e.g. task 039 in comparison with https://instructions.apps.allenai.org/dataset_viewer?file=subtask039_qasc_find_overlapping_words.json), need to add those in various tasks inside the tasks folder.
Do we need to add the prefix "things to avoid", "emphasis" etc. while adding content in definition? May be we can remove those.

Dataset of stances and perspectives

https://github.com/CogComp/perspectrum

The original dataset templates

For the datasets that we're allowed to, I think it makes sense to include their templates in repo.

Subtasks based on dialog datasets

Hi. Are the sub-tasks created from a dialogue dataset in the scope of this project? For instance, utterance-level or dialogue-level classification tasks with natural instructions? I cannot find any other tasks created in this way. Can you share an example?

I would like to add our recent work at NAACL 2021 (https://aclanthology.org/2021.naacl-main.254.pdf). Our dataset contains negotiation dialogues between two participants (MTurkers) and associated strategy annotations.

Potential sub-tasks can include predicting the negotiation strategy used by participants from a given utterance (and some context) and predicting participant satisfaction given a complete dialogue.

Let me know! I would be happy to work towards including these subtasks for this project.

Tasks built based on coding challenges

We can use the task description of coding challenges (the ones that are relatively simple to understand for avg humans) and create input/output instances based on the solution program.

https://leetcode.com/
https://www.codechef.com/problems/school
https://edabit.com/challenges

A field for "source"

To keep track of where the data/annotations came from.

Should we separate it to different tasks for different topics in one dataset?

The AFS (argument facet similarity) dataset has argument pairs with similarity scores in three different topics (gun control, death penalty, gay marriage). They are all basically the same task, but the topic they are discussing are different. I was thinking that I should combine them into one argument similarity task, but they provided their data in separate files, and they also separated all of their evaluation results for each topic (see https://arxiv.org/pdf/1709.01887.pdf Table 4). So I wonder if they are concerned of some domain differences and if I should also separate them into three different tasks. Thanks

Human evaluation of the tasks

Purpose

We need to ensure the quality of the presented tasks.
One way to do this is to ask human annotators (crowd workers) to read our instructions and answer them.

Stages

I am assuming that we're gonna use AMTI for this: https://github.com/allenai/amti

Have a crowdsourcing template with place-holders for the instructions / positive examples / negative examples.
Have a script to read take the task name as a parameter and spit out a subset of it in an appropriate format. The resulting file should have all the placeholders needed for the template file.
Evaluate a few of our tasks with this architecture to make sure that we have a reasonable pipeline.

The task definition would have to define a list of tags (it could be coarse labels, such as verb, noun, etc. Or fine-grained like VB, VBB, etc).
each instance can be: what is the part-of-speech tag of the word x in the following sentence: y?
there are a couple of pos tagging datasets out there we can use for this: http://nlpprogress.com/english/part-of-speech_tagging.html#:~:text=A%20standard%20dataset%20for%20POS,are%20evaluated%20based%20on%20accuracy.

Datasets for compositional generalization

https://arxiv.org/pdf/2010.12725.pdf