lupantech / scienceqa Goto Github PK

Data and code for NeurIPS 2022 Paper "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering".

License: MIT License

Python 97.57% Shell 2.43%

scienceqa's People

Contributors

Stargazers

Watchers

Forkers

oqustudy oe-heart zahidsqldba07 anmol-m-0 nia-bald techthiyanes arenaa gitbenxing monup165 cyberax64 vpmohanty pruthwik swapnil2597 syshensyshen abhinavm24 fundou yifangao96 gengfire zbryikt rfiacne penghao1023 xring ukaserge aileen5150 zhangjihua396 yikuide armenr techventurebuilder ailabteam godsky117 davidmeza1 qmwz518 liang-qiu delicate2000 zhenghao977 guspan-tanadi puzhang1993 animesh dada308162 student865 rohan598 piterdias mjaniec2013 iq-scm 2132660698 skyrookieyu lazykumasensei shichao-wang cargonriv jan-karsten-kuhnke zongdaoming pavankale2709 jameszhou-gl rioncarter 2305349 wesley7137 ml-course-project theapproach gemhou sorokinvld yeonju7kim chrispoulin

scienceqa's Issues

Sometimes the images are not related to the question at all

Hi expert, is it by design?
For example, the picture is only some green plats in the small strays, but the real way to solve the problem counts on the language understanding.
"180":{ "question":"Which of the following was a dependent variable in this experiment?", "choices":[ "the temperature of the heating pad", "the number of days until a seed germinated" ], "answer":1, "hint":"The passage below describes an experiment. Read the passage and think about the variables that are described.\n\nKenneth wanted to grow cucumbers from seeds. He read that using a heating pad to heat up potting soil could help make seeds germinate, or sprout, faster. Kenneth wondered whether the temperature of the heating pad would affect how quickly the seeds germinated.\nKenneth prepared two potting trays, each made up of ten small pots of soil. He planted one cucumber seed in each small pot and arranged the potting trays near a sunny window. He set an electric heating pad to 75\u00b0F and placed it under one potting tray. He set a second heating pad to 85\u00b0F and placed it under the other potting tray. Kenneth observed the pots daily, and he counted the number of days it took until a seed germinated in each pot.\nHint: An independent variable is a variable whose effect you are investigating. A dependent variable is a variable that you measure.\nFigure: germinating plants in a potting tray.", "image":"image.png", "task":"closed choice", "grade":"grade6", "subject":"natural science", "topic":"science-and-engineering-practices", "category":"Designing experiments", "skill":"Identify independent and dependent variables", "lecture":"Experiments have variables, or parts that change. You can design an experiment to find out how one variable affects another variable. For example, imagine that you want to find out if fertilizer affects the number of tomatoes a tomato plant grows. To answer this question, you decide to set up two equal groups of tomato plants. Then, you add fertilizer to the soil of the plants in one group but not in the other group. Later, you measure the effect of the fertilizer by counting the number of tomatoes on each plant.\nIn this experiment, the amount of fertilizer added to the soil and the number of tomatoes were both variables.\nThe amount of fertilizer added to the soil was an independent variable because it was the variable whose effect you were investigating. This type of variable is called independent because its value does not depend on what happens after the experiment begins. Instead, you decided to give fertilizer to some plants and not to others.\nThe number of tomatoes was a dependent variable because it was the variable you were measuring. This type of variable is called dependent because its value can depend on what happens in the experiment.", "solution":"", "split":"test" },

How did you limit the answer generation space to just the options you provided?

Hello,

I am trying to finetune MiniGPT4 on a Student engagement dataset. It labels are image ids and captions about the image. I managed to get it to perform okay on the dataset. However, It would hallucinate or give answer outside of what I would wish for it to answer as MiniGPT-4 is used for Freeform open-ended VQA.

How did you limit the answer generation space to just the options you provided? The paper mentioned about using a linear classifier at the end to limit the output tokens to just those answer options?

Tony,

How come the results are different between the leaderboard and the original paper?

Thanks for your evaluation dataset. However, I found the evaluation results are different between the leaderboard and the original paper. You do not show their best two results in the leaderboards. You actually show the ablation results instead. Why?

Original paper link: https://arxiv.org/pdf/2302.00923v4.pdf

#5 @lupantech

About experiments code

Hey, I think your work is very meaningful, and I found in the paper that you experimented on the UnifiedQA model, but this part of the code is not currently available in the repo.
Will you release the code next?

Topic-level evaluation result

Hi, this is a great dataset. Thanks for the hard work.

Instead of subject-level result, I am interested in topic-level evaluation result such as biology, chemistry, etc, and thus I wonder:

How do I compute accuracy for topic level?
Do the models on the leaderboard provide topic-level results or provide the raw result files so that people can compute topic-level accuracy for them?

Thanks.

Question about lecture

Hi expert, I've looked at the problems.json, the lecture looks not reasonable, how the lecture is generated? for example: for qid 5, the lecture is People can use the engineering-design process to develop solutions to problems. One step in the process is testing if a potential solution meets the requirements of the design. How can you determine what a test can show? You need to figure out what was tested and what was measured.nImagine an engineer needs to design a bridge for a windy location. She wants to make sure the bridge will not move too much in high wind. So, she builds a smaller prototype, or model, of a bridge. Then, she exposes the prototype to high winds and measures how much the bridge moves.nFirst, identify what was tested. A test can examine one design, or it may compare multiple prototypes to each other. In the test described above, the engineer tested a prototype of a bridge in high wind.nThen, identify what the test measured. One of the criteria for the bridge was that it not move too much in high winds. The test measured how much the prototype bridge moved.nTests can show how well one or more designs meet the criteria. The test described above can show whether the bridge would move too much in high winds..",
However, the question is something about Gordon's test

Hugging Face dataset

Hey, awesome work! I wanted to make this more accessible by putting on the huggingface hub: https://huggingface.co/datasets/derek-thomas/ScienceQA

There were a lot of fields in the description card that I filled in as best as I could. Would you consider reviewing this and after it meets your expectations could you add a link on your github repo?

Thanks,
Derek

Add LLaMA-SciTune results to the leaderboard

Hi, It would be great to add LLaMA-SciTune (developed on top of LLaVA architecture with scientific visual-language data) results to the leaderboard.

Google drive link

It seems that the link of Google Drive cannot be downloaded, is there a download link of one drive?

Question about "context" in the data set

First of all, thank you for open source a very good dataset.

There is the above image on your official website, and I am a little confused about the "context" in the red box. I didn't find the relevant keys in the "problems.json" file you provided. Can you tell me which parts make up "context" ？

I would like to incorporate your dataset into my multimodal work. I would be very grateful if you could reply.

Questions about the process of dataset building

Thanks for your awesome work! It paves the way towards multimodal reasoning agents. I noticed that the questions are collected from IXL learning. Since IXL learning is a website, would you mind explaining in detail how do you get the data from it? And the details you process the crawled data?

Thanks in advance :)

Request to add Honeybee results to the leaderboard

Hello,

Thank you for this great project.

Could you add the results of Honeybee? (code: https://github.com/kakaobrain/honeybee)

Both results are based on the 13B models.

Thanks!

data = datasets.load_dataset('derek-thomas/ScienceQA', 'test')
for sample in data['test']:
      sample['image']

But the sample['image'] in the data is the format of a dictionary with keys of 'bytes' and 'path', which is not a PIL image. And I don't know how to process it.