webqna / webqa Goto Github PK

License: Creative Commons Zero v1.0 Universal

Shell 100.00%

webqa's Introduction

News

Oct 15 Update: We decided to release the output files of our baseline models in case they will be helpful for future investigations. Feel free to check it out!

Oct 9 Update: Please note that we've updated the image reading method from cv2 to PIL in the demo notebook. ImageFile.LOAD_TRUNCATED_IMAGES = True is the key to avoid "Image NoneType error".

Download Data

Main Data

The main data is split into two files. One for train+val (36,766+4,966 samples) and the other for test (7,540 samples).

Images

The large img file is compressed and split into 51 chunks of 1GB. You can download all chunks at once by running this script.

To unzip and merge all chunks, run 7z x imgs.7z.001

We also provide google drive download links

You are good when you have WebQA_train_val.json, WebQA_test.json, imgs.lineidx and imgs.tsv.

Explore Data

Output Format (A json file with guids as keys)

{<guid>: {'sources': [<image_id>/<snippet_id>, ..., ],
          'answer': "xxxxxxx" },
 <guid>: {...},
 <guid>: {...},

}

webqa's People

Contributors

Stargazers

Watchers

Forkers

chaochun dangiankit stephen0808 ankitshah009 jacobswan1 shubhamphal tpavankalyan

webqa's Issues

Different snippet id but Same fact and url for text document

Hi, I'm a student looking at a dataset.

I took a look at the dataset and realized that there was data in the text document that had a different snippet id but the exact same fact and wiki url.

For example, in WebQA_train_val.json

{
    "title": "2008 Summer Olympics",
    "fact": "The theme song of the 2008 Summer Olympics was \"You and Me,\" which was composed by Chen Qigang, the musical director of the opening ceremony.",
    "url": "https://en.wikipedia.org/wiki/2008_Summer_Olympics",
    "snippet_id": "d5bbd0e20dba11ecb1e81171463288e9_7"
}
{
    "title": "2008 Summer Olympics",
    "fact": "The theme song of the 2008 Summer Olympics was \"You and Me,\" which was composed by Chen Qigang, the musical director of the opening ceremony.",
    "url": "https://en.wikipedia.org/wiki/2008_Summer_Olympics",
    "snippet_id": "d5bbd13c0dba11ecb1e81171463288e9_8"
}
{
    "title": "2008 Summer Olympics",
    "fact": "The theme song of the 2008 Summer Olympics was \"You and Me,\" which was composed by Chen Qigang, the musical director of the opening ceremony.",
    "url": "https://en.wikipedia.org/wiki/2008_Summer_Olympics",
    "snippet_id": "d5bcc8440dba11ecb1e81171463288e9_14"
}

All three examples have different snippet_ids,
but the same fact: "The theme song of the 2008 Summer Olympics was "You and Me," which was composed by Chen Qigang, the musical director of the opening ceremony.",
and the same url: "https://en.wikipedia.org/wiki/2008_Summer_Olympics".

My understanding is that different text documents are given different snippet_ids.
If there is something I am missing or misunderstanding, I would appreciate it if you could let me know.

I haven't figured out if there are more examples like this, but I'd like to correct my misconceptions first.

Thank you for your help.

This site can’t be reached when click on "Download Data -> Main Data"

Hi,

I cannot connect to the server. I also try to use wget from PSC but time out. Could you help to check server status?

How to download the text collection of fully retrieval?

Could you please provide the download url of the collection of text document dataset?

Metrics

In your paper, you have tested on Img-based and Txt-based datasets. However, in the leaderboard, there are not so clear. Does the result come from an average?

Not all examples consist of positive images?

Hi, I went through the WebQA_train_val.json and found out of 41739 examples only 21465 has positive image ids? So is this normal or I did some mistake during the preprocessing?

Dataset json file doesn't have "Keywords_A" for any question.

Webqa_train_val.json
For any query, Keywords_A is missing.
But I see this key as part of your baseline evaluation and also Take_a_look_WebQA.ipynb files,
Kindly help with this, please share a data version that has this key, so that we can evaluate our models better.

is there any template form for the short explanation paper?

To be included in the Neurips2021 Competition write-up, authors must provide a short explanation (~1 page) of what they did and what insights they discovered by Oct 29, 2021.

as stated in webqna homepage, is there any template form for the short explanation paper?

Dataset download failed

The 51 blocks are downloaded, but errors always occur during the decompression process.

The data download from google drive is consistent with demo

The data downloaded from Google Cloud Drive is inconsistent with the data displayed by Have_a_Look_WebQA.ipynb in the demo folder, and the content of the Keywords_A is missing。Is there any way to download the data displayed in the demo version?

Json File required but eval returns Tsv

Hi, I really like your work and I try to evaluate some benchmark on WebQA.
From my understanding, I need to run the vlp/eval.py file which gives me predictions in the tsv format.
However, uploading to the server requires a json file, thus I was wondering if I was missing a step since I do not know whether that is done for me or I have to manually create a json file. Thank you!