Giter Site home page Giter Site logo

webqa's Introduction

News

Oct 15 Update: We decided to release the output files of our baseline models in case they will be helpful for future investigations. Feel free to check it out!

Oct 9 Update: Please note that we've updated the image reading method from cv2 to PIL in the demo notebook. ImageFile.LOAD_TRUNCATED_IMAGES = True is the key to avoid "Image NoneType error".


Download Data

The main data is split into two files. One for train+val (36,766+4,966 samples) and the other for test (7,540 samples).

  • Images

The large img file is compressed and split into 51 chunks of 1GB. You can download all chunks at once by running this script.

To unzip and merge all chunks, run 7z x imgs.7z.001

We also provide google drive download links

You are good when you have WebQA_train_val.json, WebQA_test.json, imgs.lineidx and imgs.tsv.


Explore Data

Output Format (A json file with guids as keys)

{<guid>: {'sources': [<image_id>/<snippet_id>, ..., ],
          'answer': "xxxxxxx" },
 <guid>: {...},
 <guid>: {...},

}

webqa's People

Contributors

webqna avatar zdxdsw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

webqa's Issues

Different snippet id but Same fact and url for text document

Hi, I'm a student looking at a dataset.

I took a look at the dataset and realized that there was data in the text document that had a different snippet id but the exact same fact and wiki url.

For example, in WebQA_train_val.json

{
    "title": "2008 Summer Olympics",
    "fact": "The theme song of the 2008 Summer Olympics was \"You and Me,\" which was composed by Chen Qigang, the musical director of the opening ceremony.",
    "url": "https://en.wikipedia.org/wiki/2008_Summer_Olympics",
    "snippet_id": "d5bbd0e20dba11ecb1e81171463288e9_7"
}
{
    "title": "2008 Summer Olympics",
    "fact": "The theme song of the 2008 Summer Olympics was \"You and Me,\" which was composed by Chen Qigang, the musical director of the opening ceremony.",
    "url": "https://en.wikipedia.org/wiki/2008_Summer_Olympics",
    "snippet_id": "d5bbd13c0dba11ecb1e81171463288e9_8"
}
{
    "title": "2008 Summer Olympics",
    "fact": "The theme song of the 2008 Summer Olympics was \"You and Me,\" which was composed by Chen Qigang, the musical director of the opening ceremony.",
    "url": "https://en.wikipedia.org/wiki/2008_Summer_Olympics",
    "snippet_id": "d5bcc8440dba11ecb1e81171463288e9_14"
}

All three examples have different snippet_ids,
but the same fact: "The theme song of the 2008 Summer Olympics was "You and Me," which was composed by Chen Qigang, the musical director of the opening ceremony.",
and the same url: "https://en.wikipedia.org/wiki/2008_Summer_Olympics".

My understanding is that different text documents are given different snippet_ids.
If there is something I am missing or misunderstanding, I would appreciate it if you could let me know.

I haven't figured out if there are more examples like this, but I'd like to correct my misconceptions first.

Thank you for your help.

Metrics

In your paper, you have tested on Img-based and Txt-based datasets. However, in the leaderboard, there are not so clear. Does the result come from an average?

Not all examples consist of positive images?

Hi, I went through the WebQA_train_val.json and found out of 41739 examples only 21465 has positive image ids? So is this normal or I did some mistake during the preprocessing?

Dataset json file doesn't have "Keywords_A" for any question.

Webqa_train_val.json
For any query, Keywords_A is missing.
But I see this key as part of your baseline evaluation and also Take_a_look_WebQA.ipynb files,
Kindly help with this, please share a data version that has this key, so that we can evaluate our models better.

is there any template form for the short explanation paper?

To be included in the Neurips2021 Competition write-up, authors must provide a short explanation (~1 page) of what they did and what insights they discovered by Oct 29, 2021.

as stated in webqna homepage, is there any template form for the short explanation paper?

The data download from google drive is consistent with demo

The data downloaded from Google Cloud Drive is inconsistent with the data displayed by Have_a_Look_WebQA.ipynb in the demo folder, and the content of the Keywords_A is missing。Is there any way to download the data displayed in the demo version?

Json File required but eval returns Tsv

Hi, I really like your work and I try to evaluate some benchmark on WebQA.
From my understanding, I need to run the vlp/eval.py file which gives me predictions in the tsv format.
However, uploading to the server requires a json file, thus I was wondering if I was missing a step since I do not know whether that is done for me or I have to manually create a json file. Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.