Giter Site home page Giter Site logo

azfarniaz / communityquestionanswering Goto Github PK

View Code? Open in Web Editor NEW

This project forked from whiskeyromeo/communityquestionanswering

0.0 2.0 0.0 399.28 MB

Task 3: SemEval 2017 Community Question Answering

Python 84.38% Perl 10.62% Java 1.92% Batchfile 0.11% Shell 1.67% JavaScript 0.20% HTML 0.12% CSS 0.24% Raku 0.74%

communityquestionanswering's Introduction

SemEval 2017 Task 3


Note

Code for the project and final results may be found in the FinalProject directory. Additionally, there exists over 130MB of ublabeled Questions/Comments which I crawled from QatarLiving which may be of use to those studying the topic in the future. The crawler itself no longer is functional as the QatarLiving site layout has changed however the compressed, unlabeled data may be found here.

Introduction

Community forums are increasingly gaining popularity as a way to pose questions and receive honest and open answers. These forums are rarely moderated, allowing anyone to ask or respond to a question. The lack of moderation has many advantages including letting users post anything they want, resulting in some well thought out responses. However, this openness comes with a downfall of people posting responses that are not relevant to the question asked. Ranking comments that are most relevant to the question asked will save the user from sifting through hundreds of responses. Further, providing a list of similar questions will provide the user with a bank of comments that could possibly provide they answer they are seeking.

The data in Semeval Task 3 comes from Qatar Living. This is a forum where users can post questions about life in Qatar, and receive responses from the community. We are focused on subtasks A and B, which is question-comment similarity and question-question similarity. During this first stage, we have primarily focused on question-question similarity. Our goals for the next stage of the project are to improve the accuracy of the question-question similarity, as well as work on question-comment similarity.

Method

In order to determine related questions, we first created a term frequency-inverse document frequency matrix, where the questions were columns and vocabulary were rows. We then performed latent semantic indexing on this matrix and then calculated the cosine similarity from the resulting matrix. This allowed us to rank which questions were most similar to a given question. We tried using both Doc2Vec as well as LSI on the matrix created by TF-IDF. We found that using LSI gave us slightly better results than Doc2Vec. Finally, the Cosine Similarity was calculated on the vectors found using LSI, which gave us scores corresponding to the vectors. From these scores, we were able to determine which questions ranked most similar to a question.

Tasks

From the SemEval Site:

Our main CQA task, as in 2016, is: “given (i) a new question and (ii) a large collection of question-answer threads created by a user community, rank the answer posts that are most useful for answering the new question.”

Additionally, we propose two sub-tasks:

[1] Question Similarity (QS): given the new question and a set of related questions from the collection, rank the similar questions according to their similarity to the original question (with the idea that the answers to the similar questions should be answering the new question as well).

[2] Relevance Classification (RC): given a question from a question-answer thread, rank the answer posts according to their relevance with respect to the question.


SubTasks

  • Subtask A: Question-Comment Similarity:

    Given a question and its first 10 comments in the question thread, rerank these 10 comments according to their relevance with respect to the question.

  • Subtask B: Question-Question Similarity:

    Given a new question (aka original question) and the set of the first 10 related questions (retrieved by a search engine), rerank the related questions according to their similarity with respect to the original question.

  • Subtask C: Question-External Comment Similarity : -- this is the main English subtask.

    Given a new question (aka the original question), the set of the first 10 related questions (retrieved by a search engine), each associated with its first 10 comments appearing in its thread, rerank the 100 comments (10 questions x 10 comments) according to their relevance with respect to the original question.

  • Multi-Domain Duplicate Detection Subtask (CQADupStack Task) : - Task E: Identify duplicate questions in StackExchange.

    Given a new question (aka the original question), a set of 50 candidate questions,rerank the 50 candidate questions according to their relevance with respect to the original question, and truncate the result list in such a way that only "PerfectMatch" questions appear in it.


Links

Directly Relevant to the Competition

Former SemEval Projects

communityquestionanswering's People

Contributors

whiskeyromeo avatar kleiner617 avatar

Watchers

James Cloos avatar Azfar Niaz avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.