Giter Site home page Giter Site logo

zalo-vietnamese-wiki-qa's Introduction

Vietnamese Wikipedia Question Answering

Table of contents

Problem statement

Given a question, related paragraphs from a Wikipedia article (in the shuffle order) in Vietnamese, the task is finding paragraph which answers the question of each test case.

This challenge is categorized into the set of basic Question-Answering problems. We need to build a system that automatically answer questions posed by humans in a natural language. But this challenge is simpler, we just need to detect the answer provided for the question is good (which can fully answer the question) or bad (which cannot satisfy the question).

This is the challenge provided by Zalo in the Zalo AI Challenge 2019.

Data acquisition

We have found three datasets which are closed to the requirements of this challenge.

  • First, we have the dataset provided by Zalo - the host of the contest. This dataset provides 18108 rows of question-answer pairs.

  • Second, we have the dataset provided by Stanford University called SQuAD 2.0 (the Stanford Question-Answering Dataset). This is the standard dataset for the Question-Answering problems for English language. We intend to translate this English data into Vietnamese. This dataset provides 141979 rows of question-answer pairs.

  • Finally, we have the dataset called Vietnamese SQuAD, which is the combination of two smaller datasets. The first one is MLQA (MultiLingual Question Answering) provided by Facebook Research, which contains a part of SQuAD 2.0 that is manually translated into Vietnamese. The second one is the Vietnamese dataset provided by Long Mai. This combination dataset provides 9460 rows of question-answer pairs.

Approaches

We have tried many different baseline models but most of models includes two steps: features extraction and classification.

  • With feature extraction, we use: Bag of words, Bag of words + NGrams, Tf-Idf, Tf-Idf + NGrams, Word2Vec and Average Word2Vec.

  • With classification, we use: Naive Bayes, KNN, Rocchio, Random Forest, Extra Trees, Linear SVC, LightGBM, XGBoost, CRNN + Attention.

Evaluation

Model Macro-F1 score
KNN 0.6495627076335737
Rocchio 0.6524710046296189
Naive Bayes 0.6575807856090050
Extra Trees 0.7004486941326811
LinearSVC 0.7073252865333168
LightGBM 0.7181613875112443
CRNN + Attention 0.7668151799465259
BERT 0.79 and higher

The current optimal approach for this problem is BERT, which is a Transformer-based machine learning technique for NLP pre-training developed by Google. Referencing the Github repository namnv1113, they use the BERT model and get the approximately 0.79 of F1 score. This is currently also the optimal approach for most of Question-Answering problems.

This year (2020), VinAI have released their new researcher based on BERT called PhoBERT. This is the pre-trained models for Vietnamese. We have not tried to used this pre-trained models but we believe that the performance can even higher than the previous one.

zalo-vietnamese-wiki-qa's People

Contributors

newluminous avatar quocanuetcs avatar

Stargazers

Lưu Văn Đức Thiệu avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.