Giter Site home page Giter Site logo

drqa_cn's Introduction

DrQA Chinese implementation

Introduction

This is a modified version of facebook DrQA module which supports Chinese language. The git repo is for study only. This module can be used to answer question for any specific context. The initial optimization is targeting to area of specific university. This project is not fully tested nor fully complete.

DrQA Introduction

DrQA is a system for reading comprehension applied to open-domain question answering. In particular, DrQA is targeted at the task of "machine reading at scale" (MRS). In this setting, we are searching for an answer to a question in a potentially very large corpus of unstructured documents (that may not be redundant). Thus the system has to combine the challenges of document retrieval (finding the relevant documents) with that of machine comprehension of text (identifying the answers from those documents).

Our experiments with DrQA focus on answering factoid questions while using Wikipedia as the unique knowledge source for documents. Wikipedia is a well-suited source of large-scale, rich, detailed information. In order to answer any question, one must first retrieve the few potentially relevant articles among more than 5 million, and then scan them carefully to identify the answer.

Note that DrQA treats Wikipedia as a generic collection of articles and does not rely on its internal graph structure. As a result, DrQA can be straightforwardly applied to any collection of documents, as described in the retriever README.

Installation

This is a modified version of facebook DrQA module. This module is for study only. To install this module, please install pytorch according to pytorch.org and run the setup.py in python3 environment. (3.5, 3.6 both works well) (the setup may cover the facebook DrQA) If I missed some requirements, please just install with pip. Then install corenlp with Chinese package according to CoreNLP offical, you may specific classpath in environment or in file drqa\tokenizers\Zh_tokenizer.py. Then you may download vectors and training sets to start your work.
Download link : Data , secret: 232d
Merge drqa folder with original folder, the file contains common data file and zh_dict.json for Chinese_English translation.

Structures

/data : stores all the data  
    /vector 
    /'training set'
    /'db' : retriever db
    /'module' : saved module
    ...
/drqa : main modules
    /features : common features file shared in project
    /pipline : concact reader and retriever
        drqa.py  original pipline manager
        simpleDrQA.py  a simple version of pipline manager
    /reader : reader module
        ...
    /retriever : retriever module
        ...
        net_retriever.py : simply retrieve context (search) in the search engine (baidu) and use results as context
    /tokenizers : tokenizer features
        /Zh_tokenizer.py : corenlp chinese (use tag '--tokenizer zh' to specific)
        /zh_features.py : common chinese features
        ...
/scripts : common command line methold
    ...
    /pipline
        sinteractive.py : use simple drqa agent
        ...

Common files in the project is not mentioned, please check with facebook DrQA.

Features

Please check facebook module for designing features.
As a Chinese implementation of original module, this project supported full Chinese support with full Chinese linguistic tags. Chinese Lemma tag is replaced with English translation. All the expression will be parsed through Chinese normalization. (symbol, simp and trad)
Includes function for various Chinese features transformation :

  1. simplified to traditional
  2. Chinese to pinyin
  3. Chinese number to number
  4. SBC case to DBC case
  5. common symbol transformation

The module embed with common words mapping. (abbreviation <-> full spelling, etc.)
This module provides a simple context scoring function for better answer ranking.
Provide simple context retriever. (worked with baidu search engine)
Provide parsed and tested training set (based on WebQA) and word embedding (60 dimension and 200 dimension). Provide with testing module.

Result

sinteractive.py result example:

>>> process("西交图书馆的全名?", doc_n=1, net_n=3)
09/27/2017 04:45:38 PM: [ [question after filting : 西安交通大学图书馆的全名? ] ]
09/27/2017 04:45:39 PM: [ [retreive from net : 3 | expect : 3] ]

...

09/27/2017 04:45:43 PM: [ [retreive from db] ] =================raw text==================
...侧,目前为工程训练中心、实验室及艺术庭院.西安交大图书馆北楼始建于1961年7月,共三层,建筑面积11200平方米,是和老教学主楼一并设计建设“中苏风格”建筑,风格朴实宏伟.和北楼相连的南楼建筑面积18000平方米,于1991年3月投入使用,地上13层,地下2层.设计上南楼保留了北楼的设计元素,外形呈金字塔形,被部分同学们戏称为“铁甲小宝”.图书馆南楼顶部有报时的大钟,报时音乐为“东方红”,2010年曾改为**名曲“茉莉花”,后因国际形势变化改回“东方红”.1995年,图书馆经钱学森本人同意及中宣部批准改名钱学森图书馆,并由时任****总书记、国家主席***题写馆名.现今该图书馆拥有阅览座位3518席,累计藏书522.8万册(件),报刊10053种,现刊4089种. ===================================

....

图书馆南楼顶部有报时的大钟,报时音乐为“东方红”,2010年曾改为**名曲“茉莉花”,后因国际形势变化改回“东方红” 1995年,图书馆经钱学森本人同意及中宣部批准改名钱学森图书馆,并由时任****总书记、国家主席***题写馆名 ======== answer :钱学森图书馆 answer score : 0.0819935 context score : 9.164698700898501 Time: 12.1489

Training with WebQA training dataset, the code runs a 65% exact match rate in valiation set.
The result of retriever module or pipline is not tested. (our document set is not complete at all and retriever module seems working badly) The procession for context (retrieved data) is vital in final performance.

License

DrQA_cn is BSD-licensed based on DrQA.

Training set is licensed by baidu : WebQA. This dataset is released for research purpose only. Copyright (C) 2016 Baidu.com, Inc. All Rights Reserved.

drqa_cn's People

Contributors

amosekang avatar ajfisch avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.