Giter Site home page Giter Site logo

wikiqa's Introduction

WikiQA

My Qusetion-Answer System:Use Wiki-articles relative to question to generate answers

基于Flask框架,界面如下:

home

Components

wikiQA
│  app.py
│  stopwords.txt
│  tools.py
│  v2.db 存储wiki-article的数据库
│  
├─model 存储rank需要的模型(TODO:可用map整合为一个)
│      corpus.pkl
│      doc2idx.pkl
│      sklearn_tfidf.pkl
│      tfidf_matrix.pkl
│      
├─scripts 脚本文件,用于构建数据库以及预处理
│  │  __init__.py
│  │  
│  └─retriever
│          build_db.py
│          build_tdidf.py
│          convert.py
│          prep_file.py
│          prep_text.py
│          __init__.py
│          
├─static 
│  ├─css
│  │      bootstrap.css
│  │      custom.min.css
│  │      
│  └─js
│          jquery-3.4.1.min.js
│          load.js
│          
├─templates
│      index.html
│      
└─wikiqa
    │  __init__.py
    │  
    └─retriever
            doc_db.py
            elasticsearch_ranker.py TODO
            jieba_tfidf_ranker.py 残次
            sklearn_tfidf_ranker.py 可用
            utils.py
            __init__.py
            

Retriver

数据获取

在wikidump下载:zhwiki-latest-pages-articles.xml

使用WikiExtractor处理数据

python WikiExtractor.py -b 1024M -o ../extracted --json zhwiki-latest-pages-articles.xml.bz2

预处理

使用WikiExtractor提取后,仍然存在许多问题

结构性预处理

  1. convert.py:使用外部工具opencc.exe进行简繁转换
  2. prep_file.py:去除没用的特殊符号(eg.'「『')

内容预处理

prep_text.py,构建数据库时,使用multiprocess中的Pool进行多线程操作,传入参数initargs=(preprocess,)来调用prep_text进行以下内容处理。

  1. 删除标题带有“消歧义”的文章

    eg. 北京 (消歧义)https://zh.wikipedia.org/wiki/%E5%8C%97%E4%BA%AC_(%E6%B6%88%E6%AD%A7%E4%B9%89)

  2. 删除标题带有“大纲”“列表”“索引”的文章

    eg.心理学大纲https://zh.wikipedia.org/wiki/%E5%BF%83%E7%90%86%E5%AD%A6%E5%A4%A7%E7%BA%B2

    **大陆报纸列表https://zh.wikipedia.org/wiki/%E4%B8%AD%E5%9B%BD%E5%A4%A7%E9%99%86%E6%8A%A5%E7%BA%B8%E5%88%97%E8%A1%A8

    世界政区索引https://zh.wikipedia.org/wiki/%E4%B8%96%E7%95%8C%E6%94%BF%E5%8D%80%E7%B4%A2%E5%BC%95

  3. 删除以[可以指:|可以是:|指的可能是:]结尾的页面

    eg.首都省;我的**梦;**话

  4. 删除text = title 的文章

    eg.**瀑布

构建数据库

使用sqlite3存储文章的title和text

构建排序模型

  1. 1-gram Tf-idf Ranker

    使用sklearn中的TfidfVectorizer

  2. 2-gram

    tfidf_model = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=stopwords, ngram_range=(1, 2)).fit(corpus)

    调整参数ngram_range

Document Reader

(TODO:通过TOP5 articles获取答案)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.