Giter Site home page Giter Site logo

bert-for-webqa's Introduction

Bert-for-WebQA

使用 torch 和 transformers 包从零开始搭建了一个中文阅读问答训练测试 框架 ,数据选择百度的 WebQA 问答数据集, 类似于斯坦福智能问答数据集, 使用 Bert-base-chinese 和 CRF 模型 做基础,模型可以根据需要持续更新。

模型

输入:[‘CLS’]+Question+[‘SEP’]+Evidence 字符串

模型框架:采用多任务联合训练的方式,共两个任务:

       任务1. 使用"[CLS]"来判断两个句子是否是Quesntion-Evidence的关系;

       任务2. 使用Question+[‘SEP’]+Evidence的BERT表达 + CRF模型 进行序列标注,找出Evidence中的答案。

输出:

       任务1. [batch_size,1] 的0-1 序列,1表示对应的文章中含有问题答案,0表示没有;
       
       任务2. [batch_size, seq_len] 的0-1 序列, Evidence 中出现答案的位置为 1 ,其余为 0。

备注: 选择使用"[CLS]"做Quesntion-Evidence关系判断的原因是,做大规模文档检索时,通常回返回一些带有迷惑性的负样本,用"[CLS]"可以进行二次过滤。

训练精度

       Eval On TestData   Eval-Loss: 15.383  Eval-Result(召回): R = 0.796

       Eval On DevData    Eval-Loss: 13.986  Eval-Result(召回): R = 0.795

数据集来自:https://pan.baidu.com/s/1QUsKcFWZ7Tg1dk_AbldZ1A 提取码:2dva

BaseLine论文:https://arxiv.org/abs/1607.06275

模型的谷歌云共享连接(训练好的模型):https://drive.google.com/open?id=1KHlCnT6VEpDCvtJp8FfwMtU5_ABrYzH9

==================== 超参 ====================

       early_stop = 1
               lr = 1e-05
               l2 = 1e-05
         n_epochs = 5
        Negweight = 0.01
         trainset = data/me_train.json
           devset = data/me_validation.ann.json
          testset = data/me_test.ann.json
   knowledge_path = data/me_test.ann.json
    Stopword_path = data/stop_words.txt
           device = cuda
             mode = train
       model_path = save_model/latest_model.pt
       model_back = save_model/back_model.pt
       batch_size = 16

说明:上面效果只训练了半个epoch 因为疫情在家没有服务器,用谷歌云训练的,设备是tesla-P100,回答一个问题平均耗时40ms。

问答模块

问答模块设计了两种功能:

1.带有文章的阅读问答;

2.根据问题从知识库中快速检索文章,再进行阅读问答的智能问答,问题的答案要在知识库里面有才行!

阅读问答效果如下:

image text

智能问答效果如下:

image text

文档检索

       步骤-0 准备知识库 

       步骤-1 jieba分词 

       步骤-2 去停用词 

       步骤-3 基于分词和二元语法词袋,使用sklearn计算TF-IDF矩阵 

       步骤-4 根据Query和知识库的TF-IDF矩阵计算排序出相关度较高的10篇文章。

用测试集数据搭建的知识库,文章检索精度 89%,其中文章数为3024,根据一个Query一次筛选出15篇文章,有89%的概率包含正确Evidene。

运行

       训练 %run TrainAndEval.py --batch_size=8 --mode="train" --model_path='save_model/latest_model.pt'

       评估 %run TrainAndEval.py --mode="eval" --model_path='save_model/latest_model.pt'

       阅读问答 %run TrainAndEval.py  --mode="demo" --model_path='save_model/latest_model.pt'

       智能问答 %run TrainAndEval.py  --mode="QA" --model_path='save_model/latest_model.pt'

不足

  1. 目前模型对正确的Evidence能高准确度识别出正确答案,但是很难分辨有迷惑性的错误Evidence,下一步需要对"[CLS]"识别错误Evidence进行提升。这会导致在智能问答模块,识别出多个包含正确答案的候选答案,却无法确定哪一个是唯一正确答案。

  2. 大规模文档检索时,因词袋较大,TF-IDF矩阵计算会很慢,下一步会根据FaceBook/DrQA文档检索模块,使用稀疏矩阵和哈希特征进行改进。

bert-for-webqa's People

Contributors

hanlard avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.