Giter Site home page Giter Site logo

chatopera / insuranceqa-corpus-zh Goto Github PK

View Code? Open in Web Editor NEW
1.0K 1.0K 343.0 545.65 MB

:helicopter: 保险行业语料库,聊天机器人

Home Page: https://www.chatopera.com/

License: Other

Python 93.45% Shell 6.55%
chatbot corpus dataset insurance insuranceqa-corpus-zh machine-learning natural-language-processing natural-language-understanding qasystem question-answering

insuranceqa-corpus-zh's Issues

pip安装insuranceqa_data报错

File "/Library/Python/2.7/site-packages/distribute-0.6.28-py2.7.egg/setuptools/dist.py", line 257, in finalize_options
ep.require(installer=self.fetch_build_egg)
File "/Library/Python/2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 2029, in require
working_set.resolve(self.dist.requires(self.extras),env,installer))
File "/Library/Python/2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 579, in resolve
env = Environment(self.entries)
File "/Library/Python/2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 748, in init
self.scan(search_path)
File "/Library/Python/2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 777, in scan
for dist in find_distributions(item):
File "/Library/Python/2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 1757, in find_on_path
path_item,entry,metadata,precedence=DEVELOP_DIST
File "/Library/Python/2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 2151, in from_location
py_version=py_version, platform=platform, **kw
File "/Library/Python/2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 2128, in init
self.project_name = safe_name(project_name or 'Unknown')
File "/Library/Python/2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 1139, in safe_name
return re.sub('[^A-Za-z0-9.]+', '-', name)
RuntimeError: maximum recursion depth exceeded

v2.1 is available

hi, folks

这个语料库在insuranceqa-corpus-zh v1版本中,不是很适合机器学习,因为语料没有分词,去标去停,添加标签。在 v2.1版本中,已经支持了 load_pairs_test, load_pairs_trainload_pairs_valid,并且支持了 load_pairs_vocab。这个是基于词表的,在test, train和valid中,都使用WordId,并且添加了Label来表明该回复是正例还是负例。基于pairs的数据,可以更方便的利用一些库进行训练:

DeepQA2

InsuranceQA TensorFlow

Chatbot Retrieval

详细文档:https://github.com/Samurais/insuranceqa-corpus-zh/releases/tag/v2.1

快速升级

pip install --upgrade insuranceqa_data

@sjqzhang, @rgtjf, @fssqawj

为什么数据没有标点?

概述

数据中无标点

通过id2word之后,utterance里的文本没有标点或者断句,是否可以加上标点呢?

理想解决方案

是否可以提供有标点的数据?

负例的选取

您好,十分感谢您的工作。有个小问题想问下,您处理过后的数据集形成了正负比例1:10的qa对,我想知道负例是如何选取的呢?是通过随机sample的还是通过类似Bm-25的算法抽取的呢?

直接Git clone,数据集无法用tar打开

命令1:
tar xvf iqa.train.json.gz
tar: Unrecognized archive format
tar: Error exit delayed from previous errors.

命令2:
tar -xvzf iqa.train.json.gz
tar: Unrecognized archive format
tar: Error exit delayed from previous errors.

命令3:
unzip iqa.train.json.gz
Archive: iqa.train.json.gz
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of iqa.train.json.gz or
iqa.train.json.gz.zip, and cannot find iqa.train.json.gz.ZIP, period.

请检查下数据格式,是不是有问题

求解答pair数据集的疑惑,谢谢

您好:
请问能详细说明一下项目insuranceqa-corpus-zh中corpus目录下iqa.train.tokenlized.pair.json文件中的数值的对应关系吗?特别是“question”字段不清楚如何对应到原文本?

    由于近期实验需要参考您这份数据集,还望您能尽快回复,谢谢。

关于使用模型?

作者您好,我想问下该如何使用模型呢,例如输入一个问题,怎么得到相应的回答呢?

IOError: [Errno socket error] EOF occurred in violation of protocol (_ssl.c:726)

描述

运行 insuranceqa.load_pairs_train() 时总会报 “IOError: [Errno socket error] EOF occurred in violation of protocol (_ssl.c:726)”,看了下貌似是 “wget.download("https://github.com/Samurais/insuranceqa-corpus-zh/raw/release/corpus/pairs/iqa.test.json.gz", out = os.path.join(curdir, 'pairs'))”这里出了问题,是我少安装了什么吗?

功能

环境

python 2.7
python 3.6

操作系统

Windows 10

代码版本

Git commit hash (git rev-parse HEAD)

OpenSource by Chatopera

chatoper banner

TypeError: the JSON object must be str, not 'bytes'

当我运行程序时候,总是出错。
with gzip.open(data_path, 'rb') as f:
data = json.loads(f.read())
return data
问题出现在这里,json接收str类型,gzip读取后是其他类型,尝试过类型转换,但是也没有成功,有没有解决办法。

数据集有问题

老哥,你把正例负例搞成1:10,真是太荒谬了,这样跑出来的结果不管正例还是负例,都是判错,导致最后准确率极限是10/11.应该选取符合实际情况的数据集,这样才有说服力。

无法正常获取数据

使用api无法获取数据,疑似连接失效

import insuranceqa_data as insuranceqa
train_data = insuranceqa.load_pairs_train()

 [insuranceqa_data] downloading data https://github.com/Samurais/insuranceqa-corpus-zh/raw/release/corpus/pairs/iqa.test.json.gz ... 

...中间其他日志省略...

File /usr/local/lib/python3.8/socket.py:796, in create_connection(address, timeout, source_address)
    794 if source_address:
    795     sock.bind(source_address)
--> 796 sock.connect(sa)
    797 # Break explicitly a reference cycle
    798 err = None

OSError: [Errno socket error] [Errno 110] Connection timed out

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.