ganymedenil / document.ai Goto Github PK

View Code? Open in Web Editor NEW

3.6K 43.0 311.0 78 KB

基于向量数据库与GPT3.5的通用本地知识库方案(A universal local knowledge base solution based on vector database and GPT3.5)

License: GNU Affero General Public License v3.0

Python 64.55% HTML 35.45%

document.ai's Introduction

document.ai

基于向量数据库与GPT3.5的通用本地知识库方案(A universal local knowledge base solution based on vector database and GPT3.5)

流程

整个流程非常简单，也没有复杂的地方，相信关注GPT领域的都会看到过如上的流程。

主要就以下几个点：

将本地答案数据集，转为向量存储到向量数据
当用户输入查询的问题时，把问题转为向量然后从向量数据库中查询相近的答案topK 这个时候其实就是我们最普遍的问答查询方案，在没有GPT的时候就直接返回相关的答案整个流程就结束了
现在有GPT了可以优化回答内容的整体结构，在单纯的搜索场景下其实这个优化没什么意义。但如果在客服等的聊天场景下，引用相关领域内容回复时，这样就会显得不那么的突兀。

使用范围

请参考 OpenAI 的使用政策

https://openai.com/policies/usage-policies

我的 MSD 案例只是探索其中一个垂直领域的可行性，你可以把这个项目迁移到任何你熟悉的领域中，而不必拘泥于医疗领域

难点

查询数据不准确

基于数据的优化

问答拆分查询

在上面的例子中，我们直接将问题和答案做匹配，有些时候因为问题的模糊性会导致匹配不相关的答案。

如果在已经有大量的问答映射数据的情况下，问题直接搜索问题集，然后基于已有映射返回当前问题匹配的问题集的答案，这样可以提升一定的问题准确性。

抽取主题词生成向量数据

因为答案中有大量非答案的内容，可以通过抽取答案主题然后组合生成向量数据，也可以在一定程度上提升相似度，主题算法有LDA、LSA等。

基于自训练的Embedding模型

openAI 的Embedding模型数据更多是基于普遍性数据训练，如果你要做问答的领域太过于专业有可能就会出现查询数据不准确的情况。

解决方案是自训练 Embedding 模型，在这里我推荐一个项目 text2vec ，shibing624 已经给出了一个模型基于 CoSENT + MacBERT +STS-B，shibing624/text2vec-base-chinese。

我也在前些日子训练了基于 CoSENT + LERT + STS-B的两个模型一个隐层大小是1024的text2vec-large-chinese，另一个是768的text2vec-base-chinese。也欢迎比对。

为了做这个Demo我还训练了两个医疗问答相关的模型基于cMedQQ数据集，其他与上面的一致分别是text2vec-cmedqq-lert-large和text2vec-cmedqq-lert-base。

基于 Fine-tune

目前我自身测试下来，使用问答数据集对GPT模型进行Fine-tune后，对于该类问题的准确性大幅提高。你可以理解为GPT通过大量的专业领域数据的训练后，当你对它提问的时候会更像在和这个领域的专家对话，然后配合调小接口中temperature参数，可以得到更确定的结果。

但现在 Fine-tune 训练和使用成本还是太高，每天都会有新的数据，不可能高频的进行 Fine-tune。我的一个想法是每隔一个长周期对数据进行 Fine-tune ，然后配合外置的向量数据库的相似查询来补足 Fine-tune 模型本身的数据准确性问题。

Buy me a coffee

document.ai's People

Contributors

Stargazers

Watchers

Forkers

phantomtide jjqtony kevinsun2017 nero520 johnliu33 iseeyo hqman blackwhites lyhiving pdkyll glaceage piggypiggyrun howiechen95 8-diagrams zxm9988 alphacaicai chris-han tide999 xiispace yayawawo cyoyo-geek tiwentichat topsvcloud duoluo unliu lavineleo zhongerxin xiangtuo hnkama suzg cosmoslazycat xingcici qraccess yezhwi rexsu nough1 gemnioo dalian-ai taozhijiang zs1621 citypages maxwelledisons ideal19dev20 acamelq yxybyq tonyxia2016 griffan igen90 itsharex linuer vjimrunning wishgale katherineq11 bigbrother666sh xuexiaogang linecode fangqiluxatu mplebron antiboson bilalnawaz072 vitekrubtsov nicocanada circlestarzero blm666 toread-jxj gebilaoman haozech junit burakakrishna qhxin danecryessx living198x nsongbai jangocheng techventurebuilder kokoosik jiangtao itsbean mengmajun ch8os kai2002 techthiyanes ai-jie01 louiscklaw ouichien git-models chuan0668 chring32 kang9779 mamingsuper jackcashman xiaolingis bravohaha yi-ge asdlei99 sherry0429 harveyvd newbeeyoung forksx dujingcen

document.ai's Issues

数据量级

仿照大佬的项目搭了一个，暴力计算的相似度，大概多少量级数据需要用到向量库么
https://github.com/thinksoso/MedChat

如何在现有Embedding模型基础上使用无监督数据微调？

感谢分享！

有一个问题想咨询一下您，按照我的理解，GanymedeNil/text2vec-large-chinese模型是基于LERT预训练模型，使用CoSENT方法，在中文STS-B数据集上训练得到的。

我现在有一些特定领域文本，想使用该Embedding模型在特定领域文本上微调，但这些文本是无标注的，无法使用CoSENT方法进行有监督微调。
那是不是我的可行做法可以是：

使用LERT在这些领域文本上进行MLM无监督微调，再在STS-B上微调；
尝试利用特定领域文本构建文本对，利用CoSENT方法微调；

直觉上看，这两种做法会有效吗？希望听一下您的见解。
期待您的回复

向量数据库查询到的应该是和问题相似的内容吧

把用户的 q 转化为向量后，在向量数据库中查询到的 topK 个结果应该是和 q 相似的吧，而不是 q 对应的答案吧？还是说我对向量数据库理解有问题，谢谢。

关于自训练 Embeddings 问题

你好，非常感谢作者的贡献，让我更加理解实现思路，我遇到了点问题，想请教您。

如果我自己依据想构建的知识库的数据去训练 Embedding 模型，然后向量化本地数据的时候，同时把训练 Embedding 模型的数据也向量化存储在Qdrant，这样做是不是不合适？

我想基于我们公司已有的想沉淀为知识库的数据进行训练 Embedding，这样期望进行向量化存储和搜索的时候，相似性和准确率稍微可以高点，我该怎么做呢？

数据集爬虫咨询

您好，很感谢您的项目，学习到很多~

请问你的默沙东的数据集是自己从官网爬的吗，我看openAI官方的例子有webQA的爬虫的例子，不知道是不是这样弄也可以

能基于已有的embedding模型，在特定的知识领域上做微调吗？

感谢作者大大的分享，想知道对于计算资源不足的人而言，能不能基于已有的embedding模型在特定的领域知识上做微调，或者类似于lora的方式做增量

关于 OpenAI 接口请求超时

长时间无法响应是因为OpenAI接口已经被ban，已经有很多公开的方案了，请善用搜索

text2vec模型效果怎么样

大佬自训练的版本，看起来不错，效果怎样，有在相关数据集上的评估指标可以分享吗？

比如 text2vec-large-chinese 和 text2vec-base-chinese 的效果对比，便于大家选用

谢谢！

启动后提示：openai.error.APIConnectionError: Error communicating with OpenAI。这是因为 ssl 请求失败吗

openai.error.APIConnectionError: Error communicating with OpenAI: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/embeddings (Caused by SSLError(

2023-04-02 13:45:47,078] ERROR in app: Exception on /search [POST]
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_exceptions.py", line 10, in map_exceptions
yield
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/backends/sync.py", line 94, in connect_tcp
sock = socket.create_connection(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/socket.py", line 845, in create_connection
raise err
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/socket.py", line 833, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
yield
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_transports/default.py", line 218, in handle_request
resp = self._pool.handle_request(req)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 253, in handle_request
raise exc
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 237, in handle_request
response = connection.handle_request(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 86, in handle_request
raise exc
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 63, in handle_request
stream = self._connect(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 111, in _connect
stream = self._network_backend.connect_tcp(**kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/backends/sync.py", line 93, in connect_tcp
with map_exceptions(exc_map):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/contextlib.py", line 153, in exit
self.gen.throw(typ, value, traceback)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
raise to_exc(exc)
httpcore.ConnectError: [Errno 61] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 95, in send_inner
response = self._client.send(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_client.py", line 908, in send
response = self._send_handling_auth(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_client.py", line 936, in _send_handling_auth
response = self._send_handling_redirects(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_client.py", line 973, in _send_handling_redirects
response = self._send_single_request(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_client.py", line 1009, in _send_single_request
response = transport.handle_request(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_transports/default.py", line 217, in handle_request
with map_httpcore_exceptions():
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/contextlib.py", line 153, in exit
self.gen.throw(typ, value, traceback)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.ConnectError: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/flask/app.py", line 2528, in wsgi_app
response = self.full_dispatch_request()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/flask/app.py", line 1825, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/flask/app.py", line 1823, in full_dispatch_request
rv = self.dispatch_request()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/flask/app.py", line 1799, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "/Users/xiaolu/Desktop/github/document.ai/code/server/server.py", line 105, in search
res = query(search)
File "/Users/xiaolu/Desktop/github/document.ai/code/server/server.py", line 64, in query
search_result = client.search(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/qdrant_client.py", line 253, in search
return self._client.search(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/qdrant_remote.py", line 419, in search
search_result = self.http.points_api.search_points(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api/points_api.py", line 963, in search_points
return self._build_for_search_points(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api/points_api.py", line 488, in build_for_search_points
return self.api_client.request(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 68, in request
return self.send(request, type)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 85, in send
response = self.middleware(request, self.send_inner)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 188, in call
return call_next(request)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/qdrant_client/http/api_client.py", line 97, in send_inner
raise ResponseHandlingException(e)
qdrant_client.http.exceptions.ResponseHandlingException: [Errno 61] Connection refused

资料库包含tag应该怎么整理

比如像 Notion 中自己关于某个 topic 的笔记，就应该记成类似如下形式吗？

{
    title: "某个领域的研究",
    text: "具体的某个研究的内容，可能非常复杂和杂碎"
}

另外就是，很多知识库都会有 tag 系统，我对某个内容会进行 tag，这个信息怎么纳入知识库或者 vector 中？

SentenceTransformer 调用的问题

很感谢作者的分享！
我现在使用LangChain-chatglm项目时碰到了一些加载模型文件的问题，大概确定了是因为模型文件缺少了modules.json，pooling文件夹，还有sentence_xlnet_config.json，这些都从shibing624/text2vec那边复制一份过来会有问题吗，需要有其他特别的设置吗