zake7749 / chatbot Goto Github PK

View Code? Open in Web Editor NEW

896.0 72.0 271.0 29.92 MB

基於向量匹配的情境式聊天機器人

License: GNU General Public License v3.0

Python 100.00%

chatbot's Introduction

Mianbot

Mianbot 是採用樣板與檢索式模型搭建的聊天機器人，目前有兩種產生回覆的方式，專案仍在開發中:)

其一（左圖）是以詞向量進行短語分類，針對分類的目標模組實現特徵抽取與記憶回覆功能，以進行多輪對話，匹配方式可參考Semantic Graph（目前仍在施工中 ΣΣΣ (」○ ω○ )／）。
其二（右圖）除了天氣應答外，主要是以 PTT Gossiping 作為知識庫，透過文本相似度的比對取出與使用者輸入最相似的文章標題，再從推文集內挑選出最為可靠的回覆，程式內容及實驗過程請參見PTT-Chat_Generator。

匹配示例

更多的樣例可以參照 example/output.txt

輸入：明天早上叫我起床。

相似度	概念	匹配元
0.4521	鬧鐘	起床
0.3904	天氣	早上
0.3067	住宿	起床
0.1747	病症	起床
0.1580	購買	早上
0.1270	股票	早上
0.1096	觀光	早上

輸入：明天上海會不會下雨？

相似度	概念	匹配元
0.5665	天氣	下雨
0.3918	鬧鐘	下雨
0.1807	病症	下雨
0.1362	住宿	下雨
0.0000	股票
0.0000	觀光
0.0000	購買

環境需求

安裝 python3 開發環境
安裝 gensim – Topic Modelling in Python
安裝 jieba 结巴中文分词
有已訓練好的中文詞向量，並根據檔案位置調整 Console class 的初始化參數。

import Chatbot.console as console
c = console.Console(model_path='your_model')

如要使用 QA 模組，請先依照問答測試用資料集進行配置，或透過將chatbot.py 中的 self.github_qa_unupdated 設為 True 選擇關閉 QA 模組

使用方式

聊天機器人

import Chatbot.chatbot as chatbot

chatter = chatbot.Chatbot(w2v_model_path='your_model')
chatter.waiting_loop()

計算匹配度

import Chatbot.console as console

c = console.Console(model_path='your_model')
speech = input('Input a sentence:')
res,path = c.rule_match(speech)
c.write_output(speech,res,path)

規則格式

規則採用 json 格式，樣板規則放置於\RuleMatcher\rule中，

    {
        "domain": "代表這個規則的抽象概念",
        "response": [
		"對應到該規則後",
        	"機器人所會給予的回覆",
        	"機器人會隨機抽取一條 response"
        ],
        "concepts": [
            "該規則的可能表示方式"
        ],
        "children": ["該規則的子規則","如購買 -> 購買飲料,購買衣服......"]
    }

Example

    {
        "domain": "購買",
        "response": [
        	"正在將您導向購物模組"
        ],
        "concepts": [
            "購買","購物","訂購"
        ],
        "children": [
            "購買生活用品",
            "購買家電",
            "購買食物",
            "購買飲料",
            "購買鞋子",
            "購買衣服",
            "購買電腦產品"
        ]
    },

問答測試用資料集

請點擊這裡下載部分測試用資料集，內容包含了 PTT C_Chat、Gossiping 版非新聞類問答約 250,000 則。檔案解壓縮後請放置於 QuestionAnswering/data/ 資料夾下，reply.rar 解壓縮後的資料夾請放置於 QuestionAnswering/data/processed 下：

QuestionAnswering
└── data
   ├── SegTitles.txt
   ├── processed
   │   └── reply
   │       ├── 0.json
   │       ├── .
   │       ├── .
   │       ├── .
   │       └── xxx.json
   └── Titles.txt

完成配置後，可以將chatbot.py 中的 self.github_qa_unupdated 設為 False 打開問答模組進行測試。

開發日誌

特別致謝

網路探勘暨跨語知識系統實驗室
智慧型知識管理實驗室
Legoly
給予我協助與交流的每名朋友

chatbot's People

Contributors

Stargazers

Watchers

Forkers

nickbanana konata39 kevin0248 jude2014 chiachun1127 sherlockhoatszx chao-jiang xru ambier fancycheung ioffl dr-data chenfeng1993 smilechun shadowjf guildford david30907d wudeshi mars-wei 307509256 toughie88 littlewizardli 6676401088 benderpan allensmile little1tow nonva benjamesbabala jdc08161063 chagge tbornt bigrlab afcentry iamsile zhangyunfang evenloooo fresty kashyaparjun plexzhang chen1220502052 pustar colinsongf kexiter yumiao1203 jhowliu sunfuy roycwwong dpny518 fendouai llamaslama liushui9404 crayhuang zhangjingpu a3794110 hailiang-wang linhuaiyi yaps amshb001 wuxiaobo baokunguo galaxyh paper318 rie-long angrysquirrel jerry1281114 kfcmax wuziliang18 chaoyanjie aaasmile dnychennnn leofelipefrohlich cutecha awesome-archive yangbingxu fdxyang jimting peihuiapple asleda t1t1985 sunhk25 yiwangsun sysulj masonyang miaohang123 xikunlun001 cn3c3p zhaoqi02 qi166 chengc017 28huasheng alf7927 xrick yaohuatj fs2hero baddot liupengpop reyadrahman alphadl grainw dingguijin

chatbot's Issues

对话

请问题主大大，我通过word2vec训练的模型添加进工程文件当中，能够运行成功，每输入一句话都可以得到一句回复，但是有个问题，所有的回复都局限于几句话：“你好”，“我不太明白你的意思”，“原来如此”，“是吗”等等，反反复复，没有针对性回答，这个问题该如何改进呢？我现在把问答界面赋于如下：
你好，我是 MianBot
你好
Handler of '問候' have not implemented
你好
很高兴认识你
我不太明白你的意思
你现在哪
是嗎?
今天的天气怎么样
是嗎?
明天去哪玩
我不太明白你的意思
早上好
是嗎?

运行报错

FileNotFoundError: [Errno 2] No such file or directory: 'model/ch-corpus-3sg.bin'

NameError: name 'exit' is not defined

在运行demo_chatbot.py时报错，显示Chatbot-master/Chatbot/RuleMatcher文件夹下的rulebase.py第216行exit()出现错误：NameError: name 'exit' is not defined
在运行demo.py时也会报同样的错，因为调用的同样时这个rulebase.py
我的环境是python3.6.3，请问怎么回事？

[Errno 2] No such file or directory

我利用您提供的"使用 gensim 訓練中文詞向量"訓練好後，把模型放在該資料夾
也有將console更改如下
c = console.Console(model_path='word2vec-tutorial-master/word2vec.model')
但是出現[Errno 2] No such file or directory: 'word2vec-tutorial-master/word2vec.model'
請問我哪邊沒有完成嗎~~!?

問答測試用資料集檔案遺失

您好
下載連結
https://drive.google.com/file/d/0BxfXm7KkNKc-RkY2Z1pONUlqODg/view?usp=sharing
出現
抱歉，您要求的檔案不存在。
請確定網址無誤，且檔案確實存在。
請問能否重新上傳

下载资料

问答资料在Google硬盘下载不了

model

No such file or directory: 'model/ch-corpus-3sg.bin,请问如何获取这个模型的？

如何训练自己的模型？

bugs

File "test.py", line 9, in main
model = models.Word2Vec.load_word2vec_format('ch-corpus-3sg.bin',binary=True)

gensim/models/word2vec.py", line 1608, in load_word2vec_format
raise DeprecationWarning("Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format instead.")

[ERROR] 出現 >> [Gensim] 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

請問大神，這個錯誤該怎麼解決，拜託幫忙了~~感覺差一點點可以完成
console檔程式碼如下

==========================================================
import random
import os

import jieba
import jieba.analyse

import RuleMatcher.rulebase as rulebase

class Console(object):

"""
Build some nlp function as an package.
"""

def __init__(self,model_path="model/ch-corpus-3sg.bin",
             rule_path="RuleMatcher/rule/",
             stopword="jieba_dict/stopword.txt",
             jieba_dic="jieba_dict/dict.txt.big",
             jieba_user_dic="jieba_dict/userdict.txt"):

    print("[Console] Building a console...")

    cur_dir = os.getcwd()
    curPath = os.path.dirname(__file__)
    os.chdir(curPath)

    # jieba custom setting.
    self.init_jieba(jieba_dic, jieba_user_dic)
    self.stopword = self.load_stopword(stopword)

    # build the rulebase.
    self.rb = rulebase.RuleBase()

    print("[Console] Loading the word embedding model...")

    try:
        self.rb.load_model(model_path)
        # models.Word2Vec.load('word2vec.model')
        
    except FileNotFoundError as e:
        print("[Console] 請確定詞向量模型有正確配置")
        print(e)
        exit()
    except Exception as e:
        print("[Gensim]")
        print(e)
        exit()

    print("[Console] Loading pre-defined rules.")
    self.rb.load_rules_from_dic(rule_path)

    print("[Console] Initialized successfully :>")

    os.chdir(cur_dir)


def listen(self):
    #into interactive console
    while True:
        self.show_information()
        choice = input('Your choice is: ')
        choice = choice.lower()
        if choice == 'e':
            res = self.jieba_tf_idf()
            for tag, weight in res:
                print('%s %s' % (tag, weight))
        elif choice == 'g':
            res = self.jieba_textrank()
            for tag, weight in res:
                print('%s %s' % (tag, weight))
        elif choice == 'p':
            print(self.rb)
        elif choice == 'r':
            self.rb.load_rules('RuleMatcher/rule/',reload=True)
        elif choice == 'd':
            self.test_speech()
        elif choice == 'm':
            speech = input('Input a sentence:')
            res,path = self.rule_match(speech)
            self.write_output(speech,res,path)
        elif choice == 'b':
            exit()
        elif choice == 's':
            rule_id = input('Input a rule id:')
            res = self.get_response(rule_id)
            if res is not None:
                print(res)
        elif choice == 'o':
            self.rb.output_as_json()
        else:
            print('[Opps!] No such choice: ' + choice + '.')

def jieba_textrank(self):

    """
    Use textrank in jieba to extract keywords in a sentence.
    """

    speech = input('Input a sentence: ')
    return jieba.analyse.textrank(speech, withWeight=True, topK=20)

def jieba_tf_idf(self):

    """
    Use tf/idf in jieba to extract keywords in a sentence
    """

    speech = input('Input a sentence: ')
    return jieba.analyse.extract_tags(speech, topK=20, withWeight=True)

def show_information(self):
    print('Here is chatbot backend, enter your choice.')
    print('- D)emo the data in speech.txt.')
    print('- E)xtract the name entity.')
    print('- G)ive me the TextRank.')
    print('- M)atch a sentence with rules.')
    print('- P)rint all rules in the rulebase.')
    print('- R)eload the base rule.')
    print('- O)utput all rules to rule.json.')
    print('- S)how me a random response of a rule')
    print('- B)ye.')

def init_jieba(self, seg_dic, userdic):

    """
    jieba custom setting.
    """

    jieba.load_userdict(userdic)
    jieba.set_dictionary(seg_dic)
    with open(userdic,'r',encoding='utf-8') as input:
        for word in input:
            word = word.strip('\n')
            jieba.suggest_freq(word, True)

def load_stopword(self, path):

    stopword = set()
    with open(path,'r',encoding='utf-8') as stopword_list:
        for sw in stopword_list:
            sw = sw.strip('\n')
            stopword.add(sw)
    return stopword

def word_segment(self, sentence):

    words = jieba.cut(sentence, HMM=False)
    #clean up the stopword
    keyword = []
    for word in words:
        if word not in self.stopword:
            keyword.append(word)
    return keyword

def rule_match(self, sentence, best_only=False, search_from=None, segmented=False):

    """
    Match the sentence with rules.

    Args:
        - sentence  : the string you want to match with rules.
        - best_only : if True, only return the best matched rule.
        - root      : a domain name, then the rule match will start
                      at searching from that domain, not from forest roots.
        - segmented : the sentence is segmented or not.
    Return:
        - a list of candiate rule
        - the travel path of classification tree.
    """
    keyword = []
    if segmented:
        keyword = sentence
    else:
        keyword = self.word_segment(sentence)

    if search_from is None: # use for classification (rule matching).
        result_list,path = self.rb.match(keyword,threshold=0.1)
    else:  # use for reasoning.
        result_list,path = self.rb.match(keyword,threshold=0.1,root=search_from)

    if best_only:
        return [result_list[0], path]
    else:
        return [result_list, path]


def get_response(self, rule_id):

    """
    Get a random response from the given rule's response'list.
    """
    rule = self.rb.rules[rule_id]
    res_num = rule.has_response()
    if res_num == 0:
        return None
    else:
        return rule.response[random.randrange(0,res_num)]

def test_speech(self):

    """
    Try matching all sentence in 'example/output.txt'
    """

    output = open('example/output.txt','w',encoding='utf-8')
    # load sample data
    with open('example/speech.txt','r',encoding='utf-8') as input:
        for speech in input:
            speech = speech.strip('\n')
            result,path = self.rule_match(speech)
            self.write_output(speech, result, path, output)

def write_output(self, org_speech, result, path, output = None):

    """
    Show the matching result.

        Args:
            - org_speech: the original input string.
            - result: a sorted array, refer match() in rulebase.py.
            - path: the travel path in classification tree.
            - output: expect as a file writer, if none, print
              the result to stdio.
    """
    result_information = ''
    result_information += "Case# " + str(org_speech) + '\n'
    result_information += "------------------\n"
    for similarity,rule,matchee in result:
        str_sim = '%.4f' % similarity
        result_information += str_sim+'\t'+path+rule+'\t\t'+matchee+'\n'
    result_information += "------------------\n"

    if output is None:
        print(result_information)
    else:
        output.write(result_information)

if name == 'main':
main()

问答测试用资料集缺失

题主大大，您好！
这个链接https://drive.google.com/file/d/11JlbmYmuu00TfmfdAfyoGK_E3VMp8Vd_/view出现访问页面不存在，请问能否重新上传一份问答测试用资料集？或者发我一份新的下载链接？谢谢！

Rulebase生成

您好，想請教專案中的rulebase是如何生成的呢?

關於demo.py, test.py執行之詢問

您好感謝提供這個資源讓我們可以使用
如同題目我在運行demo.py時遇到了問題(下附錯誤敘述)
我參照了其他使用者所提問過的
得知word2vec的model需要進行訓練
於是參考了您的另一篇文章成功訓練完成
測試demo.py (word2vec的非本chatbot之demo.py) 也運行無誤
然而我將該model代回chatbot時
將demo.py, demo_chatbot.py 等等需要撰寫model路徑的檔案都改變後
卻仍舊出現錯誤訊息
希望po主能幫我解答感激不盡

https://images.plurk.com/3eU8qiGvmUsxseftD1eQ.jpg

無法執行

大神你我因功課需要，需要製作聊天機器人

但是我無法執行這個專案

python3.6 demo_chatBot.py

[Gensim]
'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Taiba分词工具

你好，运行机器人时发现依赖Taiba分词工具，但是这个项目目前pip无法下载，Github上( https://github.com/fann1993814 )也被删除了。

运行test.py 报错

Hello @zake7749

我先使用pip install word2vec成功后
root@ubuntu:~# pip install word2vec
Collecting word2vec
Using cached word2vec-0.9.2.tar.gz
Requirement already satisfied: numpy in /usr/local/lib/python3.5/dist-packages ( from word2vec)
Requirement already satisfied: cython in /usr/local/lib/python3.5/dist-packages (from word2vec)
Building wheels for collected packages: word2vec
Running setup.py bdist_wheel for word2vec ... done
Stored in directory: /root/.cache/pip/wheels/81/d0/9d/93f56c6111d24248341bbe35 5fd7d5ef6243f89260af5e91b3
Successfully built word2vec
Installing collected packages: word2vec
Successfully installed word2vec-0.9.2

我修改了model = models.Word2Vec.load('/usr/local/bin/word2vec')

接下来运行test.py

root@ubuntu:/home/liaotian/Chatbot/Chatbot/model# python test.py
2018-02-06 23:20:31,550 : INFO : loading Word2Vec object from /usr/local/bin/word2vec
Traceback (most recent call last):
File "test.py", line 41, in
main()
File "test.py", line 9, in main
model = models.Word2Vec.load('/usr/local/bin/word2vec')
File "/usr/local/lib/python3.5/dist-packages/gensim/models/word2vec.py", line 975, in load
return super(Word2Vec, cls).load(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/gensim/models/base_any2vec.py", line 629, in load
model = super(BaseWordEmbeddingsModel, cls).load(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/gensim/models/base_any2vec.py", line 278, in load
return super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/gensim/utils.py", line 395, in load
obj = unpickle(fname)
File "/usr/local/lib/python3.5/dist-packages/gensim/utils.py", line 1302, in unpickle
return _pickle.load(f, encoding='latin1')
_pickle.UnpicklingError: invalid load key, '.

出现了这个错误，请问怎么解决

谢谢

請問可以使用對話集來訓練model嗎

您好，我想利用一些個人對話集來訓練模型的話該怎麼做!!?

測試用資料集下載不到了

發現一個問題：QA 模組需要的”測試用資料集“下載的時候顯示"Oops! There was a problem with the network
Download"。

另外，請問一下，reply.rar裡面的這些xx.json文件是通過什麼方式產生出來的呢？
感謝！

对话意图抽取

您好，请问对话意图抽取这块是如何实现的，有详细的资料讲解么，对这块比较感兴趣，想深入了解下。

使用QA模組遇到問題

感謝您提供code讓大家參考
我有先依照問答測試用資料集進行配置
但在執行demo_qa.py遇到問題不知道怎麼解決，如圖
https://i.imgur.com/drsgZIp.jpg
希望大大能幫我解答

Web API

请问题主大大，如何让对话聊天在Web中呈现出来？Web API交互怎么实现的？

記憶回覆功能

你好，這個專案對我初學聊天機器人有很大的幫助，先謝謝你的分享。
我想了解一下這聊天機器人是怎樣達至記憶回覆功能？

以左圖為例，聊天機器人如何記錄"高雄"這一個前面對話的選項呢?

繁体转换

@zake7749 请问题主大大，我发现问答测试资料集里面几乎都是回复繁体字，输入简体字也很难匹配成功，有什么方法可以把问答测试集里面的繁体字统一转换为简体字呢？

想請問資料集的相關資訊

/Reply/*.json
Titles.txt
SigTitles.txt
請問他們的關係是什麼呢?
有看到Sig好像會對應到Reply，那Titles.txt呢

TypeError

TypeError: cannot use a string pattern on a bytes-like object
请问题主大大，这个问题如何解决？运行demo_chatbot.py后，能够运行成功，但输入一句话“很高兴认识你”时，无法得到一个确确的回答。
回复出现如下信息：
Traceback (most recent call last):
File "E:/python_work/pycharm/Chatbot-master/demo_chatbot.py", line 4, in
chatter.waiting_loop()
File "E:\python_work\pycharm\Chatbot-master\Chatbot\chatbot.py", line 65, in waiting_loop
res = self.listen(speech)
File "E:\python_work\pycharm\Chatbot-master\Chatbot\chatbot.py", line 109, in listen
response,stauts,target,candiates = self.getResponseOnRootDomains(target)
File "E:\python_work\pycharm\Chatbot-master\Chatbot\chatbot.py", line 149, in getResponseOnRootDomains
status,response = handler.get_response(self.speech, self.speech_domain, target)
File "E:\python_work\pycharm\Chatbot-master\Chatbot\task_modules\other\stock.py", line 30, in get_response
stock_no = self.get_stock_no(nm)
File "E:\python_work\pycharm\Chatbot-master\Chatbot\task_modules\other\stock.py", line 69, in get_stock_no
m = re.search('([0-9]{4}[ ]{2}|[0-9]{5}[ LRU]{1}|[0-9]6),([^ ]*)( *),',col)
File "D:\python\python-3.5.4\lib\re.py", line 173, in search
return _compile(pattern, flags).search(string)
TypeError: cannot use a string pattern on a bytes-like object

Process finished with exit code 1