Giter Site home page Giter Site logo

chat4u's Introduction

Chat4U

用微信聊天记录训练一个你专属的聊天机器人。

获取数据库密钥

微信聊天记录会加密存储在 sqlite 数据库中,首先需要获取数据库密钥,你需要一台 macOS 笔记本,手机使用 Android/iPhone 均可,执行以下步骤:

  1. 将微信聊天记录同步到 mac 笔记本:手机微信进入 设置 > 聊天 > 聊天记录迁移与备份 > 迁移 > 迁移到电脑微信,选择需要同步的聊天记录。
  2. 下载 nalzok/wechat-decipher-macos 提供的 dtrace 脚本。
git clone https://github.com/nalzok/wechat-decipher-macos
  1. 电脑端启动微信,停留在登录页面,执行以下命令。
sudo ./wechat-decipher-macos/macos/dbcracker.d -p $(pgrep WeChat) | tee dbtrace.log
  1. 如果没有 dtrace 权限,需要临时关闭 SIP,参考这里,再次执行上述指令。
  2. 电脑端登录微信,然后退出微信,预期会提取出微信所有数据库的密钥,保存在 dbtrace.log 内,样例如下。
sqlcipher '/Users/<user>/Library/Containers/com.tencent.xinWeChat/Data/Library/Application Support/com.tencent.xinWeChat/2.0b4.0.9/5976edc4b2ac64741cacc525f229c5fe/Message/msg_0.db'
--------------------------------------------------------------------------------
PRAGMA key = "x'<384_bit_key>'";
PRAGMA cipher_compatibility = 3;
PRAGMA kdf_iter = 64000;
PRAGMA cipher_page_size = 1024;
........................................

其他操作系统用户可以尝试以下方式,仅调研未验证过,供参考:

解密数据库

在我的 macOS 笔记本上,微信聊天记录存储在 msg_0.db - msg_9.db 内,仅解密这几个数据库即可。

需要安装 sqlcipher 进行解密,macOS 系统用户直接执行:

brew install sqlcipher

执行以下脚本,自动解析 dbtrace.log,解密 msg_x.db 并导出到 plain_msg_x.db

python3 decrypt.py

数据处理

可以通过 https://sqliteviewer.app/ 打开解密后的数据库 plain_msg_x.db,找到你所需聊天记录所在的表,将数据库和表名填写到 prepare_data.py 内,执行下面脚本生成训练数据 train.json,目前策略比较简单,仅处理了单轮对话,会将 5 分钟内连续的对话合并。

python3 prepare_data.py

训练数据样例如下:

[
    {"instruction": "你好", "output": "你好"}
    {"instruction": "你是谁", "output": "你猜猜"}
]

模型训练

准备一台带 GPU 的 linux 机器,将 train.json scp 到 GPU 机器上。

我使用的是 stanford_alpaca 全图微调 LLaMA-7B,在 8 卡 V100-SXM2-32GB 上训练 90k 条数据 3 个 epoch,仅需 1 小时。

# clone the alpaca repo
git clone https://github.com/tatsu-lab/stanford_alpaca.git && cd stanford_alpaca
# adjust deepspeed config ... such as disabling offloading
vim ./configs/default_offload_opt_param.json
# train with deepspeed zero3
torchrun --nproc_per_node=8 --master_port=23456 train.py \
    --model_name_or_path huggyllama/llama-7b \
    --data_path ../train.json \
    --model_max_length 128 \
    --fp16 True \
    --output_dir ../llama-wechat \
    --num_train_epochs 3 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 10 \
    --deepspeed "./configs/default_offload_opt_param.json" \
    --tf32 False

DeepSpeed zero3 会分片保存权重,需要将它们合并成一个 pytorch checkpoint 文件:

cd llama-wechat
python3 zero_to_fp32.py . pytorch_model.bin

消费级显卡上可以尝试 alpaca-lora 仅微调 lora 权重,可以显著降低显存和训练成本。

模型部署

前端调试

可以使用 alpaca-lora 部署 gradio 前端,供调试使用。如果是全图微调,需要把 peft 相关代码注释掉,仅加载基础模型。

git clone https://github.com/tloen/alpaca-lora.git && cd alpaca-lora
CUDA_VISIBLE_DEVICES=0 python3 generate.py --base_model ../llama-wechat

运行效果:

微信接入

需要部署一个兼容 OpenAI API 的模型服务,这里基于 llama4openai-api.py 简单适配下,见本仓库里的 llama4openai-api.py,启动服务:

CUDA_VISIBLE_DEVICES=0 python3 llama4openai-api.py

测试接口是否可用:

curl http://127.0.0.1:5000/chat/completions -v -H "Content-Type: application/json" -H "Authorization: Bearer $OPENAI_API_KEY" --data '{"model":"llama-wechat","max_tokens":128,"temperature":0.95,"messages":[{"role":"user","content":"你好"}]}'

使用 wechat-chatgpt 接入微信,API 地址填自己本地的模型服务地址:

docker run -it --rm --name wechat-chatgpt \
    -e API=http://127.0.0.1:5000 \
    -e OPENAI_API_KEY=$OPENAI_API_KEY \
    -e MODEL="gpt-3.5-turbo" \
    -e CHAT_PRIVATE_TRIGGER_KEYWORD="" \
    -v $(pwd)/data:/app/data/wechat-assistant.memory-card.json \
    holegots/wechat-chatgpt:latest

运行效果:

"刚接入" 是机器人说的第一句话,对方到最后也没有猜到。

总体来看,用聊天记录训练的机器人必然会有一些常识性错误,但在聊天风格上已经模仿的比较好了。

chat4u's People

Contributors

li-plus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

chat4u's Issues

dtrace: failed to compile script `dbcracker.d`

你好,在 获取数据库密钥 的第3步中,已经临时关闭了SIP,运行的时候报错:

dtrace: failed to compile script ./wechat-decipher-macos/macos/dbcracker.d: line 102: extraneous argument '1073' ($3 is not referenced)

系统信息:
ProductName: macOS
ProductVersion: 12.7.1
BuildVersion: 21G920

output repeat instruction

感谢作者!!follow到了用alpaca-lora fine tune结束在gradio尝试输入,之前因为数据格式的原因对原本案例中的instruction和output之间加入了input item。现在中文输入instruction绝大部份时间output都是重复一遍或者几遍instruction。求解答!提前感谢

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.