Giter Site home page Giter Site logo

lixiang0 / web_kg Goto Github PK

View Code? Open in Web Editor NEW
901.0 22.0 188.0 42.06 MB

爬取百度百科中文页面,抽取三元组信息,构建中文知识图谱

Home Page: http://kg.rubenxiao.com

Python 100.00%
spider baidu baike wiki neo4j knowledge-graph nlp

web_kg's Introduction

开源web知识图谱项目

  • 爬取百度百科中文页面
  • 解析三元组和网页内容
  • 构建中文知识图谱
  • 构建百科bot(构建中)
update 20200720

Windows上的部署参考如何在Windows上部署,感谢LMY-nlp0701!

update 20191121
  • 迁移代码到爬虫框架scrapy
  • 优化了抽取部分代码
  • 数据持久化迁移到mongodb
  • 修复chatbot失效问题
  • 开放neo4j后台界面,可以查看知识图谱成型效果
Tips
  • 如果是项目问题,请提issue。
  • 如果涉及到不方便公开的,请发邮件。
  • ChatBot请访问链接
  • 成型的百科知识图谱访问链接,用户名:neo4j,密码:123。效果如下:

环境

  • python 3.6
  • re:url正则匹配
  • scrapy:网页爬虫和网页解析
  • neo4j:知识图谱图数据库,安装可以参考链接
  • pip install neo4j-driver:neo4j python驱动
  • pip install pymongodb:mongodb的python支持
  • mongodb数据库:安装参考链接

代码执行:

cd WEB_KG/baike
scrapy crawl baike

执行界面(按ctrl+c停止):

知识图谱效果图

mongodb存储的网页内容

mongodb存储的三元组

neo4j后台界面

web_kg's People

Contributors

lixiang0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

web_kg's Issues

No such file or directory实在不太清楚缺少了啥

D:\bandzip\WEB_KG-master\baike\spiders>python baike.py
Traceback (most recent call last):
File "baike.py", line 20, in
class BaikeSpider(scrapy.Spider):
File "baike.py", line 30, in BaikeSpider
driver = GraphDatabase.driver(
File "d:\ProgramData\Anaconda3\lib\site-packages\neo4j_init_.py", line 120, in driver
return Driver(uri, **config)
File "d:\ProgramData\Anaconda3\lib\site-packages\neo4j_init_.py", line 161, in new
return subclass(uri, **config)
File "d:\ProgramData\Anaconda3\lib\site-packages\neo4j_init_.py", line 235, in new
pool.release(pool.acquire())
File "d:\ProgramData\Anaconda3\lib\site-packages\neobolt\direct.py", line 715, in acquire
return self.acquire_direct(self.address)
File "d:\ProgramData\Anaconda3\lib\site-packages\neobolt\direct.py", line 608, in acquire_direct
connection = self.connector(address, error_handler=self.connection_error_handler)
File "d:\ProgramData\Anaconda3\lib\site-packages\neo4j_init_.py", line 232, in connector
return connect(address, **dict(config, **kwargs))
File "d:\ProgramData\Anaconda3\lib\site-packages\neobolt\direct.py", line 972, in connect
raise last_error
File "d:\ProgramData\Anaconda3\lib\site-packages\neobolt\direct.py", line 963, in connect
s, der_encoded_server_certificate = _secure(s, host, security_plan.ssl_context, **config)
File "d:\ProgramData\Anaconda3\lib\site-packages\neobolt\direct.py", line 854, in _secure
s = ssl_context.wrap_socket(s, server_hostname=host if HAS_SNI and host else None)
File "d:\ProgramData\Anaconda3\lib\ssl.py", line 500, in wrap_socket
return self.sslsocket_class._create(
File "d:\ProgramData\Anaconda3\lib\ssl.py", line 1040, in _create
self.do_handshake()
File "d:\ProgramData\Anaconda3\lib\ssl.py", line 1309, in do_handshake
self._sslobj.do_handshake()
FileNotFoundError: [Errno 2] No such file or directory

extract-para.py

运行extract-para.py时,在处理完所有的txt文件后,程序不终止,请问是什么原因?

neo4j.v1

from neo4j.v1 import GraphDatabase ModuleNotFoundError: No module named 'neo4j.v1' why?

关于初始化

请问下如果想把之前得到的数据都删了,重新跑自己需要的数据,是把data里的数据都删了就好了吗?
还有想请问下如果只是爬自己感兴趣的目标集的话
items = set(response.xpath(
'//a[contains(@href, "/item/")]/@href').re(r'/item/[A-Za-z0-9%\u4E00-\u9FA5]+'))
for item in items:
new_url = 'https://baike.baidu.com'+urllib.parse.unquote(item)
new_item_name = re.sub(
'/', '', re.sub('https://baike.baidu.com/item/', '', new_url))
if new_item_name not in self.olds:
yield response.follow(new_url, callback=self.parse)
把这里直接注释掉,然后在网址那里改成自己需要的就行了吗?

python3.6 get UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-14: ordinal not in range(128)

Traceback (most recent call last):
File "html_parser.py", line 55, in
new_urls, _ = parser.parse(content)
File "html_parser.py", line 44, in parse
is_saved = self._save_new_data( soup,html_cont)
File "html_parser.py", line 34, in _save_new_data
with open(os.path.join(path ,title+'.html'), 'w') as f:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-14: ordinal not in range(128)

抽取百科别的内容

您好,我成功运行了您的项目,这是一个非常棒的项目。

不过爬取的大多数内容都是人物,请问在什么地方进行修改能够爬取别的内容呢?

该项目成功在Win10上部署

成功运行流程及效果图

1.启动MongoDB服务
图片1

2.启动neo4j服务
图片2

3.浏览器上访问 http://localhost:7474/
显示以下的界面:
图片3
用户名和密码与代码中保持一致:auth=("neo4j", "123")
图片4

4.启动Pycharm,在终端中输入运行命令,开始运行
图片5
注:虽然还是有Warning,但是目前还未影响程序运行。

5.最终效果图,neo4j知识图谱会根据抽取的结果动态更新
图片6
图片7
图片8
图片9
注:生成的知识图谱还是有点奇怪,这需要我后续深入研究。

Win10环境配置

前言:所有需要在Windows上的包我都上传到百度网盘上了,有需要的直接下载。
链接:https://pan.baidu.com/s/1buizBSSuT4wIgPUFtUQW9g
提取码:jay1

图片10
下面逐步介绍如何展开配置
1.安装pycharm 社区版 + python3.7.8
2.安装MongoDB 3.2.22
MongoDB安装指南
注:记得跟着指南 安装MongoDB服务
3.安装neo4j
neo4j安装指南
注:由指南可知(第一部分)可知,我们需要先安装Java JRE,我安装的是jdk-14.0.2_windows-x64_bin
neo4j-community-4.1.1-windows直接解压到D盘就好了,记得跟着指南启动Neo4j程序(指南的第四部分)
4.pycharm里的包安装
scrapy 1.6.0
pymongo 3.10.1
neo4j 1.7.6
neo4j-driver 1.7.6

最后一个注意事项:WEB_KG-master\baike\spiders\baike.py文件的29行应改为:driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "123"), encrypted=False)。
还有该baike.py文件中的logging报错是因为linux和windows的文件夹表示不一样,可以改,也可以直接注释掉。

非常感谢作者开源分享了此工作,希望大家能一起学习,我还有些bug,先告辞了!

OSError: [Errno 22] Invalid argument:

我在项目的baike目录下运行scrapy crawl baike,结果报错“OSError: [Errno 22] Invalid argument: 'D:\code\program\WEB_KG-master\baike\logs\Sun_Mar_22_21:30:37_2020.log' ”,请问该怎么解决呢?感激不尽!
}69AD$0HQQX15)QEG(3DB8G

爬到的东西为空

像爬取的text为空,还有就是添加三元组的时候attrs和values也是空的,所以加不到三元组里

neo4j.exceptions.AuthError: The client is unauthorized due to authentication failure. ubuntu下这是什么原因呢?到了最后一步,谢谢

Traceback (most recent call last):
File "/home/lab548/Downloads/nlp/QA/WEB_KG-master/kg/insert_to_neo4j.py", line 13, in
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "123"))
File "/usr/local/lib/python3.5/dist-packages/neo4j/v1/api.py", line 94, in driver
return Driver(uri, **config)
File "/usr/local/lib/python3.5/dist-packages/neo4j/v1/api.py", line 133, in new
return subclass(uri, **config)
File "/usr/local/lib/python3.5/dist-packages/neo4j/v1/direct.py", line 73, in new
pool.release(pool.acquire())
File "/usr/local/lib/python3.5/dist-packages/neo4j/v1/direct.py", line 44, in acquire
return self.acquire_direct(self.address)
File "/usr/local/lib/python3.5/dist-packages/neo4j/bolt/connection.py", line 450, in acquire_direct
connection = self.connector(address, self.connection_error_handler)
File "/usr/local/lib/python3.5/dist-packages/neo4j/v1/direct.py", line 70, in connector
return connect(address, security_plan.ssl_context, error_handler, **config)
File "/usr/local/lib/python3.5/dist-packages/neo4j/bolt/connection.py", line 704, in connect
raise last_error
File "/usr/local/lib/python3.5/dist-packages/neo4j/bolt/connection.py", line 696, in connect
connection = _handshake(s, resolved_address, der_encoded_server_certificate, error_handler, **config)
File "/usr/local/lib/python3.5/dist-packages/neo4j/bolt/connection.py", line 668, in _handshake
connection.init()
File "/usr/local/lib/python3.5/dist-packages/neo4j/bolt/connection.py", line 207, in init
self.sync()
File "/usr/local/lib/python3.5/dist-packages/neo4j/bolt/connection.py", line 380, in sync
detail_delta, summary_delta = self.fetch()
File "/usr/local/lib/python3.5/dist-packages/neo4j/bolt/connection.py", line 287, in fetch
return self._fetch()
File "/usr/local/lib/python3.5/dist-packages/neo4j/bolt/connection.py", line 327, in _fetch
response.on_failure(summary_metadata or {})
File "/usr/local/lib/python3.5/dist-packages/neo4j/bolt/response.py", line 61, in on_failure
raise AuthError(message)
neo4j.exceptions.AuthError: The client is unauthorized due to authentication failure.

怎样把三元组信息导入到neo4j中

作者您好,看了您的代码,有一个疑问,不知道怎样将抽出的三元组信息(在您的项目中是以bin格式存储的pickle数据)导入到neo4j数据库中。好像目前项目中并没有 这块的代码,如方便,可否将这块代码公布出来。

另,现在知识图谱结果一般都通过web可视化,关于这块的程序不知您是否已经有了代码?或思路?如方便,也请一并告之。

非常感谢。

list index out of range

你好,我试图用爬取自己想要的网页,这样的话第二步和第三步似乎时必须的,但是我在尝试第二步时出错,不知该如何解决。报错信息依然是:print(pages[0]),IndexError: list index out of range。而实际是存在的

如何提高性能,爬取一段时间后节点增加速度变慢

如题,neo4j数据库中大概花了3个小时达到9w个节点,然后节点增加得就非常缓慢了,该如何优化呢?

速率曲线大概是这样
image

因为部署在阿里云主机上内存有限,用bloomfilter替代了代码中的set去重,并-s JOBDIR= 在磁盘上

数据量大带来的可视化问题

您好,我想请教一下,当数据量大到一定程度后,可视化显示卡顿,要处理好久,这部分您是怎么优化的。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.