Giter Site home page Giter Site logo

bilibilispider's Introduction

Bilibili视频信息爬虫

开发环境:python 3.6.5、Scrapy:1.5.0

B站的视频信息(播放数量、弹幕数、投币数等)都是动态加载的,所以通过xpath或者css选择器分析原页面是找不到该类信息的,因此分析网页的network,找到了它的api接口:

image

用浏览器打开Request_URL,就能看到所需要的信息了:

image

网页请求逻辑

title = response.css("#viewbox_report h1 span::text").extract()
url = response.url
UP = response.css(".info a[href]::text").extract()
up_url = response.css(".info a::attr(href)").extract()
video_type = response.css("span.crumb:nth-child(2) > a:nth-child(1)::text").extract()
up_time = response.css(".tm-info.tminfo time::text").extract()

yield scrapy.Request("https://api.bilibili.com/x/web-interface/archive/stat?aid={0}".format(av_num),
                   meta={
                         "title":title,"url":url, "UP":UP,"up_url":up_url,"video_type":video_type,
                         'up_time':up_time,"av_num":av_num,
                          },callback=self.parse_item)             #爬取静态数据

yield Request(url='https://www.bilibili.com/video/av{0}'.format(match_re), callback=self.parse) #爬取json数据

该爬虫通过遍历av号爬取全部视频信息:

在请求原始url时,先抓取原始网页上可用的信息通过meta发送,分析api地址为:'https://api.bilibili.com/x/web-interface/archive/stat?aid='+av号 再次给api地址发送request请求,抓取剩余信息获取items;结束后请求下一个url。

使用pyecharts将数据可视化

bilibili一月各个分区的播放比例

image

bilibili一月播放量最高的15位UP

image

37万原始数据下载地址:https://pan.baidu.com/s/1_NKz0nmQ7ZsjmXRJI1rQ0Q

bilibilispider's People

Stargazers

 avatar  avatar  avatar Kehuan avatar  avatar Jervis_Cen avatar  avatar Arthur avatar  avatar EthanXie avatar Weichuang Li avatar  avatar shiliang avatar SwagXin avatar 李佳俊 avatar python_repo avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.