Giter Site home page Giter Site logo

librauee / reptile Goto Github PK

View Code? Open in Web Editor NEW
1.6K 54.0 510.0 7.25 MB

🏀 Python3 网络爬虫实战(部分含详细教程)猫眼 腾讯视频 豆瓣 研招网 微博 笔趣阁小说 百度热点 B站 CSDN 网易云阅读 阿里文学 百度股票 今日头条 微信公众号 网易云音乐 拉勾 有道 unsplash 实习僧 汽车之家 英雄联盟盒子 大众点评 链家 LPL赛程 台风 梦幻西游、阴阳师藏宝阁 天气 牛客网 百度文库 睡前故事 知乎 Wish

Python 99.88% HTML 0.12%
python3 requests scrapy spider

reptile's Introduction

Github stars Github stars

Spider Learning

  • Language : Python3
  • Content : 一些爬虫的学习实例和自己的爬虫实战汇总,包含入门阶段和中级阶段的两阶段实战内容,技术手段包括XPath、BeautifulSoup、正则表达式、Ajax异步加载、代理IP、多线程、抓包工具、字体反爬、 JS逆向、Scrapy框架、反调试、验证码等。
  • Notice : 欢迎关注我的微信公众号follow,与我一起成长~
  • 内含大量Python学习资源,电子书,视频,扫码关注即可

入门阶段

XPath

BeautifulSoup

正则表达式

Ajax异步加载

代理IP

多线程

抓包工具Fiddler

中级阶段

字体反爬

JS逆向

Scrapy框架

反调试

验证码


Number Website Article
1 豆瓣 豆瓣电影排行榜
2 大学排名
3 微博
4 研招网 爬取研招网调剂信息
5 代理IP
6 淘宝
7 股票
8 猫眼 爬取豆瓣、猫眼流浪地球数万条评论信息
9 儿童故事 给女友定时发送睡前小故事
10 CSDN
11 百度热点
12 笔趣阁
13 腾讯视频 爬取腾讯视频电视剧弹幕
14 英文短文
15 公交信息
16 网易云阅读
17 今日头条
18 网易云音乐 JS逆向之网易云音乐
19 拉勾
20 有道翻译 JS逆向初探之有道翻译
21 阿里文学 JS逆向之阿里文学
22 unsplash scrapy实战之unsplash
23 掌上英雄联盟 一键抓取掌盟文章
24 微信公众号 批量下载文章
25 链家
26 实习僧 字体反爬之实习僧
27 汽车之家 字体反爬之汽车之家
28 大众点评 字体反爬之大众点评
29 阴阳师
30 梦幻西游
31 台风
32 全国历史天气
33 牛客网 Python爬取海量面经
34 PentaQ电竞 Python爬取英雄联盟职业比赛数据
35 百度文库 因不可抗力已删除
36 知乎 知乎海量表情包
37 wish

github

reptile's People

Contributors

fattotora avatar librauee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

reptile's Issues

重复了

正则表达-公交信息 =代理IP

台风脚本那里有点问题

我手抄了一下台风历史信息的脚本,运行的时候发现总有一个错误发生
麻烦你帮我看一下哪里出错了

-*- mode: compilation; default-directory: "~/spider/spider/spiders/" -*-
Compilation started at Thu Mar  4 14:19:50

python3 typhoon.py
Traceback (most recent call last):
  File "typhoon.py", line 114, in <module>
    tfcraw.get_tf_detail()
  File "typhoon.py", line 62, in get_tf_detail
    tf_list = self.get_tf_list()
  File "typhoon.py", line 44, in get_tf_list
    year_list = self.get_year()
  File "typhoon.py", line 34, in get_year
    years = r.json()
  File "/home/steiner/.local/lib/python3.6/site-packages/requests/models.py", line 897, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python3/dist-packages/simplejson/__init__.py", line 518, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 370, in decode
    obj, end = self.raw_decode(s)
  File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 400, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Compilation exited abnormally with code 1 at Thu Mar  4 14:19:51

代码在这

import requests
from pymongo import MongoClient
import time
import random

class Typhoon:
    def __init__(self):
        self.user_agent = [
                           "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
                           "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
                           "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
                           "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
                           "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
    ]

        self.base_url = 'http://www.wztf121.com/data/complex/{}.json'
        self.headers = {
            'Cookie': '_gscu_1378142123=65572018r5on4x80; _gscbrs_1378142123=1; vjuids=30469f88b.16c835d32ea.0.8062809782e9b; vjlast=1565572019.1565572019.30; Hm_lvt_e592d6befa4f9918e6496980d22c5649=1565572019; Wa_lvt_1=1565572019; Wa_lpvt_1=1565576034; _gscs_1378142123=65572018v2ofkf80|pv:8; Hm_lpvt_e592d6befa4f9918e6496980d22c5649=1565576061',
            'Host': 'www.wztf121.com',
            'Referer': 'http://www.wztf121.com/history.html',
            'User-Agent': random.choice(self.user_agent)
        }

        self.client = MongoClient()
        self.db     = self.client.typhoon



    def get_year(self):
        year_list = []
        years_url = self.base_url.format('years')

        r = requests.get(years_url, headers = self.headers)
        years = r.json()

        for year in years:
            year_list.append(year['year'])

        print('以获取所有台风记录的年份')
        return year_list

    def get_tf_list(self):
        tf_list = []
        year_list = self.get_year()

        for year in year_list:
            url = self.base_url.format(year)

            r = requests.get(url, headers = self.headers)
            tfs = r.json()

            for tf in tfs:
                tfbh = tf['tfbh']
                tf_list.append(tfbh)

            time.sleep(random.random())

        print('已获得所有台风的编号,格式为 年份 + 次序')
        return tf_list

    def get_tf_detail(self):
        tf_list = self.get_tf_list()
        count = 1
        for tf in tf_list:
            tf_url = self.base_url.format(tf)
            r = requests.get(tf_url, headers = self.headers)
            tf_detail = r.json()

            begin_time = tf_detail[0]['begin_time']
            ename      = tf_detail[0]['ename']
            end_time   = tf_detail[0]['end_time']
            name       = tf_detail[0]['name']
            points     = tf_detail[0]['points']

            for point in points:
                latitude  = point['latitude']
                longitude = point['longitude']
                power     = point['power']
                speed     = point['speed']
                pressure  = point['pressure']
                strong    = point['strong']
                real_time = point['time']

                detail = {
                    'name': name,
                    'ename': ename,
                    'latitude': latitude,
                    'longitude': longitude,
                    'power': power,
                    'speed': speed,
                    'pressure': pressure,
                    'strong': strong,
                    'time': real_time,
                }
                self.db['detail'].insert_one(detail)


            time.sleep(5 * random.random())
            tf_info = {
                'name': name,
                'ename': ename,
                'begin_time': begin_time,
                'end_time': end_time,
            }

            self.db['info'].insert_one(tf_info)
            print('已存入第{}条台风详细信息!'.format(count))
            count += 1

                
            
        
tfcraw = Typhoon()
tfcraw.get_tf_detail()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.