Giter Site home page Giter Site logo

lewoudar / scalpel Goto Github PK

View Code? Open in Web Editor NEW
39.0 2.0 2.0 502 KB

A fast and powerful web scraping library

Home Page: https://scalpel.readthedocs.io/en/latest/

License: Apache License 2.0

Python 99.15% HTML 0.85%
scalpel trio gevent webscraping crawler python asyncio anyio

scalpel's Introduction

Pyscalpel

Pypi version Coverage Status Documentation Status Code Style License Apache 2

Your easy-to-use, fast and powerful web scraping library.

Why?

I already knew scrapy which is the reference in python for web scraping. But two things bothered me.

  • I feel like scrapy cannot integrate into an existing project, you need to treat your web scraping stuff like a project on its own.
  • Usage of Twisted who is a veteran in asynchronous programming, but I think that there are better asynchronous frameworks today. Note that this second point is not true anymore as I'm writing the document since scrapy adds support for asyncio

After having made this observation I decided to create pyscalpel. And let's be honest, I also want to have my own web scraping library, and it is fun to write one ;)

Installation

pip install pyscalpel  # to only use the asyncio backend
pip install pyscalpel[gevent] # to install the gevent backend
pip install pyscalpel[trio] # to installl the trio backend
pip install pyscalpel[full] # to install all the backends

If you know about poetry you can use it instead of pip.

poetry add pyscalpel  # to only use the asyncio backend
poetry add pyscalpel[gevent] # to install the gevent backend
poetry add pyscalpel[trio] # to install the trio backend
poetry add pyscalpel[full] # to install all the backends

pyscalpel works starting from python 3.7, it relies on robust packages:

  • configuror: A configuration toolkit.
  • httpx: A modern http client.
  • selenium: A library for controlling a browser.
  • gevent: An asynchronous framework using the synchronous way. (optional)
  • trio: A modern asynchronous framework using async/await syntax. (optional)
  • anyio: An asynchronous networking and concurrency library that works on top of either asyncio or trio.
  • parsel: A library elements in HTML/XML documents.
  • attrs: A library helping to write classes without pain.
  • fake-useragent: A simple library to fake a user agent.
  • rfc3986: A library for url parsing and validation.
  • msgpack: A library allowing for fast serialization/deserialization of data structures.

Documentation

The documentation is available at https://scalpel.readthedocs.io/en/latest/.

Usage

To give you an overview of what can be done, this is a simple example of quote scraping. Don't hesitate to look at the examples folder for more snippets to look at.

with gevent

from pathlib import Path

from scalpel import Configuration
from scalpel.green import StaticSpider, StaticResponse, read_mp

def parse(spider: StaticSpider, response: StaticResponse) -> None:
    for quote in response.xpath('//div[@class="quote"]'):
        data = {
            'message': quote.xpath('./span[@class="text"]/text()').get(),
            'author': quote.xpath('./span/small/text()').get(),
            'tags': quote.xpath('./div/a/text()').getall()
        }
        spider.save_item(data)

    next_link = response.xpath('//nav/ul/li[@class="next"]/a').xpath('@href').get()
    if next_link is not None:
        response.follow(next_link)

if __name__ == '__main__':
    backup = Path(__file__).parent / 'backup.mp'
    config = Configuration(backup_filename=f'{backup}')
    spider = StaticSpider(urls=['http://quotes.toscrape.com'], parse=parse, config=config)
    spider.run()
    print(spider.statistics())
    # you can do whatever you want with the results
    for quote_data in read_mp(filename=backup, decoder=spider.config.msgpack_decoder):
        print(quote_data)

with anyio

from pathlib import Path

import anyio
from scalpel import Configuration
from scalpel.any_io import StaticResponse, StaticSpider, read_mp


async def parse(spider: StaticSpider, response: StaticResponse) -> None:
    for quote in response.xpath('//div[@class="quote"]'):
        data = {
            'message': quote.xpath('./span[@class="text"]/text()').get(),
            'author': quote.xpath('./span/small/text()').get(),
            'tags': quote.xpath('./div/a/text()').getall()
        }
        await spider.save_item(data)

    next_link = response.xpath('//nav/ul/li[@class="next"]/a').xpath('@href').get()
    if next_link is not None:
        await response.follow(next_link)

async def main():
    backup = Path(__file__).parent / 'backup.mp'
    config = Configuration(backup_filename=f'{backup}')
    spider = StaticSpider(urls=['http://quotes.toscrape.com'], parse=parse, config=config)
    await spider.run()
    print(spider.statistics())
    # you can do whatever you want with the results
    async for item in read_mp(backup, decoder=spider.config.msgpack_decoder):
        print(item)

if __name__ == '__main__':
    # by default, this will run the asyncio backend, if you want the trio backend, you must first install the trio
    # package and replace the following line with: anyio.run(main, backend='trio').
    anyio.run(main)

Known limitations

pyscalpel aims to handle SPA (single page application) through the use of selenium. However, due to the synchronous nature of selenium, it is hard to leverage anyio and gevent asynchronous feature. You will notice that the selenium spider is slower than the static spider. For more information look at the documentation.

Warning

pyscalpel is a young project, so it is expected to have breaking changes in the api without respecting the semver principle. It is recommended to pin the version you are using for now.

scalpel's People

Contributors

dependabot[bot] avatar lewoudar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

scalpel's Issues

Add support for asyncio

I don't like asyncio design to be honest but the work made by the author of anyio is awesome.
It brings the trio api to asyncio and therefore made it easy to support it. I will work on it soon.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.