Giter Site home page Giter Site logo

roniemartinez / dude Goto Github PK

View Code? Open in Web Editor NEW
413.0 12.0 21.0 2.36 MB

dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators

Home Page: https://roniemartinez.github.io/dude/

License: GNU Affero General Public License v3.0

Makefile 0.63% Python 99.24% Dockerfile 0.13%
python scraping framework playwright scraper xpath css web-scraping beautifulsoup4 parsel

dude's Introduction

I am a software developer from the Philippines.

Philippines ๐Ÿ‡ต๐Ÿ‡ญ โžก๏ธ Germany ๐Ÿ‡ฉ๐Ÿ‡ช โžก๏ธ United Kingdom ๐Ÿ‡ฌ๐Ÿ‡ง

I am now in the UK. ๐Ÿ‡ฌ๐Ÿ‡ง

So what do I do in the world of software engineering?

Sponsors keep me motivated in writing, maintaining projects, help open source and building new things.

Buy Me A Coffee

Use my DigitalOcean referral below.

DigitalOcean Referral Badge

dude's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dude's Issues

Cache first before saving

To support concurrency in the future, scraped data should be cached first (File or Redis) before calling a function decorated with @save().

This will ensure that the @save() approach can be easily implemented by anyone.

Requirement:

  • User should be able to easily switch between several caching backends to scale easily

Add new decorators: css(), xpath(), regex() and text()

BeautifulSoup4 (#19, #32) and Parcel (#33) have methods like .css(), .xpath(), etc. and we can use these to add useful and more readable way to find elements in a page. For developers/web scrapers, the words "CSS" and "XPath" sounds more familiar than the word "select".

New decorators:

  1. @css() - for CSS selectors
  2. @xpath() - for XPath selectors
  3. @regex() - so as to not be confused with the standard library re
  4. @text() - this is supported by Playwright but these can be just created from the regex support of BeautifulSoup4 and Parcel

Explode merged table cells (colspan and rowspan)

It will be easier to extract individual table cell data if colspan and rowspan are exploded into single cells.
Add options (--explode-rowspan, --explode-colspan and/or --explode-table-cells) to use.

For example, this table:

  <tr>
    <td rowspan="2">Two rows</td>
    <td colspan="2">Two columns</td>
  </tr>
  <tr>
    <td>Single A</td>
    <td>Single B</td>
  </tr>
</table>

Should be converted to this:

  <tr>
    <td>Two rows</td>
    <td>Two columns</td>
    <td>Two columns</td>
  </tr>
  <tr>
    <td>Two rows</td>
    <td>Single A</td>
    <td>Single B</td>
  </tr>
</table>

KeyError when using follow_urls = True on run()

How to reproduce:

from dude import select

@select(css="a")
def result_url(element):
    return {"url": element.get_attribute("href")}

if __name__ == "__main__":
    import dude

    dude.run(urls=["https://www.google.com"], follow_urls=True, ignore_robots_txt=True)

Error location:

https://github.com/roniemartinez/dude/blob/53d53c2bd840ea52fc341089313f122735dd6ab4/dude/base.py#LL65C13-L65C40

Error origin:

https://github.com/roniemartinez/dude/blob/53d53c2bd840ea52fc341089313f122735dd6ab4/dude/scraper.py#LL96C44-L96C55

Changes to URL pattern matching

1. Rename url param to url_match

The naming will give more meaning and makes it more flexible to support... (next item)

2. URL matching functions

Instead of doing

@select(..., url_match=r"example\.com")
def myfunc(element):
    ...

it should be possible to use a custom function/lambda

@select(..., url_match=lambda x: x.startswith("example"))
def myfunc(element):
    ...

This makes it more extensible to any other use cases.

3. Change the URL filter from regex to fnmatch

fnmatch is simpler and easier to understand than regex. In cases where regex is needed, use a function for the filter.

Run function on page load

Issue

Currently, @select()-decorated functions only run when selectors are matched. It should be possible to run functions on page load.

Sample use case

  • Taking page screenshots - selector is not required.

Possible option

from dude import onload  # or some other decorator name possible?


@onload
def take_screenshot(page):
    page.screenshot(path="screenshot.png", full_page=True)

NOTES

We also have @select(..., setup=True), which will run functions before the data extraction (e.g. dismissing dialogs), see documentation. We probably need to align these options or redesign. Improvements and suggestions are appreciated.

Follow dynamically-built URLs

Use case

There are instances wherein IDs or slugs are embedded in other elements and they can be built into URLs that can be "followed".

<div id="project-id">Project: eca514fc</div>

The project ID eca514fc can be extracted and can be built to a URL, for example, https://example.com/projects/eca514fc.
There should be an option to follow this URL from a decorated function.

Solution

Implement the proposed solution in #62

@select(css=".project-id")
def get_link(element, scraper):  # <-- pass scraper object

    project_id = element.text_content().removeprefix("Project:").strip()
    url = f"https://example.com/projects/{project_id}"

    scraper.follow_url(url)  # <-- add to the URLs that will be scraped by the scraper 

    return {"project_id": project_id}

Final solution: #146

Selector for JSON contents

There are existing ways to extract data from JSON without traversing the contents one by one.

Options

Proposed style

@select(jsonpath="$.store.book[0].title")
def extract_title(title):
    return {"title": title}


@select(jmespath="locations[?state == 'WA'].name | sort(@)")
def extract_washington_cities(cities):
    return {"cities": cities}

Notes

  • Only applies if content type is application/json

Introduce group() decorator

Currently, grouping is done by specifying the group parameter.

@select(selector="...", group="<group-selector>")
@select(selector="...", group="<group-selector>")
def handler(element):
    return {"<key>": "<value-extracted-from-element>"}

This is alright for Playwright because it is smart enough to identify the selector but for other parsers like BeautifulSoup4, lxml (not implemented - #38) and Parsel (not implemented - #33). There is no easy way to distinguish the type of selector used for grouping.

By introducing a new @group() decorator, we can specify the type using any available arguments, css, xpath, text, regex or just selector (Playwright does not really need the other options).

@group(css="<group-selector>")
@select(selector="...")
@select(selector="...")
def handler(element):
    return {"<key>": "<value-extracted-from-element>"}

The @group() decorator also prevents repeating the same group parameter on each @select() decorator.

The group parameter in @select() decorator will be retained and can be used to override the @group() decorator only for that specific rule. It will not affect other @select() decorators.

Limitations to implement:

  1. There should be only one @group() decorator per function. Throw an error or overwrite it in a "group rule" list based on the precedence of using multiple decorators in Python.

Group data into separate tables

Output is always flattened into a single list of dictionaries before saving to CSV, JSON, etc.

By grouping data into separate tables, it will be easier to post-process and merge the tables, e.g. connect data from separate pages into one row.

@select(css="...", table="table1")
def function1(element):
    return {"data1": element.text_content()}

@select(css="...", table="table2")
def function2(element):
    return {"data2": element.text_content()}

User-Agent

  • Set a value for Dude User-Agent instead of using the default values on each parser backend (e.g.: pydude/{version} (+https://github.com/roniemartinez/dude))
  • Add option to override the User-Agent
  • For Playwright and Pyppeteer and Selenium, the User-Agent should include the original (Chromium, Firefox, Webkit) with the Dude User-Agent inserted into the string

Add wait option

Some JS-enabled websites has multiple pop-ups (agree dialog -> newsletter dialog -> subscribe dialog -> now you can use the website) and these dialogs takes time. Adding wait will make sure that the setup handlers will wait first.

Option 1: Explicit

@select(text="AGREE", setup=True, wait=1000)  # or wait_for="visible"
def handler(element, page):
    with page.expect_navigation():
        element.click()

Option 2: Implicit

This will be done in the background by Dude

Block ads?

Ads often pops-up for Playwright, Pyppeteer and Selenium and gets clicked instead of the actual target elements.

Spider

Possible ways to implement a simple Spider

### Expose "scraper" object in the handler functions

@select(css="a")
def get_link(element, scraper):  # <-- pass scraper object
    url = element.get_attribute("href")
    scraper.follow(url)  # <-- add to the URLs that will be scraped by the scraper 
    return {"url": url}

### Include the URLs in return

@select(css="a")
def get_link(element):
    url = element.get_attribute("href")
    return {"url": url}, [url, ...]  # <- return a tuple of dict result and list of URLs

Final implementation

Just use --follow-urls or pass follow_urls=True to run(). #90
This is less complicated when compared to managing the URLs to crawl by yourself inside the code.

WARNING: Do not use until #27 is implemented as this option will crawl indefinitely and will not save the data.

Option to download/save files by extension

  1. Download by file extension
  2. Download by mimetype, e.g. png should also match image/png mimetype
dude scrape ... --download png,jpg  # download all png and jpg files
dude scrape ... --download *  # download all files

Add option to use Parsel

With the introduction of BeautifulSoup4 in #19 and #32, it should be even more possible to use Parsel as alternative to Playwright and and BeautifulSoup4

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.