Giter Site home page Giter Site logo

thewebscrapingclub / webscraping-from-0-to-hero Goto Github PK

View Code? Open in Web Editor NEW
1.5K 31.0 85.0 1.16 MB

The web scraping open project repository aims to share knowledge and experiences about web scraping with Python

playwright python scrapy scrapy-spider scrapysplash webscraping

webscraping-from-0-to-hero's Introduction

Web scraping from 0 to hero

Originally named "Web Scraping Open Project", this repository wants so create a common knowledge among web scraping experts, interesting enough for both rookies and experts in the field. Anyone can submit some content if it adds value to the project. Of course, we won't accept any AI-generated content and sellish and sponsored material, even if there are some sections dedicated to commercial tools, but they're based on user experience and not on marketing.

Why this repository?

Web scraping is becoming harder and more expensive, with anti-bot becoming more aggressive and requiring commercial tools for being bypassed. But, at the same time, the need for web data is growing exponentially, following the post-Covid-19 increase in digitalization. On top of this, AI models will need more and more data to be trained and the main source is usually the web (just ask Reddit and Twitter ) So while there are some increasing challenges, there are more and more opportunities for developers who want to embark on the career of a web data engineer. In this repository we're building a silo of all the sparse and fragmented content around the web and sharing some experience with tools, languages, and best practices to create a great basecamp for who's starting now but also a source of inspiration for experts looking for new tools and solutions.

Who am I?

I'm Pierluigi Vinciguerra, co-founder and CTO at Databoutique.com and I'm working in web scraping for more than 10 years. I've always felt the need to centralize in some places the information about web scraping that are sparse around the web. At first, I started taking some notes and in 2022 I've decided to share with everyone starting a free substack called The Web Scraping Club, a quite successful one considering the niche I'm writing to, even if it's only my voice that is heard. With this repository, I want to create a chorus of web scraping experts sharing their experiences and ideas so that all the industry could benefit from it.

How this repository works?

This repository wants to be a central hub for information about web scraping, so to keep it readable and ordered this page will be used as a table of content, with links to all the topics covered. Topics can be added by anyone if they are relevant and add some value to the repository. I tend to use the pages to create short content (about 400/500 words max) and link to external pages if longer content is needed, but that's not a rule. You can write an excerpt of a longer blog on these pages and then link the full article. Feel free to add your contributions to this repository, sharing each other's knowledge will boost the value of this repository for everyone.

Content not allowed:

  • Out of scope content
  • Promotional content
  • Referral codes
  • AI-generated content

The table of content below will be updated regularly as soon as some new topics are coming to my mind, if it's not linking to any article it means that the page still does not exist, so feel free to add one.

Table of content

1.Before scraping a website

1.1 Is scraping that website legal?

1.2 Preliminary website study

  • Does the website have an API (internal or exposed)?
  • Does it have some JSON inside the HTML?

2. Best practices

  • Use JSON instead of HTML, if possible
  • Selectors
  • Data formatting
  • Reducing the requests number

3. Free Tools

3.1. Headless python scrapers

3.2. Python scrapers with fully rendered browsers

3.3. Non Python scrapers with fully rendered browsers

3.4. Non Python full-featured web scraping libraries

4. Commercial Tools

  • Proxy solutions
  • Scraping API

5. Common anti-bot softwares & techniques

5.1. Anti-bot Softwares

5.2. Anti-bot Techniques

6. Test websites for your scraper

7. How to make money with web scraping

  • Freelancing
  • Sell your scrapers with Apify
  • Sell your data on Databoutique.com

webscraping-from-0-to-hero's People

Contributors

abe-101 avatar ohld avatar pervillalva avatar pigivinci avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.