Web scraping from 0 to hero

Originally named "Web Scraping Open Project", this repository wants so create a common knowledge among web scraping experts, interesting enough for both rookies and experts in the field. Anyone can submit some content if it adds value to the project. Of course, we won't accept any AI-generated content and sellish and sponsored material, even if there are some sections dedicated to commercial tools, but they're based on user experience and not on marketing.

Why this repository?

Web scraping is becoming harder and more expensive, with anti-bot becoming more aggressive and requiring commercial tools for being bypassed. But, at the same time, the need for web data is growing exponentially, following the post-Covid-19 increase in digitalization. On top of this, AI models will need more and more data to be trained and the main source is usually the web (just ask Reddit and Twitter ) So while there are some increasing challenges, there are more and more opportunities for developers who want to embark on the career of a web data engineer. In this repository we're building a silo of all the sparse and fragmented content around the web and sharing some experience with tools, languages, and best practices to create a great basecamp for who's starting now but also a source of inspiration for experts looking for new tools and solutions.

Who am I?

I'm Pierluigi Vinciguerra, co-founder and CTO at Databoutique.com and I'm working in web scraping for more than 10 years. I've always felt the need to centralize in some places the information about web scraping that are sparse around the web. At first, I started taking some notes and in 2022 I've decided to share with everyone starting a free substack called The Web Scraping Club, a quite successful one considering the niche I'm writing to, even if it's only my voice that is heard. With this repository, I want to create a chorus of web scraping experts sharing their experiences and ideas so that all the industry could benefit from it.

How this repository works?

This repository wants to be a central hub for information about web scraping, so to keep it readable and ordered this page will be used as a table of content, with links to all the topics covered. Topics can be added by anyone if they are relevant and add some value to the repository. I tend to use the pages to create short content (about 400/500 words max) and link to external pages if longer content is needed, but that's not a rule. You can write an excerpt of a longer blog on these pages and then link the full article. Feel free to add your contributions to this repository, sharing each other's knowledge will boost the value of this repository for everyone.

Content not allowed:

Out of scope content
Promotional content
Referral codes
AI-generated content

The table of content below will be updated regularly as soon as some new topics are coming to my mind, if it's not linking to any article it means that the page still does not exist, so feel free to add one.

Table of content

1.Before scraping a website

1.1 Is scraping that website legal?

1.2 Preliminary website study

Does the website have an API (internal or exposed)?
Does it have some JSON inside the HTML?

2. Best practices

Use JSON instead of HTML, if possible
Selectors
Data formatting
Reducing the requests number

3. Free Tools

3.1. Headless python scrapers

Scrapy
scrapy_splash

3.2. Python scrapers with fully rendered browsers

3.3. Non Python scrapers with fully rendered browsers

Puppeteer

3.4. Non Python full-featured web scraping libraries

Crawlee

4. Commercial Tools

Proxy solutions
Scraping API

5. Common anti-bot softwares & techniques

5.1. Anti-bot Softwares

5.2. Anti-bot Techniques

Passive fingerprinting including:
Browser Fingerprinting techniques including:

6. Test websites for your scraper

https://bot.incolumitas.com/ one of the most complete set of tests for your scrapers
https://pixelscan.net/ check your ip and your machine
https://bot.sannysoft.com/ another great list of tests
https://abrahamjuliot.github.io/creepjs/ set of tests on fingerprinting
https://fingerprintjs.com/products/bot-detection/ page about BotD, a javascript bot detection library included in Cloudflare, where you can also test your configuration

7. How to make money with web scraping

Freelancing
Sell your scrapers with Apify
Sell your data on Databoutique.com

thewebscrapingclub / webscraping-from-0-to-hero Goto Github PK