Le Monde is the most famous newspaper in France. It offers thousands of articles through its online website.
This project allows browsing most recent articles from their website and store them in a SQLite database :
- URL
- Title
- Description (short summary)
- Article content
- Author
- Illustration (blob)
- Date
Features :
- Persisting login cookies
- Article caching : only crawling new articles
This project uses Playwright.
Name | Type | Description |
---|---|---|
LEMONDE_EMAIL | str | Your Le Monde email address |
LEMONDE_PASSWORD | str | Your Le Monde password |
START_LINK | str | After login, start scraping articles from this page |
RETRIEVE_RELATED_ARTICLE_LINKS | bool | Crawl links in currently scraped article pointing to other similar articles |
RETRIEVE_EACH_ARTICLE_LINKS | bool | Crawl all article links present in the currently scraped article |
-
Copy and fill your credentials in
.env
:cp .env.example .env
Edit
LEMONDE_EMAIL
andLEMONDE_PASSWORD
matching your Le Monde's credentials (we recommend a premium account to avoid any limit) -
Running the container
docker-compose up
You must have Python>=3.7
and pip
installed.
-
Install dependencies
pip3 install -r requirements.txt
-
Run CLI
LEMONDE_EMAIL='...' LEMONDE_PASSWORD='...' python3 ./scripts/crawler.py
- You might be interested in Prefect to automate this crawling task each day