Giter Site home page Giter Site logo

nbeny / lemonde-crawler Goto Github PK

View Code? Open in Web Editor NEW

This project forked from flavienbwk/lemonde-crawler

0.0 0.0 0.0 21 KB

Browse articles from Le Monde's website and store them in a SQLite database.

License: Apache License 2.0

Shell 0.26% Python 93.63% Dockerfile 6.11%

lemonde-crawler's Introduction

🕷️ Le Monde crawler

⚠️ THIS PROJECT ISN'T MAINTAINED ANYMORE, PLEASE VISIT News Crawler, THE SUCCESSOR OF THIS PROJECT.

Le Monde is the most famous newspaper in France. It offers thousands of articles through its online website.


This project allows browsing most recent articles from their website and store them in a SQLite database :

  • URL
  • Title
  • Description (short summary)
  • Article content
  • Author
  • Illustration (blob)
  • Date

Features :

  • Persisting login cookies
  • Article caching : only crawling new articles

This project uses Playwright.

⚠️ DISCLAIMER : This project is for educational purpose only ! Do NOT use it for any other intent. It was developed as a fun side-project to train my scraping skills.

Parameters

Name Type Description
LEMONDE_EMAIL str Your Le Monde email address
LEMONDE_PASSWORD str Your Le Monde password
START_LINK str After login, start scraping articles from this page
RETRIEVE_RELATED_ARTICLE_LINKS bool Crawl links in currently scraped article pointing to other similar articles
RETRIEVE_EACH_ARTICLE_LINKS bool Crawl all article links present in the currently scraped article

Usage (Docker)

  1. Copy and fill your credentials in .env :

    cp .env.example .env

    Edit LEMONDE_EMAIL and LEMONDE_PASSWORD matching your Le Monde's credentials (we recommend a premium account to avoid any limit)

  2. Running the container

    docker-compose up

Usage (CLI)

You must have Python>=3.7 and pip installed.

  1. Install dependencies

    pip3 install -r requirements.txt
  2. Run CLI

    LEMONDE_EMAIL='...' LEMONDE_PASSWORD='...' python3 ./scripts/crawler.py

Ideas

  • You might be interested in Prefect to automate this crawling task each day

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.