Giter Site home page Giter Site logo

fundus's Introduction

Logo

A very simple news crawler in Python. Developed at Humboldt University of Berlin.

PyPi version python Static Badge Publisher Coverage


Fundus is:

  • A static news crawler. Fundus lets you crawl online news articles with only a few lines of Python code! Be it from live websites or the CC-NEWS dataset.

  • An open-source Python package. Fundus is built on the idea of building something together. We welcome your contribution to help Fundus grow!


Quick Start

To install from pip, simply do:

pip install fundus

Fundus requires Python 3.8+.

Example 1: Crawl a bunch of English-language news articles

Let's use Fundus to crawl 2 articles from publishers based in the US.

from fundus import PublisherCollection, Crawler

# initialize the crawler for news publishers based in the US
crawler = Crawler(PublisherCollection.us)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
    print(article)

That's already it!

If you run this code, it should print out something like this:

Fundus-Article:
- Title: "Feinstein's Return Not Enough for Confirmation of Controversial New [...]"
- Text:  "Democrats jammed three of President Joe Biden's controversial court nominees
          through committee votes on Thursday thanks to a last-minute [...]"
- URL:    https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/
- From:   FreeBeacon (2023-05-11 18:41)

Fundus-Article:
- Title: "Northwestern student government freezes College Republicans funding over [...]"
- Text:  "Student government at Northwestern University in Illinois "indefinitely" froze
          the funds of the university's chapter of College Republicans [...]"
- URL:    https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community
- From:   FoxNews (2023-05-09 14:37)

This printout tells you that you successfully crawled two articles!

For each article, the printout details:

  • the "Title" of the article, i.e. its headline
  • the "Text", i.e. the main article body text
  • the "URL" from which it was crawled
  • the news source it is "From"

Example 2: Crawl a specific news source

Maybe you want to crawl a specific news source instead. Let's crawl news articles from Washington Times only:

from fundus import PublisherCollection, Crawler

# initialize the crawler for The New Yorker
crawler = Crawler(PublisherCollection.us.TheNewYorker)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
    print(article)

Example 3: Crawl articles from CC-NEWS

If you're not familiar with CC-NEWS, check out their paper.

from fundus import PublisherCollection, CCNewsCrawler

# initialize the crawler for news publishers based in the US
crawler = CCNewsCrawler(*PublisherCollection.us)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
  print(article)

Tutorials

We provide quick tutorials to get you started with the library:

  1. Tutorial 1: How to crawl news with Fundus
  2. Tutorial 2: How to crawl articles from CC-NEWS
  3. Tutorial 3: The Article Class
  4. Tutorial 4: How to filter articles
  5. Tutorial 5: How to search for publishers

If you wish to contribute check out these tutorials:

  1. How to contribute
  2. How to add a publisher

Currently Supported News Sources

You can find the publishers currently supported here.

Also: Adding a new publisher is easy - consider contributing to the project!

Evaluation benchmark

Check out our evaluation benchmark.

Scraper Precision Recall F1-Score
Fundus 99.89±0.57 96.75±12.75 97.69±9.75
Trafilatura 90.54±18.86 93.23±23.81 89.81±23.69
BTE 81.09±19.41 98.23±8.61 87.14±15.48
jusText 86.51±18.92 90.23±20.61 86.96±19.76
news-please 92.26±12.40 86.38±27.59 85.81±23.29
BoilerNet 84.73±20.82 90.66±21.05 85.77±20.28
Boilerpipe 82.89±20.65 82.11±29.99 79.90±25.86

Cite

Please cite the following paper when using Fundus or building upon our work:

@misc{dallabetta2024fundus,
      title={Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions}, 
      author={Max Dallabetta and Conrad Dobberstein and Adrian Breiding and Alan Akbik},
      year={2024},
      eprint={2403.15279},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact

Please email your questions or comments to Max Dallabetta

Contributing

Thanks for your interest in contributing! There are many ways to get involved; start with our contributor guidelines and then check these open issues for specific tasks.

License

MIT

fundus's People

Contributors

maxdall avatar weyaaron avatar dobbersc avatar addie9800 avatar alanakbik avatar fabianhenning avatar lethalsnake1337 avatar boriskalika avatar screw-44 avatar mk2112 avatar lukasgarbas avatar jannispoltier avatar martinknz avatar dkm1006 avatar jabbawukis avatar myoncee avatar susannaruecker avatar annikathiele avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.