Giter Site home page Giter Site logo

ilias-ant / american-alpine-club-articles Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 0.0 30.76 MB

Kaggle dataset: American Alpine Club articles - climbing accidents and major new climbs.

Home Page: https://www.kaggle.com/datasets/iantonopoulos/american-alpine-club-articles

License: Apache License 2.0

Python 65.25% Jupyter Notebook 34.75%
aac accidents alpinism climbing mountaineering

american-alpine-club-articles's Introduction

american-alpine-club-datasets

AAC: climbing accidents and major new climbs.

kaggle dataset

Code style: black

Articles from the American Alpine Club's publications: AAJ and ANAC.

This work has been published as a Kaggle dataset.

The project consists of the following components:

  • collectors: a Scrapy project, responsible for scraping articles from publications.americanalpineclub.org.
  • opensearch-cluster: an OpenSearch cluster, where the scraped articles are indexed.
  • publishers: functionality responsible for the publication of the articles index (e.g. as Kaggle dataset).
  • notebooks: a collection of Jupyter notebooks, for various dataset-based explorations and applications.
  • dataset: the raw dataset, in CSV format.

Citation

@misc{ilias antonopoulos_2022, 
      title={AAC: climbing accidents and major new climbs}, 
      url={https://www.kaggle.com/dsv/4457812}, 
      DOI={10.34740/KAGGLE/DSV/4457812}, 
      publisher={Kaggle}, 
      author={Ilias Antonopoulos},
      year={2022} 
}

american-alpine-club-articles's People

Contributors

ilias-ant avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

american-alpine-club-articles's Issues

reduce no. of dropped items

From the Scrapy stats of the initial run:

2022-11-07 04:06:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 72,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 59,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 13,
 'downloader/request_bytes': 16174589,
 'downloader/request_count': 29300,
 'downloader/request_method_count/GET': 29300,
 'downloader/response_bytes': 561325860,
 'downloader/response_count': 29228,
 'downloader/response_status_count/200': 29228,
 'dupefilter/filtered': 6,
 'elapsed_time_seconds': 107792.388109,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 11, 7, 2, 6, 4, 242959),
 'item_dropped_count': 217,
 'item_dropped_reasons_count/DropItem': 217,
 'item_scraped_count': 27617,
 'log_count/DEBUG': 139779,
 'log_count/INFO': 29423,
 'log_count/WARNING': 218,
 'memusage/max': 78548992,
 'memusage/startup': 65867776,
 'request_depth_max': 2,
 'response_received_count': 29228,
 'retry/count': 72,
 'retry/reason_count/twisted.internet.error.TimeoutError': 59,
 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 13,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 29299,
 'scheduler/dequeued/memory': 29299,
 'scheduler/enqueued': 29299,
 'scheduler/enqueued/memory': 29299,
 'start_time': datetime.datetime(2022, 11, 5, 20, 9, 31, 854850)}

we have:

...
'item_dropped_count': 217,
'item_dropped_reasons_count/DropItem': 217,
...

Can we reduce this number?

scrape Article referer as well

i.e. the URLs of the form: /articles?page=x

This will ease out debugging, since some of the scraped elements occur in the page-level as well (not only in the article-level)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.