Giter Site home page Giter Site logo

web_crawl's Introduction

Incremental news crawling project

Incremental news crawling project is to crawl articles by daily incremental from 39 sources (17 en, 2 id, 2 ms, 5 ta, 3 vi, 10 zh):

Language Sources
English ABC BBC Bernama Chinadaily CNA CNN france24news koreaherald MoscowTimes Mothership oneindia straitstimes techcrunch theguardian Theindependent thenational Weekender
Indonesian koranjakarta mediaindonesia
Malay Bernama Brudirect
Tamil BBC dinamani hindutamil oneindia Theekkathir
Vietnamese nguoiviet nhandan tuoitre
Chinese ABC ABC Chinadaily Chinanews Newsmarket Sina Twreporter uschinapress voachinese zaobao

Project Script Files

  • /home/xuanlong/web_crawl/web_crawl

Project data and log files

  • crawled articles will be stored into Elasticsearch Data pool collection news_articles_en / news_articles_id / news_articles_ms / news_articles_vi / news_articles_ta / news_articles_zh, at the same time, these articles will be split into sentences and save in .jsonl format at path /home/xuanlong/web_crawl/data/news_article/ for subsequent processing & back translation
  • log files for daily crawling will be placed at /home/xuanlong/web_crawl/data/

Quick Start

python ./web_crawl/runner.py

Independent crawlers

Independent crawlers are one-time run crawlers, each crawler for one source:

Language Sources
English hardwarezone reddit
Indonesian detik
Thai ch3plus koratdaily prachachat thansettakij

Project Script Files

  • /home/xuanlong/web_crawl/crawlers

Project data and log files

  • crawled articles will be split into sentences and save in .jsonl format at path /home/xuanlong/web_crawl/data/ for subsequent processing & back translation
  • log files for daily crawling will be placed at /home/xuanlong/web_crawl/data

Quick Start

for source hardwarezone

python ./web_crawl/crawlers/forum_en_hardwarezone.py

for source reddit

python ./web_crawl/crawlers/forum_reddit.py

for source detik

python ./web_crawl/crawlers/forum_id_detik.py

for source ch3plus

python ./web_crawl/crawlers/th_ch3plus_Spider.py

for source koratdaily

python ./web_crawl/crawlers/th_koratdaily_Spider.py

for source prachachat

python ./web_crawl/crawlers/th_prachachat_Spider.py

for source thansettakij

python ./web_crawl/crawlers/th_thansettakij_Spider.py

web_crawl's People

Contributors

zouxunlong avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.