Giter Site home page Giter Site logo

tumblr_crawler's Introduction

Tumblr_Crawler

This is a multi-threade crawler for Tumblr. Able to download entire blog or any post that you like.

There are two crawler module for video and image. One is for video, another is for image including GIF. The main file is Crawler.

Change Log

update2.0 for download any Post

This version of TumblrCrawler combine video and image including GIF in the same file. What’s more, it can acknowledge whether the main content is video or photo. Current version is only for download post page directly.
The whole blog searching function is undergoing. This searching will be easy, ignoring the JS. My thoughts is using archive page to get all the post pages, then get in every page to download.

Update3.0

This version is final one which add crawler whole blog posts function, which means this crawler can download all the file, including images and video, of one blog once.
This crawler uses threading.Thread Module. Every 10 posts as a page in tumblr as a single thread one time, Multi-thread accelerate whole procession. It needs no cookie can crawler any account. Of course, the more post there are, the longer it will take to crawler all.

update4.0

Find out some blog install personal Theme, which means they use different stylesheet from the default one. So it leads to crawl from home page is unavailable, so I change to search the default page as Archive. All the Archive page is the same stylesheet. But every archive page has 50 post, which means one single thread has to process 50 post url download. Definitely, it slow downs the procession a lot. But it has to be.

PersonalThemeSearch.py Module is for discriminating whether blog use default stylesheet or personal one.

ArchiveSearch.py is the Module for crawling all the post url in Archive page, every page has 50 posts url. Meanwhile, original way to crawl main page gets 10 posts every posts.

This version only figure out that searching all the post in every kind of stylesheet blog. It need to be solved to design a more universal function to crawl personal template post’s content.

What’s more, this version fixes some exception in none post page and a little logical problem about input. There are some spacial cases of url format, like "https://.*?", "http://wanimal1983.org/" (WTF? Redirection? http://wanimal1983.tumblr.com)

update5.0

This may be final version. It fix the problem that can not download content of special stylesheet blogs, and all the problems in last version. It adds the discrimination for homepage or post page, which means that user can download whole blog or specific post.

The main function is working for lots of blogs, like special url or theme. Of course, there may be some freak blogs’ stylesheet that is incompatible. You are welcome to remind me if you have some find. :)

update 5.5 Stable version

Fix the url decoding problem, then there will be no more 'url not found' problem which can be viewed from the browser.

update 6.0

Tumblr update the format of videos' url. So the version before 6.0 may not download the video. I modify the regular expression.

Envirment

Development under Python3.5 with some basic packages, such as requests.

Run

Run the TumblrCrawler.py directly. The input could be the blog's url, such as http://name.tumblr.com/
Or any single post that you like.

Finally, Enjoy your Interested and Excited Dowload! :)

tumblr_crawler's People

Contributors

sparrow629 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.