Giter Site home page Giter Site logo

imagecrawler's Introduction

image_crawler v1.0.4

This is an image crawler in pure shell script, which could download the images (actually the image URLs on the web) once given a keyword. It fakes to visit Google Image/Baidu Image, using the keywords you provide to perform like a human search, and then parse and record the results(image urls) returned by Google or Baidu. Once you have the image urls, you could download the real images using any script languages you like. This tool could also be used to compare the search quality and relevance between Google Image and Baidu Image.

What's NEW?

  • Now it supports both wget and curl!
  • Performance is enhanced ~10X faster, increasing the parallelism by utilizing multi-process background jobs in bash.
  • Baidu/Google image search urls are updated. This script was created ~5 years ago, and things have changed a lot since then, both baidu and google changed their image search query parameters and rules. So I took quite a few time this weekend to figure out what's the change, and update the script to make it work again.
  • A python script is introduced to decode the baidu image objURL. I planned to do it in bash script originally, but you need to introduce a hashtable, damn complicated to make it work in bash only... So I'll leave it as a TODO now - to update this python script to the bash script with the same functionality.
  • [Experiment]: Downloading images after parsing out the img urls. You could turn this feature off by specifying 'EXPERIMENT="OFF"'.

TODO:

  • Continue to improve the performance.
  • Adding proxy support?
  • Implement the python equivalent to decode the baidu image url.
  • ...

How to use it?

  • Input: A file named query_list.txt, per keyword per line.
  • Usage: $./image_crawler.sh google ;
    • google could be replaced by baidu
    • is the number of images you want to download for a given keyword.
  • Output: The script generates a directory named google/(or baidu/ as you choosed), which contains files of the format: "i_objURL-list_keyword[i]", i is the ith keyword in the query_list.txt. In each of these files contain num lines, per image url per line.

Performance

I've tested this script with 10 keywords (just as in the query_list.txt), each keyword crawling 300 results using Google.
Results are as follows:
[unix14 ~/imagecrawler]$ time ./image_crawler.sh google 300
real 0m5.766s user 0m2.425s sys 0m2.254s

[unix14 ~/imagecrawler]$ time ./image_crawler.sh baidu 300
real 0m11.419s user 0m1.254s sys 0m1.044s

The result is not bad, and in the future I'll tweak it into a more concurrent version.

Note:

It works in any platform that supports bash, egrep, awk, python, wget | curl. So, Ubuntu, MacOS, etc.

imagecrawler's People

Contributors

dryruner avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.