Giter Site home page Giter Site logo

imagecrawler's Introduction

image_crawler v1.0.4

This is an image crawler in pure shell script, which could download the images (actually the image URLs on the web) once given a keyword. It fakes to visit Google Image/Baidu Image, using the keywords you provide to perform like a human search, and then parse and record the results(image urls) returned by Google or Baidu. Once you have the image urls, you could download the real images using any script languages you like. This tool could also be used to compare the search quality and relevance between Google Image and Baidu Image.

What's NEW?

  • Now it supports both wget and curl!
  • Performance is enhanced ~10X faster, increasing the parallelism by utilizing multi-process background jobs in bash.
  • Baidu/Google image search urls are updated. This script was created ~5 years ago, and things have changed a lot since then, both baidu and google changed their image search query parameters and rules. So I took quite a few time this weekend to figure out what's the change, and update the script to make it work again.
  • A python script is introduced to decode the baidu image objURL. I planned to do it in bash script originally, but you need to introduce a hashtable, damn complicated to make it work in bash only... So I'll leave it as a TODO now - to update this python script to the bash script with the same functionality.
  • [Experiment]: Downloading images after parsing out the img urls. You could turn this feature off by specifying 'EXPERIMENT="OFF"'.

TODO:

  • Continue to improve the performance.
  • Adding proxy support?
  • Implement the python equivalent to decode the baidu image url.
  • ...

How to use it?

  • Input: A file named query_list.txt, per keyword per line.
  • Usage: $./image_crawler.sh google ;
    • google could be replaced by baidu
    • is the number of images you want to download for a given keyword.
  • Output: The script generates a directory named google/(or baidu/ as you choosed), which contains files of the format: "i_objURL-list_keyword[i]", i is the ith keyword in the query_list.txt. In each of these files contain num lines, per image url per line.

Performance

I've tested this script with 10 keywords (just as in the query_list.txt), each keyword crawling 300 results using Google.
Results are as follows:
[unix14 ~/imagecrawler]$ time ./image_crawler.sh google 300
real 0m5.766s user 0m2.425s sys 0m2.254s

[unix14 ~/imagecrawler]$ time ./image_crawler.sh baidu 300
real 0m11.419s user 0m1.254s sys 0m1.044s

The result is not bad, and in the future I'll tweak it into a more concurrent version.

Note:

It works in any platform that supports bash, egrep, awk, python, wget | curl. So, Ubuntu, MacOS, etc.

imagecrawler's People

Contributors

dryruner avatar

Stargazers

Sehee Lee avatar Yujin Lee avatar GAURAV avatar  avatar HWANHEE KIM avatar Eunice Yoon avatar Hooney avatar Jaeyoung Yoon avatar Hyeonsik Song avatar JaeSeok avatar STARTALK avatar SANG RAK LEE avatar Siva avatar Junhyuk Park avatar Wonjae Cho avatar Junsoo Lee avatar Sihyung Park avatar Dongkwan Kim avatar Seokjoon Ahn avatar  avatar nicewoong avatar zhiyue avatar Allen Lin avatar Kyungkoo Min avatar  avatar Sung Yun Byeon avatar Jihyong Oh avatar Toqi Tahamid avatar Refik Can Malli avatar Jianhui avatar eking he avatar Xiaoliang Wang avatar  avatar Brian S. avatar msenturk avatar Alex Hall avatar  avatar secutron avatar Seung-Jin Hong avatar  avatar Peratham Wiriyathammabhum avatar TENSORTALK avatar  avatar  avatar André Vitor Cuba de Miranda avatar  avatar Yi Zhu avatar Hobin Ryu avatar Adrian Wong avatar Tianyao Chen avatar Xiang avatar Yajun Wang avatar  avatar Sun-Jin Park avatar Kuan-Ting Chou avatar Kyrre Havik avatar quidesign avatar Yifan Lu avatar Ivo Lima avatar bxp avatar Jianquan Liu avatar

Watchers

 avatar Wonjae Cho avatar  avatar uuspider avatar

imagecrawler's Issues

[Known Issue] grep/egrep has different versions (GNU and BSD as I've encountered), need to support both.

  1. On ubuntu:
    $ grep -V;
    grep (GNU grep) 2.16
    ...
    $ egrep -V;
    egrep (GNU grep) 2.16
    ...
  2. On Mac OSX (BSD):
    $ grep -V;
    grep (BSD grep) 2.5.1-FreeBSD
    $ egrep -V;
    egrep (BSD grep) 2.5.1-FreeBSD

** What's the impact?

BSD greps and GNU greps has slightly different functionalities.
GNU grep supports perl-like non greedy match:
grep -P '"objURL":".?"' -o
But it doesn't work with egrep [-P] '"objURL":".
?"' -o (GNU egrep is different from BSD egrep)

BSD egrep supports this by:
egrep '"objURL":".?"' -o
But it doesn't work with grep -P '"objURL":".
?"' -o (BSD grep doesn't support -P)

Empty Files on Ubuntu

Hello,
I'm trying to run this on my Ubuntu machine, it runs perfectly fine but all the files created are empty (both for Google & Baidu). Do you have any idea what might be the issue?

Thank you very much for this,

download google image like *.jpg

hi jonnyhsy, thanks for your great work.
I tried your code on ubuntu 16.04, it works well. But I furtherly want to crawl images from google image, such as "bar chart" from google image search. How can i get this by your code, thanks very much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.