Giter Site home page Giter Site logo

mf1024 / imagenet-datasets-downloader Goto Github PK

View Code? Open in Web Editor NEW
496.0 1.0 130.0 1.57 MB

ImageNet dataset downloader. Creates a custom dataset by specifying the required number of classes and images in a class.

Home Page: https://mf1024.github.io/2019/06/09/how-to-scrape-the-imagenet/

Python 100.00%

imagenet-datasets-downloader's Introduction

ImageNet Downloader

This is ImageNet dataset downloader. You can create new datasets from subsets of ImageNet by specifying how many classes you need and how many images per class you need. This is achieved by using image urls provided by ImageNet API.

In this blog post I wrote in a bit more detail how and why I wrote the tool. Also, I did a little analysis of the current state of the ImageNet URLs in the post.

This software is written in Python 3

Usage

The following command will randomly select 100 of ImageNet classes with at least 200 images in them and start downloading:

python ./downloader.py \
    -data_root /data_root_folder/imagenet \
    -number_of_classes 100 \
    -images_per_class 200

The following command will download 500 images from each of selected class:

python ./downloader.py 
    -data_root /data_root_folder/imagenet \
    -use_class_list True \
    -class_list n09858165 n01539573 n03405111 \
    -images_per_class 500 

You can find class list in this csv where I list every class that appear in the ImageNet with number of total urls and total flickr urls it that class.

Multiprocessing workers

I've implementet parallel request processing and I've added multiprocessing_workers parameter which by default is 8. You can turn it higher, but I havent yet tested the limits of flickr allowed bandwith myself, so use it with care and you will have to find the limits yourself if you want to go for the maximum speed.

You can do something like this:

python ./downloader.py \
    -data_root /data_root_folder/imagenet \
    -number_of_classes 1000 \
    -images_per_class 500 \
    -multiprocessing_workers 24

imagenet-datasets-downloader's People

Contributors

91abdullah avatar christian-rauch avatar mf1024 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

imagenet-datasets-downloader's Issues

Nothing gets downloaded

Ubuntu 18, Python 3.7.3

python3 downloader.py -data_root ./ -number_of_classes 3 -images_per_class 5

Output:

Picked the following clases:
['thunderer', 'shuttle helicopter', 'scratcher']
Scraping images for class "thunderer"
Multiprocessing workers: 8
Scraping images for class "shuttle helicopter"
Multiprocessing workers: 8
Scraping images for class "scratcher"
Multiprocessing workers: 8

But there are no images within these folders, nothing's actually downloaded. I ran this a couple of times.

Cheers

freeze_support() issue

Anyone know how to fix this issue?

RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

No labels found?

Is this tool just install images? I cant find labels find in the dataset.

this is my syntax

python3 ./downloader.py -data_root {IMG_NET} -use_class_list True -class_list {class_list} -images_per_class 2000

why # of image classes are more than 1000?

can anyone please tell me why there are more than 1000 classes listed in .csv file. what I know that imagenet has 1000 classes only on which torchvision.models resnet101, densenet161 and other models are trained on.

Wrong classes being downloaded?

I like the idea of this type of downloader and was going to try it out. So, I wanted to do 2 classes, hornet and wasp. But I can't seem to find the correct set of parameters to download just those two.

For example:

python ./downloader.py -data_root ../../ --use_class_list true -class_list n02213543 n02213107 -images_per_class 500
Picked the following clases:
['sea boat', 'pavlova', 'corn', 'tank', 'army officer', 'record player', 'monk', 'lovebird', 'loom', 'shark']
Scraping images for class "sea boat"

I am not sure why it is showing 10 or so other classes that were not selected.

Am I using the parameters incorrectly?

Downloading more Images per class than required and that too randomly

I tried downloading the data with the following command:

python ./downloader.py \
    -data_root /imagenet \
    -number_of_classes 19000 \
    -images_per_class 2 \
    -multiprocessing_workers 24

It downloaded arbitrary number of images per class. I mean I am unable to see why this is happening. Some of the images in a class were 21 other were 3 and so on and so forth.

Validation/Test set urls

Hi,

Thanks for this incredible tool. Huge timesaver! I had one concern though.

Is there a way to specify the downloader to download the images only from the validation/test set. Currently, I am not sure whether the images are downloaded from the training set or test set. This may be needed to evaluate a pretrained classifier on test set images.

Facing a problem with data_root

I am facing a problem with the -data_root argument. I wanted to know what exactly is expected over there?

Here is the command that I ran in the command prompt
python ./downloader.py -data_root f:\imagenet -number_of_classes 100

File "./downloader.py", line 35
  logging.error(f'folder {args.data_root} does not exist! please provide existing folder in -data_root arg!')
                                                                                                           ^
SyntaxError: invalid syntax

I thought it was a problem with how I wrote the syntax so I wrote the command without any arguments so it can take default values. I still got the same error

python ./downloader.py

Missing license for this repository

Hello, my name is Janos Pauli. I opened this issue concerning the license and thus usage rights of this repository. Within a research project I'm conducting during my master's program I would like to conduct an ML analysis focusing on some of the imagenet stimuli and their respective category. As I thus need the stimuli corresponding classes, I would like to utilize the resources provided in this repository. However, I was not able to find its license and therefore usage rights/permissions and wanted to ask if this noted somewhere and if not, if the respective information could be added? I'm truly sorry if I missed something obvious and would like to thank you for this great resources.
Best,
Janos :)

Putting this in pypi?

Hi,

I am so happy that this repo exists - it is easy to use and helps a lot of people. Maybe integrating it in PyPi and make the functionality available to be installed via pip will open new possibilities.

If @mf1024 is interested, I will gladly try to support you!

Greetings,

kevinkit

Error: Image Reset by Peer

When running the code
python ./downloader.py -data_root /data_root_folder/imagenet -number_of_classes 100 -images_per_class 200

I am getting the following error:

Traceback (most recent call last):
  File "/Users/chiarabigarella/.virtualenvs/imgnet-downloader/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/Users/chiarabigarella/.virtualenvs/imgnet-downloader/lib/python3.6/site-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/Users/chiarabigarella/.virtualenvs/imgnet-downloader/lib/python3.6/site-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1331, in getresponse
    response.begin()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 258, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
ConnectionResetError: [Errno 54] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/chiarabigarella/.virtualenvs/imgnet-downloader/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/Users/chiarabigarella/.virtualenvs/imgnet-downloader/lib/python3.6/site-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/Users/chiarabigarella/.virtualenvs/imgnet-downloader/lib/python3.6/site-packages/urllib3/util/retry.py", line 368, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/Users/chiarabigarella/.virtualenvs/imgnet-downloader/lib/python3.6/site-packages/urllib3/packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/Users/chiarabigarella/.virtualenvs/imgnet-downloader/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/Users/chiarabigarella/.virtualenvs/imgnet-downloader/lib/python3.6/site-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/Users/chiarabigarella/.virtualenvs/imgnet-downloader/lib/python3.6/site-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1331, in getresponse
    response.begin()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 258, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./downloader.py", line 324, in <module>
    resp = requests.get(url_urls)
  File "/Users/chiarabigarella/.virtualenvs/imgnet-downloader/lib/python3.6/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/chiarabigarella/.virtualenvs/imgnet-downloader/lib/python3.6/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/chiarabigarella/.virtualenvs/imgnet-downloader/lib/python3.6/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/chiarabigarella/.virtualenvs/imgnet-downloader/lib/python3.6/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/Users/chiarabigarella/.virtualenvs/imgnet-downloader/lib/python3.6/site-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(

Could anyone please help?

2010 or 2012 api used ?

I tried to scrape classes for zero-shot learning, where 360 of my classes are from 2010 Imagenet set [and are not present in 2012 set]. The rest 1k classes are from the 2012 set.

Is there a difference between the two sets api ? Since I 'm able to pull the 360 classes, but the 1k classes stops at the beginning itself due to classes not present in the class_info_dict

image_per_class fails

Hi.

Trying to download specific classes with -image_per_class 500 but it only downloads 10 images.

Any idea why?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.