Giter Site home page Giter Site logo

lijiunderstand / yfcc15m_downloader Goto Github PK

View Code? Open in Web Editor NEW

This project forked from adamrain/yfcc15m_downloader

0.0 0.0 0.0 12 KB

A subset of YFCC100M. Tools, checking scripts and links of web drive to download datasets(uncompressed).

Python 90.95% Batchfile 9.05%

yfcc15m_downloader's Introduction

YFCC15M_downloader

A subset of YFCC100M. Tools, checking scripts and links of web drive to download datasets.


We followed the dataset preparation process of DeCLIP here.

  1. First, Download DeCLIP's YFCC15M label file 'yfcc15m_clean_open_data.json' at Google Driver.

  2. Extract the URL from the JSON file and split it into several URL list files for download using split_download_task.py.

  3. Crawl the image by the URL dirctely using auto_download.bat (Here, we use Wget, you may need to install that). The bat file is for Windows, and you may need to rewrite a shell file if using Linux. Or, simply download from the links below!

    • You can stop the process and start over afterward if something is wrong. Wget will skip the downloaded files and clean log files.
    • The error will be recorded in log files. Before re-start the download, it is recommended to run clean_err_file_from_logs.py to filter and delete the wrong files.
  4. Check the downloaded images using check_images.py.


Dataset infos:

  • The dataset should contains 15,388,848 images.
  • We managed to crawl 15,061,747 of them.
  • Total space occupied: 867.73G.

Web Drive links:


If the link fails, please leave a message in the issue.

yfcc15m_downloader's People

Contributors

adamrain avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.