Giter Site home page Giter Site logo

ava-scraper's Introduction

AVA-Scraper


What is AVA?

AVA: A Large-Scale Database for Aesthetic Visual Analysis

The AVA dataset was released in 2012 for conducting research on image aesthetics. The dataset consists of over 250,000 images. The images consists of number of votes for ratings (1-10), semantic tags, and the challenges it is associated to.
PAPER: AVA: A Large-Scale Database for Aesthetic Visual Analysis

Later in 2016, the comments (labeled as AVA-Comments) were released for the respective images and consists of over 1.5 million comments.
PAPER: Joint Image and Text Representation for Aesthetics Analysis

Both comments and images were taken from dpchallenge.com.


What is AVA-Scraper?

Images after 2012 are not scraped and are still available on dpchallenge.com. This scraper is used to download the images, comments and other metadata. It is divided into three parts:

  • Image scraper: Used for extracting images, their ratings, and number of votes. Stored as IMAGE_ID.jpg

  • Comment scraper: Used for extracting comments from images, some text cleaning (example: removal of URLs, carraige returns character, etc) and storage. Stored as IMAGE_ID.txt, with one line per comment.

  • Others: Used for extracting new challenges and existing rules

The new data is stored under the name of AVA 2.0

How does it work?

Scraping takes place in the following order:

  1. New challenges are extracted from dpchallenge and stops when the last challenge from the first AVA dataset is reached.

  2. Going one challenge at a time, images are extracted along with their ratings and votes.

  3. The IDs of each extracted image is saved. Then, looping each image at a time, the comments and semantic tags are extracted.

As of 11th August, 2017: 81,986 new images have been extracted. This only includes images WITH ratings.

NOTE: There is always a delay of 60s, as per the requirements in robots.txt. If you get blocked, it will take around a week to get unbarred.

An emergency function has been added in case of issues, such as the site's server inactive or loss of internet connection. The function will carry on scraping from where it left off :)

ava-scraper's People

Contributors

tazeek avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.