Giter Site home page Giter Site logo

cuulee / fastdup Goto Github PK

View Code? Open in Web Editor NEW

This project forked from visual-layer/fastdup

0.0 0.0 0.0 447.8 MB

FastDup is a tool for gaining insights from a large image collection. It can find anomalies, duplicate and near duplicate images, clusters of similaritity, learn the normal behavior and temporal interactions between images. It can be used for smart subsampling of a higher quality dataset, outlier removal, novelty detection of new information to be sent for tagging. FastDup scales to millions of images running on CPU only.

License: Other

Python 99.38% Dockerfile 0.62%

fastdup's Introduction

Open In Colab Open In Kaggle Slack Medium Mailing list

FastDup | A tool for gaining insights from a large image collection

Large Image Datasets Today are a Mess | Blog Post | Video Tutorial

FastDup is a tool for gaining insights from a large image collection. It can find anomalies, duplicate and near duplicate images, clusters of similarity, learn the normal behavior and temporal interactions between images. It can be used for smart subsampling of a higher quality dataset, outlier removal, novelty detection of new information to be sent for tagging. FastDup scales to millions of images running on CPU only.

From the authors of GraphLab and Turi Create.

Identify duplicates

alt text Duplicates and near duplicates identified in MS-COCO and Imagenet-21K dataset

Find corrupted and broken images

alt text Thousands of broken ImageNet images that have confusing labels of real objects.

Find outliers

alt text IMDB-WIKI outliers (data goal is for face recognition, gender and age classification)

Find similar persons

alt text Can you tell how many different persons?

Find wrong labels

alt text Wrong labels in the Imagenet-21K dataset.

Find image with contradicting labels

alt text Cluster of wrong labels in the Imagenet-21K . No human can tell those red wines from their image.

alt text Fun labels in the Imagenet-21K dataset

Coming soon: image graph search (please reach out if you like to beta test)

alt text alt text alt text

Upcoming new features: image graph search!

Results on Key Datasets (full results here)

We have thoroughly tested fastdup across various famous visual datasets. Ranging from pilar Academic datasets to Kaggle competitions. A key finding we have made using FastDup is that there are ~1.2M (!) duplicate images on the ImageNet-21K dataset, out of which 104K pairs belong both to the train and to the val splits (this amounts to 20% of the validation set). This is a new unknown result! Full results are below. * train/val splits are taken from https://github.com/Alibaba-MIIL/ImageNet21 .

Dataset Total Images cost [$] spot cost [$] processing [sec] Identical pairs Anomalies
imagenet21k-resized 11,582,724 4.98 1.24 11,561 1,194,059 Anomalies Wrong Labels
imdb-wiki 514,883 0.65 0.16 1,509 187,965 View
places365-standard 2,168,460 1.01 0.25 2,349 93,109 View
herbarium-2022-fgvc9 1,050,179 0.69 0.17 1,598 33,115 View
landmark-recognition-2021 1,590,815 0.96 0.24 2,236 2,613 View
visualgenome 108,079 0.05 0.01 124 223 View
iwildcam2021-fgvc9 261,428 0.29 0.07 682 54 View
coco 163,957 0.09 0.02 218 54 View
sku110k 11,743 0.03 0.01 77 7 View
  • Experiments presented are on a 32 core Google cloud machine, with 128GB RAM (no GPU required).
  • All experiments could be also reproduced on a 8 core, 32GB machine (excluding Imagenet-21K).
  • We run on the full ImageNet-21K dataset (11.5M images) to compare all pairs of images in less than 3 hours WITHOUT a GPU (with Google cloud cost of 5$).

Quick Installation

For Python 3.7 and 3.8 (Ubuntu 20.04 or Ubuntu 18.04 or Mac M1 or Mac Intel Mojave and up)

python3.8 -m pip install fastdup

Running the code

import fastdup
fastdup.run(input_dir="/path/to/your/folder", work_dir='out')                            #main running function
fastdup.create_duplicates_gallery('out/similarity.csv', save_path='.')       #create a visual gallery of found duplicates
fastdup.create_outliers_gallery('out/outliers.csv',   save_path='.')       #create a visual gallery of anomalies
fastdup.create_components_gallery('out', save_path='.')                    #create visualiaiton of connected components

alt text Working on the Food-101 dataset. Detecting identical pairs, similar-pairs (search) and outliers (non-food images..)

Getting started examples

Tensorboard Projector integration is explained in our Colab notebook

Detailed instructions

Support

Join our Slack channel

Technology

We build upon several excellent open source tools. Microsoft's ONNX Runtime, Facebook's Faiss, Open CV, Pillow Resize, Apple's Turi Create, Minio, Amazon's awscli, TensorBoard.

About Us

Danny Bickson, Amir Alush

fastdup's People

Contributors

amiralush avatar dbickson avatar sourabmaity avatar visualdatabase avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.