jesjimher / imgdupes Goto Github PK

Checks for duplicated images in a directory tree, ignoring metadata

License: GNU General Public License v3.0

Python 100.00%

imgdupes's Introduction

jpegdupes

jpegdupes (previously known as imgdupes) is a command-line tool that finds duplicated images in a directory tree. The difference with other file-oriented utilities (like fdupes) is that jpegdupes is specifically tailored to JPEG files, and compares only the image data chunk, ignoring any metadata present in the file (EXIF info, tags, titles, orientation tag...). This makes possible to find duplicated images when one of the file's metadata has been modified by imaging software, and byte-by-byte comparators fail to report them as equal. This might happen in a number of situations, for example:

Modifying EXIF info to adjust date taken (typically when the camera has been set at a wrong date/time)
Adjusting rotation flag or rotating the photo losslessly (f. e. using jpegtran)
Adding tags, or setting title, description, rating...

A common scenario that could lead to this kind of duplicates is importing a file in your favourite image management program, altering its metadata in some way (tag, rotate or whatever), and then re-importing it again later because the image software isn't smart enough to realize that the modified image is the same than the original, unmodified file that it's in the camera. This kind of duplicates are annoying and hard to find, because standard file duplication utilities won't report them (they're actually different files), so human checking is almost always required. jpegdupes tries to automate this task.

Invocation of jpegdupes is intentionally similar to that of the UNIX command fdupes. This is just for clarity and ease of use, but jpegdupes is not meant as a direct replacement of fdupes, which is much more mature and well tested. The recommended usage is using fdupes first in order to find and delete byte-by-byte duplicates, and only then using jpegdupes to look for duplicates that fdupes might have missed. Even if user interface is inspired by fdupes, not all fdupes parameters are implemented, though, and there're even some differences in global behaviour (i. e. recursive exploring is optional in fdupes, but mandatory in jpegdupes), so be sure to execute jpegdupes --help to check the details. Jpegdupes only considers files with extension .jpeg or .jpg (case insensitive). All other image files are ignored.

How to run jpegdupes

A simple invocation of jpegdupes would be:

jpegdupes filepath

where filepath is a folder containing jpg images, for example the root folder of your photo library. It would start to recursively analyze the directory tree, and at the end it would show a list of the duplicates it might have found. If you use --delete parameter, it would instead ask you, for each set of duplicates, which one should be preserved, and delete the rest. If in addition to --delete, also the flag --auto is passed, jpegdupes will automatically choose one file to keep and delete all others that it considers duplicates without asking.

Analyzing each image chunk of data in order to compare and find duplicates is a time consuming task. So, in order to speed up future executions, jpegdupes creates a cache file inside the directory it's analyzing, containing the image signatures already generated. It's a small file, called .signatures, and follows python pickle format. Anyway, if you don't feel comfortable with the idea of jpegdupes writing to your disk, the parameter --clean may be used, which assures that nothing will be written to disk. The disadvantage of this is that all images will need to be re-analyzed each time jpegdupes is executed, and with a big collection it might take a while.

Filtering duplicates before importing

Prevention is better than a cure. A common use case is to copy photos from your camera or phone to a temporary folder to_import on your computer, then use a photo manager application to import the new photos from the to_import folder, into your photo library folder. To prevent duplicates entering into your library, run jpegdupes like this:

jpegdupes to_import --library /path/to/library --delete

This will analyze both the to_import folder with new photos and your existing library folder. Any jpg files in the to_import folder that already exist in library, will be deleted from the to_import folder. Without the --delete flag they will only be printed. The remaining files are truly new ones, that can now be imported with your photo manager application. This way no new duplicates will be added to your library.

Notes

WARNING: If migrating from a previous Python 2.x version of jpegdupes, you'll probably get a nasty error about encoding. Due to changes in Python 3 encoding management, signature files (.signatures) created with previous versions of jpegdupes aren't readable anymore, so you'll have to delete them and let jpegdupes regenerate them from scratch.

As a final disclaimer, jpegdupes is provided as is, and I can't be made responsible of any damages that might happen to your collection by using it. I use jpegdupes myself, so I'm reasonably confident that it works, and at the same time I'm the first interested in that it's free of bugs, but I can't make any guarantee of that. Keep also in mind that, even if jpegdupes reports that two files correspond to the same image, this might not necessarily mean that you have to delete one of them. It's up to you to decide which cases correspond to software mistakes (i. e. re-importing an existing image that had been already imported and tagged) and which ones are legitimate.

Requirements

jpegdupes uses Python 3 since v2. The following external packages are required to execute jpegdupes:

GExiv2: JPEG metadata reading
jpeginfo: Not actually needed, but I've found a number of corrupt JPEG files that only jpeginfo has been able to detect. If jpegdupes finds it installed it will use it as an extra validation step, so if you find jpegdupes getting stuck at certain files, try installing jpeginfo in your system with apt or whatever.
Other dependencies: Python 3 CFFI support, libturbojpeg...

All these packages are usually installable in any Linux distribution by using their own package managers.

In Ubuntu, the following commands should install everything:

sudo apt-get install python3-dev libjpeg-dev gir1.2-gexiv2-0.10 jpeginfo python3-cffi libturbojpeg0-dev python3-gi

For Arch Linux there are AUR packages jpegdupes and jpegdupes-git.

History

v2.1

New --library option, and several cleanup and code improvements from hilkoc (thanks a lot!)

v2.0

Migration from Python 2 to 3 Packaged for Pypi distribution (and renamed to jpegdupes due to imgdupes already existing as another project) A lot of minor tweaks by me and some very kind contributors (thanks plenaerts for his several contributions, lagerspetz for his ideas and tweaks in his own fork, and probably others I don't remember right now).

v1.2

Added multi-cpu support Remove -r parameter. Calculation of all possible rotations is now mandatory Removed identify command as a hash method. It complicated things, and MD5 and CRC are faster and available everywhere.

v1.1

Support for losslessly rotated image detection

v1.0

Initial release

imgdupes's People

Contributors

Stargazers

Watchers

Forkers

petermalig lagerspetz macressler pedropatinho medadrufus datagram1 top-on hilkoc audunmg gitthubba tomhoover

imgdupes's Issues

Spawn multiple processes to hash files faster

Detect duplicated images rotated with jpegtran

When a JPEG file has been rotated using jpegtran or any other JPEG lossless rotation utility, imgdupes can't find duplicates, because this kind of rotation involves altering original image data. "Standard" rotation (switching EXIF rotation tag) is fully detected.

One way to detect this kind of transformations would imply generating and storing in .signatures up to 4 hashes (all possible rotations) instead of just one. This would slow things quite a bit, albeit perhaps not that much since image data would already be in memory and imgdupes is usually I/O bound. Some multiprocessing would help.

One thing to note is that jpegtran also allows to losslessly flip images, so theoretically imgdupes should store all 4 possible rotations, 2 possible flips (horizontal and vertical), and all possible combinations of rotation+flip. Since this is obviously unfeasible, I think that flipping may be ignored for the moment. After all, is not an operation as common as rotation.

Exception: tostring() has been removed. Please call tobytes() instead

tostring() has been deprecated, and it hangs on execution...

delete option crashes it

I just did a
$ jpegdupes -d /home/turgut/Pictures/

and got:
(...)
Exploring ./2018/07
Exploring ./2018/07/06
Exploring ./2018/07/07
Exploring ./2018/07/08
Exploring ./2018/07/24
Exploring ./2018/07/27
Exploring ./2018/08
Exploring ./2018/08/19
Exploring ./2018/08/21
Exploring ./2018/08/23
Exploring ./2018/08/11
Exploring ./2018/08/12
Exploring ./2018/08/17
Exploring ./2018/08/18
Exploring ./2018/08/20
Exploring ./2018/09
Exploring ./2018/09/07
Exploring ./2018/09/08
Exploring ./2018/09/09
Exploring ./2018/09/14
Exploring ./2018/09/16
Exploring ./2009
Exploring ./2009/09
Exploring ./2009/09/22

Traceback (most recent call last):
File "/usr/local/bin/jpegdupes", line 11, in
load_entry_point('jpegdupes==2.0.13', 'console_scripts', 'jpegdupes')()
File "/usr/local/lib/python3.6/site-packages/jpegdupes-2.0.13-py3.6.egg/jpegdupes/jpegdupes.py", line 337, in main
File "/usr/local/lib/python3.6/site-packages/jpegdupes-2.0.13-py3.6.egg/jpegdupes/jpegdupes.py", line 337, in
File "/usr/local/lib/python3.6/site-packages/jpegdupes-2.0.13-py3.6.egg/jpegdupes/jpegdupes.py", line 145, in metadata_summary
AttributeError: 'Metadata' object has no attribute 'get_tags'

recompile with -fPIC

Hi, I tried running this script (Linux Mint).
I had to add
import gi
gi.require_version('GExiv2', '0.10')
before the from gi.repository import GExiv2 in order to not get an error.

The next attempt left me with the below error on the top and the script just finished after crawling some folders:

/usr/bin/ld: /usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/libturbojpeg.a(libturbojpeg_la-turbojpeg.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object; recompile with -fPIC
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/libturbojpeg.a: error adding symbols: Bad value
collect2: error: ld returned 1 exit status

Dupes found are actually the same file

Got result like this:

(all 4 files are SAME file)

.... dupes that are ok ...

./2018-09-17/IMG_20180819_193752.jpg 
 ./2018-09-17/IMG_20180819_193752.jpg 
 ./2018-09-17/IMG_20180819_193752.jpg 
 ./2018-09-17/IMG_20180819_193752.jpg

... some more dupes that are ok ...

I don't know why but it worked perfect (detected all dupes it should have detected) except when it thought this one file to be a dupe of itself... weird.

Have you thought about doing whole-file MD5 for other image types such as png and nef?

I have forked your code and done some adjustments. I have some files that crash imgdupes, because they have "truncated jpg block" data. Also, imgdupes seems to show the same file multiple times for HDR files re-developed by shotwell. Then choosing one to keep fails with the error that it cannot delete the extras, e.g.

If you are still interested in this project, I'm planning to send you some PRs for:

Do not crash on truncated JPG data blocks, catch the exception and do whole-file hash for those
Do not crash on files with non-jpg content such as misnamed PNG files
Automatic mode: non-interactively select to keep the best duplicate of a set, with the most tags, residing in the shallowest directory tree, and with the longest directory path in case of ties (prefer more descriptive directory names and shallower trees)

Execpation not properly caught

imgdupes stops on this image:

  Calculating hash of ./_xxxxxxxxxx/xxxxxx xxx xxxxxxx/xxxxxx xxxxxx/xxxxxx_xxxxxxxx_668.jpg...
Traceback (most recent call last):
  File "/root/imgdupes/imgdupes.py", line 256, in <module>
    'hash':hashcalc(ruta,pool,args.method),
  File "/root/imgdupes/imgdupes.py", line 66, in hashcalc
    results=pool.map(phash,lista)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
    raise self._value
NameError: global name 'path' is not defined
root@nas:/yyyyy/yyyyy/yyyyyy/yyyyyy# ls -l "./_xxxxxxxxxx/xxxxxx xxx xxxxxxx/xxxxxx xxxxxx/xxxxxx_xxxxxxxx_668.jpg"
-rwxrwxr-x+ 1 root root 2960434 Jun  8  2007 ./_xxxxxxxxxx/xxxxxx xxx xxxxxxx/xxxxxx xxxxxx/xxxxxx_xxxxxxxx_668.jpg
root@nas:/yyyyy/yyyyy/yyyyyy/yyyyyy# file "./_xxxxxxxxxx/xxxxxx xxx xxxxxxx/xxxxxx xxxxxx/xxxxxx_xxxxxxxx_668.jpg"
./_xxxxxxxxxx/xxxxxx xxx xxxxxxx/xxxxxx xxxxxx/xxxxxx_xxxxxxxx_668.jpg: JPEG image data, Exif standard: [TIFF image data, big-endian, direntries=11, manufacturer=CASIO COMPUTER CO.,LTD , model=QV-R61 , orientation=upper-left, xresolution=178, yresolution=186, resolutionunit=2, software=1.00                 , datetime=2007:02:01 22:11:18]

releases on github

Hi,

Thanks for notifying me of your updates & posting them on pypi.

For the arch linux' AUR package you should release the versions on github. The AUR pkg will download sources from github and not pypi. I've proceeded already by creating jpegdupes-git package which downloads your latest commit as source, but the non git version of the AUR pkg would require a release on github.

For simplicity's sake it would be nice to rename the github repo to jpegdupes as well. People will be wondering what line 4 in my PKGBUILD is.

Thanks!

Pieter

‘struct.error: integer out of range for 'H' format code’

Hello,

I get an error after a certain number of files analysed:

Traceback (most recent call last):
  File "/home/gilles/bin/imgdupes.py", line 259, in <module>
    'hash':hashcalc(ruta,pool,args.method),
  File "/home/gilles/bin/imgdupes.py", line 63, in hashcalc
    results=pool.map(phash,lista)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
    raise self._value
struct.error: integer out of range for 'H' format code

If I relaunch the command, it continues from where it stopped until the next error.

$ python --version 
Python 2.7.6

and

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.2 LTS
Release:    14.04
Codename:   trusty

Thank you!

installation on ubuntu 20.04 failed

root@myhostname:/tmp/imgdupes# apt-get install python3-dev libjpeg-dev gir1.2-gexiv2-0.10 jpeginfo
...
root@myhostname:/tmp/imgdupes# python3 setup.py build
...
root@myhostname:/tmp/imgdupes# python3 setup.py install
ModuleNotFoundError: No module named 'cffi'