Giter Site home page Giter Site logo

kdeldycke / mail-deduplicate Goto Github PK

View Code? Open in Web Editor NEW
159.0 10.0 38.0 6.4 MB

๐Ÿ“ง CLI to deduplicate mails from mail boxes.

Home Page: https://kdeldycke.github.io/mail-deduplicate

License: GNU General Public License v2.0

Python 100.00%
python mail dedupe cli maildir mailbox mbox deduplication cleanup babyl

mail-deduplicate's Introduction

Mail Deduplicate

Last release Python versions Unittests status Documentation status Coverage status DOI

What is Mail Deduplicate?

Provides the mdedup CLI, an utility to deduplicate mails from a set of boxes.

Mail Deduplicate

Features

  • Duplicate detection based on cherry-picked and normalized mail headers.
  • Fetch mails from multiple sources.
  • Reads and writes to mbox, maildir, babyl, mh and mmdf formats.
  • Deduplication strategies based on size, content, timestamp, file path or random choice.
  • Copy, move or delete the resulting set of duplicates.
  • Dry-run mode.
  • Protection against false-positives with safety checks on size and content differences.
  • Supports macOS, Linux and Windows.
  • Standalone executables for Linux, macOS and Windows.
  • Shell auto-completion for Bash, Zsh and Fish.

โš ๏ธ Warning: Performances

mdedup implementation is quite naive at the moment and everything resides in memory.

If this is good enough for a volume of a couple of gigabytes, the more emails mdedup try to parse, the closer you'll reach the memory limits of your machine. In which case mdedup will exit abrubtly, zapped by the OOM killer of your OS. Of course your mileage may vary depending on your hardware.

You can influence implementation of this feature with pull requests, or purchase of business support ๐Ÿค and sponsorship ๐Ÿซถ.

Example

Installation

From sources

Easiest way is to install mdedup from sources with pipx:

$ pipx install mail-deduplicate

Other alternatives installation methods are available in the documentation.

Executables

Standalone executables of mdedup's latest version are available for several platforms and architectures:

Platform x86_64
Linux Download mdedup-linux-x64.bin
macOS Download mdedup-macos-x64.bin
Windows Download mdedup-windows-x64.exe

mail-deduplicate's People

Contributors

arvendofloh avatar aspiers avatar baip avatar breser avatar dependabot[bot] avatar dfukagaw28 avatar jepler avatar juantascon avatar kdeldycke avatar kdm9 avatar kianmeng avatar krig avatar leggewie avatar marcelm avatar mnalis avatar ncenerar avatar painted-fox avatar pechfunk avatar reedog117 avatar shirosaki avatar tnhh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mail-deduplicate's Issues

Use click as CLI framework

As discussed in #21, installation and usage of the script is highly dysfunctional.

I plan to upgrade the whole CLI code base to reuse a proper framework in the name of click. I already use it extensively in other project with great success.

AssertionError in self.stats['set_deduplicated'] in python3

I use version 3.0.0, installed via pip3 on Ubuntu Bionic. That means, my standard python interpreter is python 2.7.17. That leads to the following AssertionError.

mdedup deduplicate ~/Maildir/.DAMAGED20200924.Usenet-Speicher/ ~/Maildir/.Usenet-Speicher/
[...]
Check that mail differences are within the limits.
โ•’โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ••
โ”‚ Mails      โ”‚   Metric โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ Found      โ”‚       70 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Skipped    โ”‚        0 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Rejected   โ”‚        0 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Kept       โ”‚       70 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Unique     โ”‚        0 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Duplicates โ”‚       70 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Deleted    โ”‚        0 โ”‚
โ•˜โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•›
โ•’โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ••
โ”‚ Duplicate sets                       โ”‚   Metric โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ Total                                โ”‚       35 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Ignored                              โ”‚        0 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Skipped                              โ”‚        0 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Rejected (bad encoding)              โ”‚        0 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Rejected (too dissimilar in size)    โ”‚        0 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Rejected (too dissimilar in content) โ”‚        0 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Deduplicated                         โ”‚        0 โ”‚
โ•˜โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•›
Traceback (most recent call last):
  File "/home/leggewie/.local/bin/mdedup", line 11, in <module>
    sys.exit(cli())
  File "/home/leggewie/.local/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/leggewie/.local/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/leggewie/.local/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/leggewie/.local/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/leggewie/.local/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/leggewie/.local/lib/python3.6/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/leggewie/.local/lib/python3.6/site-packages/mail_deduplicate/cli.py", line 219, in deduplicate
    dedup.report()
  File "/home/leggewie/.local/lib/python3.6/site-packages/mail_deduplicate/deduplicate.py", line 648, in report
    self.stats['set_deduplicated'])
AssertionError

AttributeError: Message instance has no attribute 'get_all'

Not having much luck getting this to work:

Traceback (most recent call last):
  File "/tmp/python/bin/mdedup", line 9, in <module>
    load_entry_point('maildir-deduplicate==1.0.1.dev0', 'console_scripts', 'mdedup')()
  File "/tmp/python/local/lib/python2.7/site-packages/click/core.py", line 700, in __call__
    return self.main(*args, **kwargs)
  File "/tmp/python/local/lib/python2.7/site-packages/click/core.py", line 680, in main
    rv = self.invoke(ctx)
  File "/tmp/python/local/lib/python2.7/site-packages/click/core.py", line 1027, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/tmp/python/local/lib/python2.7/site-packages/click/core.py", line 873, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/tmp/python/local/lib/python2.7/site-packages/click/core.py", line 508, in invoke
    return callback(*args, **kwargs)
  File "/tmp/python/local/lib/python2.7/site-packages/click/decorators.py", line 16, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/tmp/maildir-deduplicate/maildir_deduplicate/cli.py", line 137, in deduplicate
    dedup.add_maildir(maildir)
  File "/tmp/maildir-deduplicate/maildir_deduplicate/deduplicate.py", line 72, in add_maildir
    mail_file, message, self.use_message_id)
  File "/tmp/maildir-deduplicate/maildir_deduplicate/deduplicate.py", line 95, in compute_hash
    canonical_headers_text = cls.canonical_headers(mail_file, message)
  File "/tmp/maildir-deduplicate/maildir_deduplicate/deduplicate.py", line 115, in canonical_headers
    for value in mail.get_all(header):
AttributeError: Message instance has no attribute 'get_all'

Re-implement statistics

I broke statistics while reimplementing the whole deduplication strategy in commit 728b3fe.

Statistics needs to be re-implemented before we can release 2.0.0.

AttributeError: 'NoneType' object has no attribute 'replace' at apply_strategy:187

Just got a crash on mdedup -n, version 2.0.2 (devel):

Traceback (most recent call last):
  File "/usr/local/bin/mdedup", line 9, in <module>
    load_entry_point('maildir-deduplicate==2.0.2', 'console_scripts', 'mdedup')()
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "build/bdist.linux-x86_64/egg/maildir_deduplicate/cli.py", line 194, in deduplicate
  File "build/bdist.linux-x86_64/egg/maildir_deduplicate/deduplicate.py", line 547, in run
  File "build/bdist.linux-x86_64/egg/maildir_deduplicate/deduplicate.py", line 187, in apply_strategy
AttributeError: 'NoneType' object has no attribute 'replace'

AttributeError: 'module' object has no attribute 'init' on startup

To reproduce:

mdedup --help

results in

Traceback (most recent call last):
  File "/usr/bin/mdedup", line 7, in <module>
    from maildir_deduplicate.cli import cli
  File "/usr/lib/python3.6/site-packages/maildir_deduplicate/cli.py", line 51, in <module>
    @click_log.init(logger)
AttributeError: module 'click_log' has no attribute 'init'

Cause:

The user interface of click_log has changed in version 0.2.0.
See

Confusing size/content flag description

Looking at the source, setting -1 to either the size or content flags completely ignores potential changes in the size or content.

The help text implies that setting to -1 is equivalent to setting to zero (i.e. no difference in size/content allowed), whereas it's actually the opposite (any difference in size/content allowed).

mdedup appears to stop for no reason

I have a rather large Maildir folder with a total of 11966 files. Almost every mail is duplicated, courtesy of a misconfigured mail sync tool. When running mdedup on it, it seems to just stop with exit code 1 and no interesting log output. It reaches 4% before stopping, but setting the -i flag has it stop at 2%, oddly enough.

Here's a paste of said log in debug mode: http://dpaste.com/3X57Z0V

Does not run on Python 2.6.6

Not sure if a minimum Python version needs to be specified, but I can't even run mdedup --help without running into an error when it's installed on CentOS 6.x via PyPi.

Traceback (most recent call last):
  File "usr/bin/mdedup", line 7, in <module>
    from maildir_deduplicate.cli import cli
  File "/usr/lib/python2.6/site-packages/maildir_deduplicate/cli.py", line 94, in <module>
    DEFAULT_SIZE_DIFFERENCE_THRESHOLD))
ValueError: zero length field name in format

please consider to test syntactical correctness before computation of hashes

Computing the hashes is computationally expensive and potentially time-consuming. I think it might be a good idea to test syntactical correctness before that task instead of after. In my case, mdedup had been running 20 to 30 minutes, hashing 20.000 to 30.000 mails before failing.

$ mdedup deduplicate -n -s delete-matching-path -rDAMAGED2020 ~/Maildir/.DAMAGED20200924.Trash/ ~/Maildir/.Trash/
[...]
=== Phase #2: deduplicate mails.
The delete-matching-path strategy will be applied on each duplicate set.
--- 2 mails sharing hash bb2bc9037f04dbdc50519fc7ce88e7694fcb2ececbf49a3fb9d37f5b
Check that mail differences are within the limits.
Traceback (most recent call last):
  File "/home/leggewie/.local/bin/mdedup", line 11, in <module>
    sys.exit(cli())
  File "/home/leggewie/.local/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/leggewie/.local/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/leggewie/.local/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/leggewie/.local/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/leggewie/.local/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/leggewie/.local/lib/python3.6/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/leggewie/.local/lib/python3.6/site-packages/mail_deduplicate/cli.py", line 217, in deduplicate
    dedup.run()
  File "/home/leggewie/.local/lib/python3.6/site-packages/mail_deduplicate/deduplicate.py", line 593, in run
    duplicates.apply_strategy(self.conf.strategy)
TypeError: apply_strategy() takes 1 positional argument but 2 were given

No deletion: all mail differences within limits

I am using

mdedup -C -1 -S -1 -s delete-smaller maildirname

But I still get "Check that mail differences are within the limits." messages, and nothing is deleted. All duplicates are either ignored or skipped.

โ”‚ Mails      โ”‚   Metric โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ Found      โ”‚      227 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Rejected   โ”‚        0 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Kept       โ”‚      227 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Unique     โ”‚        7 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Duplicates โ”‚      220 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Deleted    โ”‚        0 โ”‚
โ•˜โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•›
โ•’โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ••
โ”‚ Duplicate sets                       โ”‚   Metric โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ Total                                โ”‚      117 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Ignored                              โ”‚        7 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Skipped                              โ”‚      110 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Rejected (bad encoding)              โ”‚        0 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Rejected (too dissimilar in size)    โ”‚        0 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Rejected (too dissimilar in content) โ”‚        0 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Deduplicated                         โ”‚        0 โ”‚

What might be wrong?

Is there a way to rename the files rather than delete duplicates?

Thanks for help.

Vikas

Add Python 3.4 support

Hi,

I have tried to use mdedup both via pip install or by cloning the repo and I have an error with __init__.py:

 File "SOME_LOCAL_PATH/maildir_deduplicate/__init__.py", line 94, in <module>
    from deduplicate import Deduplicate
ImportError: No module named 'deduplicate'

This can be fixed by replacing

from deduplicate import Deduplicate

by

from maildir_deduplicate.deduplicate import Deduplicate

I am currently using Python 3.4 on Debian.

Best regards,

Christophe

Numerous problems with click_log

click_log no longer has init() or get_level(), so maildir-deduplicate doesn't work. Any chance of replacing click_log with something that has a more stable API?

No graceful handling of crashes/further Unicode issues

Hi,

I've had a number of unhandled exceptions related to all the Unicode issues (#32, #33, #35).
I accidentally originally installed the Python 2 version of maildir-deduplicate, and it is better on the Python 3 version, but several of my mails still manage to cause unhandled exceptions.

These seem to predominantly be mails which have been wrongly encoded - which should have been marked and encoded as UTF, but haven't been.
Python does seem to recognize something is off and shift those characters to 0xFFFD, the Unicode Replacement Character, but nevertheless fails with UnicodeEncodeError: 'ascii' codec can't encode character '\ufffd' in position 13: ordinal not in range(128).

I understand that maildir-deduplicate can't magically know in what particular way a fucked up mail was fucked up and treat the wrong data correctly. That mail was encoded wrongly and that's my problem, not maildir-deduplicate's.

What's annoying me is the lack of handling on the exception.
There are over 4000 mails in that maildir, the vast majority of which are perfectly RFC-compliant, and I can't parse that folder because a handful of mails are screwed and maildir-deduplicate doesn't properly handle it.

I don't expect the software to magically fix broken input. But if I have 1 broken e-mail out of a thousand, I do expect it to just skip the broken one and do the other 999.

Unicode-errors are the most common ones, but it's not limited to that: I've also had one run fail on me because of a missing header in a collection:

  File "/usr/local/lib/python3.4/dist-packages/maildir_deduplicate/deduplicate.py", line 355, in get_lines_from_message_body
    header_text, sep, body = message.as_string().partition("\n\n")
  File "/usr/lib/python3.4/email/message.py", line 159, in as_string
    g.flatten(self, unixfrom=unixfrom)
  File "/usr/lib/python3.4/email/generator.py", line 112, in flatten
    self._write(msg)
  File "/usr/lib/python3.4/email/generator.py", line 178, in _write
    self._dispatch(msg)
  File "/usr/lib/python3.4/email/generator.py", line 211, in _dispatch
    meth(msg)
  File "/usr/lib/python3.4/email/generator.py", line 269, in _handle_multipart
    g.flatten(part, unixfrom=False, linesep=self._NL)
  File "/usr/lib/python3.4/email/generator.py", line 112, in flatten
    self._write(msg)
  File "/usr/lib/python3.4/email/generator.py", line 178, in _write
    self._dispatch(msg)
  File "/usr/lib/python3.4/email/generator.py", line 211, in _dispatch
    meth(msg)
  File "/usr/lib/python3.4/email/generator.py", line 269, in _handle_multipart
    g.flatten(part, unixfrom=False, linesep=self._NL)
  File "/usr/lib/python3.4/email/generator.py", line 112, in flatten
    self._write(msg)
  File "/usr/lib/python3.4/email/generator.py", line 186, in _write
    msg.replace_header('content-transfer-encoding', munge_cte[0])
  File "/usr/lib/python3.4/email/message.py", line 559, in replace_header
    raise KeyError(_name)
KeyError: 'content-transfer-encoding'

The fact that you're using exceptions at all is good. But not handling exceptions is bad, and not handling exceptions in a program designed for batch processing is just wrong.

I will try to hack something up for my local installation and I will submit a patch if I succeed, but this is ultimately a question of design mentality: You are currently placing the burden of dealing with problematic input on the user. You're essentially saying "this program will work fine...if you made sure those 10000 mails you want to scan are all RFC-compliant in advance!".

I do believe it would greatly increase the usefulness of this tool if you expected it to fail on some messages and dealt with that gracefully, instead of just crashing back into the terminal in the middle of processing.

Thank you for your efforts. I haven't actually gotten this tool to work yet, but thanks to your work, I at least have a shot at dealing with these mails. I do appreciate the time you're investing.

does not remove duplicates, returns 1

When I run mdedup, program says it found more than one duplicates, but it does not remove them. An exit code of one is returned.

marco@myhost:~$ mdedup --version
mdedup, version 2.1.0
marco@myhost:~$ mdedup deduplicate -s delete-older -t date-header ~/Maildir/
=== Start phase #1: load mails and compute hashes.
Opening maildir at /home/users/marco/Maildir ...
10 mails found.
100%|###########################################################################################################################################################################|
marco@myhost:~$ echo $?
1

I run mdedup 2.1.0 with Python 2.7.9 on Debian 8.8 (stable) that runs Courier IMAP.

Assertion Error at deduplicate.py:596

Failed assert on developer version:

Traceback (most recent call last):
  File "/usr/local/bin/mdedup", line 9, in <module>
    load_entry_point('maildir-deduplicate==2.0.2', 'console_scripts', 'mdedup')()
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "build/bdist.linux-x86_64/egg/maildir_deduplicate/cli.py", line 196, in deduplicate
  File "build/bdist.linux-x86_64/egg/maildir_deduplicate/deduplicate.py", line 596, in report
AssertionError

ignore non-conforming emails instead of dying

When maildir-deduplicate encounters mail which does not contain all the headers it wants, it dies with an fatal error, aborting everything. As maildir do sometimes contain messages without all wanted field (it could be corrupted mail, but more often it is spam with mangled/missing headers on purpose. Or it could be completely valid mail too - RFC5322 in section 3.6, as well as its predecessors, require only "Date:" and "From:" fields, all others like "To", "Subject", etc. are optional), this makes script useless as it dies with "No canonical headers found" or "Not enough data from canonical headers to compute reliable hash" errors, requiring manual inspecting, backing up and moving the problematic mail, and restarting the process from scratch - which is very tedious and impractical, especially if it is more than few e-mails.

What I propose is: when we don't have enough headers, or there is a problem with any of them, simply print an warning, and then ignore the message (as it was never in the Maildir), thus playing safe and never even considering it for removing; while allowing the script to continue working on remaining 99% mails. Safety level would remain the same, while script would be able to actually do its work

TypeError: 'NoneType' object has no attribute '__getitem__' round 2

Hi,

thanks for your software. I encountered a bug.

The command:

$ mdedup deduplicate -t date-header -s delete-newer -n .Archives.2001

The version (installed from source):

$ mdedup --version
mdedup, version 2.2.0

OS version

$ cat /etc/issue
Ubuntu Bionic Beaver (development branch) \n \l

The log:

=== Start phase #1: load mails and compute hashes.
Opening maildir at .............Archives.2001 ...
2 mails found.
100%|##########################################################################################################################################################|
=== Start phase #2: deduplicate mails.
The delete-newer strategy will be applied on each duplicate set.
---
Check that mail differences are within the limits.
Traceback (most recent call last):
  File "/usr/local/bin/mdedup", line 11, in <module>
    load_entry_point('maildir-deduplicate==2.2.0', 'console_scripts', 'mdedup')()
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "build/bdist.linux-x86_64/egg/maildir_deduplicate/cli.py", line 209, in deduplicate
  File "build/bdist.linux-x86_64/egg/maildir_deduplicate/deduplicate.py", line 554, in run
  File "build/bdist.linux-x86_64/egg/maildir_deduplicate/deduplicate.py", line 206, in dedupe
  File "build/bdist.linux-x86_64/egg/maildir_deduplicate/deduplicate.py", line 192, in apply_strategy
  File "build/bdist.linux-x86_64/egg/maildir_deduplicate/deduplicate.py", line 279, in delete_newer
  File "/usr/local/lib/python2.7/dist-packages/boltons/cacheutils.py", line 660, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "build/bdist.linux-x86_64/egg/maildir_deduplicate/deduplicate.py", line 99, in oldest_timestamp
  File "/usr/local/lib/python2.7/dist-packages/boltons/cacheutils.py", line 660, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "build/bdist.linux-x86_64/egg/maildir_deduplicate/mail.py", line 82, in timestamp
  File "/usr/lib/python2.7/email/_parseaddr.py", line 154, in mktime_tz
    if data[9] is None:
TypeError: 'NoneType' object has no attribute '__getitem__'

The culprit: I've tracked the bug down to two duplicate mails which are simply missing a date header. If I add the following line to the mails the program completes successfully.

Date: Fri, 28 Feb 2003 09:02:38 +0100

Maybe this case is related #24 ?

Example mail (reduced case)

X-Mozilla-Status: 0009
X-Mozilla-Status2: 00000000
X-Account-Key: account3
Subject: 
Message-ID: <000501cf4ab3$da480c70$8ed82550$@Domain>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_NextPart_000_0006_01CF4ABC.3C0C7470"
X-Mailer: Microsoft Outlook 15.0
Content-Language: de
X-MailStoreMapiMimePostProcessor: 8.1.0.9075
X-MailStore-Folder-UTF7: default/Outlook Outlook Data File/Personal
X-MailStore-Message-ID: <[email protected]>
X-MailStore-Header-Hash: 0000000000000000000000000000000000000000
X-MailStore-Date: 20091216134939
X-MailStore-Flags: 0

------=_NextPart_000_0006_01CF4ABC.3C0C7470
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit


....


------=_NextPart_000_0006_01CF4ABC.3C0C7470
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
....
</BODY>
</HTML>
------=_NextPart_000_0006_01CF4ABC.3C0C7470--

Assertion error: Kept != Unique + Duplicates

Running mdedup deduplicate -n -s delete-older -t date-header . (from my maildir folder) raises an assertion error:

(...)

  File "/home/salexan/.local/lib/python3.6/site-packages/maildir_deduplicate/cli.py", line 209, in deduplicate
    dedup.report()
  File "/home/salexan/.local/lib/python3.6/site-packages/maildir_deduplicate/deduplicate.py", line 597, in report
    self.stats['mail_duplicates'])
AssertionError

This is related to the following numbers in the stats:
Kept โ”‚ 3215 โ”‚
Unique โ”‚ 2416 โ”‚
Duplicates โ”‚ 3215 โ”‚

TypeError: splitlines() takes no keyword arguments

I use version 3.0.0, installed via pip on Ubuntu Bionic. That means, my standard python interpreter is python 2.7.17. That leads to the following TypeError when the splitlines function is called.

$ mdedup deduplicate -n ~/Maildir/.DAMAGED20200924.Usenet-Speicher/ ~/Maildir/.Usenet-Speicher/
=== Phase #1: load mails and compute hashes.
Opening /home/leggewie/Maildir/.DAMAGED20200924.Usenet-Speicher as a maildir...
35 mails found.
100%|################################################################################################################|
Opening /home/leggewie/Maildir/.Usenet-Speicher as a maildir...
35 mails found.
100%|################################################################################################################|
=== Phase #2: deduplicate mails.
warning: No removal strategy will be applied.
--- 2 mails sharing hash 38deda92254b1c34c6a538a2737de8904e636f29b96865242bc27681
Check that mail differences are within the limits.
Traceback (most recent call last):
  File "/home/leggewie/.local/bin/mdedup", line 11, in <module>
    sys.exit(cli())
  File "/home/leggewie/.local/lib/python2.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/leggewie/.local/lib/python2.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/leggewie/.local/lib/python2.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/leggewie/.local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/leggewie/.local/lib/python2.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/leggewie/.local/lib/python2.7/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/leggewie/.local/lib/python2.7/site-packages/mail_deduplicate/cli.py", line 217, in deduplicate
    dedup.run()
  File "/home/leggewie/.local/lib/python2.7/site-packages/mail_deduplicate/deduplicate.py", line 579, in run
    duplicates.check_differences()
  File "/home/leggewie/.local/lib/python2.7/site-packages/mail_deduplicate/deduplicate.py", line 132, in check_differences
    size_difference = abs(mail_a.size - mail_b.size)
  File "/home/leggewie/.local/lib/python2.7/site-packages/boltons/cacheutils.py", line 610, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "/home/leggewie/.local/lib/python2.7/site-packages/mail_deduplicate/mail.py", line 118, in size
    return len(''.join(self.body_lines))
  File "/home/leggewie/.local/lib/python2.7/site-packages/boltons/cacheutils.py", line 610, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "/home/leggewie/.local/lib/python2.7/site-packages/mail_deduplicate/mail.py", line 159, in body_lines
    body.extend(part_body.splitlines(keepends=True))
TypeError: splitlines() takes no keyword arguments

Replace progressbar2 by Click's

progressbar2 has been added in f5275e0 . It is a nice addition and progressbar2 itself is a cool package.

Since we only use the basic features of progressbar2 and we already depends on click, I propose to directly use Click built-in progress bar instead.

This is not a high-priority task, it is not blocking releases. It's just a nice-to-have to reduce the number of dependencies and avoid duplicating features.

Move mails instead of deleting them

The CLI deletes duplicates. This feature request propose to allow the user to choose another course of action, and move the duplicate mails elsewhere.

The first simple implementation that come to mind would be to pass an option to the CLI to set the type (maildir/mbox) and the location of a brand new, non-existing path destination. Then move duplicate mails there.

Based on a suggestion by @vikasrawal at #97.

Crashes on some Maildirs

Hi,

I use maildir-deduplicate as of commit 33e5108 with python 2.6.6.
On some maildirs the script crashes like that:

[user@host maildir-deduplicate]$ python maildir-deduplicate.py -d ../.randomfolder/
Processing 8583 mails in ~/Maildir/.randomfolder ........................Beendet
[user@host maildir-deduplicate]$

(where Beendet is German and means Terminated or Stopped etc.)

It also happens on some maildirs which do not contain more than a few hundered e-mails so the number of mails seem to be irrelevant to this.

I am not a python expert so I would love to get any hints how to debug that one.

MemoryError

I get a memory error on a maildir with 838 emails:

Processing 838 mails in info/Maildir/.INBOX.leveranciers ...Trimmed Subject to Betaling ontvangen
Trimmed Subject to Bestelbevestiging
....Trimmed Subject to Welkom bij Golfsets!
Traceback (most recent call last):
  File "maildir-deduplicate.py", line 554, in <module>
    main()
  File "maildir-deduplicate.py", line 500, in main
    duplicates_run(opts, maildir_paths)
  File "maildir-deduplicate.py", line 518, in duplicates_run
    mail_count += collate_folder_by_hash(mails_by_hash, maildir, opts.message_id)
  File "maildir-deduplicate.py", line 299, in collate_folder_by_hash
    for mail_id, message in mail_folder.iteritems():
  File "/usr/lib/python2.7/mailbox.py", line 126, in iteritems
    value = self[key]
  File "/usr/lib/python2.7/mailbox.py", line 82, in __getitem__
    return self.get_message(key)
  File "/usr/lib/python2.7/mailbox.py", line 337, in get_message
    msg = MaildirMessage(f)
  File "/usr/lib/python2.7/mailbox.py", line 1427, in __init__
    Message.__init__(self, message)
  File "/usr/lib/python2.7/mailbox.py", line 1399, in __init__
    self._become_message(email.message_from_file(message))
  File "/usr/lib/python2.7/email/__init__.py", line 66, in message_from_file
    return Parser(*args, **kws).parse(fp)
  File "/usr/lib/python2.7/email/parser.py", line 71, in parse
    feedparser.feed(data)
  File "/usr/lib/python2.7/email/feedparser.py", line 157, in feed
    self._call_parse()
  File "/usr/lib/python2.7/email/feedparser.py", line 161, in _call_parse
    self._parse()
  File "/usr/lib/python2.7/email/feedparser.py", line 375, in _parsegen
    payload = payload[:-len(mo.group(0))]
MemoryError

Automatically keep usage text in sync

I'm not sure it's a good idea to copy'n'paste the usage text into the README because violation of the DRY rule makes it uncomfortably easy to forget to update it whenever the usage text is changed via changes in the code, and then they get out of sync. But if you are convinced neither you nor any future maintainer will ever forget then I guess you can ignore me and close this ;-)

Python 3 - TypeError: 'cmp' is an invalid keyword argument for this function

This is with:

thunderstorm murray # python --version
Python 3.5.1
thunderstorm murray # mdedup --version
mdedup, version 1.2.0

The system is a Gentoo Linux x86_64 system.

I am testing with a small Maildir which I know has some duplicates in it, this is what I'm seeing when I run mdedup:

thunderstorm murray # mdedup -v  DEBUG deduplicate -n Maildir
Processing 35 mails in /home/murray/Maildir
debug: Hash is 2432b264412f83b743f76820494027b6920b6d5822d6c6d2a472d7bc for mail '1466129749.M534299P7950.thunderstorm,S=503405,W=509974'.
debug: Hash is 03b7105677137585e934bf43b7de418b2f9638ed2efe1a3da6029e32 for mail '1466129749.M534325P7950.thunderstorm,S=5053,W=5142'.
debug: Hash is a258a1c6dc1204b907e381169cd1ad0dec65c77a36886f4184933f10 for mail '1466129749.M534304P7950.thunderstorm,S=14792,W=15108'.
debug: Hash is 28ad5b710b388667b0bf0ee847e2cf7174e15f5d823d566d68256dda for mail '1466129749.M534306P7950.thunderstorm,S=8607,W=8732'.
debug: Hash is 624d4ea689096d26b3de13bf3f54dd2f02635c5a9b79797af3978d97 for mail '1466129749.M534295P7950.thunderstorm,S=41767,W=42408'.
debug: Hash is 2fb8ad807a3197cf6837551ca428dadb74821a9b93f96e2eb6415ec5 for mail '1466129749.M534307P7950.thunderstorm,S=26740,W=27240'.
debug: Hash is b720bef3e4b27ce0f703ee13941532478e24d9e713b07d9903ab3a05 for mail '1466129749.M534293P7950.thunderstorm,S=5898,W=6043'.
debug: Hash is 16192e2336bf9919991257e3b4d00de7eca399d6e69f3ef6dd84673b for mail '1466129749.M534298P7950.thunderstorm,S=42484,W=43049'.
debug: Hash is 3160628605746bc50097d8742379f3a5c3ab6e8dd589d3c2eddffa31 for mail '1466129749.M534291P7950.thunderstorm,S=973277,W=985987'.
debug: Hash is e18bcdc38cb10701ec0a56eb85f6d8383e27abcb67b4bd3f09c3c48d for mail '1466129749.M534294P7950.thunderstorm,S=4654,W=4716'.
debug: Hash is b011ffbfd46f40a349979b05347c6eb51f84ed38e8b5ea8d4908c886 for mail '1466129749.M534303P7950.thunderstorm,S=48249,W=48921'.
debug: Hash is d3f39b687a4eebfa17ddaedda8eb39736d623748f956313a7d9a5d30 for mail '1466129749.M534292P7950.thunderstorm,S=338951,W=343443'.
debug: Hash is 5033048eaeec5b379e9c843aa0e0ce57e031fbc647b22a627544766a for mail '1466129749.M534316P7950.thunderstorm,S=9215,W=9358'.
debug: Hash is 6525592585c5e47b62383082f213af2c714ab1dd7d8b0ae607eaa733 for mail '1466129749.M534323P7950.thunderstorm,S=119429,W=121141'.
debug: Hash is 5549def4c02ea0e1210f002e967781a3bc8c2bece635f004336abc89 for mail '1466129749.M534297P7950.thunderstorm,S=38415,W=39209'.
debug: Hash is 664e868121a40651be43b054c6c8a0bf2b3a46bdfa17fa6fc1a63c77 for mail '1466129749.M534311P7950.thunderstorm,S=1112515,W=1126976'.
debug: Hash is 748772fb4e7a3d32acaf47f40496d73867aa41fa41c386001625e1e2 for mail '1466129749.M534296P7950.thunderstorm,S=101196,W=102544'.
debug: Hash is 7fedb8343a6735bbdda8d01d208e6204497a41f4b219788d70544f81 for mail '1466129749.M534319P7950.thunderstorm,S=3741134,W=3789766'.
debug: Hash is 5033048eaeec5b379e9c843aa0e0ce57e031fbc647b22a627544766a for mail '1466129749.M534317P7950.thunderstorm,S=7193,W=7310'.
debug: Hash is 79ff9e78a43a8a4c54eae798d64c6fc66d2cc4c67677f38816fb0929 for mail '1466129749.M534301P7950.thunderstorm,S=25528,W=25993'.
debug: Hash is 8a82df1dbd771a67bb7ee49c202a3578b30a82012dc952329962779b for mail '1466129749.M534300P7950.thunderstorm,S=5827,W=5925'.
debug: Hash is 3d678c15b79cc82e08ded13806b9a46a85df10f8f8a0fd140a51fca8 for mail '1466129749.M534305P7950.thunderstorm,S=28634,W=29143'.
debug: Hash is 4db378bed3f7b340f7d04234991504cf113cb1c91bc4a568fe412994 for mail '1466129749.M534308P7950.thunderstorm,S=50835,W=51542'.
debug: Hash is fce21633281251a5e7a697fde97e52ba770ff30b0439abdf1ac1b8a1 for mail '1466129749.M534302P7950.thunderstorm,S=33100,W=33558'.
debug: Hash is 0f781e5d7632351335e2386c68535b0ad1e7120e065c60ac0bf65516 for mail '1466129749.M534321P7950.thunderstorm,S=7406134,W=7502437'.
debug: Hash is 23e8315bb0f1cf1d29c1987efe7cf53a32fbb7338fb89d34073de595 for mail '1466129749.M534324P7950.thunderstorm,S=8175924,W=8282150'.
debug: Hash is 83afd5fc7a59445061f87b4e1ec793ca7452cf13696437fa9ce09a1b for mail '1466129749.M534314P7950.thunderstorm,S=5896,W=5995'.
debug: Hash is 84c6f6e223fec5a208aaab354e1b022ce0443cddf1c6651597f84f04 for mail '1466129749.M534322P7950.thunderstorm,S=9247928,W=9368076'.
debug: Hash is 1ae2e0a13bd557ffd75a7610c7003a109d60fbfb14701d72525cf5ad for mail '1466129749.M534309P7950.thunderstorm,S=36576,W=37079'.
debug: Hash is 664e868121a40651be43b054c6c8a0bf2b3a46bdfa17fa6fc1a63c77 for mail '1466129749.M534313P7950.thunderstorm,S=445424,W=451222'.
debug: Hash is 8433ac85e3449f0611e40dcaeb4e9465b57ae2c0b9a8c8327579fb2f for mail '1466129749.M534318P7950.thunderstorm,S=29628,W=30231'.
debug: Hash is 7fedb8343a6735bbdda8d01d208e6204497a41f4b219788d70544f81 for mail '1466129749.M534320P7950.thunderstorm,S=3741134,W=3789766'.
debug: Hash is 5033048eaeec5b379e9c843aa0e0ce57e031fbc647b22a627544766a for mail '1466129749.M534315P7950.thunderstorm,S=11217,W=11386'.
debug: Hash is 664e868121a40651be43b054c6c8a0bf2b3a46bdfa17fa6fc1a63c77 for mail '1466129749.M534310P7950.thunderstorm,S=620741,W=628816'.
debug: Hash is 664e868121a40651be43b054c6c8a0bf2b3a46bdfa17fa6fc1a63c77 for mail '1466129749.M534312P7950.thunderstorm,S=469343,W=475452'.
Subject: Prayers of the faithfull
Traceback (most recent call last):
  File "/usr/bin/mdedup", line 9, in <module>
    load_entry_point('maildir-deduplicate==1.2.0', 'console_scripts', 'mdedup')()
  File "/usr/lib64/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib64/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/lib64/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib64/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib64/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib64/python3.5/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib64/python3.5/site-packages/maildir_deduplicate/cli.py", line 146, in deduplicate
    dedup.run()
  File "/usr/lib64/python3.5/site-packages/maildir_deduplicate/deduplicate.py", line 229, in run
    sorted_messages_size = self.size_sort(messages)
  File "/usr/lib64/python3.5/site-packages/maildir_deduplicate/deduplicate.py", line 343, in size_sort
    sizes.sort(cmp=_sort_by_size)
TypeError: 'cmp' is an invalid keyword argument for this function
thunderstorm murray # 

TypeError: expected string or buffer

Old script runs, but is not very clear on what it's doing. It did give me something wrong after a run on some Maildir:

 Traceback (most recent call last):
   File "./maildir-deduplicate.py", line 204, in <module>
     main()
   File "./maildir-deduplicate.py", line 200, in main
    duplicates = findDuplicates(mails_by_hash, delete)
   File "./maildir-deduplicate.py", line 107, in findDuplicates
    subject, count = re.subn('\s+', ' ', subject)
   File "/usr/lib64/python2.6/re.py", line 162, in subn
    return _compile(pattern, 0).subn(repl, string, count)
  TypeError: expected string or buffer

Reported by @jult at #21 (comment)

UnicodeDecodeError

I tried to use mdedup on my maildir with 8276 and got the following error:

Traceback (most recent call last):
  File "/home/user/bin/mdedup", line 11, in <module>
    sys.exit(cli())
  File "/home/user/.local/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.local/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/user/.local/lib/python2.7/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/.local/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.local/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/user/.local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/user/.local/lib/python2.7/site-packages/maildir_deduplicate/cli.py", line 139, in deduplicate
    dedup.add_maildir(maildir)
  File "/home/user/.local/lib/python2.7/site-packages/maildir_deduplicate/deduplicate.py", line 80, in add_maildir
    mail_file, message, self.use_message_id)
  File "/home/user/.local/lib/python2.7/site-packages/maildir_deduplicate/deduplicate.py", line 103, in compute_hash
    canonical_headers_text = cls.canonical_headers(mail_file, message)
  File "/home/user/.local/lib/python2.7/site-packages/maildir_deduplicate/deduplicate.py", line 125, in canonical_headers
    canonical_value = cls.canonical_header_value(header, value)
  File "/home/user/.local/lib/python2.7/site-packages/maildir_deduplicate/deduplicate.py", line 148, in canonical_header_value
    value = re.sub('\s+', ' ', value).strip()
  File "/usr/lib/python2.7/re.py", line 155, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 4: ordinal not in range(128)

Unfortunately I can't identify the message, which causes this error. mdedup -v doesn't show more information. How can I find the problematic message?

Large memory usage

Hi,

I'm running maildir-deduplicate on a very large maildir (> 8GB) on Debian 7.0 and I found that maildir-deduplicate is eating large sum of memory, almost the same size as of the maildir.

It look's like maildir-deduplicate will generally load every mail file into memory which will cause a lot of pain in case of a very large mailbox. It would be better if we just store a hash or message-id in memory.

Subfolders are not processed

Only 69 emails are processed, although I have +30k in mine .Maildir. Those 69 are just in my INBOX folder, the rest is in subfolders which leads to my assumption that subfolders are not processed.

Deduplication command on .Maildir

$ mdedup --hash-header date --hash-header from --hash-header to --hash-header message-id --strategy select-smallest --dry-run --export .Maildir.tmp .Maildir
โ— Phase #0 - Load mails
Opening /(...)/.Maildir ...
maildir detected.
69 mails found.
(...)

File count in .Maildir

$ find .Maildir -type f | wc -l
36674

All data on execution context as provided by $ mdedup --version:

$ mdedup --version
mdedup 6.0.1
{'username': '-', 'guid': '77601cc7d08d280102bb709be48d881', 'hostname': '-', 'hostfqdn': '-', 'uname': {'system': 'Linux', 'node': '-', 'release': '3.10.105', 'version': '#25426 SMP Wed Jul 8 03:10:21 CST 2020', 'machine': 'armv7l', 'processor': ''}, 'linux_dist_name': '', 'linux_dist_version': '', 'cpu_count': 2, 'fs_encoding': 'utf-8', 'ulimit_soft': 1024, 'ulimit_hard': 4096, 'cwd': '-', 'umask': '0o2', 'python': {'argv': '-', 'bin': '-', 'version': '3.8.2 (tags/Contacts-1.0.0-0232-200617:57e5f51, Jun 29 2020, 09:34:08) [GCC 4.9.3 20150311 (prerelease)]', 'compiler': 'GCC 4.9.3 20150311 (prerelease)', 'build_date': 'Jun 29 2020 09:34:08', 'version_info': [3, 8, 2, 'final', 0], 'features': {'openssl': 'OpenSSL 1.0.2u-fips  20 Dec 2019', 'expat': 'expat_2.2.1', 'sqlite': '3.10.2', 'tkinter': '', 'zlib': '1.2.8', 'unicode_wide': True, 'readline': True, '64bit': False, 'ipv6': True, 'threading': True, 'urandom': True}}, 'time_utc': '2020-11-04 00:19:10.939181', 'time_utc_offset': 1.0, '_eco_version': '1.0.1'}

Use Flanker to improve parsing of badly encoded mail

Detection and skipping of badly encoded mails has been added in #47. We can go further and improve parsing of these mails.

Flanker is a good effort to better parse email content in Python: https://github.com/mailgun/flanker . The idea would be to reuse its parsing utilities.

Unfortunately, Flanker doesn't target Python 3: mailgun/flanker#106 . Some efforts are made on side branches and forks to make Flanker Python 3 ready: mailgun/flanker#106 (comment) . But reconciliation of these forks is unlikely to happen. ๐Ÿ˜ข

TypeError: apply_strategy() takes 1 positional argument but 2 were given

I'm using version 3.0.0 of mdedupe. Thank you so much for providing this tool.

My python foo is not that strong, but I believe I've come across a genuine bug and have a rough idea where it is coming from. The apply_strategy function in the DuplicateSet class accepts no positional argument other than self. Yet, it is being called with self.conf.strategy as a parameter. Kindly, please have a look.

$ mdedup deduplicate -n -s delete-matching-path -rDAMAGED2020 ~/Maildir/.DAMAGED20200924.Trash/ ~/Maildir/.Trash/
[...]...
=== Phase #2: deduplicate mails.
The delete-matching-path strategy will be applied on each duplicate set.
--- 2 mails sharing hash bb2bc9037f04dbdc50519fc7ce88e7694fcb2ececbf49a3fb9d37f5b
Check that mail differences are within the limits.
Traceback (most recent call last):
[...]
  File "/home/leggewie/.local/lib/python3.6/site-packages/mail_deduplicate/deduplicate.py", line 593, in run
    duplicates.apply_strategy(self.conf.strategy)
TypeError: apply_strategy() takes 1 positional argument but 2 were given

discussion on stackoverflow

Traceback on parsing email without info which email caused it.

I'm deduping a large maildir.

Command line:

mdedup -n -S 0 -C 0 -a move-selected -e maildir -E ~/mail/Maildir/.GMAllMailNonDupe -s select-oldest  ~/mail/Maildir/`

Output ends with:

Check mail differences are below the thresholds.
Select all mails sharing the oldest 1299180509 timestamp...
warning: Skip set: all 5 mails within were selected. The strategy criterion was not able to discard some.
โ—ผ 7 mails sharing hash faad932b06e1bed2520658105bcae3d483acf05fd605db23f365adc6
Check mail differences are below the thresholds.
Traceback (most recent call last):
  File "/usr/local/bin/mdedup", line 8, in <module>
    sys.exit(mdedup())
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/mail_deduplicate/cli.py", line 385, in mdedup
    dedup.select_all()
  File "/usr/local/lib/python3.9/site-packages/mail_deduplicate/deduplicate.py", line 417, in select_all
    candidates = duplicates.select_candidates()
  File "/usr/local/lib/python3.9/site-packages/mail_deduplicate/deduplicate.py", line 277, in select_candidates
    selected = apply_strategy(self.conf.strategy, self)
  File "/usr/local/lib/python3.9/site-packages/mail_deduplicate/strategy.py", line 262, in apply_strategy
    return set(method(duplicates))
  File "/usr/local/lib/python3.9/site-packages/mail_deduplicate/strategy.py", line 49, in select_oldest
    f"Select all mails sharing the oldest {duplicates.oldest_timestamp} "
  File "/usr/local/lib/python3.9/site-packages/boltons/cacheutils.py", line 610, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "/usr/local/lib/python3.9/site-packages/mail_deduplicate/deduplicate.py", line 156, in oldest_timestamp
    return min(map(attrgetter("timestamp"), self.pool))
  File "/usr/local/lib/python3.9/site-packages/boltons/cacheutils.py", line 610, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "/usr/local/lib/python3.9/site-packages/mail_deduplicate/mail.py", line 118, in timestamp
    value = email.utils.mktime_tz(email.utils.parsedate_tz(value))
  File "/usr/local/Cellar/[email protected]/3.9.0_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/email/_parseaddr.py", line 185, in mktime_tz
    if data[9] is None:
TypeError: 'NoneType' object is not subscriptable

It looks to me like there's a defective email, but how am I supposed to figure out which one?

latest version crashes with "Not enough data from canonical headers to compute reliable hash"

WARNING: ignoring problematic /var/qmail/mailnames/xx.nl/xx/Maildir/new/1436223346.M611808P3746.xx.xx.eu,S=321,W=328: Not enough data from canonical headers to compute reliable hash!
Headers:
--------- 8< --------- 8< --------- 8< --------- 8< --------- 8< ---------
Message-ID: <8[10
--------- 8< --------- 8< --------- 8< --------- 8< --------- 8< ---------

..Traceback (most recent call last):
  File "./maildir-deduplicate.py", line 616, in <module>
    main()
  File "./maildir-deduplicate.py", line 560, in main
    duplicates_run(opts, maildir_paths)
  File "./maildir-deduplicate.py", line 578, in duplicates_run
    mail_count += collate_folder_by_hash(mails_by_hash, maildir, opts.message_id)
  File "./maildir-deduplicate.py", line 331, in collate_folder_by_hash
    mail_hash, header_text = compute_hash_key(mail_file, message, use_message_id)
  File "./maildir-deduplicate.py", line 315, in compute_hash_key
    canonical_headers_text = get_canonical_headers(mail_file, message)
  File "./maildir-deduplicate.py", line 226, in get_canonical_headers
    canonical_value = get_canonical_header_value(header, value)
  File "./maildir-deduplicate.py", line 285, in get_canonical_header_value
    utc_timestamp = email.utils.mktime_tz(parsed)
  File "/usr/lib64/python2.6/email/_parseaddr.py", line 142, in mktime_tz
    if data[9] is None:
TypeError: 'NoneType' object is unsubscriptable

The error is not very descriptive to me, what does it mean?

Crash TypeError: 'NoneType' object has no attribute '__getitem__'

I am getting a reproducible error:

I use this command
python init.py -o /path-to-my-dir/getmaildir/

This is the whole output (yes, there are a lot of mails in that dir)

Processing 460581 mails in /mnt/rraid/COMPRESSED/Backup/Mails/getmaildir .........................Traceback (most recent call last):
  File "__init__.py", line 572, in <module>
    main()
  File "__init__.py", line 516, in main
    duplicates_run(opts, maildir_paths)
  File "__init__.py", line 534, in duplicates_run
    mail_count += collate_folder_by_hash(mails_by_hash, maildir, opts.message_id)
  File "__init__.py", line 287, in collate_folder_by_hash
    mail_hash, header_text = compute_hash_key(mail_file, message, use_message_id)
  File "__init__.py", line 271, in compute_hash_key
    canonical_headers_text = get_canonical_headers(mail_file, message)
  File "__init__.py", line 182, in get_canonical_headers
    canonical_value = get_canonical_header_value(header, value)
  File "__init__.py", line 241, in get_canonical_header_value
    utc_timestamp = email.utils.mktime_tz(parsed)
  File "/usr/lib/python2.7/email/_parseaddr.py", line 154, in mktime_tz
    if data[9] is None:
TypeError: 'NoneType' object has no attribute '__getitem__'

any ideas?

(update: the error message)

python regexp.pattern error

--- ./maildir-deduplicate.py.old    2013-05-09 13:03:03.603295769 +0200
+++ ./maildir-deduplicate.py    2013-05-09 12:19:31.000000000 +0200
@@ -385,10 +385,10 @@
     if len(doomed) == len(duplicate_set):
         if opts.remove_matching:
             sys.stderr.write("/%s/ matched whole set; not removing any duplicates.\n" %
-                             opts.remove_matching.pattern)
+                             opts.remove_matching)
         elif opts.remove_not_matching:
             sys.stderr.write("/%s/ matched whole set; not removing any duplicates.\n" %
-                             opts.remove_not_matching.pattern)
+                             opts.remove_not_matching)
         else:
             fatal("BUG: removal strategy tried to remove all duplicates in set!")
         return { }

Message sizes not be calculated correctly?

I'm using the -S switch to specify a maximum allowed difference between the size of duplicates. However, I'm noticing some weird behavior.

I have a mailbox where two messages are exactly the same, with the only thing differing being the filename. When deduplication is run (with -o (remove older dupes), -S 0, and -D 0), I see the following in logs for a number of files.

For hash key xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, sizes differ by 2532 > 0 bytes:
   1406217655 /var/mail/xxx/xxx/Maildir/xxx/cur/1406064211_4.1556,U=263,FMD5=abcdefg:2,S
   1406220187 /var/mail/xxx/xxx/Maildir/xxx/cur/1406216514_6.1664,U=263,FMD5=abcdefg:2,

However, when I do a diff and ls of these files, they are identical with the exact same file size. What could be the issue here?

I saw the commented line in sort_messages_by_size where now the size is being determined from the message_body rather than using the size in the OS, but I still don't see how two different file sizes can be calculated if both files have exactly the same content.

Cache hashes on filesystem

When hashing about 20.000 to 30.000 mails mdedup used up 2G of RAM. I think it would be good idea to offload storing the hashes to a temporary file, one that could potentially be reused on a subsequent run.

'module' object has no attribute 'init'

# pip3 install --user maildir-deduplicate
[...]
Successfully installed boltons-18.0.0 click-6.7 click-log-0.2.1 maildir-deduplicate-2.1.0 progressbar2-3.37.1 python-utils-2.3.0 tabulate-0.8.2
# ~/.local/bin/mdedup --help
Traceback (most recent call last):
  File "/root/.local/bin/mdedup", line 7, in <module>
    from maildir_deduplicate.cli import cli
  File "/root/.local/lib64/python3.4/site-packages/maildir_deduplicate/cli.py", line 51, in <module>
    @click_log.init(logger)
AttributeError: 'module' object has no attribute 'init'
# python3 --version
Python 3.4.6
# pip3 --version
pip 9.0.1 from /usr/lib64/python3.4/site-packages (python 3.4)

This is on a funtoo system. Used --user so that dependencies wouldn't collide with package manager.

Serious documentation issue: Strategies reversed

The documentation both on GitHub and in the script states:

Removal strategies for each set of mail duplicates:
- older: remove all but the newest message (determined by ctime).
- newer: remove all but the oldest message (determined by ctime).

I am running with --strategy newer, so I would have expected the oldest message to survive.
Nevertheless, the debug output states:

Subject: Test 2
left 1 1472685012.4187646 /home/renegade/[REDACTED]
removed 2 1472685012.3827646 /home/renegade/[REDACTED]
removed 3 1472680756.5782707 /home/renegade/[REDACTED]
removed 4 1472680756.5702705 /home/renegade/[REDACTED]

1472685012 is 2016-08-31T23:10:12+00:00
1472680756 is 2016-08-31T21:59:16+00:00

The message marked as "4" above is quite clearly the oldest message and should have survived, according to the documentation. Instead, the newest message was chosen.

This is how Python sorting orders the values:

>>> test=[1472685012.3827646,1472680756.5782707,1472680756.5702705,1472685012.4187646]
>>> test.sort()
>>> test
[1472680756.5702705, 1472680756.5782707, 1472685012.3827646, 1472685012.4187646]
>>> test.sort(reverse=True)
>>> test
[1472685012.4187646, 1472685012.3827646, 1472680756.5782707, 1472680756.5702705]

It is quite obvious that, for the result above, maildir-deduplicate chose the reverse=True sorting argument.

    def time_sort(messages, old_to_new):
        ctimes = []
        for mail_file, message in messages:
            ctime = os.path.getctime(mail_file)
            ctimes.append((ctime, mail_file, message))

        ctimes.sort(reverse=old_to_new)

        return ctimes

The code chooses whether to reverse sort depending on the second argument to time_sort.

                if self.strategy == OLDER:
                    sorted_messages_ctime = self.time_sort(messages, False)
                elif self.strategy == NEWER:
                    sorted_messages_ctime = self.time_sort(messages, True)

If the strategy NEWER is chosen, the code chooses reverse sorting, which results in the largest number to be sorted first. Since Unix timestamps are ever-increasing, the largest number is the newest (youngest) date.
The code later chooses the survivor by simply skipping the first record:

        for i, duplicate in enumerate(duplicate_set):
            size, mail_file, message = duplicate
            if self.strategy in [SMALLER, OLDER, NEWER]:
                if i > 0:
                    doomed[mail_file] = 1

Unless I'm missing something major here, the actual deletion behaviour of the code is the exact opposite of what the documentation states!

The documentation is laid out in the sense of "which messages should I delete?".
The code operates in the sense of "which message should I keep?".

maildir-deduplicate fails with UnicodeDecodeError

Reproducer e-mail: https://emergent.unpythonic.net/files/sandbox/maildir-encoding-error.zip

I tested at ref 26a167c on debian stretch.

=== Start phase #1: load mails and compute hashes.
Opening maildir at /tmp/maildir-test ...
1 mails found.
N/A%|                                                                          |Traceback (most recent call last):
  File "/home/jepler/bin/mdedup", line 11, in <module>
    load_entry_point('maildir-deduplicate==2.1.1', 'console_scripts', 'mdedup')()
  File "/home/jepler/.local/lib/python2.7/site-packages/click-6.7-py2.7.egg/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/jepler/.local/lib/python2.7/site-packages/click-6.7-py2.7.egg/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/jepler/.local/lib/python2.7/site-packages/click-6.7-py2.7.egg/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/jepler/.local/lib/python2.7/site-packages/click-6.7-py2.7.egg/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jepler/.local/lib/python2.7/site-packages/click-6.7-py2.7.egg/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/jepler/.local/lib/python2.7/site-packages/click-6.7-py2.7.egg/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "build/bdist.linux-x86_64/egg/maildir_deduplicate/cli.py", line 204, in deduplicate
  File "build/bdist.linux-x86_64/egg/maildir_deduplicate/deduplicate.py", line 523, in add_maildir
  File "/home/jepler/.local/lib/python2.7/site-packages/boltons-17.1.0-py2.7.egg/boltons/cacheutils.py", line 658, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "build/bdist.linux-x86_64/egg/maildir_deduplicate/mail.py", line 138, in hash_key
  File "/home/jepler/.local/lib/python2.7/site-packages/boltons-17.1.0-py2.7.egg/boltons/cacheutils.py", line 658, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "build/bdist.linux-x86_64/egg/maildir_deduplicate/mail.py", line 157, in canonical_headers
  File "build/bdist.linux-x86_64/egg/maildir_deduplicate/mail.py", line 187, in canonical_header_value
  File "/usr/lib/python2.7/re.py", line 155, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x96 in position 0: ordinal not in range(128)

TypeError: expected string or bytes-like object

I'm scanning a 500K message collection, which I suspect is 75% dupes from various IMAP syncs gone awry. About 1/4 through I always get a crash with this output:

Traceback (most recent call last):                                                                          File "/home/scott/.local/bin/mdedup", line 11, in <module>
    load_entry_point('maildir-deduplicate==2.2.0', 'console_scripts', 'mdedup')()
  File "/home/scott/.local/lib/python3.6/site-packages/click/core.py", line 722, in __call__                 return self.main(*args, **kwargs)
  File "/home/scott/.local/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/scott/.local/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/scott/.local/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/scott/.local/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return ctx.invoke(self.callback, **ctx.params)                                               [56/272]
  File "/home/scott/.local/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/scott/.local/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/scott/.local/lib/python3.6/site-packages/maildir_deduplicate/cli.py", line 206, in deduplicate                                                                                                          dedup.add_maildir(maildir)
  File "/home/scott/.local/lib/python3.6/site-packages/maildir_deduplicate/deduplicate.py", line 523, in
add_maildir
    mail_hash = mail.hash_key
  File "/home/scott/.local/lib/python3.6/site-packages/boltons/cacheutils.py", line 658, in __get__          value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "/home/scott/.local/lib/python3.6/site-packages/maildir_deduplicate/mail.py", line 138, in hash_key
    return hashlib.sha224(self.canonical_headers).hexdigest()
  File "/home/scott/.local/lib/python3.6/site-packages/boltons/cacheutils.py", line 658, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)                                              File "/home/scott/.local/lib/python3.6/site-packages/maildir_deduplicate/mail.py", line 157, in canonical_headers
    canonical_value = self.canonical_header_value(header, value)
  File "/home/scott/.local/lib/python3.6/site-packages/maildir_deduplicate/mail.py", line 187, in canonical_header_value
    value = re.sub(r'\s+', ' ', value).strip()
  File "/usr/local/lib/python3.6/re.py", line 191, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object

It would be nicer if the malformed email I presume is causing this was skipped with a warning instead of crashing the app. Or, if there were some way to get the app to print out the path of the email its checking before checking it (maybe I'll pipe it to a log file), so after a crash I can go back and remove the problematic email and try again.

I'm running maildir-deduplicate f1c6ff2 on OpenBSD 6.1-stable with Python 3.6.

Unclear how to install

Since transitioning from a simple script to a package, it's not at all clear how to use this any more. README.rst is missing any installation instructions, and the usage text begins:

Usage: __init__.py [OPTIONS] [MAILDIR [MAILDIR ...]]

which is obviously wrong.

Confirm non-existence of export mailbox before hashing mails

mdedup wants to create a new mailbox in the location specified by --export, and if the location exists a FileExistsError is raised by create_box.

Because this comes after the mail has been hashed, and the hashing is the time-consuming part of the process, one ends up having to re-hash if the mailbox already exists. Although it is a user error, it is annoying to have to re-hash.

So it would be nice if mdedup could check that the export destination doesn't exist, and throw the exception if it does, before the hashing takes place.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.