Comments (12)
@vincentbernat is right. All *-discarded
actions are not implemented yet.
In the mean time, you can indeed emulate the effect of -a delete-discarded -s select-one
with -a delete-selected -s select-all-but-one
. I.e. by reversing the logic.
The thing is all *-discarded
actions are syntactic sugar. They should implement the reverse of *-selected
actions at:
mail-deduplicate/mail_deduplicate/action.py
Lines 82 to 91 in f423b41
Same for strategies by the way. They all have their own aliases to allow users to better map their own worldview to the operational logic. See:
mail-deduplicate/mail_deduplicate/strategy.py
Lines 206 to 224 in f423b41
The problem is at the selection process step, right before we perform the action:
Applying the strategy return the list of selected mails, not the one that were discarded. See how each implemented strategies only returns the subset of the duplicate pool:
mail-deduplicate/mail_deduplicate/strategy.py
Lines 37 to 39 in f423b41
mail-deduplicate/mail_deduplicate/strategy.py
Lines 52 to 56 in f423b41
and so on...
By the time we need to perform the action, we only have a subset of the initial duplicate pool. We do not have the list of those that were discarded.
This is the limit of the code architecture inherited from the initial dumb script I wrote 10 years ago. It was an early optimization to reduce the memory footprint. Given that context, it will be hard to easily implement the action with the current code structure.
I propose to first tackle #87, i.e. keep a cache of canonical hashes used to ID each mail. That way we'll be in a position to only deal with sets of hashes in our selection/action phases instead of parsed mail objects. This will bring much cleaner and flexible code to implement the missing actions.
from mail-deduplicate.
I also ran into this bug. For some reason, only the *-selected
actions are implemented. So, in many cases, you can use -a delete-selected
(and maybe repeat several times if you get several duplicates).
Edit: very bad idea. Non-duplicated mails will be selected too...
from mail-deduplicate.
I've managed to do the inteded action by making it skip unique emails:
mail_deduplicate/mail_deduplicate/deduplicate.py Lines 404 to 422
Thanks, it;s exact place where I'm coding right now for myself, but on top I introduced a new option.
from mail-deduplicate.
FYI: just added an issue and PR regarding the option to only act on duplicates (#203 and #204, respectively)
from mail-deduplicate.
The missing *-discarded
actions have been implemented in #290 and are now available upstream.
from mail-deduplicate.
same is true for move-discarded
from mail-deduplicate.
So the core functionality of the tool is not implemented... Can we prioritize this bug (which actually makes tool useless) to be fixed sooner than later?
from mail-deduplicate.
@kdeldycke
The problem is that single mails are got removed by your proposal. So, it's not equivalent substitution, please revert "bug" tag and prioritize the fix.
mdedup -n -i maildir -s select-all-but-one -a delete-selected -C 32 .
● Phase #4 - Report and statistics
╒════════════╤══════════╤══════════════════════════════════════════════════════════════╕
│ Mails │ Metric │ Description │
╞════════════╪══════════╪══════════════════════════════════════════════════════════════╡
│ Found │ 221373 │ Total number of mails encountered from all mail sources. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Rejected │ 1067 │ Number of mails individuality rejected because they were │
│ │ │ unparseable or did not had enough metadata to compute │
│ │ │ hashes. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Retained │ 220306 │ Number of valid mails parsed and retained for deduplication. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Hashes │ 165621 │ Number of unique hashes. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Unique │ 121943 │ Number of unique mails (which where automaticcaly added to │
│ │ │ selection). │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Duplicates │ 98363 │ Number of duplicate mails (sum of mails in all duplicate │
│ │ │ sets with at least 2 mails). │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Skipped │ 8644 │ Number of mails ignored in the selection phase because the │
│ │ │ whole set they belongs to was skipped. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Discarded │ 40268 │ Number of mails discarded from the final selection. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Selected │ 171394 │ Number of mails kept in the final selection on which the │
│ │ │ action will be performed. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Copied │ 0 │ Number of mails copied from their original mailbox to │
│ │ │ another. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Moved │ 0 │ Number of mails moved from their original mailbox to │
│ │ │ another. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Deleted │ 171394 │ Number of mails deleted from their mailbox in-place. │
╘════════════╧══════════╧══════════════════════════════════════════════════════════════╛
╒════════════════════╤══════════╤════════════════════════════════════════════════════════════╕
│ Duplicate sets │ Metric │ Description │
╞════════════════════╪══════════╪════════════════════════════════════════════════════════════╡
│ Total │ 165621 │ Total number of duplicate sets. │
├────────────────────┼──────────┼────────────────────────────────────────────────────────────┤
│ Single │ 121943 │ Total number of sets containing a single mail and did not │
│ │ │ had to have a strategy applied to. They were automatticaly │
│ │ │ kept in the final selection. │
├────────────────────┼──────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Encoding │ 0 │ Number of sets skipped from the selection process because │
│ │ │ they had encoding issues. │
├────────────────────┼──────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Size │ 510 │ Number of sets skipped from the selection process because │
│ │ │ they were too disimilar in size. │
├────────────────────┼──────────┼────────────────────────────────────────────────────────────┤
│ Skipeed - Content │ 0 │ Number of sets skipped from the selection process because │
│ │ │ they were too disimilar in content. │
├────────────────────┼──────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Strategy │ 0 │ Number of sets skipped from the selection process because │
│ │ │ the strategy could not be applied. │
├────────────────────┼──────────┼────────────────────────────────────────────────────────────┤
│ Deduplicated │ 40268 │ Number of valid sets on which the selection strategy was │
│ │ │ successfully applied. │
╘════════════════════╧══════════╧════════════════════════════════════════════════════════════╛
from mail-deduplicate.
I've managed to do the inteded action by making it skip unique emails:
mail_deduplicate/mail_deduplicate/deduplicate.py
Lines 404 to 422
# Unique mails are always selected. No need to mobilize the whole
# DuplicateSet machinery.
if mail_count == 1:
self.stats["mail_unique"] += 1
self.stats["set_single"] += 1
if self.conf.action == "delete-selected":
logger.debug("Skipped deletion of unique mail.")
self.stats["mail_skipped"] += 1
else:
logger.debug("Add unique message to selection.")
self.stats["mail_selected"] += 1
candidates = mail_set
# We need to resort to a selection strategy to discriminate mails
# within the set.
else:
duplicates = DuplicateSet(hash_key, mail_set, self.conf)
candidates = duplicates.select_candidates()
# Merge duplicate set's stats to global stats.
self.stats += duplicates.stats
from mail-deduplicate.
Any progress on this? select-newer
also doesn't work as expected. Single mails without duplicates are still selected.
│ Mails │ Metric │ Description │
╞════════════╪══════════╪══════════════════════════════════════════════════════════════╡
│ Found │ 1060 │ Total number of mails encountered from all mail sources. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Rejected │ 0 │ Number of mails rejected individually because they were │
│ │ │ unparseable or did not have enough metadata to compute │
│ │ │ hashes. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Retained │ 1060 │ Number of valid mails parsed and retained for deduplication. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Hashes │ 1060 │ Number of unique hashes. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Unique │ 1060 │ Number of unique mails (which where automatically added to │
│ │ │ selection). │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Duplicates │ 0 │ Number of duplicate mails (sum of mails in all duplicate │
│ │ │ sets with at least 2 mails). │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Skipped │ 0 │ Number of mails ignored in the selection phase because the │
│ │ │ whole set they belong to was skipped. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Discarded │ 0 │ Number of mails discarded from the final selection. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Selected │ 1060 │ Number of mails kept in the final selection on which the │
│ │ │ action will be performed. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Copied │ 0 │ Number of mails copied from their original mailbox to │
│ │ │ another. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Moved │ 0 │ Number of mails moved from their original mailbox to │
│ │ │ another. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Deleted │ 1060 │ Number of mails deleted from their mailbox in-place. │
Command line was
mdedup -s select-newer -a delete-selected -n -t ctime maildir
from mail-deduplicate.
@dschrempf Can you please open a new ticket regarding select-newer misbehaving?
from mail-deduplicate.
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
from mail-deduplicate.
Related Issues (20)
- `--help` option and naked `mdedup` calls must print the same help screen HOT 4
- Hardlink Dupes HOT 4
- warn users of the 3.x release of the unsupported status HOT 6
- Update GitHub project description and link HOT 4
- add pip to pyproject.toml? HOT 6
- Object has no attribute '_subdir' error HOT 8
- Add option to ignore single messages when performing any actions HOT 2
- AttributeError: 'MaildirDedupMail' object has no attribute '_subdir' HOT 2
- iteritems is python2-only HOT 3
- -s discard-newer -a delete-discarded isn't deleting any mail HOT 2
- `boltons.ecoutils.pprint` error on Python 3.10 HOT 7
- OOM: `mdedup` hangs then exits with message `Killed` HOT 8
- No docs HOT 4
- Broken links HOT 1
- Broken links HOT 1
- Broken links HOT 1
- Create a new, deduplicated mailbox with unique emails too (Documentation: What is "discarded"?) HOT 3
- TypeError: 'NoneType' object is not subscriptable (mail with no Date) HOT 1
- Broken links
- 🎁 Multiple strategies
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mail-deduplicate.