Comments (3)
One other thing that isn't clear from the documentation:
If two items tie, e.g. have the same datestamp, is a tiebreak made. This would be logical, but a strict reading of the documentation would be that BOTH emails are selected.
Meaning, if 1A and 1B have identical timestamps, are BOTH selected and acted upon? Or just one, for actions that typically select one message.
from mail-deduplicate.
Just to followup, I still could not determine the behavior. I used GPT4 and plugged in each file, trying to see if I could determine code that would answer my question. However, I was unable to determine which code directly addresses the handling of unique emails in the deduplication process or the resolution of ties in duplicate selection.
from mail-deduplicate.
So I am constucting a toy mbox to understand the behavior, but now I am more confused than ever:
From [email protected] Thu Jan 1 00:00:00 2021
Subject: Duplicate Email 1
Date: Thu, 1 Jan 2021 00:00:00 +0000
From: [email protected]
To: [email protected]
This is a duplicate email.
From [email protected] Thu Jan 1 00:00:00 2021
Subject: Duplicate Email 1
Date: Thu, 1 Jan 2021 00:00:00 +0000
From: [email protected]
To: [email protected]
This is a duplicate email.
From [email protected] Thu Jan 1 00:01:00 2021
Subject: Slightly Different Email
Date: Thu, 1 Jan 2021 00:01:00 +0000
From: [email protected]
To: [email protected]
This email is slightly different.
From [email protected] Thu Jan 1 00:02:00 2021
Subject: Unique Email
Date: Thu, 1 Jan 2021 00:02:00 +0000
From: [email protected]
To: [email protected]
This is a unique email.
Giving:
● Step #5 - Report and statistics
╒════════════╤════════╤══════════════════════════════════════════════════════════════╕
│ Mails │ Metric │ Description │
╞════════════╪════════╪══════════════════════════════════════════════════════════════╡
│ Found │ 4 │ Total number of mails encountered from all mail sources. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Rejected │ 0 │ Number of mails rejected individually because they were │
│ │ │ unparseable or did not have enough metadata to compute │
│ │ │ hashes. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Retained │ 4 │ Number of valid mails parsed and retained for deduplication. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Hashes │ 3 │ Number of unique hashes. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Unique │ 0 │ Number of unique mails (which where automatically added to │
│ │ │ selection). │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Duplicates │ 4 │ Number of duplicate mails (sum of mails in all duplicate │
│ │ │ sets with at least 2 mails). │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Skipped │ 4 │ Number of mails ignored in the selection step because the │
│ │ │ whole set they belong to was skipped. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Discarded │ 0 │ Number of mails discarded from the final selection. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Selected │ 0 │ Number of mails kept in the final selection on which the │
│ │ │ action will be performed. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Copied │ 0 │ Number of mails copied from their original mailbox to │
│ │ │ another. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Moved │ 0 │ Number of mails moved from their original mailbox to │
│ │ │ another. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Deleted │ 0 │ Number of mails deleted from their mailbox in-place. │
╘════════════╧════════╧══════════════════════════════════════════════════════════════╛
╒════════════════════╤════════╤════════════════════════════════════════════════════════════╕
│ Duplicate sets │ Metric │ Description │
╞════════════════════╪════════╪════════════════════════════════════════════════════════════╡
│ Total │ 3 │ Total number of duplicate sets. │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Single │ 0 │ Total number of sets containing only a single mail with no │
│ │ │ applicable strategy. They were automatically kept in the │
│ │ │ final selection. │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Encoding │ 0 │ Number of sets skipped from the selection process because │
│ │ │ they had encoding issues. │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Size │ 0 │ Number of sets skipped from the selection process because │
│ │ │ they were too dissimilar in size. │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Content │ 0 │ Number of sets skipped from the selection process because │
│ │ │ they were too dissimilar in content. │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Strategy │ 3 │ Number of sets skipped from the selection process because │
│ │ │ the strategy could not be applied. │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Deduplicated │ 0 │ Number of valid sets on which the selection strategy was │
│ │ │ successfully applied. │
╘════════════════════╧════════╧════════════════════════════════════════════════════════════╛
This suggests:
- If there are enough headers, the email is considered (not sure what term you use here) and will either be selected or discarded. (Otherwise, it is "rejected". Which is important but not well documented, since these emails are not acted upon.)
- Since the two dup emails have identical timestamps, they are both selected. This is technically correct but also a bit hard to get around when you do want tiebreak behavior.
Anyway, what is clear is that all emails are selected, and move-discarded
thus moved none. So move-selected
should move ALL of them, right? But I do the same command with move-selected
and nothing happens and mbox is unchanged!
● Step #3 - Select mails in each group
info: select-newest strategy will be applied on each duplicate set to select candidates.
info: ◼ 2 mails sharing hash 05a3285c1254315fa50966ae1bed99e47ab51a592d9e728a7a70e526
info: Check mail differences are below the thresholds.
info: Select all mails sharing the newest 1609459200 timestamp...
warning: Skip set: all 2 mails within were selected. The strategy criterion was not able to discard some.
info: Check mail differences are below the thresholds.
info: Select all mails sharing the newest 1609459260 timestamp...
warning: Skip set: all 1 mails within were selected. The strategy criterion was not able to discard some.
info: Check mail differences are below the thresholds.
info: Select all mails sharing the newest 1609459320 timestamp...
warning: Skip set: all 1 mails within were selected. The strategy criterion was not able to discard some.
● Step #4 - Perform action on selected mails
info: Perform move-selected action...
warning: No mail selected to perform action on.
● Step #5 - Report and statistics
╒════════════╤════════╤══════════════════════════════════════════════════════════════╕
│ Mails │ Metric │ Description │
╞════════════╪════════╪══════════════════════════════════════════════════════════════╡
│ Found │ 4 │ Total number of mails encountered from all mail sources. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Rejected │ 0 │ Number of mails rejected individually because they were │
│ │ │ unparseable or did not have enough metadata to compute │
│ │ │ hashes. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Retained │ 4 │ Number of valid mails parsed and retained for deduplication. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Hashes │ 3 │ Number of unique hashes. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Unique │ 0 │ Number of unique mails (which where automatically added to │
│ │ │ selection). │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Duplicates │ 4 │ Number of duplicate mails (sum of mails in all duplicate │
│ │ │ sets with at least 2 mails). │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Skipped │ 4 │ Number of mails ignored in the selection step because the │
│ │ │ whole set they belong to was skipped. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Discarded │ 0 │ Number of mails discarded from the final selection. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Selected │ 0 │ Number of mails kept in the final selection on which the │
│ │ │ action will be performed. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Copied │ 0 │ Number of mails copied from their original mailbox to │
│ │ │ another. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Moved │ 0 │ Number of mails moved from their original mailbox to │
│ │ │ another. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Deleted │ 0 │ Number of mails deleted from their mailbox in-place. │
╘════════════╧════════╧══════════════════════════════════════════════════════════════╛
╒════════════════════╤════════╤════════════════════════════════════════════════════════════╕
│ Duplicate sets │ Metric │ Description │
╞════════════════════╪════════╪════════════════════════════════════════════════════════════╡
│ Total │ 3 │ Total number of duplicate sets. │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Single │ 0 │ Total number of sets containing only a single mail with no │
│ │ │ applicable strategy. They were automatically kept in the │
│ │ │ final selection. │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Encoding │ 0 │ Number of sets skipped from the selection process because │
│ │ │ they had encoding issues. │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Size │ 0 │ Number of sets skipped from the selection process because │
│ │ │ they were too dissimilar in size. │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Content │ 0 │ Number of sets skipped from the selection process because │
│ │ │ they were too dissimilar in content. │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Strategy │ 3 │ Number of sets skipped from the selection process because │
│ │ │ the strategy could not be applied. │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Deduplicated │ 0 │ Number of valid sets on which the selection strategy was │
│ │ │ successfully applied. │
╘════════════════════╧════════╧════════════════════════════════════════════════════════════╛
from mail-deduplicate.
Related Issues (20)
- Hardlink Dupes HOT 4
- warn users of the 3.x release of the unsupported status HOT 6
- Update GitHub project description and link HOT 4
- add pip to pyproject.toml? HOT 6
- Object has no attribute '_subdir' error HOT 8
- Add option to ignore single messages when performing any actions HOT 2
- AttributeError: 'MaildirDedupMail' object has no attribute '_subdir' HOT 2
- iteritems is python2-only HOT 3
- -s discard-newer -a delete-discarded isn't deleting any mail HOT 2
- `boltons.ecoutils.pprint` error on Python 3.10 HOT 7
- OOM: `mdedup` hangs then exits with message `Killed` HOT 8
- No docs HOT 4
- Broken links HOT 1
- Broken links HOT 1
- Broken links HOT 1
- TypeError: 'NoneType' object is not subscriptable (mail with no Date) HOT 1
- Broken links HOT 1
- 🎁 Multiple strategies
- No action performed. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mail-deduplicate.