Giter Site home page Giter Site logo

Comments (3)

turian avatar turian commented on September 22, 2024

One other thing that isn't clear from the documentation:

If two items tie, e.g. have the same datestamp, is a tiebreak made. This would be logical, but a strict reading of the documentation would be that BOTH emails are selected.

Meaning, if 1A and 1B have identical timestamps, are BOTH selected and acted upon? Or just one, for actions that typically select one message.

from mail-deduplicate.

turian avatar turian commented on September 22, 2024

Just to followup, I still could not determine the behavior. I used GPT4 and plugged in each file, trying to see if I could determine code that would answer my question. However, I was unable to determine which code directly addresses the handling of unique emails in the deduplication process or the resolution of ties in duplicate selection.

from mail-deduplicate.

turian avatar turian commented on September 22, 2024

So I am constucting a toy mbox to understand the behavior, but now I am more confused than ever:

From [email protected] Thu Jan  1 00:00:00 2021
Subject: Duplicate Email 1
Date: Thu, 1 Jan 2021 00:00:00 +0000
From: [email protected]
To: [email protected]

This is a duplicate email.

From [email protected] Thu Jan  1 00:00:00 2021
Subject: Duplicate Email 1
Date: Thu, 1 Jan 2021 00:00:00 +0000
From: [email protected]
To: [email protected]

This is a duplicate email.

From [email protected] Thu Jan  1 00:01:00 2021
Subject: Slightly Different Email
Date: Thu, 1 Jan 2021 00:01:00 +0000
From: [email protected]
To: [email protected]

This email is slightly different.

From [email protected] Thu Jan  1 00:02:00 2021
Subject: Unique Email
Date: Thu, 1 Jan 2021 00:02:00 +0000
From: [email protected]
To: [email protected]

This is a unique email.

Giving:


● Step #5 - Report and statistics
╒════════════╤════════╤══════════════════════════════════════════════════════════════╕
│ Mails      │ Metric │ Description                                                  │
╞════════════╪════════╪══════════════════════════════════════════════════════════════╡
│ Found      │      4 │ Total number of mails encountered from all mail sources.     │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Rejected   │      0 │ Number of mails rejected individually because they were      │
│            │        │ unparseable or did not have enough metadata to compute       │
│            │        │ hashes.                                                      │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Retained   │      4 │ Number of valid mails parsed and retained for deduplication. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Hashes     │      3 │ Number of unique hashes.                                     │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Unique     │      0 │ Number of unique mails (which where automatically added to   │
│            │        │ selection).                                                  │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Duplicates │      4 │ Number of duplicate mails (sum of mails in all duplicate     │
│            │        │ sets with at least 2 mails).                                 │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Skipped    │      4 │ Number of mails ignored in the selection step because the    │
│            │        │ whole set they belong to was skipped.                        │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Discarded  │      0 │ Number of mails discarded from the final selection.          │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Selected   │      0 │ Number of mails kept in the final selection on which the     │
│            │        │ action will be performed.                                    │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Copied     │      0 │ Number of mails copied from their original mailbox to        │
│            │        │ another.                                                     │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Moved      │      0 │ Number of mails moved from their original mailbox to         │
│            │        │ another.                                                     │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Deleted    │      0 │ Number of mails deleted from their mailbox in-place.         │
╘════════════╧════════╧══════════════════════════════════════════════════════════════╛
╒════════════════════╤════════╤════════════════════════════════════════════════════════════╕
│ Duplicate sets     │ Metric │ Description                                                │
╞════════════════════╪════════╪════════════════════════════════════════════════════════════╡
│ Total              │      3 │ Total number of duplicate sets.                            │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Single             │      0 │ Total number of sets containing only a single mail with no │
│                    │        │ applicable strategy. They were automatically kept in the   │
│                    │        │ final selection.                                           │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Encoding │      0 │ Number of sets skipped from the selection process because  │
│                    │        │ they had encoding issues.                                  │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Size     │      0 │ Number of sets skipped from the selection process because  │
│                    │        │ they were too dissimilar in size.                          │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Content  │      0 │ Number of sets skipped from the selection process because  │
│                    │        │ they were too dissimilar in content.                       │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Strategy │      3 │ Number of sets skipped from the selection process because  │
│                    │        │ the strategy could not be applied.                         │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Deduplicated       │      0 │ Number of valid sets on which the selection strategy was   │
│                    │        │ successfully applied.                                      │
╘════════════════════╧════════╧════════════════════════════════════════════════════════════╛

This suggests:

  • If there are enough headers, the email is considered (not sure what term you use here) and will either be selected or discarded. (Otherwise, it is "rejected". Which is important but not well documented, since these emails are not acted upon.)
  • Since the two dup emails have identical timestamps, they are both selected. This is technically correct but also a bit hard to get around when you do want tiebreak behavior.

Anyway, what is clear is that all emails are selected, and move-discarded thus moved none. So move-selected should move ALL of them, right? But I do the same command with move-selected and nothing happens and mbox is unchanged!

● Step #3 - Select mails in each group
info: select-newest strategy will be applied on each duplicate set to select candidates.
info: ◼ 2 mails sharing hash 05a3285c1254315fa50966ae1bed99e47ab51a592d9e728a7a70e526
info: Check mail differences are below the thresholds.
info: Select all mails sharing the newest 1609459200 timestamp...
warning: Skip set: all 2 mails within were selected. The strategy criterion was not able to discard some.
info: Check mail differences are below the thresholds.
info: Select all mails sharing the newest 1609459260 timestamp...
warning: Skip set: all 1 mails within were selected. The strategy criterion was not able to discard some.
info: Check mail differences are below the thresholds.
info: Select all mails sharing the newest 1609459320 timestamp...
warning: Skip set: all 1 mails within were selected. The strategy criterion was not able to discard some.

● Step #4 - Perform action on selected mails
info: Perform move-selected action...
warning: No mail selected to perform action on.

● Step #5 - Report and statistics
╒════════════╤════════╤══════════════════════════════════════════════════════════════╕
│ Mails      │ Metric │ Description                                                  │
╞════════════╪════════╪══════════════════════════════════════════════════════════════╡
│ Found      │      4 │ Total number of mails encountered from all mail sources.     │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Rejected   │      0 │ Number of mails rejected individually because they were      │
│            │        │ unparseable or did not have enough metadata to compute       │
│            │        │ hashes.                                                      │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Retained   │      4 │ Number of valid mails parsed and retained for deduplication. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Hashes     │      3 │ Number of unique hashes.                                     │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Unique     │      0 │ Number of unique mails (which where automatically added to   │
│            │        │ selection).                                                  │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Duplicates │      4 │ Number of duplicate mails (sum of mails in all duplicate     │
│            │        │ sets with at least 2 mails).                                 │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Skipped    │      4 │ Number of mails ignored in the selection step because the    │
│            │        │ whole set they belong to was skipped.                        │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Discarded  │      0 │ Number of mails discarded from the final selection.          │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Selected   │      0 │ Number of mails kept in the final selection on which the     │
│            │        │ action will be performed.                                    │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Copied     │      0 │ Number of mails copied from their original mailbox to        │
│            │        │ another.                                                     │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Moved      │      0 │ Number of mails moved from their original mailbox to         │
│            │        │ another.                                                     │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Deleted    │      0 │ Number of mails deleted from their mailbox in-place.         │
╘════════════╧════════╧══════════════════════════════════════════════════════════════╛
╒════════════════════╤════════╤════════════════════════════════════════════════════════════╕
│ Duplicate sets     │ Metric │ Description                                                │
╞════════════════════╪════════╪════════════════════════════════════════════════════════════╡
│ Total              │      3 │ Total number of duplicate sets.                            │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Single             │      0 │ Total number of sets containing only a single mail with no │
│                    │        │ applicable strategy. They were automatically kept in the   │
│                    │        │ final selection.                                           │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Encoding │      0 │ Number of sets skipped from the selection process because  │
│                    │        │ they had encoding issues.                                  │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Size     │      0 │ Number of sets skipped from the selection process because  │
│                    │        │ they were too dissimilar in size.                          │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Content  │      0 │ Number of sets skipped from the selection process because  │
│                    │        │ they were too dissimilar in content.                       │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Strategy │      3 │ Number of sets skipped from the selection process because  │
│                    │        │ the strategy could not be applied.                         │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Deduplicated       │      0 │ Number of valid sets on which the selection strategy was   │
│                    │        │ successfully applied.                                      │
╘════════════════════╧════════╧════════════════════════════════════════════════════════════╛

from mail-deduplicate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.