Giter Site home page Giter Site logo

Comments (12)

kdeldycke avatar kdeldycke commented on July 23, 2024 3

@vincentbernat is right. All *-discarded actions are not implemented yet.

In the mean time, you can indeed emulate the effect of -a delete-discarded -s select-one with -a delete-selected -s select-all-but-one. I.e. by reversing the logic.

The thing is all *-discarded actions are syntactic sugar. They should implement the reverse of *-selected actions at:

ACTIONS = FrozenDict(
{
COPY_SELECTED: copy_selected,
COPY_DISCARDED: None,
MOVE_SELECTED: move_selected,
MOVE_DISCARDED: None,
DELETE_SELECTED: delete_selected,
DELETE_DISCARDED: None,
}
)

Same for strategies by the way. They all have their own aliases to allow users to better map their own worldview to the operational logic. See:

# Groups strategy aliases and their definitions. Aliases are great useability
# features as it helps users to better reason about the selection operators
# dependening on their mental models.
STRATEGY_ALIASES = frozenset(
[
(SELECT_NEWEST, DISCARD_OLDER),
(SELECT_NEWER, DISCARD_OLDEST),
(SELECT_OLDEST, DISCARD_NEWER),
(SELECT_OLDER, DISCARD_NEWEST),
(SELECT_BIGGEST, DISCARD_SMALLER),
(SELECT_BIGGER, DISCARD_SMALLEST),
(SELECT_SMALLEST, DISCARD_BIGGER),
(SELECT_SMALLER, DISCARD_BIGGEST),
(SELECT_NON_MATCHING_PATH, DISCARD_MATCHING_PATH),
(SELECT_MATCHING_PATH, DISCARD_NON_MATCHING_PATH),
(SELECT_ALL_BUT_ONE, DISCARD_ONE),
(SELECT_ONE, DISCARD_ALL_BUT_ONE),
]
)

The problem is at the selection process step, right before we perform the action:

selected = apply_strategy(self.conf.strategy, self)

Applying the strategy return the list of selected mails, not the one that were discarded. See how each implemented strategies only returns the subset of the duplicate pool:

return {
mail for mail in duplicates.pool if mail.timestamp < duplicates.newest_timestamp
}

return {
mail
for mail in duplicates.pool
if mail.timestamp == duplicates.oldest_timestamp
}

and so on...

By the time we need to perform the action, we only have a subset of the initial duplicate pool. We do not have the list of those that were discarded.

This is the limit of the code architecture inherited from the initial dumb script I wrote 10 years ago. It was an early optimization to reduce the memory footprint. Given that context, it will be hard to easily implement the action with the current code structure.

I propose to first tackle #87, i.e. keep a cache of canonical hashes used to ID each mail. That way we'll be in a position to only deal with sets of hashes in our selection/action phases instead of parsed mail objects. This will bring much cleaner and flexible code to implement the missing actions.

from mail-deduplicate.

vincentbernat avatar vincentbernat commented on July 23, 2024 1

I also ran into this bug. For some reason, only the *-selected actions are implemented. So, in many cases, you can use -a delete-selected (and maybe repeat several times if you get several duplicates).

Edit: very bad idea. Non-duplicated mails will be selected too...

from mail-deduplicate.

andy-shev avatar andy-shev commented on July 23, 2024 1

I've managed to do the inteded action by making it skip unique emails:

mail_deduplicate/mail_deduplicate/deduplicate.py Lines 404 to 422

Thanks, it;s exact place where I'm coding right now for myself, but on top I introduced a new option.

from mail-deduplicate.

djwf avatar djwf commented on July 23, 2024 1

FYI: just added an issue and PR regarding the option to only act on duplicates (#203 and #204, respectively)

from mail-deduplicate.

kdeldycke avatar kdeldycke commented on July 23, 2024 1

The missing *-discarded actions have been implemented in #290 and are now available upstream.

from mail-deduplicate.

leggewie avatar leggewie commented on July 23, 2024

same is true for move-discarded

from mail-deduplicate.

andy-shev avatar andy-shev commented on July 23, 2024

So the core functionality of the tool is not implemented... Can we prioritize this bug (which actually makes tool useless) to be fixed sooner than later?

from mail-deduplicate.

andy-shev avatar andy-shev commented on July 23, 2024

@kdeldycke
The problem is that single mails are got removed by your proposal. So, it's not equivalent substitution, please revert "bug" tag and prioritize the fix.

mdedup -n -i maildir -s select-all-but-one -a delete-selected -C 32 .

● Phase #4 - Report and statistics
╒════════════╤══════════╤══════════════════════════════════════════════════════════════╕
│ Mails      │   Metric │ Description                                                  │
╞════════════╪══════════╪══════════════════════════════════════════════════════════════╡
│ Found      │   221373 │ Total number of mails encountered from all mail sources.     │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Rejected   │     1067 │ Number of mails individuality rejected because they were     │
│            │          │ unparseable or did not had enough metadata to compute        │
│            │          │ hashes.                                                      │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Retained   │   220306 │ Number of valid mails parsed and retained for deduplication. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Hashes     │   165621 │ Number of unique hashes.                                     │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Unique     │   121943 │ Number of unique mails (which where automaticcaly added to   │
│            │          │ selection).                                                  │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Duplicates │    98363 │ Number of duplicate mails (sum of mails in all duplicate     │
│            │          │ sets with at least 2 mails).                                 │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Skipped    │     8644 │ Number of mails ignored in the selection phase because the   │
│            │          │ whole set they belongs to was skipped.                       │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Discarded  │    40268 │ Number of mails discarded from the final selection.          │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Selected   │   171394 │ Number of mails kept in the final selection on which the     │
│            │          │ action will be performed.                                    │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Copied     │        0 │ Number of mails copied from their original mailbox to        │
│            │          │ another.                                                     │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Moved      │        0 │ Number of mails moved from their original mailbox to         │
│            │          │ another.                                                     │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Deleted    │   171394 │ Number of mails deleted from their mailbox in-place.         │
╘════════════╧══════════╧══════════════════════════════════════════════════════════════╛
╒════════════════════╤══════════╤════════════════════════════════════════════════════════════╕
│ Duplicate sets     │   Metric │ Description                                                │
╞════════════════════╪══════════╪════════════════════════════════════════════════════════════╡
│ Total              │   165621 │ Total number of duplicate sets.                            │
├────────────────────┼──────────┼────────────────────────────────────────────────────────────┤
│ Single             │   121943 │ Total number of sets containing a single mail and did not  │
│                    │          │ had to have a strategy applied to. They were automatticaly │
│                    │          │ kept in the final selection.                               │
├────────────────────┼──────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Encoding │        0 │ Number of sets skipped from the selection process because  │
│                    │          │ they had encoding issues.                                  │
├────────────────────┼──────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Size     │      510 │ Number of sets skipped from the selection process because  │
│                    │          │ they were too disimilar in size.                           │
├────────────────────┼──────────┼────────────────────────────────────────────────────────────┤
│ Skipeed - Content  │        0 │ Number of sets skipped from the selection process because  │
│                    │          │ they were too disimilar in content.                        │
├────────────────────┼──────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Strategy │        0 │ Number of sets skipped from the selection process because  │
│                    │          │ the strategy could not be applied.                         │
├────────────────────┼──────────┼────────────────────────────────────────────────────────────┤
│ Deduplicated       │    40268 │ Number of valid sets on which the selection strategy was   │
│                    │          │ successfully applied.                                      │
╘════════════════════╧══════════╧════════════════════════════════════════════════════════════╛

from mail-deduplicate.

nutria007 avatar nutria007 commented on July 23, 2024

I've managed to do the inteded action by making it skip unique emails:

https://github.com/nutria007/mail-deduplicate/blob/872ec4a10dd391c9ad1650e7317b75dd52d24abe/mail_deduplicate/deduplicate.py#L404-L422

mail_deduplicate/mail_deduplicate/deduplicate.py
Lines 404 to 422

            # Unique mails are always selected. No need to mobilize the whole
            # DuplicateSet machinery.
            if mail_count == 1:
                self.stats["mail_unique"] += 1
                self.stats["set_single"] += 1
                if self.conf.action == "delete-selected":
                    logger.debug("Skipped deletion of unique mail.")
                    self.stats["mail_skipped"] += 1
                else:
                    logger.debug("Add unique message to selection.")
                    self.stats["mail_selected"] += 1
                    candidates = mail_set
            # We need to resort to a selection strategy to discriminate mails
            # within the set.
            else:
                duplicates = DuplicateSet(hash_key, mail_set, self.conf)
                candidates = duplicates.select_candidates()
                # Merge duplicate set's stats to global stats.
                self.stats += duplicates.stats

from mail-deduplicate.

dschrempf avatar dschrempf commented on July 23, 2024

Any progress on this? select-newer also doesn't work as expected. Single mails without duplicates are still selected.

│ Mails      │   Metric │ Description                                                  │
╞════════════╪══════════╪══════════════════════════════════════════════════════════════╡
│ Found      │     1060 │ Total number of mails encountered from all mail sources.     │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Rejected   │        0 │ Number of mails rejected individually because they were      │
│            │          │ unparseable or did not have enough metadata to compute       │
│            │          │ hashes.                                                      │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Retained   │     1060 │ Number of valid mails parsed and retained for deduplication. │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Hashes     │     1060 │ Number of unique hashes.                                     │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Unique     │     1060 │ Number of unique mails (which where automatically added to   │
│            │          │ selection).                                                  │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Duplicates │        0 │ Number of duplicate mails (sum of mails in all duplicate     │
│            │          │ sets with at least 2 mails).                                 │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Skipped    │        0 │ Number of mails ignored in the selection phase because the   │
│            │          │ whole set they belong to was skipped.                        │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Discarded  │        0 │ Number of mails discarded from the final selection.          │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Selected   │     1060 │ Number of mails kept in the final selection on which the     │
│            │          │ action will be performed.                                    │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Copied     │        0 │ Number of mails copied from their original mailbox to        │
│            │          │ another.                                                     │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Moved      │        0 │ Number of mails moved from their original mailbox to         │
│            │          │ another.                                                     │
├────────────┼──────────┼──────────────────────────────────────────────────────────────┤
│ Deleted    │     1060 │ Number of mails deleted from their mailbox in-place.         │

Command line was

 mdedup -s select-newer -a delete-selected -n -t ctime maildir

from mail-deduplicate.

leggewie avatar leggewie commented on July 23, 2024

@dschrempf Can you please open a new ticket regarding select-newer misbehaving?

from mail-deduplicate.

github-actions avatar github-actions commented on July 23, 2024

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

from mail-deduplicate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.