jcburley / duff Goto Github PK

3.0 1.0 0.0 200 KB

Duplicate File Finder

License: Other

duff's Introduction

duff - Duplicate file finder
============================

0. Introduction
===============

Duff is a command-line utility for identifying duplicates in a given set of
files.  It attempts to be usably fast, and uses SHA1 checksums as a part of
the comparisons.

The project website is here:

  http://duff.sourceforge.net/

Duff resides in public CVS on cvs.sourceforge.net.  The CVSROOT for anonymous,
read-only access is:

  :pserver:[email protected]:/cvsroot/duff

The CVS module for duff 0.x is `duff'.

The version numbering scheme for duff is as follows:

 * The first number is the major version.  This will be updated upon what the
   author considers a round of feature completion.  The only feature currently
   missing for the next major release is i18n.

 * The second number is the minor version number.  This is updated for releases
   that include minor new features, or features that do not change the
   functionality of the program.

 * The third number, if present, is the bugfix release number.  This indicates
   a release which only fixes bugs present in a previous major or minor release.


1. License and copyright
========================

Duff is copyright (c) 2005 Camilla Berglund <[email protected]>

Duff is licensed under the zlib/libpng license.  See the file `COPYING' for
license details.  The license is also included at the top of each source file.

Duff contains sha1-asaddi.
Copyright (c) 2001-2003 Allan Saddi <[email protected]>
See the files `sha1.c' or `sha1.h' for license details.


2. Project news
===============

See the file `NEWS'.


3. Building Duff
================

If you got this source tree from a CVS repository, you will need to bootstrap
the build environment using `bootstrap.sh'.  Note that this script requires
autoconf and automake to run.

If (or once) you have a `configure' script, go ahead and run it.  No additional
magic should be required.  If it is, then that's a bug and should be reported.

This release of duff has been successfully built on the following systems:

  Arch Linux x86
  Darwin 7.9.0 powerpc
  Debian Etch powerpc
  Debian Sarge alpha
  FreeBSD 4.11 x86
  FreeBSD 5.4 x86
  NetBSD 1.6.1 sparc
  SunOS 5.9 sparc64
  Ubuntu Breezy x86

Earlier releases have been successfully built on the following systems:

  Arch Linux x86
  Darwin 7.9.0 powerpc
  Debian Etch powerpc
  Debian Sarge alpha
  FreeBSD 4.11 x86
  FreeBSD 5.4 x86
  SunOS 5.9 sparc64

The tools used were gcc and GNU or BSD make.  However, it should build on most
Unix systems without modifications.


4. Installing Duff
==================

See the file `INSTALL'.


5. Using Duff
=============

See the accompanying manpage duff(1).

To read the manpage before installation, use the following command:

  groff -mdoc -Tascii duff.1 | less -R

On Linux systems, however, the following command may suffice:

  man -l duff.1


6. Hacking Duff
===============

See the file `HACKING'.


7. Bugs, feedback and patches
=============================

Please send bug reports, feedback, patches and cookies to:
Camilla Berglund <[email protected]>

For more involved discussions, please join the mailing list:
http://lists.sourceforge.net/lists/listinfo/duff-devel


8. Disambiguation
=================

This is duff, the Unix command-line utility, and not DUFF, the Windows program.
If you wish to find duplicate files on Windows, use DUFF.


9. Release history
===================

Version 0.1 was named `duplicate', and was never released anywhere.

Version 0.2 was the first release named duff.  It lacked a real checksumming
algorithm, and was thus only released to a few individuals, during the first
half of 2005.

Version 0.3 was the first official release, on November 22, 2005, after a
prolonged search for a suitably licensed implementation of SHA1.

Version 0.3.1 was a bugfix release, on November 27, 2005, adding a single
feature (-z), which just happened to get included.

Version 0.4 was the second feature release, on January 13, 2006, adding a
number of missing and/or requested features as well as bug fixes.  It was the
first release to be considered stable and safe enough for everyday use.

Version 0.5 improves the algorithm that searches for duplicates by
sorting the list of entries.  The changes to this version were contributed
by James Craig Burley <[email protected]>.

duff's People

Stargazers

Watchers

duff's Issues

Implement a more efficient data structure

See duffdriver.c:155 for this TODO (has_recurse_directory).

Change to getopt_long and add long options

Is this necessary?

Optimize counting number of entries by incrementing the count each time we add an entry

See duffdriver.c:344 for more (report_clusters).

join-duplicates.sh throws this error: "mktemp: too few X's in template `...'"

Probably need to quote the results of the dirname ... just its $file argument is now quoted; otherwise, spaces can lead to dirname thinking a template is being supplied, rather than the default.

Fix manpage formatting of sample commands

Detect duplicate file arguments on command line

Use a read buffer

See duffentry.c:352 (compare_entry_contents).

0.5 ChangeLog, CHANGES, etc. needs to be more thorough

Not all changes have been documented.

Implement i18n through gettext

Level of interest? Licensing?

"Stable" sort not necessarily stable

duffdriver.c's cmpentryp() uses the relationship between the left and right pointers to determine an ordering if nothing else differs. The idea here is to provide a "stable" sort, so the order of output (reporting), within a set of duplicates, is the same as the input ordering.

Whether this is even needed, I don't know. In any case, I'm concerned that qsort() might be moving those pointers around. To be safe, something intrinsic to the relative Entry objects should be used -- perhaps something as simple as a monotonically increasing counter so each Entry has a unique ID.

Add ALGORITHM section (or similar) to manpage

Check all malloc return values for NULL

Possibly tolerate out-of-memory conditions by backing off some aggressive optimizations? But at least don't blindly proceed assuming it always returns a non-NULL value.

Extend to support "internal" files?

This would mean allowing e.g. an email to be pulled apart and the individual attachments treated as (virtual) entries; ditto a tarball, a compressed file, a compressed tarball, etc.

Make this less suboptimal

See TODO in duffdriver.c:263 for details (process_path).

Compare contents directly for clusters of two in excess mode?

(enhancement request) Order by pathname (for excess mode)?

I rely on duff a lot and often find that the file that appears in the excess mode is not the copy of the file I wish to delete (one file has been sorted into the right subfolder, the other has not).

I try to manipulate the names of the subdirectories to force the order to be how I want, but I don't think it can be fully controlled.

How can I control which copy appears in excess mode? If there is no such control, how difficult would it be to introduce such a feature?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.