Giter Site home page Giter Site logo

duometer's Introduction

Duometer - near-duplicate detection tool

Duometer allows to efficiently identify near-duplicate pairs of documents in large collections of texts. It is written in Scala and implements a MinHash algorithm.

For example, to extract text from all files in ~/text-files and identify those that have similar content, run:

./duometer -i ~/text-files -o text-files.duplicates

For more information about how to use duometer see this tutorial.

Features

  • Efficiently finds pairs of documents that contain similar text.
  • Automatically extracts text from files in a huge number of different formats (including HTML, PDF, Microsoft Office document formats and the OpenDocument format)
  • Works well with very large collections of documents.
  • Makes use of multiple CPU cores.
  • The default settings should work well in most cases but you can customize the duplicate detection for your purposes ( run ./duometer --help for a full set of options).

Installation

All platforms

The only prerequisite is a Java runtime.

  1. Download the current version of the tool here.
  2. Extract the archive and go to ./bin.
  3. Run ./duometer (on Linux and Mac) or duometer.bat (on Windows).

Debian (Ubuntu)

Download a package and run:

sudo dpkg -i duometer_0.1.3_all.deb

duometer is now installed and should be available as a shell command.

Building

Duometer uses sbt-native-packager. You can build the tool from source by running the dist command in sbt to create a .zip archive that can be run on any machine with Java installed.

Debian binary package can be created by executing debian:packageBin.

MinHash algorithm

For background information about the algoritm see a relevant chapter in Manning and Schütze (2008) or read the original Broder (1997) paper.

Supported file formats

Duometer uses Apache Tika to extract text from a huge number of different file types. For the full list of supported formats see here.

Contribute

Authors

The tool was developed at Center for Reading Research, Ghent University by Paweł Mandera.

License

The project is licensed under the Apache License 2.0.

References

Broder, A. Z. (1997). On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings (pp. 21–29). IEEE.

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. New York: Cambridge University Press.

duometer's People

Contributors

pmandera avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.