Giter Site home page Giter Site logo

sandy4321 / benedetto02 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from pan-webis-de/benedetto02

0.0 2.0 0.0 28 KB

Reimplementation of the authorship attribution approach described in "Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto. Language trees and zipping. Physical Review Letters, 88 (4), 048702, 2002" as part of the ECIR 2016 reproducibility study "Who Wrote the Web?"

Home Page: http://www.uni-weimar.de/medien/webis/publications/papers/stein_2016d.pdf

Python 100.00%

benedetto02's Introduction

benedetto02 - An Approach to Authorship Attribution

This is a reimplementation of the approach to authorship attribution originally described in

Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto. Language trees and zipping. Physical Review Letters, 88 (4), 048702, 2002 [paper]

It was reimplemented as part of a science reproducibility study alongside 14 other authorship attribution approaches. The results of the reproducibility study can be found in

Martin Potthast, Sarah Braun, Tolga Buz, Fabian Duffhauss, Florian Friedrich, Jörg Marvin Gülzow, Jakob Köhler, Winfried Lötzsch, Fabian Müller, Maike Elisa Müller, Robert Paßmann, Bernhard Reinke, Lucas Rettenmeier, Thomas Rometsch, Timo Sommer, Michael Träger, Sebastian Wilhelm, Benno Stein, Efstathios Stamatatos, and Matthias Hagen. Who Wrote the Web? Revisiting Influential Author Identification Research Applicable to Information Retrieval. In Advances in Information Retrieval. 38th European Conference on IR Research (ECIR 16) volume 9626 of Lecture Notes in Computer Science, Berlin Heidelberg New York, March 2016. Springer. [paper] [bib]

If you use this reimplementation in your own research, please make sure to cite both of the above papers.

Usage

To execute the software, install it and make sure all its dependencies are installed as well; then run the software using the following command:

python3 benedetto02.py -i <path-to-input-data> -o <output-path>

Input and Output Formats

The software accepts authorship attribution datasets that are formatted according to the corresponding PAN shared task on authorship attribution. A number of datasets can be found there, and all of them are formatted as follows.

In a dataset's TOP_DIRECTORY, a meta-file.json is found which comprises

  • the language of the texts within (e.g., EN, GR, etc.),
  • the names of the subdirectories that contain texts from candidate authors,
  • the name of the subdirectory that contains texts of unknown authorship, and
  • the name of each file of unknown authorship that is to be attributed to one of the candidate authors.

The software accepts as input a path to an inflated dataset's TOP_DIRECTORY and starts the authorship attribution process from there. The output in the OUTPUT_PATH will be a file answers.json formatted as follows:

{
"answers": [
	{"unknown_text": "unknown00001.txt", "author": "candidate00001", "score": 0.8},
	{"unknown_text": "unknown00002.txt", "author": "candidate00002", "score": 0.9}
	]
}

where unknown_text is the name of an unknown text as per meta-file.json, author is the name of a candidate author as per meta-file.json, and score is as real value in the range [0,1] which indicates the software's confidence in its attribution (0 means completely uncertain, 1 means completely sure).

License

Copyright (c) 2015 Bernhard Reinke

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

benedetto02's People

Contributors

b-reinke avatar potthast avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.