Giter Site home page Giter Site logo

ferrerojeremy / cross-language-dataset Goto Github PK

View Code? Open in Web Editor NEW
60.0 9.0 21.0 672.31 MB

A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection

License: GNU General Public License v3.0

dataset multi-granularity-dataset cross-language-dataset multi-language-dataset parallel-corpus

cross-language-dataset's Introduction

Cross-Language Dataset

Description

This dataset is a multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection. More precisely, the characteristics of this dataset are the following:

  • it is multilingual: French, English and Spanish;
  • it proposes cross-language alignment information at different granularities: document-level, sentence-level and chunk-level;
  • it is based on both parallel and comparable corpora;
  • it contains both human and machine translated text;
  • part of it has been altered (to make the cross-language similarity detection more complicated) while the rest remains without noise;
  • documents were written by multiple types of authors: from average to professionals.
Scientific paper

A description of the dataset and its building are described in the following paper:

A Multilingual, Multi-Style and Multi-Granularity Dataset for Cross-Language Textual Similarity Detection. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab. In the 10th edition of the Language Resources and Evaluation Conference (LREC 2016).

Characteristics

Sub-corpus Alignment Authors Translations Translators Alteration NE (%)
JRC Acquis2 Parallel Politicians Human Professional No 3.74
Europarl1 Parallel Politicians Human Professional No 7.74
Wikipedia2 Comparable Anyone - - Noise 8.37
PAN-PC-113 Parallel Professional authors Human Professional Yes 3.24
APR (Amazon Product Reviews4) Parallel Anyone Machine Google Translate No 6.04
Conference papers Comparable Computer scientists Human Computer scientists Noise 9.36

Statistics

Sub-corpus # Aligned documents # Aligned sentences # Aligned noun chunks
JRC-Acquis2 10,000 149,506 10,094
Europarl1 9,431 475,834 25,603
Wikipedia2 10,000 4,792 132
PAN-PC-113 2,920 88,977 1,360
APR (Amazon Product Reviews4) 6,000 23,235 2,603
Conference papers 35 1,304 272

For more statistics, see the stats/ directory.

Repository description

  • In the dataset/documents/ directory, you can find the dataset of parallel and comparable files aligned at document-level (one file represents one document).
  • In the dataset/sentences/ directory, you can find the dataset of parallel and comparable files aligned at sentence-level (one line of a file represents one sentence).
  • In the dataset/chunks/ directory, you can find the dataset of parallel and comparable files aligned at chunk-level (one line of a file represents one noun chunk).
  • In the dataset/documents/Conference_papers/ directory, you can also find a pdf_conference_papers/ directory containing the original scientific papers in PDF format.
  • In the dataset/*/PAN11/ sub-directories, you can also find a metadata/ directory containing additional information about the PAN-PC-11 alignments.
  • In the docs/ directory, you can find the papers related to the dataset.
  • In the masks/ directory, you can find the masks (that describe the pairs) used for build the folds during our evaluation.
  • In the scripts/ directory, you can find all the useful files to re-build the dataset from the pre-existing corpora.
  • In the stats/ directory, you can find the XLSX file with statistics on the dataset.
  • In the study/ directory, you can find XLSX files related to the study conducted in the BUCC 2017 paper: Deep Investigation of Cross-Language Plagiarism Detection Methods.
Scripts directory

This directory contains scripts that we used for corpus building. We also provide them in case somebody would be interested to extend the corpus.

  • In the scripts/chunking/ directory, you can find a script to extract noun chunks from a POS sequence from TreeTagger5.
  • In the scripts/create_translations_dico/ directory, you can find a script to build an unigram translation dictionary for the use of HunAlign6.
  • In the scripts/create_verif_align/ directory, you can find a script to print and save the alignments in a readable format.
  • In the scripts/enrich_dico_with_dbnary/ directory, you can find a script to enrich an unigram translation dictionary with DBNary7 entries.
  • In the scripts/parse_APR_collection/ directory, you can find a script to parse the Webis-CLS-104 corpus and extract the English-French pairs.
  • In the scripts/parse_PAN_collection/ directory, you can find a script to parse the PAN-PC-113 corpus and extract the English-Spanish pairs with metadata.
  • In the scripts/parse_conf_papers_bibtex/ directory, you can find a script to parse the TALN BibTeX8, crawl the web and thus allow the construction of French-English conference paper pairs.

To manage the encoding of the files, we use the ForceUTF89 class coded by Sebastián Grignoli.
To detect the language of a text, we use the PHP implementation10 by Nicholas Pisarro of the Cavnar and Trenkle (1994)11 classification algorithm.
To query DBNary7, we use PHP class-interfaces12.

If you have additional questions, please send it to me by email at [email protected].

References, tools used and pre-existing collections

  1. Europarl
    Philipp Koehn (2005).
    Europarl: A Parallel Corpus for Statistical Machine Translation.
    In Conference Proceedings: the tenth Machine Translation Summit, pages 79–86. AAMT.
    url: http://opus.lingfil.uu.se/Europarl.php

  2. CL-PL-09 (JRC-Acquis + Wikipedia)
    Martin Potthast, Alberto Barrón-Cedeño, Benno Stein, and Paolo Rosso (2011).
    Cross-Language Plagiarism Detection.
    In Language Ressources and Evaluation, volume 45, pages 45–62.
    url: http://users.dsic.upv.es/grupos/nle/downloads.html

  3. PAN-PC-11
    Martin Potthast, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso (2010).
    An Evaluation Framework for Plagiarism Detection.
    In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China, August 2010. Association for Computational Linguistics.
    url: http://www.uni-weimar.de/en/media/chairs/webis/corpora/pan-pc-11/

  4. Webis-CLS-10 (Amazon Product Reviews)
    Peter Prettenhofer and Benno Stein (2010).
    Cross-language text classification using structural correspondence learning.
    In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1118-1127.
    url: http://www.uni-weimar.de/en/media/chairs/webis/corpora/corpus-webis-cls-10/

  5. TreeTagger
    Helmut Schmid (1994).
    Probabilistic Part-of-Speech Tagging Using Decision Trees.
    In Proceedings of the International Conference on New Methods in Language Processing.
    url: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

  6. HunAlign
    Dániel Varga, Péter Hálacsy, Viktor Nagy, Lázló Németh, András Kornai, and Viktor Trón (2005).
    Parallel corpora for medium density languages.
    In Recent Advances in Natural Language Processing (RANLP 2005), pages 590–596.
    url: http://mokk.bme.hu/en/resources/hunalign/
    licence: GNU LGPL version 2.1 or later

  7. DBNary
    Gilles Sérasset (2014).
    DBnary: Wiktionary as a Lemon-Based Multilingual Lexical Resource in RDF.
    In to appear in Semantic Web Journal (special issue on Multilin- gual Linked Open Data).
    url: http://kaiko.getalp.org/about-dbnary/
    licence: Creative Commons Attribution-ShareAlike 3.0

  8. TALN Archives
    Florian Boudin (2013).
    TALN Archives : a digital archive of French research articles in Natural Language Processing (TALN Archives : une archive numérique francophone des articles de recherche en Traitement Automatique de la Langue) [in French]).
    In Proceedings of TALN 2013 (Volume 2: Short Papers), pages 507–514.
    url: https://github.com/boudinfl/taln-archives
    licence: Creative Commons Attribution-NonCommercial 3.0

  9. ForceUTF8
    url: https://github.com/neitanod/forceutf8
    licence: BSD

  10. Text Language Detect
    url: https://github.com/webmil/text-language-detect
    licence: BSD

  11. William B. Cavnar and John M. Trenkle (1994).
    N-Gram-Based Text Categorization.
    In Proceedings of SDAIR- 94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161–175.

  12. DBNary PHP Interface
    url: https://github.com/FerreroJeremy/DBNary-PHP-Interface
    licence: Creative Commons Attribution-ShareAlike 4.0 International

Credit

When you use this dataset, please cite:

@inproceedings{CrossLanguageDatasetLREC2016,
  TITLE = {{A Multilingual, Multi-Style and Multi-Granularity Dataset for Cross-Language Textual Similarity Detection}},
  AUTHOR = {J{\'e}r{\'e}my Ferrero and Fr{\'e}d{\'e}ric Agn{\`e}s and Laurent Besacier and Didier Schwab},
  BOOKTITLE = {{The 10th edition of the Language Resources and Evaluation Conference (LREC 2016)}},
  ADDRESS = {Portoro{\v z}, Slovenia},
  YEAR = {2016},
  MONTH = May,
  KEYWORDS = {Cross-language plagiarism detection ; Dataset ; Cross-language dataset ; Cross-language similarity detection ; Evaluation},
}

cross-language-dataset's People

Contributors

ferrerojeremy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cross-language-dataset's Issues

Can't unzip files in /mask

Hello,

Could you please check if the .zip files under the /mask directory are corrupt ? Once I attempt to unzip it, a .cpgz file is created. And once I unzip the .cpgz file, another .zip file is created...It's an endless loop.

Thanks for your help!!

Meaning of file name in masks

Hello again,

Another question about the mask file: in the following line from readme,

{"_id":{"$oid":"56bdbf0fe405a41c1f8b4569"},"0":0,"1":2,"2":"1462817114-25","3":"727911955-101"}

what does 25 in 1462817114-25 denote? (similarly, 101 in 727911955-101?) The readme says they are file names, but I checked the corpus and found there're only file names such as 1462817114 and 727911955, not including the ending part. Does 25 and 101 refer to line index (but oftentimes this number exceeds the total number of lines)?

Thanks again for your help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.