Giter Site home page Giter Site logo

yascho / www19-fair-feature-extraction Goto Github PK

View Code? Open in Web Editor NEW

This project forked from heindorf/www19-fair-feature-extraction

0.0 0.0 0.0 24 KB

WWW 2019 Paper: Debiasing Vandalism Detection Models at Wikidata: Feature Extraction Component

License: MIT License

Shell 3.51% Java 96.49%

www19-fair-feature-extraction's Introduction

Debiasing Vandalism Detection Models at Wikidata: Feature Extraction

The Wikidata Vandalism Detectors FAIR-E and FAIR-S are machine learning models for automatic vandalism detection in Wikidata without discriminating against anonymous editors. They were developed as a joint project between Paderborn University and Leipzig University.

This is the feature extraction component that extracts features for FAIR-E and FAIR-S. Classification and evaluation for FAIR-E, FAIR-S and the baselines WDVD, ORES, and FILTER can be done with the corresponding classification and evaluation component.

Paper

This source code forms the basis for our WWW 2019 paper Debiasing Vandalism Detection Models at Wikidata. When using the code, please make sure to refer to it as follows:

@inproceedings{heindorf2019debiasing,
  author    = {Stefan Heindorf and
               Yan Scholten and
               Gregor Engels and
               Martin Potthast},
  title     = {Debiasing Vandalism Detection Models at Wikidata},
  booktitle = {{WWW}},
  publisher = {{ACM}},
  year      = {2019}
}

Feature Extraction Component

Requirements

The code was tested with Java 8, under Linux 4.9.0-8-amd64 with 16 cores and 256 GB RAM.

We require an installation of 7z for decompression.

Installation

We assume the following project structure:

www19-fair
├── data
├── www19-fair-feature-classification
└── www19-fair-feature-extraction

Required Data

Before you can start the feature extraction, you need to download the following data:

  1. Wikidata Vandalism Corpus 2016:

    Expected Path: www19-fair/data/external/wdvc-2016/

  2. Wikidata JSON Dump of 2/29/2016:

    Expected Path: www19-fair/data/external/wikidata-20160229-all.json.bz2

  3. WDVD features:

    Expected Path: www19-fair/data/features/wdvd_features.csv.bz2

Execute

To start the feature extraction, you need to execute ./run.sh.

Computed Features

This feature extraction component will compute the following feature files:

www19-fair/data/
├── features/
│   ├── test/
│   │   ├── embeddings/
│   │   └── features.csv.bz2
│   ├── training/
│   │   ├── embeddings/
│   │   └── features.csv.bz2
│   └── validation/
│       ├── embeddings/
│       └── features.csv.bz2
├── item-properties/
│   └── item-properties.bz2
└── wikidata-graph/
    └── wikidata-graph.csv.bz2

features: Contains the features for the models FAIR-E and FAIR-S. The file has the following columns: revisionId, isEditingTool, subject, predicate, object, superSubject, superObject. Each row was extracted from the Wikidata Vandalism Corpus 2016 and represents a revision that adds, removes, or updates statements between two Wikidata items.

embeddings: This folder contains predicate embeddings as described in the paper. We store embeddings in four CSR-matrices: subjectOut, predicate, objectOut, objectIn.

item-properties: The list of Wikidata item properties extracted from the Wikidata JSON Dump from 2/29/2016. Item properties are the Wikidata properties solely used to describe relations between two Wikidata items.

wikidata-graph: Statements between two Wikidata items extracted from the Wikidata JSON Dump from 2/29/2016. This file contains subject-predicate-object-triple where subject and object are Wikidata items. The predicate is an item property.

Contact

For questions and feedback please contact:

Stefan Heindorf, Paderborn University
Yan Scholten, Paderborn University
Gregor Engels, Paderborn University
Martin Potthast, Leipzig University

License

The code by Stefan Heindorf, Yan Scholten, Gregor Engels, Martin Potthast is licensed under a MIT license.

www19-fair-feature-extraction's People

Contributors

heindorf avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.