Giter Site home page Giter Site logo

titsuki / minoaner Goto Github PK

View Code? Open in Web Editor NEW

This project forked from vefthym/minoaner

0.0 1.0 0.0 225 KB

Minoan ER is an Entity Resolution (ER) framework, built by researchers in Crete (the land of the ancient Minoan civilization). Entity resolution aims to identify descriptions that refer to the same entity within or across knowledge bases.

Home Page: http://www.csd.uoc.gr/~vefthym/minoanER/

License: Apache License 2.0

Java 100.00%

minoaner's Introduction

MinoanER

Minoan ER is an Entity Resolution (ER) framework, built by researchers in Crete (the land of the ancient Minoan civilization). Entity resolution aims to identify descriptions that refer to the same entity within or across knowledge bases.

The website of the project is http://www.csd.uoc.gr/~vefthym/minoanER/

The functionality of this framework is described in details in the followig PhD thesis (mostly in Chapter 4):
http://csd.uoc.gr/~vefthym/DissertationEfthymiou.pdf

MinoanER is implemented in Java 8+, using Apache Spark. We assume that a Spark cluster is available. Our code has been tested in a Spark cluster with HDFS and Mesos.

The steps followed by MinoanER are Blocking, Meta-blocking and Matching. Currently, the step of (token) blocking is taken from https://github.com/vefthym/ERframework/blob/master/src/NewApproaches/ExportDatasets.java but it can be easily incorporated in this repository, as a Spak task, as well.

To cite this work, please use the following reference:
"Vasilis Efthymiou, George Papadakis, Kostas Stefanidis, Vassilis Christophides: MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities. EDBT 2019: 373-384"
Pdf available here: https://openproceedings.org/2019/conf/edbt/EDBT19_paper_44.pdf

Running MinoanER

The main file is https://github.com/vefthym/MinoanER/blob/master/src/main/java/minoaner/workflow/Main.java. As documented in this file, it assumes 5 input paths and 1 output path, taken as runtime arguments:

inputBlocking:
The resulting blocks from token blocking. You can generate such a file from https://github.com/vefthym/ERframework/blob/master/src/NewApproaches/ExportDatasets.java. Each line corresponds to a block and its contents. The formatting should be:
blockId TAB entityIdFromD1#entityIdFromD1# ... ;entityIdFromD2#entityIdFromD2# ...
All those Ids should be positive integers.

inputTriples1/2:
The raw RDF triples of the first/second KB in N-triples format (without the trailing " ." part).

entityIds1/2:
To save some space, we replace all entity URLs with numeric (positive integer) ids. This file contains this mapping that you should provide. Each line corresponds to one mapping and should be in the form:
entityURL TAB numericId
The same numericId should not be assigned to two different entityURLs and the entityURLs should be the ones appearing in the raw RDF input (inputTriples1/2).

outputPath:
The (HDFS) path in which the output mappings will be stored. The format of the generated output is:
entityIdFromD1 TAB entityIdFromD2
for each pair of entities that have been found to match.
WARNING: the outputPath directory is deleted on each run.

You can find examples of datasets used in MinoanER in our project's website: http://csd.uoc.gr/~vefthym/minoanER/datasets.html.

Setup and Tuning

You can tune the Spark session parameters (number of workers, executors, memory, parallelism, etc) by calling the setUpSpark method in https://github.com/vefthym/MinoanER/blob/master/src/main/java/minoaner/utils/Utils.java. The body of this method should be adjusted to reflect the resources of your Spark cluster.

In the main method, you will find some hardcoded attributes that act as entity names (labels) for the datasets that we have tested. Those attributes have been generated automatically by getting the top attributes of each KB based on the harmonic mean of support and discriminability (see related publications). You can hardcode the corresponding attributes for your KBs, or find them automatically by calling the methods found in the class https://github.com/vefthym/MinoanER/blob/master/src/main/java/minoaner/relationsWeighting/RelationsRank.java.

minoaner's People

Contributors

vefthym avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.