scify / jedaitoolkit Goto Github PK

View Code? Open in Web Editor NEW

201.0 26.0 45.0 284.2 MB

An open source, high scalability toolkit in Java for Entity Resolution.

Home Page: http://jedai.scify.org

License: Apache License 2.0

Java 100.00%

entity-resolution entity-matching scalability blocking

jedaitoolkit's Introduction

Please check pyJedAI for an implementation of JedAI in native Python.

Please check our paper for a detailed description of version 3.0.

The code for running JedAI on Apache Spark is available here.

The Web Application for running JedAI is available here. A video explaining how to use it is available here.

JedAI is also available as a Docker image here. See below for more details.

The latest version of JedAI-gui is available here.

Java gEneric DAta Integration (JedAI) Toolkit

JedAI constitutes an open source, high scalability toolkit that offers out-of-the-box solutions for any data integration task, e.g., Record Linkage, Entity Resolution and Link Discovery. At its core lies a set of domain-independent, state-of-the-art techniques that apply to both RDF and relational data. These techniques rely on an approximate, schema-agnostic functionality based on (meta-)blocking for high scalability.

JedAI can be used in three different ways:

As an open source library that implements numerous state-of-the-art methods for all steps of the end-to-end ER work presented in the figure below.
As a desktop application with an intuitive Graphical User Interface that can be used by both expert and lay users.
As a workbench that compares the relative performance of different (configurations of) ER workflows.

This repository contains the code (in Java 8) of JedAI's open source library. The code of JedAI's desktop application and workbench is available in this repository.

Several datasets already converted into the serialized data type of JedAI can be found here.

You can find a short presentation of JedAI Toolkit here.

Citation

If you use JedAI, please cite the following paper:

George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas and Manolis Koubarakis: "The return of JedAI: End-to-End Entity Resolution for Structured and Semi-Structured Data", in VLDB 2018 (pdf).

Consortium

JEDAI is a collaboration project involving the following partners:

JedAI Workflow

JedAI supports 3 workflows, as shown in the following images:

Below, we explain in more detail the purpose and the functionality of every step.

Data Reading

It transforms the input data into a list of entity profiles. An entity is a uniquely identified set of name-value pairs (e.g., an RDF resource with its URI as identifier and its set of predicates and objects as name-value pairs).

The following formats are currently supported:

CSV
RDF in any format, i.e., XML, OWL, HDT, JSON
Relational Databases (mySQL, PostgreSQL)
SPARQL endpoints
Java serialized objects

Schema Clustering

This is an optional step, suitable for highly heterogeneous datasets with a schema comprising a large diversity of attribute names. To this end, it groups together attributes that are syntactically similar, but are not necessarily semantically equivalent.

The following methods are currently supported:

Attribute Name Clustering
Attribute Value Clustering
Holistic Attribute Clustering

For more details on the functionality of these methods, see here.

Block Building

It clusters entities into overlapping blocks in a lazy manner that relies on unsupervised blocking keys: every token in an attribute value forms a key. Blocks are then extracted, possibly using a transformation, based on its equality or on its similarity with other keys.

The following methods are currently supported:

Standard/Token Blocking
Sorted Neighborhood
Extended Sorted Neighborhood
Q-Grams Blocking
Extended Q-Grams Blocking
Suffix Arrays Blocking
Extended Suffix Arrays Blocking
LSH MinHash Blocking
LSH SuperBit Blocking

For more details on the functionality of these methods, see here.

Block Cleaning

Its goal is to clean a set of overlapping blocks from unnecessary comparisons, which can be either redundant (i.e., repeated comparisons that have already been executed in a previously examined block) or superfluous (i.e., comparisons that involve non-matching entities). Its methods operate on the coarse level of individual blocks or entities.

The following methods are currently supported:

Size-based Block Purging
Cardinality-based Block Purging
Block Filtering
Block Clustering

All methods are optional, but complementary with each other and can be used in combination. For more details on the functionality of these methods, see here.

Comparison Cleaning

Similar to Block Cleaning, this step aims to clean a set of blocks from both redundant and superfluous comparisons. Unlike Block Cleaning, its methods operate on the finer granularity of individual comparisons.

The following methods are currently supported:

Comparison Propagation
Cardinality Edge Pruning (CEP)
Cardinality Node Pruning (CNP)
Weighed Edge Pruning (WEP)
Weighed Node Pruning (WNP)
Reciprocal Cardinality Node Pruning (ReCNP)
Reciprocal Weighed Node Pruning (ReWNP)
BLAST
Canopy Clusetring
Extended Canopy Clustering

Most of these methods are Meta-blocking techniques. All methods are optional, but competive, in the sense that only one of them can part of an ER workflow. For more details on the functionality of these methods, see here. They can be combined with one of the following weighting schemes:

Aggregate Reciprocal Comparisons Scheme (ARCS)
Common Blocks Scheme (CBS)
Enhanced Common Blocks Scheme (ECBS)
Jaccard Scheme (JS)
Enhanced Jaccard Scheme (EJS)
Pearson chi-squared test

Entity Matching

It compares pairs of entity profiles, associating every pair with a similarity in [0,1]. Its output comprises the similarity graph, i.e., an undirected, weighted graph where the nodes correspond to entities and the edges connect pairs of compared entities.

The following schema-agnostic methods are currently supported:

Group Linkage,
Profile Matcher, which aggregates all attributes values in an individual entity into a textual representation.

Both methods can be combined with the following representation models.

character n-grams (n=2, 3 or 4)
character n-gram graphs (n=2, 3 or 4)
token n-grams (n=1, 2 or 3)
token n-gram graphs (n=1, 2 or 3)

For more details on the functionality of these bag and graph models, see here.

The bag models can be combined with the following similarity measures, using both TF and TF-IDF weights:

ARCS similarity
Cosine similarity
Jaccard similarity
Generalized Jaccard similarity
Enhanced Jaccard similarity

The graph models can be combined with the following graph similarity measures:

Containment similarity
Normalized Value similarity
Value similarity
Overall Graph similarity

Any word or character-level pre-trained embeddings are also supported in combination with cosine similarity or Euclidean distance.

Entity Clustering

It takes as input the similarity graph produced by Entity Matching and partitions it into a set of equivalence clusters, with every cluster corresponding to a distinct real-world object.

The following domain-independent methods are currently supported for Dirty ER:

Connected Components Clustering
Center Clustering
Merge-Center Clustering
Ricochet SR Clustering
Correlation Clustering
Markov Clustering
Cut Clustering

For more details on the functionality of these methods, see here.

For Clean-Clean ER, the following methods are supported:

Unique Mapping Clustering
Row-Column Clustering
Best Assignment Clustering

For more details on the functionality of the first method, see here. The 2nd algorithm implements an efficient approximation of the Hungarian Algorithm, while the 3rd one implements an efficient, heuristic solution to the assignment problem in unbalanced bipartite graphs.

Similarity Join

Similarity Join conveys the state-of-the-art algorithms for accelerating the computation of a specific character- or token-based similarity measure in combination with a user-determined similarity threshold.

The following token-based similarity jon algorithms are supported:

AllPairs
PPJoin
SilkMoth

The following character-based similarity jon algorithms are also supported:

FastSS
PassJoin
PartEnum
EdJoin
AllPairs

Comparison Prioritization

Comparison Prioritization associates all comparisons in a block collection with a weight that is proportional to the likelihood that they involve duplicates and then, it emits them iteratively, in decreasing weight.

The following methods are currently supported:

Local Progressive Sorted Neighborhood
Global Progressive Sorted Neighborhood
Progressive Block Scheduling
Progressive Entity Scheduling
Progressive Global Top Comparisons
Progressive Local Top Comparisons

For more details on the functionality of these methods, see here.

How to add JedAI as a dependency to your project

Visit https://search.maven.org/artifact/org.scify/jedai-core

How to run JedAI as a Docker image

After installing Docker on your machine, type the following commands:

docker pull gmandi/jedai-webapp

docker run -p 8080:8080 gmandi/jedai-webapp

Then, open your browser and go to localhost:8080. JedAI should be running on your browser!

How to use JedAI with Python

You can combine JedAI with Python through PyJNIus (https://github.com/kivy/pyjnius).

Preparation Steps:

Install python3 and PyJNIus (https://github.com/kivy/pyjnius).
Install java 8 openjdk and openjfx for java 8 and configure it as the default java.
Create a directory or a jar file with jedai-core and its dependencies. One approach is to use the maven-assembly-plugin (https://maven.apache.org/plugins/maven-assembly-plugin/usage.html), which will package everything to a single jar file: jedai-core-3.0-jar-with-dependencies.jar

In the following code block a simple example is presented in python 3. The code reads the ACM.csv file found at (JedAIToolkit/data/cleanCleanErDatasets/DBLP-ACM) and prints the entities found:

import jnius_config;
jnius_config.add_classpath('jedai-core-3.0-jar-with-dependencies.jar')

from jnius import autoclass

filePath = 'path_to/ACM.csv'
CsvReader = autoclass('org.scify.jedai.datareader.entityreader.EntityCSVReader')
List = autoclass('java.util.List')
EntityProfile = autoclass('org.scify.jedai.datamodel.EntityProfile')
Attribute = autoclass('org.scify.jedai.datamodel.Attribute')
csvReader = CsvReader(filePath)
csvReader.setAttributeNamesInFirstRow(True);
csvReader.setSeparator(",");
csvReader.setIdIndex(0);
profiles = csvReader.getEntityProfiles()
profilesIterator = profiles.iterator()
while profilesIterator.hasNext() :
    profile = profilesIterator.next()
    print("\n\n" + profile.getEntityUrl())
    attributesIterator = profile.getAttributes().iterator()
    while attributesIterator.hasNext() :
        print(attributesIterator.next().toString())

jedaitoolkit's People

Contributors

Stargazers

Watchers

Forkers

npit andy-wagner kashenfelter danielmelemed creditdatamw swamikevala brightymcbrightface axolander gabrielepisciotta yeikel hochunk21 piripinui hulalazz vanessad cordje allan-shoup lwj5 wumiguo rubik-ai kyrylo-novotarskyi amarjitghuman tkunstek eni-veres ayrilrini murray1991 propixel-prc elihuvillaraus fekaputra avudzor tshu-w billstam12 mschroeder-github mhoangvslev decoder996 cvedetect f-wz danhlephuoc windblow32 nikoletos-k aahmadai moemode bhavya1809 daniellsq guyouyouy shreya-shenoy

jedaitoolkit's Issues

Unable to load csv

I am using the latest JedAI-gui: jedai-ui.7z. I tried loading DBLP-ACM .csv data:

ACM.csv
DBLP2.csv
DBLP-ACM_perfectMapping.csv

and I get the following error (please see attached):

GtCSVReader problems with jgrapht ConnectivityInspector

This issue arose when I attempted to reproduce the workflow in: org.scify.jedai.demoworkflows.CsvDblpAcm.java.

During the reading process of the ground truths in DBLP-ACM_perfectMapping.csv (specifically the GtCSVReader.getDuplicatePairs method), the detection of connected components by the jgrapht package seems to not work.

For some reason I obtain a single cluster of size 2225 and then 5375 more clusters of size 1, which is obviously incorrect since the csv contains about 2225 unique pairs (which should in turn produce 2225 clusters of size 2).

Have you seen this problem before? Maybe the jgrapht package expects a different format than it did previously?

Question about Data

Hi, I found that the number of data under this repository does not seem to match the original one, and I would like to know if the data has been processed. For example, the original Amazon-Google has 1363, 3226 entities and 1300 matches respectively, but the numbers are less in this project.

Also I see a lot of dirty data that seems to just mix the two tables together? Is there any other processing.

Unable to build jedai-core - missing dependencies

Hi,

I'm unable to build the project.
The following dependencies can't be found :

com.esotericsoftware.minlog:minlog:jar:1.2-slf4j-jdanbrown-0
gr.demokritos:JInsect:jar:1.1
salvo.jesus:OpenJGraph:jar:1.1

The first one can't be found, the two others seems to be on an unreachable repository http://backend1.scify.org:60004/artifactory/pub-release-local

mvn clean install -U
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] jedai                                                              [pom]
[INFO] jedai-core                                                         [jar]
[INFO] jedai-ui                                                           [jar]
[INFO]
[INFO] ---------------------------< gr.scify:jedai >---------------------------
[INFO] Building jedai 1.3                                                 [1/3]
[INFO] --------------------------------[ pom ]---------------------------------
[INFO]
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ jedai ---
[INFO]
[INFO] --- maven-install-plugin:2.4:install (default-install) @ jedai ---
[INFO] Installing C:\projet\JedAIToolkit\pom.xml to C:\Users\nicolas.lledo\.m2\repository\gr\scify\jedai\1.3\jedai-1.3.pom
[INFO]
[INFO] ------------------------< gr.scify:jedai-core >-------------------------
[INFO] Building jedai-core 1.3                                            [2/3]
[INFO] --------------------------------[ jar ]---------------------------------
Downloading from nexus.somecompany.com: http://nexus.somecompany.com/repository/maven-public/com/esotericsoftware/minlog/minlog/1.2-slf4j-jdanbrown-0/minlog-1.2-slf4j-jdanbrown-0.pom
[WARNING] The POM for com.esotericsoftware.minlog:minlog:jar:1.2-slf4j-jdanbrown-0 is missing, no dependency information available
Downloading from nexus.somecompany.com: http://nexus.somecompany.com/repository/maven-public/gr/demokritos/JInsect/1.1/JInsect-1.1.pom
[WARNING] The POM for gr.demokritos:JInsect:jar:1.1 is missing, no dependency information available
Downloading from nexus.somecompany.com: http://nexus.somecompany.com/repository/maven-public/salvo/jesus/OpenJGraph/1.1/OpenJGraph-1.1.pom
[WARNING] The POM for salvo.jesus:OpenJGraph:jar:1.1 is missing, no dependency information available
Downloading from nexus.somecompany.com: http://nexus.somecompany.com/repository/maven-public/com/esotericsoftware/minlog/minlog/1.2-slf4j-jdanbrown-0/minlog-1.2-slf4j-jdanbrown-0.jar
Downloading from nexus.somecompany.com: http://nexus.somecompany.com/repository/maven-public/salvo/jesus/OpenJGraph/1.1/OpenJGraph-1.1.jar
Downloading from nexus.somecompany.com: http://nexus.somecompany.com/repository/maven-public/gr/demokritos/JInsect/1.1/JInsect-1.1.jar
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for jedai 1.3:
[INFO]
[INFO] jedai .............................................. SUCCESS [  0.452 s]
[INFO] jedai-core ......................................... FAILURE [  1.671 s]
[INFO] jedai-ui ........................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  2.393 s
[INFO] Finished at: 2019-02-27T17:50:24+01:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project jedai-core: Could not resolve dependencies for project gr.scify:jedai-core:jar:1.3: The following artifacts could not be resolved: com.esotericsoftware.minlog:minlog:jar:1.2-slf4j-jdanbrown-0, gr.demokritos:JInsect:jar:1.1, salvo.jesus:OpenJGraph:jar:1.1: Could not find artifact com.esotericsoftware.minlog:minlog:jar:1.2-slf4j-jdanbrown-0 in nexus.somecompany.com (http://nexus.somecompany.com/repository/maven-public/) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :jedai-core

Null values from database cause null pointer exception in blocking tokenizer

StandardBlocking.getTokens() throws null pointer exception when input parameter is null.

We ought to stop null values from being added to the EntityProfile when reading from a database

Make block building, block processing, entity clustering classes serializable and add setters for configurable fields

Having the blocking and clustering related classes serializable enables the usage of jedai-core in applications run in Hadoop and Spark clusters.
Having setters for the configurable fields add more flexibility in creating the blocking and clustering objects.

Could not read successfully the input file!

How do I create a ground Truth file?

DBPedia link broken

The link for DBPedia in data/README.md doesn't work.

Apply JedAI blocking programmatically - missing documentation

Hi!

I have successfully made the Web application work and I also made my first successful steps by using JedAI with Python.

But now I want to do it programatically with Python and without the Web application, so I want to apply the full workflow but only with the terminal and the VS Code.

But I couldn't find any detailed documentation how I can do blocking, cleaning ... programatically.

ArrayIndexOutOfBoundsException when blocking with schema clusters

I got the following error when I tried blocking with schema clusters:
java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 at org.scify.jedai.blockbuilding.AbstractBlockBuilding.lambda$parseIndex$10(AbstractBlockBuilding.java:167) at java.base/java.util.HashMap.forEach(HashMap.java:1336) at org.scify.jedai.blockbuilding.AbstractBlockBuilding.parseIndex(AbstractBlockBuilding.java:164) at org.scify.jedai.blockbuilding.AbstractBlockBuilding.readBlocks(AbstractBlockBuilding.java:196) at org.scify.jedai.blockbuilding.AbstractBlockBuilding.getBlocks(AbstractBlockBuilding.java:96) at org.scify.jedai.gui.utilities.WorkflowManager.runBlockBuilding(WorkflowManager.java:824) at org.scify.jedai.gui.utilities.WorkflowManager.runBlockingBasedWorkflow(WorkflowManager.java:896) at org.scify.jedai.gui.utilities.WorkflowManager.executeFullBlockingBasedWorkflow(WorkflowManager.java:393) at org.scify.jedai.gui.utilities.WorkflowManager.executeFullWorkflow(WorkflowManager.java:695) at org.scify.jedai.gui.controllers.steps.CompletedController.lambda$runAlgorithmBtnHandler$6(CompletedController.java:316) at java.base/java.lang.Thread.run(Thread.java:834)

There is a String split operation in the parseIndex function, that is not working properly:
final String[] entropyString = key.split(CLUSTER_SUFFIX);
The delimiters used in keyare equivalent to CLUSTER_PREFIX, not CLUSTER_SUFFIX, and they contain a dollar-sign that has to be escaped. I worked around the issue by changing the above line to
final String[] entropyString = key.split("#\\$!cl");

I'd suggest changing the values of the prefix and suffix to something that is compatible with regex - the solution above is less readable after all.

Images for README

JedAI for Data matching

Hello,
I am trying to run Web based application for a data matching task. I have two tables in the csv format: the first table contains 1.2k rows and the second table contains 7k queries. I want to use JedAI to match each query with a row from the first table. When I run a "block-based workflow" the process stuck in the table loading.
I am a bit lost about how to configure the model. So far I tried the settings in the video tutorial and some other settings but the application never generates any outputs. I share the Tables with the message, please let me know if there is anything wrong with the way i generated them.

Unable to Read csv or json files

I have my own custom data csv files for both dataset as well as ground truth file,
can anyone help me to use this file to get result.
Actually it throw some errors while using this files as an input.

Exception in thread "main" java.lang.IllegalArgumentException: loops not allowed at org.jgrapht.graph.AbstractBaseGraph.addEdge(AbstractBaseGraph.java:218) at org.scify.jedai.datareader.groundtruthreader.GtCSVReader.getDuplicatePairs(GtCSVReader.java:206) at org.scify.jedai.datareader.groundtruthreader.AbstractGtReader.getDuplicatePairs(AbstractGtReader.java:58) at org.scify.jedai.workflowbuilder.Main.main(Main.java:254)

can anyone help me?

Make jedai-core Extensible

Users of jedai-core are unable to extend the library to utilize a custom similarity metric or entity matching method due to the enums defined in the project (e.g. SimilarityMetric, EntityMatchingMethod, BlockCleaningMethod, etc.). Instead, if these features utilized an extension mechanism (for example, java.util.ServiceLoader or something equivalent), custom features would be possible.

MarkovClustering parameters not set

constructor parameters not assigned to the class properties

Converting the DBPedia dataset into non-Java format

Hello,
im working on converting the DBPedia dataset into a format accessible without Java.
I have already converted cleanDBPedia1/2.
However i do not understand the ground truth format.
The profiles have attributes and a URI.
The pairs in the ground truth consist of numbers.
However, when i interpret these numbers as offsets into either file i end up with non-matching pairs.
I wrote the entities into the files in the order they were in the deserialized Java list.
How to find matching pairs / understand the grund truth?
Kind regards

No URLs to Download

I am trying to download the pre-compiled version from the http://jedai.scify.org website.

When I click on Download desktop app for both "Desktop application for Entity Resolution" and "Workbench tool," I get a "Page Not Found" on Github.

Additionally, I created an issue for this, because the webpage doesn't have any contact information. :/

^ I tried compiling it on my machine, but it showed to a crawl, and took over an hour, so I decided to try to download the precompiled JARs. That's why I wanted to download it.

Error on TestGtRDFReader

Hi, I'm tried some tests with JedAI tool.
This tool is useful for my job and I think that it has big potentiality.
I've downloaded the attached file in nt format: source.nt, target.nt.
In the firts step I have successfully executed TestRdfReader class presents in the test package for both datasets. After that I've tried to execute TestGtRDFReader class with the same datasets used before, but I have the following error:
Exception in thread "main" java.lang.IllegalArgumentException: loops not allowed at org.jgrapht.graph.AbstractBaseGraph.addEdge(AbstractBaseGraph.java:203) at org.scify.jedai.datareader.groundtruthreader.GtRDFReader.performReading(GtRDFReader.java:236) at org.scify.jedai.datareader.groundtruthreader.GtRDFReader.getDuplicatePairs(GtRDFReader.java:92) at org.scify.jedai.datareader.groundtruthreader.AbstractGtReader.getDuplicatePairs(AbstractGtReader.java:57) at org.scify.jedai.datareader.TestGtRDFReader.main(TestGtRDFReader.java:39)

datasets.zip

Thanks in advance!

SiGMa Similarity

I had a look at the code of the SiGMa Similarity in class CharacterNGramsWithGlobalWeights and it seems to be exactly the same code as in the Generalized Jaccard Similarity. Am I missing something or is SiGMa not really implemented?

UI and Docker's Web Application get stuck in Data Reading Phase

I get the following error after specifying input sources and then pressing "Next" button in Data Reading Step in JedAI UI:

The input files could not be read successfully.

Details: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Character
cannot be cast to java.lang.String (java.lang.Character cannot be cast to cast to java.lang.String)

In the terminal of Docker's Web Application I have the following:

java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Character
	at kr.di.uoa.gr.jedaiwebapp.models.Dataset.<init>(Dataset.java:86) ~[classes!/:0.0.1-SNAPSHOT]
	at kr.di.uoa.gr.jedaiwebapp.controllers.WorkflowController.validate_DataRead(WorkflowController.java:75) ~[classes!/:0.0.1-SNAPSHOT]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_212]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_212]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_212]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_212]
	at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:190) ~[spring-web-5.1.8.RELEASE.jar!/:5.1.8.RELEASE]
...

Reduce memory footprint of SimilarityPairs

We're using jedai-core (not jedai-ui) in our application and we ran into some Out of Memory errors and started profiling our application. The largest chunk of memory was from SimilarityPairs. We experimented with reducing the size of the similarities from double to float and that reduced the memory footprint by about 25% (630 MB -> 470 MB).

I'm assuming we don't need the extra precision afforded by double, is that correct?

Better structure for match results output file

The PrintToFile.toCSV() method should output the original entity urls, and should be in a format which is easier to import into a database. e.g. 3 columns: custer_id, dataset, entity_url

Unable to achieve high recall and high precision for the bigger datasets

Hello,

In the entity matching step I'm trying to combine different bag models with similarity measures for the dirty dataset "movies" in the data folder.

Unfortunately I'm unable to get high recall and high precision, could you give a good "recipe" to get good results for that dataset?

Thank you

Documentation or examples for the open source library

I cannot seem to find any documentation or examples of a standard workflow implemented in python or java in your repository. Do either of these exist? If so, where could I find them? If not, it would be very useful to have these, since a new user of your tool like me now has to go through all of the java classes to learn how to use the tool, which will take a lot of time.

[WorkflowBuilder.Main] Loads wrong data

Selecting in C-C mode the Abt-Buy, it takes for 2nd dataset amazonProfiles.
Selecting amazonProfiles, it takes for groundtruth amazonGpIdDuplicates.

PPJoin throw ArrayIndexOutOfBound if candidateSize > requireOverlaps.length

If records[k] length is less than records[candId] then we get array index of bound since requireOverlaps created on Kth record size which might be less than records[candId].length

Cannot read ground truth

There is a bug in the code that prevents the ground-truth in CSV format from being read. I tried the samples provided and the web-based docker image failed to load it. I downloaded the code and run it step by step and I think there is a problem with the GtCSVReader. The reading part takes strings like "thisisastring" where only thisisastring should be read. I tried to add nextLine[0] = nextLine[0].substring(1, nextLine[0].length()-1); on line 200 in that file, but no success so far. I need to make it work to test some CSV entity matchings, so maybe somebody has the fix for this issue?

Regarding JedAIToolkit sample csv file

I am looking at the source code of JedAIToolkit in github.

I am not able to find the sample csv file for testing.

Can I get the cd_gold.csv and cd.csv file which has been used for testing purpose of TestGtCSVReader.java & TestEntityCSVReader.java.

Unable to build

I cloned the project to my local and followed the steps listed in the readme , but it fails to build with the error below :

git clone https://github.com/scify/JedAIToolkit.git
cd JedAIToolkit
git submodule update --init
mvn clean package

[INFO] --- maven-assembly-plugin:2.2-beta-5:single (default) @ jedai-ui ---
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] jedai .............................................. SUCCESS [  0.259 s]
[INFO] jedai-core ......................................... SUCCESS [ 59.511 s]
[INFO] jedai-ui ........................................... FAILURE [  6.408 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:06 min
[INFO] Finished at: 2018-12-11T15:42:46-05:00
[INFO] Final Memory: 42M/406M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.                                                                   2-beta-5:single (default) on project jedai-ui: Error reading assemblies: Error l                                                                   ocating assembly descriptor: assembly.xml
[ERROR]
[ERROR] [1] [INFO] Searching for file location: C:\Users\Yeikel\Documents\JedAIT                                                                   oolkit\jedai-ui\assembly.xml
[ERROR]
[ERROR] [2] [INFO] File: C:\Users\Yeikel\Documents\JedAIToolkit\jedai-ui\assembl                                                                   y.xml does not exist.
[ERROR]
[ERROR] [3] [INFO] Invalid artifact specification: 'assembly.xml'. Must contain                                                                    at least three fields, separated by ':'.
[ERROR]
[ERROR] [4] [INFO] Failed to resolve classpath resource: assemblies/assembly.xml                                                                    from classloader: ClassRealm[plugin>org.apache.maven.plugins:maven-assembly-plu                                                                   gin:2.2-beta-5, parent: sun.misc.Launcher$AppClassLoader@33909752]
[ERROR]
[ERROR] [5] [INFO] Failed to resolve classpath resource: assembly.xml from class                                                                   loader: ClassRealm[plugin>org.apache.maven.plugins:maven-assembly-plugin:2.2-bet                                                                   a-5, parent: sun.misc.Launcher$AppClassLoader@33909752]
[ERROR]
[ERROR] [6] [INFO] File: C:\Users\Yeikel\Documents\JedAIToolkit\assembly.xml doe                                                                   s not exist.
[ERROR]
[ERROR] [7] [INFO] Building URL from location: assembly.xml
[ERROR] Error:
[ERROR] java.net.MalformedURLException: no protocol: assembly.xml
[ERROR]         at java.net.URL.<init>(URL.java:593)
[ERROR]         at java.net.URL.<init>(URL.java:490)
[ERROR]         at java.net.URL.<init>(URL.java:439)
[ERROR]         at org.apache.maven.shared.io.location.URLLocatorStrategy.resolv                                                                   e(URLLocatorStrategy.java:54)
[ERROR]         at org.apache.maven.shared.io.location.Locator.resolve(Locator.j                                                                   ava:81)
[ERROR]         at org.apache.maven.plugin.assembly.io.DefaultAssemblyReader.add                                                                   AssemblyFromDescriptor(DefaultAssemblyReader.java:309)
[ERROR]         at org.apache.maven.plugin.assembly.io.DefaultAssemblyReader.rea                                                                   dAssemblies(DefaultAssemblyReader.java:125)
[ERROR]         at org.apache.maven.plugin.assembly.mojos.AbstractAssemblyMojo.e                                                                   xecute(AbstractAssemblyMojo.java:352)
[ERROR]         at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo                                                                   (DefaultBuildPluginManager.java:134)
[ERROR]         at org.apache.maven.lifecycle.internal.MojoExecutor.execute(Mojo                                                                   Executor.java:208)
[ERROR]         at org.apache.maven.lifecycle.internal.MojoExecutor.execute(Mojo                                                                   Executor.java:154)
[ERROR]         at org.apache.maven.lifecycle.internal.MojoExecutor.execute(Mojo                                                                   Executor.java:146)
[ERROR]         at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.bu                                                                   ildProject(LifecycleModuleBuilder.java:117)
[ERROR]         at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.bu                                                                   ildProject(LifecycleModuleBuilder.java:81)
[ERROR]         at org.apache.maven.lifecycle.internal.builder.singlethreaded.Si                                                                   ngleThreadedBuilder.build(SingleThreadedBuilder.java:51)
[ERROR]         at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(                                                                   LifecycleStarter.java:128)
[ERROR]         at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:309                                                                   )
[ERROR]         at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:194                                                                   )
[ERROR]         at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:107)
[ERROR]         at org.apache.maven.cli.MavenCli.execute(MavenCli.java:993)
[ERROR]         at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:345)
[ERROR]         at org.apache.maven.cli.MavenCli.main(MavenCli.java:191)
[ERROR]         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[ERROR]         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcces                                                                   sorImpl.java:62)
[ERROR]         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMet                                                                   hodAccessorImpl.java:43)
[ERROR]         at java.lang.reflect.Method.invoke(Method.java:498)
[ERROR]         at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhan                                                                   ced(Launcher.java:289)
[ERROR]         at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Laun                                                                   cher.java:229)
[ERROR]         at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExi                                                                   tCode(Launcher.java:415)
[ERROR]         at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launch                                                                   er.java:356)
[ERROR]
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e swit                                                                   ch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please rea                                                                   d the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionE                                                                   xception
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :jedai-ui

Remove maven-assembly-plugin Configuration From jeda-core

If another project is going to depend on jedai-core, having the transitive dependencies assembled inside jedai-core has the potential to conflict if different versions of those same transitive dependencies are needed for the other project. Since jedai-ui is already assembling transitive dependencies, removing transitive dependencies from jedai-core should not have any effect on the UI.

CSV GroundTruth Reader doesn't work with 1 dataset (Dirty ER)

The second URL of every row is mapped to 0.
So all of the records in the first column of the CSV file are considered duplicates (via transitive closure).

[WorkflowBuilder.Main] Error: can't locate dataset

Using the library from CLI (Linux) it raises this exception:

Please choose one of the available Clean-clean ER datasets:
1 - Abt-Buy
2 - DBLP-ACM
3 - DBLP-Scholar
4 - Amazon-Google Products
5 - IMDB-DBPedia Movies
1
Abt-Buy has been selected!
0 [main] ERROR com.esotericsoftware.minlog  - Error in data reading
java.io.FileNotFoundException: data/cleanCleanErDatasets/amazonProfiles (File o directory non esistente)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at java.io.FileInputStream.<init>(FileInputStream.java:93)
	at org.scify.jedai.datareader.AbstractReader.loadSerializedObject(AbstractReader.java:54)
	at org.scify.jedai.datareader.entityreader.EntitySerializationReader.getEntityProfiles(EntitySerializationReader.java:48)
	at org.scify.jedai.workflowbuilder.Main.main(Main.java:241)
Exception in thread "main" java.lang.NullPointerException
	at java.util.ArrayList.addAll(ArrayList.java:581)
	at org.scify.jedai.datareader.entityreader.EntitySerializationReader.getEntityProfiles(EntitySerializationReader.java:48)
	at org.scify.jedai.workflowbuilder.Main.main(Main.java:241)

org.scify.jedai.demoworkflows.RdfDblpAcm does not work as expected

It looks like the groundtruth file is wrong.

Dependency org.apache.httpcomponents:httpclient-cache, leading to CVE problem

Hi, In /maven-plugins/sitegen-maven-plugin，there is a dependency **org.apache.httpcomponents:httpclient-cache:jar:4.2.6
** that calls the risk method.

CVE-2020-13956

The scope of this CVE affected version is [,4.5.13)

After further analysis, in this project, the main Api called is org.apache.http.client.utils.URIUtils: extractHost(java.net.URI)Lorg.apache.http.HttpHost

Risk method repair link : GitHub

CVE Bug Invocation Path--

Path Length : 7

org.scify.jedai.datawriter.BlocksPerformanceWriter: printDetailedResultsToSPARQL(java.util.List,java.util.List,java.lang.String,java.lang.String)V /home/hjf/.m2/repository/org/apache/jena/jena-cmds/3.1.0/jena-cmds-3.1.0.jar
org.apache.jena.sparql.modify.UpdateProcessRemoteForm: execute()V /home/hjf/.m2/repository/org/apache/jena/jena-cmds/3.1.0/jena-cmds-3.1.0.jar
org.apache.jena.riot.web.HttpOp: execHttpPostForm(java.lang.String,org.apache.jena.sparql.engine.http.Params,java.lang.String,org.apache.jena.riot.web.HttpResponseHandler,org.apache.http.client.HttpClient,org.apache.http.protocol.HttpContext,org.apache.jena.atlas.web.auth.HttpAuthenticator)V /home/hjf/.m2/repository/org/apache/jena/jena-cmds/3.1.0/jena-cmds-3.1.0.jar
org.apache.jena.riot.web.HttpOp: exec(java.lang.String,org.apache.http.client.methods.HttpUriRequest,java.lang.String,org.apache.jena.riot.web.HttpResponseHandler,org.apache.http.client.HttpClient,org.apache.http.protocol.HttpContext,org.apache.jena.atlas.web.auth.HttpAuthenticator)V /home/hjf/.m2/repository/org/apache/jena/jena-cmds/3.1.0/jena-cmds-3.1.0.jar
org.apache.http.impl.client.AbstractHttpClient: execute(org.apache.http.client.methods.HttpUriRequest,org.apache.http.protocol.HttpContext)Lorg.apache.http.HttpResponse; /home/hjf/.m2/repository/org/apache/jena/jena-cmds/3.1.0/jena-cmds-3.1.0.jar
org.apache.http.impl.client.AbstractHttpClient: determineTarget(org.apache.http.client.methods.HttpUriRequest)Lorg.apache.http.HttpHost; /home/hjf/.m2/repository/org/apache/jena/jena-cmds/3.1.0/jena-cmds-3.1.0.jar
org.apache.http.client.utils.URIUtils: extractHost(java.net.URI)Lorg.apache.http.HttpHost;

Dependency tree--

[INFO] org.scify:jedai-core:jar:3.2.1
[INFO] +- org.jgrapht:jgrapht-core:jar:1.4.0:compile
[INFO] |  \- org.jheaps:jheaps:jar:0.11:compile
[INFO] +- net.sf.trove4j:trove4j:jar:3.0.3:compile
[INFO] +- com.esotericsoftware:minlog:jar:1.3.1:compile
[INFO] +- info.debatty:java-lsh:jar:0.11:compile
[INFO] |  \- info.debatty:java-string-similarity:jar:0.12:compile
[INFO] +- org.apache.commons:commons-lang3:jar:3.4:compile
[INFO] +- org.apache.commons:commons-math3:jar:3.1.1:compile
[INFO] +- org.apache.jena:jena-arq:jar:3.1.0:compile
[INFO] |  +- org.apache.jena:jena-core:jar:3.1.0:compile
[INFO] |  |  +- org.apache.jena:jena-iri:jar:3.1.0:compile
[INFO] |  |  +- xerces:xercesImpl:jar:2.11.0:compile
[INFO] |  |  |  \- xml-apis:xml-apis:jar:1.4.01:compile
[INFO] |  |  +- commons-cli:commons-cli:jar:1.3:compile
[INFO] |  |  \- org.apache.jena:jena-base:jar:3.1.0:compile
[INFO] |  |     \- com.github.andrewoma.dexx:collection:jar:0.6:compile
[INFO] |  +- org.apache.jena:jena-shaded-guava:jar:3.1.0:compile
[INFO] |  +- org.apache.httpcomponents:httpclient:jar:4.2.6:compile
[INFO] |  |  +- org.apache.httpcomponents:httpcore:jar:4.2.5:compile
[INFO] |  |  \- commons-codec:commons-codec:jar:1.6:compile
[INFO] |  +- com.github.jsonld-java:jsonld-java:jar:0.7.0:compile
[INFO] |  |  +- com.fasterxml.jackson.core:jackson-core:jar:2.3.3:compile
[INFO] |  |  +- com.fasterxml.jackson.core:jackson-databind:jar:2.3.3:compile
[INFO] |  |  |  \- com.fasterxml.jackson.core:jackson-annotations:jar:2.3.0:compile
[INFO] |  |  \- commons-io:commons-io:jar:2.4:compile
[INFO] |  +- org.apache.httpcomponents:httpclient-cache:jar:4.2.6:compile
[INFO] |  +- org.apache.thrift:libthrift:jar:0.9.2:compile
[INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.7.20:compile
[INFO] |  +- org.apache.commons:commons-csv:jar:1.0:compile
[INFO] |  \- org.slf4j:slf4j-api:jar:1.7.20:compile
[INFO] +- org.apache.jena:jena-cmds:jar:3.1.0:compile
[INFO] |  +- org.apache.jena:apache-jena-libs:pom:3.1.0:compile
[INFO] |  |  \- org.apache.jena:jena-tdb:jar:3.1.0:compile
[INFO] |  +- org.slf4j:slf4j-log4j12:jar:1.7.20:compile
[INFO] |  \- log4j:log4j:jar:1.2.17:compile
[INFO] +- com.opencsv:opencsv:jar:3.7:compile
[INFO] +- org.jdom:jdom2:jar:2.0.6:compile
[INFO] +- org.scify:JInsect:jar:1.1:compile
[INFO] |  \- org.scify:OpenJGraph:jar:1.1:compile
[INFO] +- org.rdfhdt:hdt-java-core:jar:1.1:compile
[INFO] |  +- com.beust:jcommander:jar:1.32:compile
[INFO] |  +- org.rdfhdt:hdt-api:jar:1.1:compile
[INFO] |  \- org.apache.commons:commons-compress:jar:1.6:compile
[INFO] |     \- org.tukaani:xz:jar:1.4:compile
[INFO] +- com.google.guava:guava-testlib:jar:30.1.1-jre:test
[INFO] |  +- com.google.code.findbugs:jsr305:jar:3.0.2:test
[INFO] |  +- org.checkerframework:checker-qual:jar:3.8.0:test
[INFO] |  +- com.google.errorprone:error_prone_annotations:jar:2.5.1:test
[INFO] |  +- com.google.j2objc:j2objc-annotations:jar:1.3:test
[INFO] |  +- com.google.guava:guava:jar:30.1.1-jre:test
[INFO] |  |  +- com.google.guava:failureaccess:jar:1.0.1:test
[INFO] |  |  \- com.google.guava:listenablefuture:jar:9999.0-empty-to-avoid-conflict-with-guava:test
[INFO] |  \- junit:junit:jar:4.13.2:test
[INFO] |     \- org.hamcrest:hamcrest-core:jar:1.3:test
[INFO] +- org.hamcrest:hamcrest:jar:2.2:test
[INFO] +- org.junit.jupiter:junit-jupiter-api:jar:5.7.2:test
[INFO] |  +- org.apiguardian:apiguardian-api:jar:1.1.0:test
[INFO] |  +- org.opentest4j:opentest4j:jar:1.2.0:test
[INFO] |  \- org.junit.platform:junit-platform-commons:jar:1.7.2:test
[INFO] \- org.junit.jupiter:junit-jupiter-engine:jar:5.7.2:test
[INFO]    \- org.junit.platform:junit-platform-engine:jar:1.7.2:test

Suggested solutions:

Update dependency version

Thank you very much.

Dirty ER examples input .csv

Hi, it is possible to have sample files in .csv format for

entity profile D1
ground truth
because .csv files with any formatting will not work.
The error from JedAI-gui is the following:

Thanks you for the support

Deploy To Maven Central

To make it easier to consume, can the project be deployed to Maven Central?

data pairs shown as false negatives and as true positives

I found some cases where data pairs showed up in the end results as false negative and true positive simultaneously.
Its cause is in the class UnilateralDuplicatePropagation and the following functions:

public boolean isSuperfluous(int entityId1, int entityId2) {
        final IdDuplicates duplicatePair1 = new IdDuplicates(entityId1, entityId2);
        final IdDuplicates duplicatePair2 = new IdDuplicates(entityId2, entityId1);
        if (duplicates.contains(duplicatePair1)
                || duplicates.contains(duplicatePair2)) {
            if (entityId1 < entityId2) {
                detectedDuplicates.add(duplicatePair1);
            } else {
                detectedDuplicates.add(duplicatePair2);
            }
        }

        return false;
    }

public Set<IdDuplicates> getFalseNegatives() {
        final Set<IdDuplicates> falseNegatives = new HashSet<>(duplicates);
        falseNegatives.removeAll(detectedDuplicates);
        return falseNegatives;
    }

Only one of two possible combinations of IDs is written to detectedDuplicates, but superfluous combinations still exist in duplicates. When removing detectedDuplicates from duplicates to create falseNegatives, those superfluous combinations remain and are exported as false negatives, while the combinations in detectedDuplicates are exported as true positives.

Null pointer when trying to load data using latest release

I am using the following release

And I am trying the jedaiDesktopApp-1.1.jar with the following datasets (from the samples) :

abtBuyIdDuplicates (for D1)
abtBuyProfiles (for truth file)

But I get the following error :

I tried with CSV files and I also get the same error

Change comparison counts type to int

We're using jedai-core in our application and we ran into some issues where the number of executed comparisons in ComparisonIterator was going over the number of total comparisons. We identified that this was happening because executedComparisons and totalComparisons are floats and changing them to ints fixed the problem. In Java, comparing two floats for exact equality is generally discouraged.

Dirty datasets in CSV format

Hi I was wondering if you have the dirty datasets available in CSV format? Otherwise I can just create a quick script that reads the JSO files and convert them myself, but I figured there is no harm in asking first! Thanks in advance.

CSV Headers with upper case doesn't works for PPJoin

In Similarity join page on UI, on providing the Select attribute of Dataset 1: & Select attribute of Dataset 2: value with uppercase value Eg: "INSTANCE ID", the algorithm fails to match results. On further investigating I found that the class AbstractSimilarityJoin method getAttributeValue(String attributeName, EntityProfile profile) on line 67 the attributeName should be changed to attributeName.toLowerCase() for considering attributeNames properly or else it simply ignores the if condition.

scify / jedaitoolkit Goto Github PK

jedaitoolkit's Introduction

Java gEneric DAta Integration (JedAI) Toolkit

Citation

Consortium

JedAI Workflow

Data Reading

Schema Clustering

Block Building

Block Cleaning

Comparison Cleaning

Entity Matching

Entity Clustering

Similarity Join

Comparison Prioritization

How to add JedAI as a dependency to your project

How to run JedAI as a Docker image

How to use JedAI with Python

jedaitoolkit's People

Contributors

Stargazers

Watchers

Forkers

jedaitoolkit's Issues

Recommend Projects

Recommend Topics

Recommend Org