Giter Site home page Giter Site logo

scify / jedaitoolkit Goto Github PK

View Code? Open in Web Editor NEW
201.0 26.0 45.0 284.2 MB

An open source, high scalability toolkit in Java for Entity Resolution.

Home Page: http://jedai.scify.org

License: Apache License 2.0

Java 100.00%
entity-resolution entity-matching scalability blocking

jedaitoolkit's Introduction

Please check pyJedAI for an implementation of JedAI in native Python.

Please check our paper for a detailed description of version 3.0.

The code for running JedAI on Apache Spark is available here.

The Web Application for running JedAI is available here. A video explaining how to use it is available here.

JedAI is also available as a Docker image here. See below for more details.

The latest version of JedAI-gui is available here.

Java gEneric DAta Integration (JedAI) Toolkit

JedAI constitutes an open source, high scalability toolkit that offers out-of-the-box solutions for any data integration task, e.g., Record Linkage, Entity Resolution and Link Discovery. At its core lies a set of domain-independent, state-of-the-art techniques that apply to both RDF and relational data. These techniques rely on an approximate, schema-agnostic functionality based on (meta-)blocking for high scalability.

JedAI can be used in three different ways:

  1. As an open source library that implements numerous state-of-the-art methods for all steps of the end-to-end ER work presented in the figure below.
  2. As a desktop application with an intuitive Graphical User Interface that can be used by both expert and lay users.
  3. As a workbench that compares the relative performance of different (configurations of) ER workflows.

This repository contains the code (in Java 8) of JedAI's open source library. The code of JedAI's desktop application and workbench is available in this repository.

Several datasets already converted into the serialized data type of JedAI can be found here.

You can find a short presentation of JedAI Toolkit here.

Citation

If you use JedAI, please cite the following paper:

George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas and Manolis Koubarakis: "The return of JedAI: End-to-End Entity Resolution for Structured and Semi-Structured Data", in VLDB 2018 (pdf).

Consortium

JEDAI is a collaboration project involving the following partners:

JedAI Workflow

JedAI supports 3 workflows, as shown in the following images:

Below, we explain in more detail the purpose and the functionality of every step.

Data Reading

It transforms the input data into a list of entity profiles. An entity is a uniquely identified set of name-value pairs (e.g., an RDF resource with its URI as identifier and its set of predicates and objects as name-value pairs).

The following formats are currently supported:

  1. CSV
  2. RDF in any format, i.e., XML, OWL, HDT, JSON
  3. Relational Databases (mySQL, PostgreSQL)
  4. SPARQL endpoints
  5. Java serialized objects

Schema Clustering

This is an optional step, suitable for highly heterogeneous datasets with a schema comprising a large diversity of attribute names. To this end, it groups together attributes that are syntactically similar, but are not necessarily semantically equivalent.

The following methods are currently supported:

  1. Attribute Name Clustering
  2. Attribute Value Clustering
  3. Holistic Attribute Clustering

For more details on the functionality of these methods, see here.

Block Building

It clusters entities into overlapping blocks in a lazy manner that relies on unsupervised blocking keys: every token in an attribute value forms a key. Blocks are then extracted, possibly using a transformation, based on its equality or on its similarity with other keys.

The following methods are currently supported:

  1. Standard/Token Blocking
  2. Sorted Neighborhood
  3. Extended Sorted Neighborhood
  4. Q-Grams Blocking
  5. Extended Q-Grams Blocking
  6. Suffix Arrays Blocking
  7. Extended Suffix Arrays Blocking
  8. LSH MinHash Blocking
  9. LSH SuperBit Blocking

For more details on the functionality of these methods, see here.

Block Cleaning

Its goal is to clean a set of overlapping blocks from unnecessary comparisons, which can be either redundant (i.e., repeated comparisons that have already been executed in a previously examined block) or superfluous (i.e., comparisons that involve non-matching entities). Its methods operate on the coarse level of individual blocks or entities.

The following methods are currently supported:

  1. Size-based Block Purging
  2. Cardinality-based Block Purging
  3. Block Filtering
  4. Block Clustering

All methods are optional, but complementary with each other and can be used in combination. For more details on the functionality of these methods, see here.

Comparison Cleaning

Similar to Block Cleaning, this step aims to clean a set of blocks from both redundant and superfluous comparisons. Unlike Block Cleaning, its methods operate on the finer granularity of individual comparisons.

The following methods are currently supported:

  1. Comparison Propagation
  2. Cardinality Edge Pruning (CEP)
  3. Cardinality Node Pruning (CNP)
  4. Weighed Edge Pruning (WEP)
  5. Weighed Node Pruning (WNP)
  6. Reciprocal Cardinality Node Pruning (ReCNP)
  7. Reciprocal Weighed Node Pruning (ReWNP)
  8. BLAST
  9. Canopy Clusetring
  10. Extended Canopy Clustering

Most of these methods are Meta-blocking techniques. All methods are optional, but competive, in the sense that only one of them can part of an ER workflow. For more details on the functionality of these methods, see here. They can be combined with one of the following weighting schemes:

  1. Aggregate Reciprocal Comparisons Scheme (ARCS)
  2. Common Blocks Scheme (CBS)
  3. Enhanced Common Blocks Scheme (ECBS)
  4. Jaccard Scheme (JS)
  5. Enhanced Jaccard Scheme (EJS)
  6. Pearson chi-squared test

Entity Matching

It compares pairs of entity profiles, associating every pair with a similarity in [0,1]. Its output comprises the similarity graph, i.e., an undirected, weighted graph where the nodes correspond to entities and the edges connect pairs of compared entities.

The following schema-agnostic methods are currently supported:

  1. Group Linkage,
  2. Profile Matcher, which aggregates all attributes values in an individual entity into a textual representation.

Both methods can be combined with the following representation models.

  1. character n-grams (n=2, 3 or 4)
  2. character n-gram graphs (n=2, 3 or 4)
  3. token n-grams (n=1, 2 or 3)
  4. token n-gram graphs (n=1, 2 or 3)

For more details on the functionality of these bag and graph models, see here.

The bag models can be combined with the following similarity measures, using both TF and TF-IDF weights:

  1. ARCS similarity
  2. Cosine similarity
  3. Jaccard similarity
  4. Generalized Jaccard similarity
  5. Enhanced Jaccard similarity

The graph models can be combined with the following graph similarity measures:

  1. Containment similarity
  2. Normalized Value similarity
  3. Value similarity
  4. Overall Graph similarity

Any word or character-level pre-trained embeddings are also supported in combination with cosine similarity or Euclidean distance.

Entity Clustering

It takes as input the similarity graph produced by Entity Matching and partitions it into a set of equivalence clusters, with every cluster corresponding to a distinct real-world object.

The following domain-independent methods are currently supported for Dirty ER:

  1. Connected Components Clustering
  2. Center Clustering
  3. Merge-Center Clustering
  4. Ricochet SR Clustering
  5. Correlation Clustering
  6. Markov Clustering
  7. Cut Clustering

For more details on the functionality of these methods, see here.

For Clean-Clean ER, the following methods are supported:

  1. Unique Mapping Clustering
  2. Row-Column Clustering
  3. Best Assignment Clustering

For more details on the functionality of the first method, see here. The 2nd algorithm implements an efficient approximation of the Hungarian Algorithm, while the 3rd one implements an efficient, heuristic solution to the assignment problem in unbalanced bipartite graphs.

Similarity Join

Similarity Join conveys the state-of-the-art algorithms for accelerating the computation of a specific character- or token-based similarity measure in combination with a user-determined similarity threshold.

The following token-based similarity jon algorithms are supported:

  1. AllPairs
  2. PPJoin
  3. SilkMoth

The following character-based similarity jon algorithms are also supported:

  1. FastSS
  2. PassJoin
  3. PartEnum
  4. EdJoin
  5. AllPairs

Comparison Prioritization

Comparison Prioritization associates all comparisons in a block collection with a weight that is proportional to the likelihood that they involve duplicates and then, it emits them iteratively, in decreasing weight.

The following methods are currently supported:

  1. Local Progressive Sorted Neighborhood
  2. Global Progressive Sorted Neighborhood
  3. Progressive Block Scheduling
  4. Progressive Entity Scheduling
  5. Progressive Global Top Comparisons
  6. Progressive Local Top Comparisons

For more details on the functionality of these methods, see here.

How to add JedAI as a dependency to your project

Visit https://search.maven.org/artifact/org.scify/jedai-core

How to run JedAI as a Docker image

After installing Docker on your machine, type the following commands:

docker pull gmandi/jedai-webapp

docker run -p 8080:8080 gmandi/jedai-webapp

Then, open your browser and go to localhost:8080. JedAI should be running on your browser!

How to use JedAI with Python

You can combine JedAI with Python through PyJNIus (https://github.com/kivy/pyjnius).

Preparation Steps:

  1. Install python3 and PyJNIus (https://github.com/kivy/pyjnius).
  2. Install java 8 openjdk and openjfx for java 8 and configure it as the default java.
  3. Create a directory or a jar file with jedai-core and its dependencies. One approach is to use the maven-assembly-plugin (https://maven.apache.org/plugins/maven-assembly-plugin/usage.html), which will package everything to a single jar file: jedai-core-3.0-jar-with-dependencies.jar

In the following code block a simple example is presented in python 3. The code reads the ACM.csv file found at (JedAIToolkit/data/cleanCleanErDatasets/DBLP-ACM) and prints the entities found:

import jnius_config;
jnius_config.add_classpath('jedai-core-3.0-jar-with-dependencies.jar')

from jnius import autoclass

filePath = 'path_to/ACM.csv'
CsvReader = autoclass('org.scify.jedai.datareader.entityreader.EntityCSVReader')
List = autoclass('java.util.List')
EntityProfile = autoclass('org.scify.jedai.datamodel.EntityProfile')
Attribute = autoclass('org.scify.jedai.datamodel.Attribute')
csvReader = CsvReader(filePath)
csvReader.setAttributeNamesInFirstRow(True);
csvReader.setSeparator(",");
csvReader.setIdIndex(0);
profiles = csvReader.getEntityProfiles()
profilesIterator = profiles.iterator()
while profilesIterator.hasNext() :
    profile = profilesIterator.next()
    print("\n\n" + profile.getEntityUrl())
    attributesIterator = profile.getAttributes().iterator()
    while attributesIterator.hasNext() :
        print(attributesIterator.next().toString())

jedaitoolkit's People

Contributors

alexjoom avatar cvedetect avatar cyber-cypher avatar eni-veres avatar gabrielepisciotta avatar ggianna avatar gioargyr avatar gpapadis avatar lwj5 avatar moemode avatar mthanos avatar murray1991 avatar swamikevala avatar xstrtok avatar zikani03 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jedaitoolkit's Issues

Unable to load csv

I am using the latest JedAI-gui: jedai-ui.7z. I tried loading DBLP-ACM .csv data:

  1. ACM.csv
  2. DBLP2.csv
  3. DBLP-ACM_perfectMapping.csv

and I get the following error (please see attached):
Screenshot 2020-05-25 at 15 20 31

GtCSVReader problems with jgrapht ConnectivityInspector

This issue arose when I attempted to reproduce the workflow in: org.scify.jedai.demoworkflows.CsvDblpAcm.java.

During the reading process of the ground truths in DBLP-ACM_perfectMapping.csv (specifically the GtCSVReader.getDuplicatePairs method), the detection of connected components by the jgrapht package seems to not work.

For some reason I obtain a single cluster of size 2225 and then 5375 more clusters of size 1, which is obviously incorrect since the csv contains about 2225 unique pairs (which should in turn produce 2225 clusters of size 2).

Have you seen this problem before? Maybe the jgrapht package expects a different format than it did previously?

Question about Data

Hi, I found that the number of data under this repository does not seem to match the original one, and I would like to know if the data has been processed. For example, the original Amazon-Google has 1363, 3226 entities and 1300 matches respectively, but the numbers are less in this project.

Also I see a lot of dirty data that seems to just mix the two tables together? Is there any other processing.

Unable to build jedai-core - missing dependencies

Hi,

I'm unable to build the project.
The following dependencies can't be found :

  • com.esotericsoftware.minlog:minlog:jar:1.2-slf4j-jdanbrown-0
  • gr.demokritos:JInsect:jar:1.1
  • salvo.jesus:OpenJGraph:jar:1.1

The first one can't be found, the two others seems to be on an unreachable repository http://backend1.scify.org:60004/artifactory/pub-release-local

mvn clean install -U
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] jedai                                                              [pom]
[INFO] jedai-core                                                         [jar]
[INFO] jedai-ui                                                           [jar]
[INFO]
[INFO] ---------------------------< gr.scify:jedai >---------------------------
[INFO] Building jedai 1.3                                                 [1/3]
[INFO] --------------------------------[ pom ]---------------------------------
[INFO]
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ jedai ---
[INFO]
[INFO] --- maven-install-plugin:2.4:install (default-install) @ jedai ---
[INFO] Installing C:\projet\JedAIToolkit\pom.xml to C:\Users\nicolas.lledo\.m2\repository\gr\scify\jedai\1.3\jedai-1.3.pom
[INFO]
[INFO] ------------------------< gr.scify:jedai-core >-------------------------
[INFO] Building jedai-core 1.3                                            [2/3]
[INFO] --------------------------------[ jar ]---------------------------------
Downloading from nexus.somecompany.com: http://nexus.somecompany.com/repository/maven-public/com/esotericsoftware/minlog/minlog/1.2-slf4j-jdanbrown-0/minlog-1.2-slf4j-jdanbrown-0.pom
[WARNING] The POM for com.esotericsoftware.minlog:minlog:jar:1.2-slf4j-jdanbrown-0 is missing, no dependency information available
Downloading from nexus.somecompany.com: http://nexus.somecompany.com/repository/maven-public/gr/demokritos/JInsect/1.1/JInsect-1.1.pom
[WARNING] The POM for gr.demokritos:JInsect:jar:1.1 is missing, no dependency information available
Downloading from nexus.somecompany.com: http://nexus.somecompany.com/repository/maven-public/salvo/jesus/OpenJGraph/1.1/OpenJGraph-1.1.pom
[WARNING] The POM for salvo.jesus:OpenJGraph:jar:1.1 is missing, no dependency information available
Downloading from nexus.somecompany.com: http://nexus.somecompany.com/repository/maven-public/com/esotericsoftware/minlog/minlog/1.2-slf4j-jdanbrown-0/minlog-1.2-slf4j-jdanbrown-0.jar
Downloading from nexus.somecompany.com: http://nexus.somecompany.com/repository/maven-public/salvo/jesus/OpenJGraph/1.1/OpenJGraph-1.1.jar
Downloading from nexus.somecompany.com: http://nexus.somecompany.com/repository/maven-public/gr/demokritos/JInsect/1.1/JInsect-1.1.jar
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for jedai 1.3:
[INFO]
[INFO] jedai .............................................. SUCCESS [  0.452 s]
[INFO] jedai-core ......................................... FAILURE [  1.671 s]
[INFO] jedai-ui ........................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  2.393 s
[INFO] Finished at: 2019-02-27T17:50:24+01:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project jedai-core: Could not resolve dependencies for project gr.scify:jedai-core:jar:1.3: The following artifacts could not be resolved: com.esotericsoftware.minlog:minlog:jar:1.2-slf4j-jdanbrown-0, gr.demokritos:JInsect:jar:1.1, salvo.jesus:OpenJGraph:jar:1.1: Could not find artifact com.esotericsoftware.minlog:minlog:jar:1.2-slf4j-jdanbrown-0 in nexus.somecompany.com (http://nexus.somecompany.com/repository/maven-public/) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :jedai-core

Apply JedAI blocking programmatically - missing documentation

Hi!

I have successfully made the Web application work and I also made my first successful steps by using JedAI with Python.

But now I want to do it programatically with Python and without the Web application, so I want to apply the full workflow but only with the terminal and the VS Code.

But I couldn't find any detailed documentation how I can do blocking, cleaning ... programatically.

ArrayIndexOutOfBoundsException when blocking with schema clusters

I got the following error when I tried blocking with schema clusters:
java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 at org.scify.jedai.blockbuilding.AbstractBlockBuilding.lambda$parseIndex$10(AbstractBlockBuilding.java:167) at java.base/java.util.HashMap.forEach(HashMap.java:1336) at org.scify.jedai.blockbuilding.AbstractBlockBuilding.parseIndex(AbstractBlockBuilding.java:164) at org.scify.jedai.blockbuilding.AbstractBlockBuilding.readBlocks(AbstractBlockBuilding.java:196) at org.scify.jedai.blockbuilding.AbstractBlockBuilding.getBlocks(AbstractBlockBuilding.java:96) at org.scify.jedai.gui.utilities.WorkflowManager.runBlockBuilding(WorkflowManager.java:824) at org.scify.jedai.gui.utilities.WorkflowManager.runBlockingBasedWorkflow(WorkflowManager.java:896) at org.scify.jedai.gui.utilities.WorkflowManager.executeFullBlockingBasedWorkflow(WorkflowManager.java:393) at org.scify.jedai.gui.utilities.WorkflowManager.executeFullWorkflow(WorkflowManager.java:695) at org.scify.jedai.gui.controllers.steps.CompletedController.lambda$runAlgorithmBtnHandler$6(CompletedController.java:316) at java.base/java.lang.Thread.run(Thread.java:834)

There is a String split operation in the parseIndex function, that is not working properly:
final String[] entropyString = key.split(CLUSTER_SUFFIX);
The delimiters used in keyare equivalent to CLUSTER_PREFIX, not CLUSTER_SUFFIX, and they contain a dollar-sign that has to be escaped. I worked around the issue by changing the above line to
final String[] entropyString = key.split("#\\$!cl");

I'd suggest changing the values of the prefix and suffix to something that is compatible with regex - the solution above is less readable after all.

JedAI for Data matching

Hello,
I am trying to run Web based application for a data matching task. I have two tables in the csv format: the first table contains 1.2k rows and the second table contains 7k queries. I want to use JedAI to match each query with a row from the first table. When I run a "block-based workflow" the process stuck in the table loading.
I am a bit lost about how to configure the model. So far I tried the settings in the video tutorial and some other settings but the application never generates any outputs. I share the Tables with the message, please let me know if there is anything wrong with the way i generated them.

Unable to Read csv or json files

I have my own custom data csv files for both dataset as well as ground truth file,
can anyone help me to use this file to get result.
Actually it throw some errors while using this files as an input.

Exception in thread "main" java.lang.IllegalArgumentException: loops not allowed at org.jgrapht.graph.AbstractBaseGraph.addEdge(AbstractBaseGraph.java:218) at org.scify.jedai.datareader.groundtruthreader.GtCSVReader.getDuplicatePairs(GtCSVReader.java:206) at org.scify.jedai.datareader.groundtruthreader.AbstractGtReader.getDuplicatePairs(AbstractGtReader.java:58) at org.scify.jedai.workflowbuilder.Main.main(Main.java:254)

can anyone help me?

Make jedai-core Extensible

Users of jedai-core are unable to extend the library to utilize a custom similarity metric or entity matching method due to the enums defined in the project (e.g. SimilarityMetric, EntityMatchingMethod, BlockCleaningMethod, etc.). Instead, if these features utilized an extension mechanism (for example, java.util.ServiceLoader or something equivalent), custom features would be possible.

Converting the DBPedia dataset into non-Java format

Hello,
im working on converting the DBPedia dataset into a format accessible without Java.
I have already converted cleanDBPedia1/2.
However i do not understand the ground truth format.
The profiles have attributes and a URI.
The pairs in the ground truth consist of numbers.
However, when i interpret these numbers as offsets into either file i end up with non-matching pairs.
I wrote the entities into the files in the order they were in the deserialized Java list.
How to find matching pairs / understand the grund truth?
Kind regards

No URLs to Download

I am trying to download the pre-compiled version from the http://jedai.scify.org website.

When I click on Download desktop app for both "Desktop application for Entity Resolution" and "Workbench tool," I get a "Page Not Found" on Github.

Additionally, I created an issue for this, because the webpage doesn't have any contact information. :/

^ I tried compiling it on my machine, but it showed to a crawl, and took over an hour, so I decided to try to download the precompiled JARs. That's why I wanted to download it.

Error on TestGtRDFReader

Hi, I'm tried some tests with JedAI tool.
This tool is useful for my job and I think that it has big potentiality.
I've downloaded the attached file in nt format: source.nt, target.nt.
In the firts step I have successfully executed TestRdfReader class presents in the test package for both datasets. After that I've tried to execute TestGtRDFReader class with the same datasets used before, but I have the following error:
Exception in thread "main" java.lang.IllegalArgumentException: loops not allowed at org.jgrapht.graph.AbstractBaseGraph.addEdge(AbstractBaseGraph.java:203) at org.scify.jedai.datareader.groundtruthreader.GtRDFReader.performReading(GtRDFReader.java:236) at org.scify.jedai.datareader.groundtruthreader.GtRDFReader.getDuplicatePairs(GtRDFReader.java:92) at org.scify.jedai.datareader.groundtruthreader.AbstractGtReader.getDuplicatePairs(AbstractGtReader.java:57) at org.scify.jedai.datareader.TestGtRDFReader.main(TestGtRDFReader.java:39)

datasets.zip

Thanks in advance!

SiGMa Similarity

I had a look at the code of the SiGMa Similarity in class CharacterNGramsWithGlobalWeights and it seems to be exactly the same code as in the Generalized Jaccard Similarity. Am I missing something or is SiGMa not really implemented?

UI and Docker's Web Application get stuck in Data Reading Phase

I get the following error after specifying input sources and then pressing "Next" button in Data Reading Step in JedAI UI:

The input files could not be read successfully.

Details: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Character
cannot be cast to java.lang.String (java.lang.Character cannot be cast to cast to java.lang.String)

In the terminal of Docker's Web Application I have the following:

java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Character
	at kr.di.uoa.gr.jedaiwebapp.models.Dataset.<init>(Dataset.java:86) ~[classes!/:0.0.1-SNAPSHOT]
	at kr.di.uoa.gr.jedaiwebapp.controllers.WorkflowController.validate_DataRead(WorkflowController.java:75) ~[classes!/:0.0.1-SNAPSHOT]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_212]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_212]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_212]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_212]
	at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:190) ~[spring-web-5.1.8.RELEASE.jar!/:5.1.8.RELEASE]
...

Reduce memory footprint of SimilarityPairs

We're using jedai-core (not jedai-ui) in our application and we ran into some Out of Memory errors and started profiling our application. The largest chunk of memory was from SimilarityPairs. We experimented with reducing the size of the similarities from double to float and that reduced the memory footprint by about 25% (630 MB -> 470 MB).

I'm assuming we don't need the extra precision afforded by double, is that correct?

Better structure for match results output file

The PrintToFile.toCSV() method should output the original entity urls, and should be in a format which is easier to import into a database. e.g. 3 columns: custer_id, dataset, entity_url

Documentation or examples for the open source library

I cannot seem to find any documentation or examples of a standard workflow implemented in python or java in your repository. Do either of these exist? If so, where could I find them? If not, it would be very useful to have these, since a new user of your tool like me now has to go through all of the java classes to learn how to use the tool, which will take a lot of time.

Cannot read ground truth

There is a bug in the code that prevents the ground-truth in CSV format from being read. I tried the samples provided and the web-based docker image failed to load it. I downloaded the code and run it step by step and I think there is a problem with the GtCSVReader. The reading part takes strings like "thisisastring" where only thisisastring should be read. I tried to add nextLine[0] = nextLine[0].substring(1, nextLine[0].length()-1); on line 200 in that file, but no success so far. I need to make it work to test some CSV entity matchings, so maybe somebody has the fix for this issue?

Regarding JedAIToolkit sample csv file

I am looking at the source code of JedAIToolkit in github.

I am not able to find the sample csv file for testing.

Can I get the cd_gold.csv and cd.csv file which has been used for testing purpose of TestGtCSVReader.java & TestEntityCSVReader.java.

Unable to build

I cloned the project to my local and followed the steps listed in the readme , but it fails to build with the error below :

git clone https://github.com/scify/JedAIToolkit.git
cd JedAIToolkit
git submodule update --init
mvn clean package
[INFO] --- maven-assembly-plugin:2.2-beta-5:single (default) @ jedai-ui ---
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] jedai .............................................. SUCCESS [  0.259 s]
[INFO] jedai-core ......................................... SUCCESS [ 59.511 s]
[INFO] jedai-ui ........................................... FAILURE [  6.408 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:06 min
[INFO] Finished at: 2018-12-11T15:42:46-05:00
[INFO] Final Memory: 42M/406M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.                                                                   2-beta-5:single (default) on project jedai-ui: Error reading assemblies: Error l                                                                   ocating assembly descriptor: assembly.xml
[ERROR]
[ERROR] [1] [INFO] Searching for file location: C:\Users\Yeikel\Documents\JedAIT                                                                   oolkit\jedai-ui\assembly.xml
[ERROR]
[ERROR] [2] [INFO] File: C:\Users\Yeikel\Documents\JedAIToolkit\jedai-ui\assembl                                                                   y.xml does not exist.
[ERROR]
[ERROR] [3] [INFO] Invalid artifact specification: 'assembly.xml'. Must contain                                                                    at least three fields, separated by ':'.
[ERROR]
[ERROR] [4] [INFO] Failed to resolve classpath resource: assemblies/assembly.xml                                                                    from classloader: ClassRealm[plugin>org.apache.maven.plugins:maven-assembly-plu                                                                   gin:2.2-beta-5, parent: sun.misc.Launcher$AppClassLoader@33909752]
[ERROR]
[ERROR] [5] [INFO] Failed to resolve classpath resource: assembly.xml from class                                                                   loader: ClassRealm[plugin>org.apache.maven.plugins:maven-assembly-plugin:2.2-bet                                                                   a-5, parent: sun.misc.Launcher$AppClassLoader@33909752]
[ERROR]
[ERROR] [6] [INFO] File: C:\Users\Yeikel\Documents\JedAIToolkit\assembly.xml doe                                                                   s not exist.
[ERROR]
[ERROR] [7] [INFO] Building URL from location: assembly.xml
[ERROR] Error:
[ERROR] java.net.MalformedURLException: no protocol: assembly.xml
[ERROR]         at java.net.URL.<init>(URL.java:593)
[ERROR]         at java.net.URL.<init>(URL.java:490)
[ERROR]         at java.net.URL.<init>(URL.java:439)
[ERROR]         at org.apache.maven.shared.io.location.URLLocatorStrategy.resolv                                                                   e(URLLocatorStrategy.java:54)
[ERROR]         at org.apache.maven.shared.io.location.Locator.resolve(Locator.j                                                                   ava:81)
[ERROR]         at org.apache.maven.plugin.assembly.io.DefaultAssemblyReader.add                                                                   AssemblyFromDescriptor(DefaultAssemblyReader.java:309)
[ERROR]         at org.apache.maven.plugin.assembly.io.DefaultAssemblyReader.rea                                                                   dAssemblies(DefaultAssemblyReader.java:125)
[ERROR]         at org.apache.maven.plugin.assembly.mojos.AbstractAssemblyMojo.e                                                                   xecute(AbstractAssemblyMojo.java:352)
[ERROR]         at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo                                                                   (DefaultBuildPluginManager.java:134)
[ERROR]         at org.apache.maven.lifecycle.internal.MojoExecutor.execute(Mojo                                                                   Executor.java:208)
[ERROR]         at org.apache.maven.lifecycle.internal.MojoExecutor.execute(Mojo                                                                   Executor.java:154)
[ERROR]         at org.apache.maven.lifecycle.internal.MojoExecutor.execute(Mojo                                                                   Executor.java:146)
[ERROR]         at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.bu                                                                   ildProject(LifecycleModuleBuilder.java:117)
[ERROR]         at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.bu                                                                   ildProject(LifecycleModuleBuilder.java:81)
[ERROR]         at org.apache.maven.lifecycle.internal.builder.singlethreaded.Si                                                                   ngleThreadedBuilder.build(SingleThreadedBuilder.java:51)
[ERROR]         at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(                                                                   LifecycleStarter.java:128)
[ERROR]         at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:309                                                                   )
[ERROR]         at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:194                                                                   )
[ERROR]         at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:107)
[ERROR]         at org.apache.maven.cli.MavenCli.execute(MavenCli.java:993)
[ERROR]         at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:345)
[ERROR]         at org.apache.maven.cli.MavenCli.main(MavenCli.java:191)
[ERROR]         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[ERROR]         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcces                                                                   sorImpl.java:62)
[ERROR]         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMet                                                                   hodAccessorImpl.java:43)
[ERROR]         at java.lang.reflect.Method.invoke(Method.java:498)
[ERROR]         at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhan                                                                   ced(Launcher.java:289)
[ERROR]         at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Laun                                                                   cher.java:229)
[ERROR]         at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExi                                                                   tCode(Launcher.java:415)
[ERROR]         at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launch                                                                   er.java:356)
[ERROR]
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e swit                                                                   ch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please rea                                                                   d the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionE                                                                   xception
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :jedai-ui

Remove maven-assembly-plugin Configuration From jeda-core

If another project is going to depend on jedai-core, having the transitive dependencies assembled inside jedai-core has the potential to conflict if different versions of those same transitive dependencies are needed for the other project. Since jedai-ui is already assembling transitive dependencies, removing transitive dependencies from jedai-core should not have any effect on the UI.

[WorkflowBuilder.Main] Error: can't locate dataset

Using the library from CLI (Linux) it raises this exception:

Please choose one of the available Clean-clean ER datasets:
1 - Abt-Buy
2 - DBLP-ACM
3 - DBLP-Scholar
4 - Amazon-Google Products
5 - IMDB-DBPedia Movies
1
Abt-Buy has been selected!
0 [main] ERROR com.esotericsoftware.minlog  - Error in data reading
java.io.FileNotFoundException: data/cleanCleanErDatasets/amazonProfiles (File o directory non esistente)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at java.io.FileInputStream.<init>(FileInputStream.java:93)
	at org.scify.jedai.datareader.AbstractReader.loadSerializedObject(AbstractReader.java:54)
	at org.scify.jedai.datareader.entityreader.EntitySerializationReader.getEntityProfiles(EntitySerializationReader.java:48)
	at org.scify.jedai.workflowbuilder.Main.main(Main.java:241)
Exception in thread "main" java.lang.NullPointerException
	at java.util.ArrayList.addAll(ArrayList.java:581)
	at org.scify.jedai.datareader.entityreader.EntitySerializationReader.getEntityProfiles(EntitySerializationReader.java:48)
	at org.scify.jedai.workflowbuilder.Main.main(Main.java:241)

Dependency org.apache.httpcomponents:httpclient-cache, leading to CVE problem

Hi, In /maven-plugins/sitegen-maven-plugin,there is a dependency **org.apache.httpcomponents:httpclient-cache:jar:4.2.6
** that calls the risk method.

CVE-2020-13956

The scope of this CVE affected version is [,4.5.13)

After further analysis, in this project, the main Api called is org.apache.http.client.utils.URIUtils: extractHost(java.net.URI)Lorg.apache.http.HttpHost

Risk method repair link : GitHub

CVE Bug Invocation Path--

Path Length : 7

org.scify.jedai.datawriter.BlocksPerformanceWriter: printDetailedResultsToSPARQL(java.util.List,java.util.List,java.lang.String,java.lang.String)V /home/hjf/.m2/repository/org/apache/jena/jena-cmds/3.1.0/jena-cmds-3.1.0.jar
org.apache.jena.sparql.modify.UpdateProcessRemoteForm: execute()V /home/hjf/.m2/repository/org/apache/jena/jena-cmds/3.1.0/jena-cmds-3.1.0.jar
org.apache.jena.riot.web.HttpOp: execHttpPostForm(java.lang.String,org.apache.jena.sparql.engine.http.Params,java.lang.String,org.apache.jena.riot.web.HttpResponseHandler,org.apache.http.client.HttpClient,org.apache.http.protocol.HttpContext,org.apache.jena.atlas.web.auth.HttpAuthenticator)V /home/hjf/.m2/repository/org/apache/jena/jena-cmds/3.1.0/jena-cmds-3.1.0.jar
org.apache.jena.riot.web.HttpOp: exec(java.lang.String,org.apache.http.client.methods.HttpUriRequest,java.lang.String,org.apache.jena.riot.web.HttpResponseHandler,org.apache.http.client.HttpClient,org.apache.http.protocol.HttpContext,org.apache.jena.atlas.web.auth.HttpAuthenticator)V /home/hjf/.m2/repository/org/apache/jena/jena-cmds/3.1.0/jena-cmds-3.1.0.jar
org.apache.http.impl.client.AbstractHttpClient: execute(org.apache.http.client.methods.HttpUriRequest,org.apache.http.protocol.HttpContext)Lorg.apache.http.HttpResponse; /home/hjf/.m2/repository/org/apache/jena/jena-cmds/3.1.0/jena-cmds-3.1.0.jar
org.apache.http.impl.client.AbstractHttpClient: determineTarget(org.apache.http.client.methods.HttpUriRequest)Lorg.apache.http.HttpHost; /home/hjf/.m2/repository/org/apache/jena/jena-cmds/3.1.0/jena-cmds-3.1.0.jar
org.apache.http.client.utils.URIUtils: extractHost(java.net.URI)Lorg.apache.http.HttpHost;

Dependency tree--

[INFO] org.scify:jedai-core:jar:3.2.1
[INFO] +- org.jgrapht:jgrapht-core:jar:1.4.0:compile
[INFO] |  \- org.jheaps:jheaps:jar:0.11:compile
[INFO] +- net.sf.trove4j:trove4j:jar:3.0.3:compile
[INFO] +- com.esotericsoftware:minlog:jar:1.3.1:compile
[INFO] +- info.debatty:java-lsh:jar:0.11:compile
[INFO] |  \- info.debatty:java-string-similarity:jar:0.12:compile
[INFO] +- org.apache.commons:commons-lang3:jar:3.4:compile
[INFO] +- org.apache.commons:commons-math3:jar:3.1.1:compile
[INFO] +- org.apache.jena:jena-arq:jar:3.1.0:compile
[INFO] |  +- org.apache.jena:jena-core:jar:3.1.0:compile
[INFO] |  |  +- org.apache.jena:jena-iri:jar:3.1.0:compile
[INFO] |  |  +- xerces:xercesImpl:jar:2.11.0:compile
[INFO] |  |  |  \- xml-apis:xml-apis:jar:1.4.01:compile
[INFO] |  |  +- commons-cli:commons-cli:jar:1.3:compile
[INFO] |  |  \- org.apache.jena:jena-base:jar:3.1.0:compile
[INFO] |  |     \- com.github.andrewoma.dexx:collection:jar:0.6:compile
[INFO] |  +- org.apache.jena:jena-shaded-guava:jar:3.1.0:compile
[INFO] |  +- org.apache.httpcomponents:httpclient:jar:4.2.6:compile
[INFO] |  |  +- org.apache.httpcomponents:httpcore:jar:4.2.5:compile
[INFO] |  |  \- commons-codec:commons-codec:jar:1.6:compile
[INFO] |  +- com.github.jsonld-java:jsonld-java:jar:0.7.0:compile
[INFO] |  |  +- com.fasterxml.jackson.core:jackson-core:jar:2.3.3:compile
[INFO] |  |  +- com.fasterxml.jackson.core:jackson-databind:jar:2.3.3:compile
[INFO] |  |  |  \- com.fasterxml.jackson.core:jackson-annotations:jar:2.3.0:compile
[INFO] |  |  \- commons-io:commons-io:jar:2.4:compile
[INFO] |  +- org.apache.httpcomponents:httpclient-cache:jar:4.2.6:compile
[INFO] |  +- org.apache.thrift:libthrift:jar:0.9.2:compile
[INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.7.20:compile
[INFO] |  +- org.apache.commons:commons-csv:jar:1.0:compile
[INFO] |  \- org.slf4j:slf4j-api:jar:1.7.20:compile
[INFO] +- org.apache.jena:jena-cmds:jar:3.1.0:compile
[INFO] |  +- org.apache.jena:apache-jena-libs:pom:3.1.0:compile
[INFO] |  |  \- org.apache.jena:jena-tdb:jar:3.1.0:compile
[INFO] |  +- org.slf4j:slf4j-log4j12:jar:1.7.20:compile
[INFO] |  \- log4j:log4j:jar:1.2.17:compile
[INFO] +- com.opencsv:opencsv:jar:3.7:compile
[INFO] +- org.jdom:jdom2:jar:2.0.6:compile
[INFO] +- org.scify:JInsect:jar:1.1:compile
[INFO] |  \- org.scify:OpenJGraph:jar:1.1:compile
[INFO] +- org.rdfhdt:hdt-java-core:jar:1.1:compile
[INFO] |  +- com.beust:jcommander:jar:1.32:compile
[INFO] |  +- org.rdfhdt:hdt-api:jar:1.1:compile
[INFO] |  \- org.apache.commons:commons-compress:jar:1.6:compile
[INFO] |     \- org.tukaani:xz:jar:1.4:compile
[INFO] +- com.google.guava:guava-testlib:jar:30.1.1-jre:test
[INFO] |  +- com.google.code.findbugs:jsr305:jar:3.0.2:test
[INFO] |  +- org.checkerframework:checker-qual:jar:3.8.0:test
[INFO] |  +- com.google.errorprone:error_prone_annotations:jar:2.5.1:test
[INFO] |  +- com.google.j2objc:j2objc-annotations:jar:1.3:test
[INFO] |  +- com.google.guava:guava:jar:30.1.1-jre:test
[INFO] |  |  +- com.google.guava:failureaccess:jar:1.0.1:test
[INFO] |  |  \- com.google.guava:listenablefuture:jar:9999.0-empty-to-avoid-conflict-with-guava:test
[INFO] |  \- junit:junit:jar:4.13.2:test
[INFO] |     \- org.hamcrest:hamcrest-core:jar:1.3:test
[INFO] +- org.hamcrest:hamcrest:jar:2.2:test
[INFO] +- org.junit.jupiter:junit-jupiter-api:jar:5.7.2:test
[INFO] |  +- org.apiguardian:apiguardian-api:jar:1.1.0:test
[INFO] |  +- org.opentest4j:opentest4j:jar:1.2.0:test
[INFO] |  \- org.junit.platform:junit-platform-commons:jar:1.7.2:test
[INFO] \- org.junit.jupiter:junit-jupiter-engine:jar:5.7.2:test
[INFO]    \- org.junit.platform:junit-platform-engine:jar:1.7.2:test

Suggested solutions:

Update dependency version

Thank you very much.

Dirty ER examples input .csv

Hi, it is possible to have sample files in .csv format for

  • entity profile D1
  • ground truth
    because .csv files with any formatting will not work.
    The error from JedAI-gui is the following:

image

Thanks you for the support

data pairs shown as false negatives and as true positives

I found some cases where data pairs showed up in the end results as false negative and true positive simultaneously.
Its cause is in the class UnilateralDuplicatePropagation and the following functions:

public boolean isSuperfluous(int entityId1, int entityId2) {
        final IdDuplicates duplicatePair1 = new IdDuplicates(entityId1, entityId2);
        final IdDuplicates duplicatePair2 = new IdDuplicates(entityId2, entityId1);
        if (duplicates.contains(duplicatePair1)
                || duplicates.contains(duplicatePair2)) {
            if (entityId1 < entityId2) {
                detectedDuplicates.add(duplicatePair1);
            } else {
                detectedDuplicates.add(duplicatePair2);
            }
        }

        return false;
    }
public Set<IdDuplicates> getFalseNegatives() {
        final Set<IdDuplicates> falseNegatives = new HashSet<>(duplicates);
        falseNegatives.removeAll(detectedDuplicates);
        return falseNegatives;
    }

Only one of two possible combinations of IDs is written to detectedDuplicates, but superfluous combinations still exist in duplicates. When removing detectedDuplicates from duplicates to create falseNegatives, those superfluous combinations remain and are exported as false negatives, while the combinations in detectedDuplicates are exported as true positives.

Change comparison counts type to int

We're using jedai-core in our application and we ran into some issues where the number of executed comparisons in ComparisonIterator was going over the number of total comparisons. We identified that this was happening because executedComparisons and totalComparisons are floats and changing them to ints fixed the problem. In Java, comparing two floats for exact equality is generally discouraged.

Dirty datasets in CSV format

Hi I was wondering if you have the dirty datasets available in CSV format? Otherwise I can just create a quick script that reads the JSO files and convert them myself, but I figured there is no harm in asking first! Thanks in advance.

CSV Headers with upper case doesn't works for PPJoin

In Similarity join page on UI, on providing the Select attribute of Dataset 1: & Select attribute of Dataset 2: value with uppercase value Eg: "INSTANCE ID", the algorithm fails to match results. On further investigating I found that the class AbstractSimilarityJoin method getAttributeValue(String attributeName, EntityProfile profile) on line 67 the attributeName should be changed to attributeName.toLowerCase() for considering attributeNames properly or else it simply ignores the if condition.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.