Giter Site home page Giter Site logo

zentity-io / zentity Goto Github PK

View Code? Open in Web Editor NEW
154.0 11.0 29.0 567 KB

Entity resolution for Elasticsearch.

Home Page: https://zentity.io

License: Apache License 2.0

Java 99.75% FreeMarker 0.25%
elasticsearch elasticsearch-plugin entity-resolution gdpr identity-resolution name-matching address-matching entity-matching

zentity's Introduction

Build Status

zentity

zentity is an Elasticsearch plugin for entity resolution.

zentity aims to be:

  • Simple - Entity resolution is hard. zentity makes it easy.
  • Fast - Get results in real-time. From milliseconds to low seconds.
  • Generic - Resolve anything. People, companies, locations, sessions, and more.
  • Transitive - Resolve over multiple hops. Recursion finds dynamic identities.
  • Multi-source - Resolve over multiple indices with disparate mappings.
  • Accommodating - Operate on data as it exists. No changing or reindexing data.
  • Logical - Logic is easier to read, troubleshoot, and optimize than statistics.
  • 100% Elasticsearch - Elasticsearch is a great foundation for entity resolution.

Documentation

Documentation is hosted at https://zentity.io/docs

Quick start

Once you have installed Elasticsearch, you can install zentity from a remote URL or a local file.

  1. Browse the releases.
  2. Find a release that matches your version of Elasticsearch. Copy the name of the .zip file.
  3. Install the plugin using the elasticsearch-plugin script that comes with Elasticsearch.

Example:

elasticsearch-plugin install https://zentity.io/releases/zentity-1.8.3-elasticsearch-8.13.3.zip

Read the installation docs for more details.

Next steps

Read the documentation to learn about entity models, how to manage entity models, and how to resolve entities.

This software is licensed under the Apache License, version 2 ("ALv2"), quoted below.

Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.

zentity's People

Contributors

austince avatar davemoore- avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zentity's Issues

Release Plan: 1.8.1

Release Plan: 1.8.1

Changes

  • Bug Fix: Obtain attributes from object arrays during resolution jobs (Issue: #85 , PR: #86 )

Release

  • Create branch 1.8.1 from 1.8.0 branch
  • Update zentity.version in pom.xml and version numbers in README.md
  • Tag as zentity-1.8.1 and push to build and deploy its release artifacts
  • Verify successful deployment

Post-release

  • Update zentity.io home
  • Update zentity.io releases
  • Update zentity.io documentation

Allow integers to be given as inputs to fields that require floats

The score and quality fields in entity models are floats. Currently, zentity strictly requires the inputs of those fields to be floats. If an integer is submitted to one of these fields, zentity will throw a validation exception.

This behavior is too restrictive for some clients. JavaScript's JSON.stringify() serializer will force any number of 0.0 or 1.0 to be serialized as 0 or 1 and there is no easy way around this (cases: here, here).

zentity should allow integers as inputs to float fields, and then convert those fields to floats for its own purposes.

Add logging to zentity

zentity should take advantage of the logging architecture of Elasticsearch to aid troubleshooting. This can be implemented as needed instead of creating a dedicated feature branch for logging.

To implement this:

  1. Add the following property to any class that will use the logger (substituting CLASS_NAME with the name of the class):
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

class MyClass {
   private static final Logger logger = LogManager.getLogger(MyClass.class);
}
  1. Invoke the logger's methods as needed:
logger.catching(e);
logger.fatal(message);
logger.error(message);
logger.warn(message);
logger.info(message);
logger.debug(message);
logger.trace(message);
  1. Add the following configurations to the elasticsearch.yml file of each node to write the log messages to the Elasticsearch log files:
logger.org.elasticsearch.plugin.zentity: DEBUG
logger.io.zentity: DEBUG

Error on resolution request using an embedded model

Whenever I try to perform a resolution request with an embedded entity model, I get a validation_exception, You must specify either an entity type or an entity model. The structure of my request follows the pattern described in the doc:

POST _zentity/resolution 
{
  "attributes": {...},
  "model": {...}
}

Not matching

I am using elastic 6.2.4 and the zentity plugin 6.2.4 (Version 1.0.0).

URL: /_zentity/models/test

My model is:
{ "attributes": { "name": { "type": "string" }, "ssn": { "type": "string" } }, "resolvers": { "name_ssn": { "attributes": [ "name", "ssn" ] }, "name_only": { "attributes": [ "name" ] } }, "matchers": { "exact": { "clause": { "term": { "{{ field }}": "{{ value }}" } } }, "fuzzy": { "clause": { "match": { "{{ field }}": { "query": "{{ value }}", "fuzziness": 2 } } } } }, "indices": { "test": { "fields": { "FIRST_NAME": { "attribute": "name", "matcher": "fuzzy" }, "LAST_NAME": { "attribute": "name", "matcher": "fuzzy" }, "SSN": { "attribute": "ssn", "matcher": "exact" } } } } }

URL: /_zentity/resolution/test

My resolution is:
{ "attributes": { "name": [ "Muruga Mani" ], "ssn": [ "111-22-3333" ] }, "include": { "indices": ["test"], "resolvers": ["name_only"] } }

or

{ "attributes": { "name": [ "Muruga Mani" ], "ssn": [ "111-22-3333" ] } }

My data in ES is (in index test and type person):
{"LAST_NAME":"Mani","FIRST_NAME":"Muruga","SSN":"111-22-3333"}

But it is not resolving the match.

My response is:
{"took":8,"hits":{"total":0,"hits":[]}}

What am I missing here?

maxClauseCount exception thrown for a simple model

I commented on the Elasticsearch discourse about an issue I am having where a simple model with a resolver for name and phone leads to an exception due to a large clause and someone recommended I post it over here.

The exception looks like this:

org.elasticsearch.ElasticsearchException$1: maxClauseCount is set to 1024
	at org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:639) ~[elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:137) [elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:264) [elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:105) [elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.action.search.InitialSearchPhase.access$200(InitialSearchPhase.java:50) [elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.action.search.InitialSearchPhase$2.onFailure(InitialSearchPhase.java:273) [elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:73) [elasticsearch-7.3.2.jar:7.3.2]
	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) [elasticsearch-7.3.2.jar:7.3.2]

This is on an index with about 13 million entries, and a model with a single resolver that looks at name and phone number. Iterating through a test set of amounting to 1000 records total where each record has a name and phone number, I get the above exceptions thrown periodically.

What's worse, anytime these errors are thrown, it takes around 10-30 seconds to resolve itself, which makes it too slow for processing the full data set (around 70k entries).

Just before the exception, the console dumps part of the query to stderr and it looks like a giant query with all of the different phone numbers in the index.

Is there something I can do to prevent this from happening? Is this a result of something I have configured incorrectly?

Unrelated values are being pulled in Fuzzy

Hi,

I am using elastic 6.2.4 and the zentity plugin 6.2.4 (Version 1.0.0).

URL: /_zentity/models/test
{ "attributes": { "name": { "type": "string" }, "ssn": { "type": "string" } }, "resolvers": { "name_ssn": { "attributes": [ "name", "ssn" ] }, "name_only": { "attributes": [ "name" ] } }, "matchers": { "exact": { "clause": { "term": { "{{ field }}": "{{ value }}" } } }, "fuzzy": { "clause": { "match": { "{{ field }}": { "query": "{{ value }}", "fuzziness": 100 } } } } }, "indices": { "test": { "fields": {"firstName": { "attribute": "name", "matcher": "fuzzy" }, "middleName": { "attribute": "name", "matcher": "fuzzy" }, "lastName": { "attribute": "name", "matcher": "fuzzy" }, "otherFirstName": { "attribute": "name", "matcher": "fuzzy" }, "otherLastName": { "attribute": "name", "matcher": "fuzzy" }, "ssn.keyword": { "attribute": "ssn", "matcher": "exact" } } } } }

I want to use firstname, lastname, middleName, otherFirstName, otherLastName considered as name attribute.

I have 5 indices in ELK
[{"_index":"test","_type":"identity","_id":"5","_version":1,"_score":1,"_source":{"firstName":"test","middleName":null,"lastName":"Beena","otherFirstName":null,"otherLastName":"William","ssn":"109520107"}},{"_index":"test","_type":"identity","_id":"2","_version":3,"_score":1,"_source":{"firstName":"test","middleName":null,"lastName":"test","otherFirstName":null,"otherLastName":null,"ssn":"109520107"}},{"_index":"test","_type":"identity","_id":"4","_version":3,"_score":1,"_source":{"firstName":"Williamz","middleName":null,"lastName":"Beena","otherFirstName":null,"otherLastName":null,"ssn":"109520107"}},{"_index":"test","_type":"identity","_id":"1","_version":1,"_score":1,"_source":{"firstName":"Bina","middleName":null,"lastName":"William","otherFirstName":null,"otherLastName":null,"ssn":"109520107"}},{"_index":"test","_type":"identity","_id":"3","_version":2,"_score":1,"_source":{"firstName":"Beena","middleName":null,"lastName":"Williamz","otherFirstName":null,"otherLastName":null,"ssn":"109520107"}}]

When hit the URL: _zentity/resolution/test with the below request

{ "attributes": { "name": [ "BEENA", "", "WILLIAM" ], "ssn": [ "109520107" ] }, "include": { "indices": [ "test" ], "resolvers": [ "name_ssn" ] } }

I got the response with all the indexes
{ "took": 53, "hits": { "total": 5, "hits": [ { "_index": "test", "_type": "identity", "_id": "5", "_hop": 0, "_attributes": { "name": "William", "ssn": "109520107" }, "_source": { "firstName": "test", "middleName": null, "lastName": "Beena", "otherFirstName": null, "otherLastName": "William", "ssn": "109520107" } }, { "_index": "test", "_type": "identity", "_id": "4", "_hop": 0, "_attributes": { "name": null, "ssn": "109520107" }, "_source": { "firstName": "Williamz", "middleName": null, "lastName": "Beena", "otherFirstName": null, "otherLastName": null, "ssn": "109520107" } }, { "_index": "test", "_type": "identity", "_id": "1", "_hop": 0, "_attributes": { "name": null, "ssn": "109520107" }, "_source": { "firstName": "Bina", "middleName": null, "lastName": "William", "otherFirstName": null, "otherLastName": null, "ssn": "109520107" } }, { "_index": "test", "_type": "identity", "_id": "3", "_hop": 0, "_attributes": { "name": null, "ssn": "109520107" }, "_source": { "firstName": "Beena", "middleName": null, "lastName": "Williamz", "otherFirstName": null, "otherLastName": null, "ssn": "109520107" } }, { "_index": "test", "_type": "identity", "_id": "2", "_hop": 1, "_attributes": { "name": null, "ssn": "109520107" }, "_source": { "firstName": "test", "middleName": null, "lastName": "test", "otherFirstName": null, "otherLastName": null, "ssn": "109520107" } } ] } }

I was not expecting the indexes "_id": "2" and "_id": "5" since names are totally off..

Can anyone please check on this.

  1. Why the "_id": "2" and "_id": "5" are returned. What am I doing wrong?
  2. the name attribute is null in the response but I see name fields populated in the _source field.
  3. I tried to changed the fuzziness score ("fuzziness") from 0 to 100, i do not see any change in the response. Any reference docs for the fuzziness score ?

Unable to detect nested object attribute from the resolver results

Hi,
I had created my zentity model which include some nested object of array attributes to be resolved.
However when I run the resolution, the _attributes list that was return from the result only include those attribute that was declared at the root level of the document.

Those nested object attributes were not being detected thus those nested object array attributes could not be used for subsequent recursive resolving traversal.

Is there anyway to do this? Thanks

E.g
{ "attributes": { "firstName": { "type": "string" }, "lastName": { "type": "string" }, "licenseNumber": { "type": "string" } }, "resolvers": { "name": { "attributes": ["lastName", "firstName"] }, "license": { "attributes": ["licenseNumber","firstName"] } }, "matchers": { "exact": { "clause": { "term": { "{{ field }}": "{{ value }}" } } }, "exact_license_nested": { "clause": { "nested": { "path": "license", "query": { "term": { "{{ field }}": "{{ value }}" } } } } }, "fuzzy": { "clause": { "match": { "{{ field }}": { "query": "{{ value }}", "fuzziness": "auto", "operator": "AND" } } } } }, "indices": { "my_index": { "fields": { "firstName": { "attribute": "firstName", "matcher": "fuzzy" }, "lastName": { "attribute": "lastName", "matcher": "fuzzy" }, "license.number.keyword": { "attribute": "licenseNumber", "matcher": "exact_license_nested" } } } } }
When I run the resolution, the result _attributes portion will only consist of
firstName, lastName but not licenseNumber although license.number is inside the document in the form of

license:[{number:1},{number:2}].
Having this will only result in traversing the "name" resolver but not the "license" resolver for subsequent hops.

Support for ES 7.7.1 or 7.8

We discovered while planning a cluster upgrade that we'd have to remove Zentity functionality completely because the plugin does not register itself as being compatible. Are there any plans to address this?

Add integration tests for Elasticsearch security features

While zentity should run seamlessly with native Elasticsearch security features and has proven to do so in practice, it would be a good idea to write automated tests for zentity operating within the constraints of those security features. The tests will provide assurance that zentity functions as designed in a secured cluster, that zentity does not somehow circumvent those security features, and that zentity properly handles security exceptions.

Features to test

Release Plan: 1.6.2

Release Plan: 1.6.2

Changes

  • Bug fix: Blocking calls leads to cluster instability (Issue: #56) (PR: #67)

Blockers

  • Enable assertions in integration tests (Issue: #64) (PR: #67)
  • Use multi-node clusters for integration tests (Issue: #68) (PR: #70)
  • Migrate from Travis CI to GitHub Actions (Issue: #54) (PR: #69)

Post-release

  • Update zentity.io releases

Multi-Node Cluster Hangs on Zentity 1.6.1 Requests

Hey there, I'm currently experiencing an issue running Zentity 1.6.1 w/ Elasticsearch 7.10.1 inside a multi-node cluster, but not on a single-node cluster. When sending alernating setup/ delete requests (as well as with other requests), it sometimes hangs and looks like the Elasticsearch CoordinatorPublication gets gummed up. I can replicate this both in a local docker-compose setup (attached below) and in Kubernetes with an elastic-on-k8s cluster w/ 3 master and 3 data nodes.

Here are the logs from the docker-compose setup, where I've deleted then created the index, and the coordination hangs for 30+ seconds:

elasticsearch    | {"type": "server", "timestamp": "2021-01-22T15:56:16,893Z", "level": "INFO", "component": "o.e.c.m.MetadataDeleteIndexService", "cluster.name": "docker-cluster", "node.name": "primary", "message": "[.zentity-models/kCCUX_6bS3CZeDQzImGi2A] deleting index", "cluster.uuid": "Zi3JrTDvRkmyjizI6z-6QQ", "node.id": "eZpuNPEsRqKPl6bhvojRJQ"  }
elasticsearch    | {"type": "deprecation", "timestamp": "2021-01-22T15:56:31,234Z", "level": "DEPRECATION", "component": "o.e.d.c.m.MetadataCreateIndexService", "cluster.name": "docker-cluster", "node.name": "primary", "message": "index name [.zentity-models] starts with a dot '.', in the next major version, index names starting with a dot are reserved for hidden indices and system indices", "cluster.uuid": "Zi3JrTDvRkmyjizI6z-6QQ", "node.id": "eZpuNPEsRqKPl6bhvojRJQ"  }
elasticsearch    | {"type": "server", "timestamp": "2021-01-22T15:56:31,309Z", "level": "INFO", "component": "o.e.c.m.MetadataCreateIndexService", "cluster.name": "docker-cluster", "node.name": "primary", "message": "[.zentity-models] creating index, cause [api], templates [], shards [1]/[1]", "cluster.uuid": "Zi3JrTDvRkmyjizI6z-6QQ", "node.id": "eZpuNPEsRqKPl6bhvojRJQ"  }
elasticsearch    | {"type": "server", "timestamp": "2021-01-22T15:56:41,313Z", "level": "INFO", "component": "o.e.c.c.C.CoordinatorPublication", "cluster.name": "docker-cluster", "node.name": "primary", "message": "after [10s] publication of cluster state version [928] is still waiting for {es-data-2}{Xjwq8qUrReyh5VUi21l3aQ}{btWNi8GkTJaAjVjbcQxe2g}{172.19.0.2}{172.19.0.2:9300}{dir} [SENT_PUBLISH_REQUEST]", "cluster.uuid": "Zi3JrTDvRkmyjizI6z-6QQ", "node.id": "eZpuNPEsRqKPl6bhvojRJQ"  }
elasticsearch    | {"type": "server", "timestamp": "2021-01-22T15:57:01,314Z", "level": "WARN", "component": "o.e.c.c.C.CoordinatorPublication", "cluster.name": "docker-cluster", "node.name": "primary", "message": "after [30s] publication of cluster state version [928] is still waiting for {es-data-2}{Xjwq8qUrReyh5VUi21l3aQ}{btWNi8GkTJaAjVjbcQxe2g}{172.19.0.2}{172.19.0.2:9300}{dir} [SENT_PUBLISH_REQUEST]", "cluster.uuid": "Zi3JrTDvRkmyjizI6z-6QQ", "node.id": "eZpuNPEsRqKPl6bhvojRJQ"  }

Do you think this originates in the plugin or in a misconfiguration of the clusters?

Docker Compose file
version: '3.7'

x-plugin-volume: &plugin-volume "./target/releases/:/plugins"

x-base-es: &base-es
image: docker.elastic.co/elasticsearch/elasticsearch-oss:${ES_VERSION:-7.10.2}
user: "elasticsearch"

install all plugins in mounted /plugin directory and start the elasticsearch server

command:
- /bin/bash
- -c
- elasticsearch-plugin install --batch https://zentity.io/releases/zentity-1.6.1-elasticsearch-7.10.2.zip && elasticsearch
ulimits:
nofile:
soft: 65536
hard: 65536
memlock:
soft: -1
hard: -1
environment: &base-env
cluster.name: docker-cluster
network.host: 0.0.0.0
# minimum_master_nodes need to be explicitly set when bound on a public IP
# set to 1 to allow single node clusters
# Details: elastic/elasticsearch#17288
discovery.zen.minimum_master_nodes: "1"
# Reduce virtual memory requirements, see docker/for-win#5202 (comment)
bootstrap.memory_lock: "false"
ES_JAVA_OPTS: "-Xms512m -Xmx512m -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=0.0.0.0:5050"
http.cors.enabled: "true"
http.cors.allow-origin: "*"
cluster.initial_master_nodes: primary
networks:
- elastic

x-base-primary-node: &base-primary-node
<<: *base-es
environment:
<<: *base-env
node.name: primary
node.master: "true"
node.data: "false"
node.ingest: "false"

x-base-data-node: &base-data-node
<<: *base-es
environment:
<<: *base-env
discovery.zen.ping.unicast.hosts: elasticsearch
node.master: "false"
node.data: "true"
node.ingest: "true"

services:
elasticsearch:
<<: *base-primary-node
hostname: elasticsearch
container_name: elasticsearch
volumes:
- *plugin-volume
- es-primary:/usr/share/elasticsearch/data
ports:
- "${ES_PORT:-9200}:9200" # http
- "${DEBUGGER_PORT:-5050}:5050" # debugger

es-data-1:
<<: *base-data-node
hostname: es-data-1
container_name: es-data-1
volumes:
- *plugin-volume
- es-data-1:/usr/share/elasticsearch/data
ports:
- "${DEBUGGER_PORT_DATA_1:-5051}:5050" # debugger

es-data-2:
<<: *base-data-node
hostname: es-data-2
container_name: es-data-2
volumes:
- *plugin-volume
- es-data-2:/usr/share/elasticsearch/data
ports:
- "${DEBUGGER_PORT_DATA_2:-5052}:5050" # debugger

kibana:
image: docker.elastic.co/kibana/kibana-oss:${KIBANA_VERSION:-7.10.1}
hostname: kibana
container_name: kibana
logging:
driver: none
environment:
- server.host=0.0.0.0
- server.name=kibana.local
- elasticsearch.url=http://elasticsearch:9200
ports:
- '${KIBANA_PORT:-5601}:5601'
networks:
- elastic

volumes:
es-primary:
driver: local
es-data-1:
driver: local
es-data-2:
driver: local

networks:
elastic:

Elastic K8s manifest
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  annotations:
    common.k8s.elastic.co/controller-version: 1.3.1
    elasticsearch.k8s.elastic.co/cluster-uuid: 8xDpRuE4T8ufu_KSJV4hFw
  creationTimestamp: "2021-01-20T17:28:29Z"
  generation: 4
  labels:
    app.kubernetes.io/instance: eck-entity-resolution
    app.kubernetes.io/managed-by: Tiller
    app.kubernetes.io/name: eck-entity-resolution
    app.kubernetes.io/part-of: eck
    app.kubernetes.io/version: 1.1.2
    helm.sh/chart: eck-entity-resolution-0.3.0
  name: eck-entity-resolution
  namespace: entity-resolution
  resourceVersion: "273469952"
  selfLink: /apis/elasticsearch.k8s.elastic.co/v1/namespaces/entity-resolution/elasticsearches/eck-entity-resolution
  uid: cff37de2-c6c3-4ebd-a230-e45f00bdc7e7
spec:
  auth:
    fileRealm:
    - secretName: eck-entity-resolution-users
    roles:
    - secretName: eck-entity-resolution-roles
  http:
    service:
      metadata:
        creationTimestamp: null
      spec: {}
    tls:
      certificate: {}
      selfSignedCertificate:
        disabled: true
  nodeSets:
  - config:
      node.data: false
      node.ingest: false
      node.master: true
    count: 3
    name: primary-node
    podTemplate:
      spec:
        containers:
        - env:
          - name: ES_JAVA_OPTS
            value: -Xms500m -Xmx500m
          name: elasticsearch
          resources:
            limits:
              cpu: 1
              memory: 1Gi
            requests:
              cpu: 0.5
              memory: 1Gi
        initContainers:
        - command:
          - sh
          - -c
          - |
            bin/elasticsearch-plugin install --batch https://github.com/zentity-io/zentity/releases/download/zentity-1.6.1/zentity-1.6.1-elasticsearch-7.10.1.zip
          name: install-plugins
        - command:
          - sh
          - -c
          - sysctl -w vm.max_map_count=262144
          name: sysctl
          securityContext:
            privileged: true
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 2Gi
        storageClassName: standard-expandable
  - config:
      node.data: true
      node.ingest: true
      node.master: false
    count: 3
    name: data-node
    podTemplate:
      containers:
      - env:
        - name: ES_JAVA_OPTS
          value: -Xms4g -Xmx4g -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=0.0.0.0:5005
        name: elasticsearch
        resources:
          limits:
            cpu: 2
            memory: 8Gi
          requests:
            cpu: 0.5
            memory: 8Gi
      spec:
        initContainers:
        - command:
          - sh
          - -c
          - |
            bin/elasticsearch-plugin install --batch https://github.com/zentity-io/zentity/releases/download/zentity-1.6.1/zentity-1.6.1-elasticsearch-7.10.1.zip
          name: install-plugins
        - command:
          - sh
          - -c
          - sysctl -w vm.max_map_count=262144
          name: sysctl
          securityContext:
            privileged: true
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 25Gi
        storageClassName: sdd-fast-expandable
  transport:
    service:
      metadata:
        creationTimestamp: null
      spec: {}
  updateStrategy:
    changeBudget: {}
  version: 7.10.1
status:
  availableNodes: 6
  health: green
  phase: Ready
  version: 7.10.1

Bulk Resolution Support

Similar to the ES bulk API, do you think it is feasible to add bulk resolution support to limit the number of network requests? It seems rare to want to resolve just a single entity if you're doing NER on a decently large piece of text.

Support OpenSearch

We've moving from Elasticsearch 7.10.2 to OpenSearch 1.2.4. The OS version is a fork of the OSS build of the ES version plus a bunch of plugins.

After skimming through the plugin migration documentation [(https://github.com/opensearch-project/opensearch-plugins/blob/main/UPGRADING.md)], I managed to build Zenity for OS.

Here is a proof of concept:

https://github.com/netom/zentity/tree/opensearch-1.2.4

I wonder if it would be possible for the project to be built for both Elasticsearch and OpenSearch.

I imagine this would require a complete re-design of the interface between the core functionality and the back end.

Non-Lating Language Support

Hi Everyone,

Thank you for the fantastic project.

I read the documentation and tried the project with the provided data which was in English.

I was wondering if this can be transferrable to other non-Latin languages out of the box

or are there any modifications that needs to happen?

Regards,

Releases 7.12.1.jar instead of .zip

In the Releases section there is no zentity-1.8.1-elasticsearch-7.12.1.zip. Instead, there is a .jar placed. Not sure if that is intended.

[Bug]

Environment

  • zentity version: 1.8.2
  • Elasticsearch version: 7.15.1

Describe the bug

I installed zentitty successfully on my windows machine.
While runing http://localhost:9200/_zentity ti thorws below error:
{"error":{"root_cause":[{"type":"invalid_index_name_exception","reason":"Invalid index name [zentity], must not start with ''.","index_uuid":"na","index":"_zentity"}],"type":"invalid_index_name_exception","reason":"Invalid index name [zentity], must not start with ''.","index_uuid":"na","index":"_zentity"},"status":400}

Expected behavior

Zentity should run as expected after installation.

Question: How to increase performance on entity resolution?

Hi!
First of all thanks for this very useful project.
I have a question about performance.
One of our use cases is to create entity groups from a pretty large index (8M documents,~4Gb). The model we use has ~10 attributes, 8 resolvers and use matchers with fuziness. We have fixed the number of hops to 2 to prevent "snowballing".
We call the resolution API ~3M times, and we would like to reduce the total computing time.

Do you think that adding more nodes to our Elasticsearch cluster can improve the response time of the resolution API? For now it's just a single node.

Because of unrelated constraints, we run a pretty old version of Elasticsearch and therefore zentity (6.2.3 and 1.0.0). Do you think upgrading could improve performance?
Do you have any other suggestions?

Resolution output should list attribute values as arrays

Currently, each document returned by a resolution job has an "_attributes" object in which each field is the name of an attribute mapped to a single value (source).

Example:

{
  "_attributes": {
    "name": "Alice Jones",
    "phone": "555-123-4567"
  },
  "_source": {
    "indexed_name": "Alice Jones",
    "indexed_phone": "555-123-4567"
  }
}

However, two common situations could lead to information being lost in the output:

  1. The value of an indexed field can be an array of values. In the example above, if the "_source" object listed "indexed_phone": [ "555-123-4567", "555-987-6543" ], then only one of those values would be mapped to the "phone" attribute in the "_attributes" object.
  2. An entity model can map multiple index fields to the same attribute. In the example above, if the "_source" object listed "indexed_phone_1" and "indexed_phone_2", it's valid to have an entity model that maps both index fields to the "phone" attribute.

In both example above, the desired behavior would be to ensure that every value is returned as an array in the "_attributes" object. For example:

{
  "_attributes": {
    "name": [ "Alice Jones" ],
    "phone": [ "555-123-4567", "555-987-6543", "555-000-1234" ]
  },
  "_source": {
    "indexed_name": "Alice Jones",
    "indexed_phone_1": [ "555-123-4567", "555-987-6543" ],
    "indexed_phone_2": [ "555-000-1234" ]
  }
}

Without this enhancement, the completeness or accuracy of resolution outputs can't be guaranteed whenever a matching document has multiple values mapped to the same attribute.

This enhancement would be a breaking chance that affects most users. But it should be easy for most users to adapt to this change.

Question: Debugging instructions

Hi,

I am trying to run the integration tests but get Integration tests are skipped: got: "Connection refused", expected: not a string containing "Connection refused" from the io.zentity.resolution.AbstractITCase.startRestClient()method because it can't connect to elasticsearch. Is there another step I am missing?

I am using Intellij but I am not a Java developer.

Confidence scores for matched documents

Background

Many users have requested a way to indicate the confidence of a match for documents returned in the output of a resolution job. Often the request is for a score, where a higher score indicates greater confidence in the match. Some users envisioned a score for each document. Others envisioned a score for specific fields such as the value of a name.

Currently, and with no change in mind for the future, zentity submits boolean queries to find matching documents in Elasticsearch. Therefore, by the standards of Elasticsearch, every matching document has a constant score of 1.

zentity also offers an "_explanation" field for each document, which describes the resolvers, matchers, and values that caused the document to match.

I believe it would be possible to let users assign scores for various concepts in zentity, to combine those scores to produce an overall confidence score for a matching document, and to implement this in a way that intuitively fits the design of zentity and does not incur a significant performance penalty.

@cmwaters89 deserves recognition for demonstrating the feasibility of this concept. Thank you for your contribution!

Concept

The following zentity components could be extended to support a user-defined score that contributes to an overall confidence score for any matching document:

  • Attribute
  • Resolver
  • Matcher
  • Index
  • Index Field
  • Hop

Examples:

  • Attributes - Users might observe that some attributes, like ssn or email, are likely to identify a single entity more accurately than other attributes like name or dob.
  • Resolvers - Likewise, some combinations of attributes (i.e. "resolvers") are likely to identify a single entity more accurately than others.
  • Matchers - Exact matchers could be considered more confident than fuzzy, phonetic, or ranged matchers.
  • Indices - Some indices could be known to have data that's more or less reliable (clean, accurate, trustworthy) than others.
  • Index Fields - Likewise, some index fields in particular could be known to have data that's more or less reliable than others.
  • Hops - Documents found in later hops could be considered less confident than documents found in earlier hops.

The concept of the feature is to let users to define a base score for any of the components listed above in whatever way they see fit for their use case. These base scores could be defined in the entity model and overridden at query time.

Upon query execution, zentity would then, for each document, combine the component base scores to produce an overall match confidence score for the document. This may involve calculating the product or average of the base scores. The way in which zentity combines the scores could also be configured by the user.

Default behavior

The default score for any document should be 1, null, or simply not present in the document at all, both to remain agnostic and to be consistent with past versions of the plugin.

The default base score for any component listed above should not influence the overall confidence score of the document. It should only contribute to that score when the user defines the base score. Potentially one solution for this is to let the default base score be null (i.e. undefined) and to skip any such scores when calculating the overall score.

Deconflicting base scores

It's possible for documents to match for multiple reasons. A document could match multiple resolvers and matchers. A case could be made for doing either of the following:

  1. Use the base score for the best resolver or matcher.
  2. Use the combined base scores for every resolver or matcher.

This behavior could be configured by the user, too.

Score thresholds

It should be easy enough to allow the user to defind a threshold that a document score must pass in order to be kept in the results. This has been another common and related feature request.

Optional and opt-in

The implementation of this feature would depend on information from the "_explanation" field. This information is only gathered when the client sets _explanation=true in the URL, because the queries become slightly more complex and incur a slight performance penalty. Likewise, I believe this scoring feature should be made optional and "opt-in" for the users who care about it.

Generic

One of the tenants of zentity is to be generic and not domain-specific. zentity should remain agnostic to the actual scoring process, and instead provide a framework in which the user can define a scoring process that fits their use case.

Scoring based on past scores

If a document matches with a relatively low score, should its values penalize the scores of documents that match it in subsequent queries? I have yet to think this through in detail.

Field level scores

This feature should work at the document level. I'm not confident (no pun intended) that this could be done at the field level without adding a lot of complexity to the plugin. By field level, I mean a score that indicates the confidence that the value of a specific field matches an input value. zentity is able to explain the reason for a match by using named filters in Elasticsearch. Named filters can inform zentity of the matchers that led to a hit, but it doesn't provide details on why they led to a hit.

Users who desire field level scores have a couple alternatives:

  1. Have the client application take the "input_value" and "target_value" from the "_explanation" field to derive a confidence score for the value outside of zentity. Perhaps the client application derives a score by comparing the length and edit distance of the two values, for example.
  2. Upon the release of this feature, treat the score from the matcher as a proxy for the score for the value, since the matcher already describes why two values match.

Dynamic field mapping

First of all, thank you for the awesome work done with the library!

We have a use case where the user profile properties are dynamic and we want to allow resolving the identity using one of these fields:

{
    "name": "Marcos",
    "last_name": "Passos",
    "extra": {
        "loyalty_number": "123"
    }
}

In this example, loyalty_number is a dynamic field, unknown at mapping time. However, we would like to match (exactly) this field in some cases.

Is it supported? What is the recommended approach?

Release Plan: 1.7.0

Release Plan: 1.7.0

Changes

  • Feature: Allow attributes to be represented as nested fields. (Issue: #77) (PR: #80)
  • Breaking Change: Enforce naming requirements for entity types (Issue: #58) (PR: #63)
  • Breaking Change: Enforce naming requirements for attributes, resolvers, and matchers (Issue: #73) (PR: #75)
  • Minor: Allow more lenient validation of empty objects in Models API (Issue: #72) (PR: #74)
  • Minor: Allow integers to be given as inputs to float fields (Issue: #59)

Release

  • Create branch 1.7.0 from main branch
  • Update zentity.version in pom.xml and version numbers in README.md
  • Tag as zentity-1.7.0 and push to build and deploy its release artifacts
  • Verify successful deployment

Post-release

  • Update zentity.io home
  • Update zentity.io releases
  • Update zentity.io documentation

_attribute not valorised and hops not executed

We are trying to use Zentity 1.5.1 with Elasticsearch 7.3.2. for a pilot project. we're not able to make working a chain of resolvers most probably for the problem reported in the title.
I report here the index mapping of the data:

{
    "obj-person": {
        "aliases": {},
        "mappings": {
            "dynamic": "strict",
            "properties": {
                "completeName": {
                    "type": "text",
                    "fields": {
                        "phonetic": {
                            "type": "text",
                            "analyzer": "phonetic_analyzer"
                        }
                    }
                },
                "entry": {
                    "properties": {
                        "createdBy": {
                            "type": "keyword"
                        },
                        "createdDate": {
                            "type": "date"
                        },
                        "infoObject": {
                            "properties": {
                                "dateOfBirth": {
                                    "properties": {
                                        "originalValue": {
                                            "type": "text",
                                            "copy_to": [
                                                "search"
                                            ]
                                        },
                                        "value": {
                                            "type": "text",
                                            "copy_to": [
                                                "search"
                                            ]
                                        }
                                    }
                                },
                                "firstName": {
                                    "properties": {
                                        "originalValue": {
                                            "type": "text",
                                            "copy_to": [
                                                "search"
                                            ]
                                        },
                                        "value": {
                                            "type": "text",
                                            "fields": {
                                                "phonetic": {
                                                    "type": "text",
                                                    "analyzer": "phonetic_analyzer"
                                                }
                                            },
                                            "copy_to": [
                                                "search",
                                                "completeName"
                                            ]
                                        }
                                    }
                                },
                                "lastName": {
                                    "properties": {
                                        "originalValue": {
                                            "type": "text",
                                            "copy_to": [
                                                "search"
                                            ]
                                        },
                                        "value": {
                                            "type": "text",
                                            "fields": {
                                                "phonetic": {
                                                    "type": "text",
                                                    "analyzer": "phonetic_analyzer"
                                                }
                                            },
                                            "copy_to": [
                                                "search",
                                                "completeName"
                                            ]
                                        }
                                    }
                                }
                            }
                        }
                    }
                },
                "search": {
                    "type": "text",
                    "fields": {
                        "graphically_similar": {
                            "type": "text",
                            "analyzer": "normalize_graphically_similar_analyzer"
                        },
                        "normalized": {
                            "type": "text",
                            "analyzer": "normalize_alphanum_analyzer"
                        },
                        "phonetic": {
                            "type": "text",
                            "analyzer": "phonetic_analyzer"
                        }
                    }
                },
                "search_sensitive": {
                    "type": "text"
                },
                "type": {
                    "type": "keyword"
                }
            }
        },
        "settings": {
            "index": {
                "number_of_shards": "1",
                "auto_expand_replicas": "1-5",
                "provided_name": "obj-person",
                "creation_date": "1582637094035",
                "analysis": {
                    "filter": {
                        "phonetic_filter": {
                            "replace": "true",
                            "type": "phonetic",
                            "encoder": "double_metaphone"
                        }
                    },
                    "analyzer": {
                        "phonetic_analyzer": {
                            "filter": [
                                "phonetic_filter"
                            ],
                            "tokenizer": "standard"
                        },
                        "normalize_graphically_similar_analyzer": {
                            "filter": [
                                "uppercase"
                            ],
                            "char_filter": [
                                "strip_special_chars",
                                "replace_graphically_similar"
                            ],
                            "type": "custom",
                            "tokenizer": "keyword"
                        },
                        "normalize_alphanum_analyzer": {
                            "filter": [
                                "uppercase",
                                "reverse"
                            ],
                            "char_filter": "strip_special_chars",
                            "type": "custom",
                            "tokenizer": "keyword"
                        }
                    },
                    "char_filter": {
                        "replace_graphically_similar": {
                            "type": "mapping",
                            "mappings": [
                                "O => 0",
                                "D => 0",
                                "I => 1",
                                "B => 8",
                                "S => 5",
                                "Z => 2",
                                "G => 6",
                                "E => 3",
                                "o => 0",
                                "d => 0",
                                "i => 1",
                                "b => 8",
                                "s => 5",
                                "z => 2",
                                "g => 6",
                                "e => 3"
                            ]
                        },
                        "strip_special_chars": {
                            "pattern": "[^\\w]",
                            "type": "pattern_replace",
                            "replacement": ""
                        }
                    }
                },
                "number_of_replicas": "1",
                "uuid": "JFGNOU6xR4i_BHM8e0nB5Q",
                "version": {
                    "created": "7030199"
                }
            }
        }
    }
}

Creating a zentity model like this:

PUT _zentity/models/zentity_test_resolution_person 
{
    "attributes" : {
      "first_name" : {
        "type" : "string"
      },
      "last_name" : {
        "type" : "string"
      },
      "dob" : {
        "type" : "string"
      }
    },
    "resolvers" : {
      "name_only" : {
        "attributes" : [
          "first_name",
          "last_name"
        ]
      },
      "dob" : {
        "attributes" : [
          "dob"
        ]
      }
    },
    "matchers" : {
      "simple" : {
        "clause" : {
          "match" : {
            "{{ field }}" : "{{ value }}"
          }
        }
      },
      "fuzzy" : {
        "clause" : {
          "match" : {
            "{{ field }}" : {
              "query" : "{{ value }}",
              "fuzziness" : "{{ params.fuzziness }}"
            }
          }
        },
        "params" : {
          "fuzziness" : "auto"
        }
      }
    },
    "indices" : {
      "obj-person" : {
        "fields" : {
          "entry.infoObject.firstName.value" : {
            "attribute" : "first_name",
            "matcher" : "fuzzy"
          },
          "entry.infoObject.lastName.value" : {
            "attribute" : "last_name",
            "matcher" : "fuzzy"
          },
         "entry.infoObject.dateOfBirth.value" : {
            "attribute" : "dob",
            "matcher" : "simple"
          }
        }
      }
    }
}

Using three objects which the core data is:
Person1:

"firstName" : {
  "value" : "Nolan"
},
"lastName" : {
  "value" : "Hendricks"
},
"dateOfBirth" : {
  "value" : "633-9242"
}

Person2:

"firstName" : {
  "value" : "Nolan"
},
"lastName" : {
  "value" : "Hendricks"
},
"dateOfBirth" : {
  "value" : "677-9999"
}

Person3:

"firstName" : {
  "value" : "Noln"
},
"lastName" : {
  "value" : "Hendricks"
},
"dateOfBirth" : {
  "value" : "677-9999"
}

If we execute this resolution:

POST _zentity/resolution/zentity_test_resolution_person?_source=false&_explanation=false
{
  "attributes": {
    "first_name": {
      "values":  ["Nolan"],
      "params": {
        "fuzziness": "0"
      }
    },
    "last_name": ["Hendricks"]
  }
}

the result is this:

{
  "took" : 2,
  "hits" : {
    "total" : 2,
    "hits" : [ {
      "_index" : "obj-person",
      "_type" : "_doc",
      "_id" : "2D6F8CCF-227B-4FBF-A749-16C098BB0C0A",
      "_hop" : 0,
      "_query" : 0,
      "_attributes" : {
        "dob" : [ ],
        "first_name" : [ ],
        "last_name" : [ ]
      }
    }, {
      "_index" : "obj-person",
      "_type" : "_doc",
      "_id" : "1CE639AD-B3FD-4FAB-9D9B-A469DE75C943",
      "_hop" : 0,
      "_query" : 0,
      "_attributes" : {
        "dob" : [ ],
        "first_name" : [ ],
        "last_name" : [ ]
      }
    } ]
  }
}

How you can see _attributes are not valorized at all and no all the hops has been performed: I'll expect to see the third result based on the "dob" field.
Can you point me in the right direction?

Implement and enforce requirements for names of attributes, resolvers, and matchers.

Attributes, resolvers, and matchers should have the same naming requirements as entity type names (see #58) for the same reasons listed in that issue.

This change should be included in the same release as #58 so that users can fix deprecated names in their entity models in one release.

Index names and index field names can be excluded from this validation. Elasticsearch validates those, but only zentity can validate attributes, resolvers, and matchers.

Bulk Model Management

Implement an API endpoint to submit multiple entity model management operations in bulk, borrowing the functionality for bulk operations introduced in #50. This will enable a more efficient handling of multiple entity model management operations. One envisioned implementation is an entity model management user interface that provides checkboxes to delete multiple models in one request.

Proposed syntax

POST /_zentity/models/_bulk[?PARAMS]
{ PARAMS }
{ PAYLOAD }
...
POST /_zentity/models/ENTITY_TYPE/_bulk[?PARAMS]
{ PARAMS }
{ PAYLOAD }
...

Accepted operations

The bulk endpoint would support operations that create, update, or delete entity models. Unlike the implementation in #50, this will require a field in PARAMS that indicates which action is to be executed. See the Elasticsearch Bulk API implementation for reference, which requires this convention, too.

Implement and enforce requirements for entity type names

Currently entity type names can be any arbitrary string. As noted in another discussion, this is problematic when implementing API endpoints that could conflict with the names of entity types. There may be other unforeseen issues by using arbitrary strings. Entity type names are meant to be identifiers, not necessarily human readable descriptions, and therefore should be expected to follow some constraints.

Proposal

Enforce the same requirements as the Elasticsearch index name requirements:

Index names must meet the following criteria:

  • Lowercase only
  • Cannot include \, /, *, ?, ", <, >, |, (space character), ,, #
  • Indices prior to 7.0 could contain a colon (:), but thatโ€™s been deprecated and wonโ€™t be supported in 7.0+
  • Cannot start with -, _, +
  • Cannot be . or ..
  • Cannot be longer than 255 bytes (note it is bytes, so multi-byte characters will count towards the 255 limit faster)
  • Names starting with . are deprecated, except for hidden indices and internal indices managed by plugins

Entity type names should follow the same rules (though allowing names to start with .). This will prevent entity type names from conflicting with reserved API terms such as _bulk and may help avoid other unforeseen issues related to syntax.

This would introduce a breaking change for existing entity models whose names do not meet this criteria.

Proposal: Move CI from Travis to GitHub Actions

Overview

With Travis CI's new pricing model, Zentity is going to be transitioned to a new plan with limited free use. From their announcement:

[we'll be moved to the] trial (free) plan with a 10K credit allotment (which allows around 1000 minutes in a Linux environment).
When your credit allotment runs out - weโ€™d love for you to consider which of our plans will meet your needs.
We will be offering an allotment of OSS minutes that will be reviewed and allocated on a case by case basis.

Given that, GitHub Actions makes a nice alternative for a few reasons:

  • Completely free for all OSS projects
  • Much faster build starts than Travis, which frequently has
  • Easier for contributors on GitHub to fork and automatically have CI setup
    • Could potentially make it easy for Zentity members to approve running PRs from untrusted public forks via PR comments, labels, CODEOWNER approval, etc.

GH Actions do have some areas that aren't so nice:

  • More customizable, and therefore more complicated to set up, than Travis
  • Each action that is used in a CI workflow is like a new dependency, and come with similar drawbacks

Migration Details

I think we can both test and release in one workflow, but I'll start with tests.

Tests

In one job:

Releases

This is where the workflow would be quite different than Travis.

In a second job:

  • Only run on tags
    • This could be the current format, or could be extended to allow for easy Release Candidate creation in the format *.*.*-rc*
    • see on.push.tags
  • Create a GitHub release with the tag via actions/create-release
  • Upload artifacts for each of the matrix ES versions and the Zentity tag to the created release via actions/upload-release-asset

Optionally, we could also automatically create a changelog for the release body (or to add to a separate file) via something like mikepenz/release-changelog-builder-action.

Conclusion

How does this compare with your current process for releasing Zentity?

Please let me know what you think! I'm happy to adjust/find more resources on/ talk about anything in here!

Does zentity support multi_match query

This is the query I am looking for,
POST _zentity/resolution/zentity_tutorial_1_person?pretty&_source=false
{
"attributes": {
"first_name": [ "Allie" ],
"last_name": [ "Jones" ]
}
}

but it returns error:
{
"took": 2,
"error": {
"by": "elasticsearch",
"type": "org.elasticsearch.common.ParsingException",
"reason": "[multi_match] query does not support [first_name]",
"stack_trace": "ParsingException[[multi_match] query does not support [first_name]]

The entity model is defined here.
PUT _zentity/models/zentity_tutorial_1_person
{
"attributes": {
"first_name": {
"type": "string"
},
"last_name": {
"type": "string"
}
},
"resolvers": {
"name_only": {
"attributes": [ "first_name", "last_name" ]
}
},
"matchers": {
"simple": {
"clause": {
"match": {
"{{ field }}": "{{ value }}"
}
}
},
"multi_match": {
"clause": {
"multi_match": {
"{{ field }}": "{{ value }}",
"type": "cross_fields",
"fields": [ "firs_tname", "last_name" ],
"operator": "and"
}
}
}
},
"indices": {
"zentity_tutorial_1_exact_name_matching": {
"fields": {
"first_name": {
"attribute": "first_name",
"matcher": "multi_match"
},
"last_name": {
"attribute": "last_name",
"matcher": "multi_match"
}
}
}
}
}

support recent patch versions for ES 7.17.x

Looks like the latest version has to match the ES patch version exactly. Any plan to support the most recent ES patch version?

#7 10.81 Exception in thread "main" java.lang.IllegalArgumentException: Plugin [zentity] was built for Elasticsearch version 7.17.0 but version 7.17.5 is running

Release Plan 1.8.0

Release Plan: 1.8.0

Changes

  • Feature: Bulk resolution (Issue: #50, PR: #79)
  • Feature: Bulk model management (Issue: #57)

Release

  • Create branch 1.8.0 from main branch
  • Update zentity.version in pom.xml and version numbers in README.md
  • Tag as zentity-1.8.0 and push to build and deploy its release artifacts
  • Verify successful deployment

Post-release

  • Update zentity.io home
  • Update zentity.io releases
  • Update zentity.io documentation

Resolve nested objects in Zentity

We are using Zentity 6.2.4. We have processed atleast 30 million records for entity resolution and we are happy with the performance of the Zentity. As of now our records had linear data, but we got a new requirement to store array of objects.

For ex โ€“ License information stored as array.

{ "firstName":"John", "lastName":"Doe", "license" : [ { "number" : "123" }, { "number" : "456" } ] }

We would like to use Zentity to resolve this record with license number as well. But I am able to do it. I have given the index and other details for your assistance.
Index Mapping
http://ELKHOST/my_index
{ "mappings": { "_doc": { "properties": { "firstName": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "lastName": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "license": { "type": "nested", "properties": { "number": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } } } } }

Insert the first document

POST http://ELKHOST/my_index/_doc/1

{ "firstName":"John", "lastName":"Doe", "license" : [ { "number" : "123" }, { "number" : "456" } ] }

Zentity Model
http://ELK HOST/_zentity/models/my_index
{ "attributes": { "firstName": { "type": "string" }, "lastName": { "type": "string" }, "licenseNumber": { "type": "string" } }, "resolvers": { "name": { "attributes": [ "lastName", "firstName" ] }, "license": { "attributes": [ "licenseNumber" ] } }, "matchers": { "exact": { "clause": { "term": { "{{ field }}": "{{ value }}" } } }, "fuzzy": { "clause": { "match": { "{{ field }}": { "query": "{{ value }}", "fuzziness": "auto", "operator": "AND" } } } } }, "indices": { "my_index": { "fields": { "firstName": { "attribute": "firstName", "matcher": "fuzzy" }, "lastName": { "attribute": "lastName", "matcher": "fuzzy" }, "license.number": { "attribute": "licenseNumber", "matcher": "exact" } } } } }
Resolve the document using the below model
http://ELKHOST/_zentity/resolution/my_index
{ "attributes": { "lastName": [ "Doe" ], "firstName":["John"] }, "scope": { "include": { "indices": [ "my_index" ], "resolvers": [ "name" ] } } }

I get the following error
{ "error": { "root_cause": [ { "type": "validation_exception", "reason": "Expected 'string' attribute data type." } ], "type": "validation_exception", "reason": "Expected 'string' attribute data type." }, "status": 400 }

But I see it when i bypass Zentity and query elastic search directly, it works fine.
http://ELKHOST/my_index/_search
{ "query": { "nested": { "path": "license", "query": { "bool": { "must": [ { "match": { "license.number": "456" }} ] } } } } }

Optional equals Matchers?

Hi,

In Zentity do we have matchers for optional matching - if value is available then match , if not available match without that value.

Support for Elasticsearch 7.13.x and 7.14.x

Zentity does not currently have a release compatible with the newer versions of Elasticsearch. Will there be a release for zentity that will include support for those versions?

Compound (grouped) attributes

In some use cases "matching" data is spread between several attributes.
For example, let's look at a use case with 3 indexes: people and cars
people index contains person_name, DOB, and address fields
cars index contains person_name, DOB, and car's license_plate fields
We would like to be able to find all people living at a particular address and all cars connected to these people. At the first glance, two resolvers should suffice:

  1. address
  2. person_name + DOB
    The first resolver will give us all people living at a particular address and the second resolver will pull out license plates for cars connected to these people.

The problem appears when a person (person-A) who lives at address-A, shares the same name with another person (person-B1) and shares the same DOB with yet another person (person-B2), where person-B1 and person-B2 happen to live at the same address-B. If we search for data starting from address-B, we are going to find person-B1 and person-B2 (correctly), but then by combining name of person-B1 and DOB or person-B2 we'll find person-A (incorrectly).

To avoid this issue there need to be a way for marking person_name and DOB attributes as "compound" or "grouped", so name and DOB of a person to look for would always come from the same record.

Error not raised when unrecognized field exists in resolution request

Observed in zentity-1.0.0-elasticsearch-6.2.4 from Issue #7.

Example:

POST _zentity/resolution/test
{
  "attributes": {
    "name": ["Alice Jones"]
  },
  "foo": {
    "bar": "baz"
  }
}

This request should fail because "foo" is an unrecognized field. Currently the request is processed.

Whenever there's an unrecognized field in a request, an error should be raised to prevent any confusion for the client.

OpenSearch support

are there any plans from your side to support OpenSearch in addition to Elasticsearch in the future?

OpenSearch has been started as a fork off of Elasticsearch 7.10 due to the license change done by elastic. OpenSearch 1.0.0 aims to be fully compatible with Elasticsearch 7.10.2. however, plugins of course need to be adapted to compile against OpenSearch. based on what i've read & heard (i'm not an ES plugin developer) it's quite simple to update a plugin so that it also compiles against OpenSearch (basically run a search & replace).

OpenSearch also provides a documentation on what needs to be done to upgrade a plugin from Elasticsearch to OpenSearch: https://github.com/opensearch-project/opensearch-plugins/blob/main/UPGRADING.md

we plan to move from Elasticsearch to OpenSearch due to the license and are currently looking into using your plugin for a use-case.

Use a multi-node cluster in production mode for integration tests

Issues like #56 have shown that a multi-node cluster in production mode would be better to run integration tests on than a single-node cluster in development mode. The plugin should be tested in an environment that more closely represents the desired state in which it will operate.

  • Migrate to GitHub Actions (Issue: #54) (PR: #69)
  • Modify GitHub Actions workflows to install docker-compose (if needed)
  • Create docker-compose file with a multi-node production-mode cluster for use in testing (PR: #70)
  • Modify ./src/test/ant/integration-tests.xml to run test cluster with docker-compose Retire Ant (PR: #70)

Match field which is of type "array of objects"

Hello Dave,
Thanks for developing such a wonderful project, I am trying to match a field which is 'array of objects' type but getting ValidationException. I might be missing something here, can you please look into it.
index mapping:
"test" : { "mappings" : { "properties" : { "education" : { "properties" : { "major" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "school" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } } } }, "name" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } } } } }
sample doc:
{ "_index" : "test", "_type" : "_doc", "_id" : "1", "_score" : 1.0, "_source" : { "name" : "John Wick", "education" : [ { "major" : "Master Of Science In Information Management", "school" : "Syracuse University" }, { "major" : "Certification Of Advanced Study In Data Science" }, { "major" : "Bachelor Of Technology", "school" : "Charotar University Of Science And Technology" } ] } }

zentity model:
PUT _zentity/models/name_education { "attributes" : { "name" : { "type": "string" }, "school" : { "type": "string" } }, "resolvers" : { "name_education" : { "attributes" : ["name", "school"] } }, "matchers" : { "simple" : { "clause" : { "match" : { "{{ field }}" : "{{ value }}" } } }, "fuzzy" : { "clause" : { "match" : { "{{ field }}" : { "query" : "{{ value }}", "fuzziness" : "1" } } } }, "exact" : { "clause" : { "term" : { "{{ field }}" : "{{ value }}" } } } }, "indices" : { "test" : { "fields" : { "name" : { "attribute" : "name", "matcher" : "simple" }, "education.school" : { "attribute" : "school", "matcher" : "simple" } } } } }
resolution request:
POST _zentity/resolution/name_education?pretty&_source=true&_explanation=true&_score=true { "attributes": { "school": [ "Syracuse University", "", "Charotar University Of Science And Technology" ], "name": [ "John Wick" ] } }

Here I am trying to do a simple match for both the attributes name and education.school. Above request results into following error (same error if I use 'fuzzy' matcher):
"error": { "by": "zentity", "type": "io.zentity.model.ValidationException", "reason": "Expected 'string' attribute data type.", "stack_trace": "io.zentity.model.ValidationException: Expected 'string' attribute data type.\n\tat io.zentity.resolution.input.value.StringValue.validate(StringValue.java:35)\n\tat io.zentity.resolution.input.value.Value.<init>(Value.java:18)\n\tat io.zentity.resolution.input.value.StringValue.<init>(StringValue.java:11)\n\tat io.zentity.resolution.input.value.Value.create(Value.java:40)\n\tat io.zentity.resolution.Job.traverse(Job.java:1346)\n\tat io.zentity.resolution.Job.run(Job.java:1539)\n\tat org.elasticsearch.plugin.zentity.ResolutionAction.lambda$prepareRequest$0(ResolutionAction.java:118)\n\tat org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:108)\n\tat org.elasticsearch.xpack.security.rest.SecurityRestFilter.lambda$handleRequest$0(SecurityRestFilter.java:58)\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)\n\tat org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$writeAuthToContext$24(AuthenticationService.java:570)\n\tat org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.writeAuthToContext(AuthenticationService.java:579)\n\tat org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.finishAuthentication(AuthenticationService.java:560)\n\tat org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.consumeUser(AuthenticationService.java:510)\n\tat org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$consumeToken$16(AuthenticationService.java:404)\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)\n\tat org.elasticsearch.xpack.core.common.IteratingActionListener.onResponse(IteratingActionListener.java:120)\n\tat org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$consumeToken$13(AuthenticationService.java:374)\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)\n\tat org.elasticsearch.xpack.security.authc.support.CachingUsernamePasswordRealm.lambda$authenticateWithCache$1(CachingUsernamePasswordRealm.java:145)\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)\n\tat org.elasticsearch.xpack.security.authc.support.CachingUsernamePasswordRealm.handleCachedAuthentication(CachingUsernamePasswordRealm.java:196)\n\tat org.elasticsearch.xpack.security.authc.support.CachingUsernamePasswordRealm.lambda$authenticateWithCache$2(CachingUsernamePasswordRealm.java:137)\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)\n\tat org.elasticsearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:112)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)\n\tat org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225)\n\tat org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:106)\n\tat org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:68)\n\tat org.elasticsearch.xpack.security.authc.support.CachingUsernamePasswordRealm.authenticateWithCache(CachingUsernamePasswordRealm.java:132)\n\tat org.elasticsearch.xpack.security.authc.support.CachingUsernamePasswordRealm.authenticate(CachingUsernamePasswordRealm.java:103)\n\tat org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$consumeToken$15(AuthenticationService.java:365)\n\tat org.elasticsearch.xpack.core.common.IteratingActionListener.run(IteratingActionListener.java:102)\n\tat org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.consumeToken(AuthenticationService.java:408)\n\tat org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$extractToken$11(AuthenticationService.java:335)\n\tat org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.extractToken(AuthenticationService.java:345)\n\tat org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$checkForApiKey$3(AuthenticationService.java:288)\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)\n\tat org.elasticsearch.xpack.security.authc.ApiKeyService.authenticateWithApiKeyIfPresent(ApiKeyService.java:325)\n\tat org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.checkForApiKey(AuthenticationService.java:269)\n\tat org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$authenticateAsync$0(AuthenticationService.java:252)\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)\n\tat org.elasticsearch.xpack.security.authc.TokenService.getAndValidateToken(TokenService.java:379)\n\tat org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$authenticateAsync$2(AuthenticationService.java:248)\n\tat org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$lookForExistingAuthentication$6(AuthenticationService.java:306)\n\tat org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lookForExistingAuthentication(AuthenticationService.java:317)\n\tat org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.authenticateAsync(AuthenticationService.java:244)\n\tat org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.access$000(AuthenticationService.java:196)\n\tat org.elasticsearch.xpack.security.authc.AuthenticationService.authenticate(AuthenticationService.java:122)\n\tat org.elasticsearch.xpack.security.rest.SecurityRestFilter.handleRequest(SecurityRestFilter.java:55)\n\tat org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:222)\n\tat org.elasticsearch.rest.RestController.tryAllHandlers(RestController.java:295)\n\tat org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:166)\n\tat org.elasticsearch.http.AbstractHttpServerTransport.dispatchRequest(AbstractHttpServerTransport.java:322)\n\tat org.elasticsearch.http.AbstractHttpServerTransport.handleIncomingRequest(AbstractHttpServerTransport.java:372)\n\tat org.elasticsearch.http.AbstractHttpServerTransport.incomingRequest(AbstractHttpServerTransport.java:301)\n\tat org.elasticsearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:69)\n\tat org.elasticsearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:31)\n\tat io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)\n\tat org.elasticsearch.http.netty4.Netty4HttpPipeliningHandler.channelRead(Netty4HttpPipeliningHandler.java:58)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)\n\tat io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)\n\tat io.netty.handler.codec.MessageToMessageCodec.channelRead(MessageToMessageCodec.java:111)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)\n\tat io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)\n\tat io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)\n\tat io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:326)\n\tat io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:300)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)\n\tat io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)\n\tat io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1478)\n\tat io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1227)\n\tat io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1274)\n\tat io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:503)\n\tat io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:442)\n\tat io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:281)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)\n\tat io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)\n\tat io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)\n\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:700)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:600)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:554)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat java.base/java.lang.Thread.run(Thread.java:830)\n" }

However if I use 'education.school.keyword' and do exact match, there's no error and it match the record:
PUT _zentity/models/name_education { "attributes" : { "name" : { "type": "string" }, "school" : { "type": "string" } }, "resolvers" : { "name_education" : { "attributes" : ["name", "school"] } }, "matchers" : { "simple" : { "clause" : { "match" : { "{{ field }}" : "{{ value }}" } } }, "fuzzy" : { "clause" : { "match" : { "{{ field }}" : { "query" : "{{ value }}", "fuzziness" : "1" } } } }, "exact" : { "clause" : { "term" : { "{{ field }}" : "{{ value }}" } } } }, "indices" : { "test" : { "fields" : { "name" : { "attribute" : "name", "matcher" : "simple" }, "education.school.keyword" : { "attribute" : "school", "matcher" : "exact" } } } } }
response for same resolution request (used earlier):
"hits" : { "total" : 1, "hits" : [ { "_index" : "test", "_type" : "_doc", "_id" : "1", "_hop" : 0, "_query" : 0, "_score" : null, "_attributes" : { "name" : [ "John Wick" ] }, "_explanation" : { "resolvers" : { "name_education" : { "attributes" : [ "name", "school" ] } }, "matches" : [ { "attribute" : "name", "target_field" : "name", "target_value" : "John Wick", "input_value" : "John Wick", "input_matcher" : "simple", "input_matcher_params" : { }, "score" : null }, { "attribute" : "school", "target_field" : "education.school.keyword", "target_value" : null, "input_value" : "Charotar University Of Science And Technology", "input_matcher" : "exact", "input_matcher_params" : { }, "score" : null }, { "attribute" : "school", "target_field" : "education.school.keyword", "target_value" : null, "input_value" : "Syracuse University", "input_matcher" : "exact", "input_matcher_params" : { }, "score" : null } ] }, "_source" : { "name" : "John Wick", "education" : [ { "major" : "Master Of Science In Information Management", "school" : "Syracuse University" }, { "major" : "Certification Of Advanced Study In Data Science" }, { "major" : "Bachelor Of Technology", "school" : "Charotar University Of Science And Technology" } ] } } ] }

Thanks
Abhishek

Allow creation of entity models with empty top-level objects.

Currently, requesting to create entity models with empty top-level objects (example below) will result in a validation error:

{
  "attributes": {},
  "resolvers": {},
  "matchers": {},
  "indices": {}
}

zentity should allow models like these to be created, and instead validate that they are complete before running a resolution job. This will make it possible to build an application that guides users through the process of creating an entity model from scratch and lets them save their progress on incomplete models.

[Bug] zentity fails to obtain attribute values from object arrays during resolution

Environment

  • zentity version: 1.8.0
  • Elasticsearch version: 7.11.1

Describe the bug

During a resolution job, zentity fails to access attributes whose values appear in an array of objects in the "_source" field of the matching documents. This is likely due to the use of JsonPointer to access attributes from documents (see also here), because the JSON Pointer syntax requires the index value for array elements. A potential solution is to replace the use of JsonPointer with JsonPath, which supports a syntax that can return all values within an array.

Related issues: #46, #49

Expected behavior

zentity should assume (like Elasticsearch) that each object in an array of objects has the same schema, and then during a resolution job, zentity should obtain attribute values from arrays of objects just like it obtains attribute values from object values or arrays of values.

Steps to reproduce

Step 1. Create an index with a nested object.

PUT my_index
{
  "mappings": {
    "properties": {
      "first_name": {
        "type": "text"
      },
      "last_name": {
        "type": "text"
      },
      "phone": {
        "type": "nested",
        "properties": {
          "number": {
            "type": "keyword"
          },
          "type": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

Step 2. Index two documents.

POST my_index/_bulk?refresh
{"index":{"_id":1}}
{"first_name":"alice","last_name":"jones","phone":[{"number":"555-123-4567","type":"home"},{"number":"555-987-6543","type":"mobile"}]}
{"index":{"_id":2}}
{"first_name":"allison","last_name":"jones","phone":[{"number":"555-987-6543","type":"mobile"}]}

Step 3. Create an entity model.

PUT _zentity/models/my_entity_model
{
  "attributes": {
    "first_name": {},
    "last_name": {},
    "phone": {}
  },
  "resolvers": {
    "name_phone": {
      "attributes": [
        "last_name",
        "phone"
      ]
    }
  },
  "matchers": {
    "exact": {
      "clause": {
        "term": {
          "{{ field }}": "{{ value }}"
        }
      }
    },
    "exact_phone": {
      "clause": {
        "nested": {
          "path": "phone",
          "query": {
            "term": {
              "{{ field }}": "{{ value }}"
            }
          }
        }
      }
    }
  },
  "indices": {
    "my_index": {
      "fields": {
        "first_name": {
          "attribute": "first_name",
          "matcher": "exact"
        },
        "last_name": {
          "attribute": "last_name",
          "matcher": "exact"
        },
        "phone.number": {
          "attribute": "phone",
          "matcher": "exact_phone"
        }
      }
    }
  }
}

Step 4. Run a resolution job. Expect the first hop to match the given name and phone number (555-123-4567), and expect the second hop to match the new phone number (555-987-6543) from the document in the first hop.

POST _zentity/resolution/my_entity_model?queries
{
  "attributes": {
    "first_name": [ "alice" ],
    "last_name": [ "jones" ],
    "phone": [ "555-123-4567" ]
  }
}

Step 5. The resolution job fails with the following error message:

io.zentity.model.ValidationException: Expected 'string' attribute data type.
	at io.zentity.resolution.input.value.StringValue.validate(StringValue.java:52)
	at io.zentity.resolution.input.value.Value.<init>(Value.java:35)
	at io.zentity.resolution.input.value.StringValue.<init>(StringValue.java:28)
	at io.zentity.resolution.input.value.Value.create(Value.java:57)
	at io.zentity.resolution.Job.onSearchComplete(Job.java:755)
	at io.zentity.resolution.Job.access$000(Job.java:50)
	at io.zentity.resolution.Job$1.onResponse(Job.java:1052)
	at io.zentity.resolution.Job$1.onResponse(Job.java:1045)
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:83)
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:77)
	at org.elasticsearch.action.ActionListener$4.onResponse(ActionListener.java:253)
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:595)
	at org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:109)
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:372)
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:366)
	at org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:219)
	at org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$1(FetchSearchPhase.java:101)
	at org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:107)
	at org.elasticsearch.action.search.FetchSearchPhase.access$000(FetchSearchPhase.java:36)
	at org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:84)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:830)

Additional context

The following request shows the query that zentity submits to Elasticsearch in the first hop, and the response that zentity receives from Elasticsearch to process. The error occurs when zentity tries to parse the values of the phone numbers, which are inside of an object array.

Request:

GET my_index/_search
{
  "_source": true,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "last_name": "jones"
          }
        },
        {
          "nested": {
            "path": "phone",
            "query": {
              "term": {
                "phone.number": "555-123-4567"
              }
            }
          }
        }
      ]
    }
  },
  "size": 1000
}

Response:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.0,
        "_source" : {
          "first_name" : "alice",
          "last_name" : "jones",
          "phone" : [
            {
              "number" : "555-123-4567",
              "type" : "home"
            },
            {
              "number" : "555-987-6543",
              "type" : "mobile"
            }
          ]
        }
      }
    ]
  }
}

Adding extra attribute fails the resolving

I have a bunch of resolvers:

{
        "resolvers": {
            "name_full_address": {
                "attributes": [
                    "name", "address_line_1", "address_line_2", "city", "state", "postal_code"
                ],
                "weight": 100
            },
            "name_full_address_one_line": {
                "attributes": [
                    "name", "address_line_1", "city", "state", "postal_code"
                ],
                "weight": 100
            },
            "loose_name_exact_address": {
                "attributes": [
                    "loose_name", "address_line_1_exact", "city", "state", "postal_code"
                ],
                "weight": 100
            },
            "name_phone": {
                "attributes": [
                    "name", "phone_number"
                ],
                "weight": 100
            },
            "name_address_postal": {
                "attributes": [
                    "name", "address_line_1", "postal_code"
                ],
                "weight": 100
            },
            "name_address_city_state": {
                "attributes": [
                    "name", "address_line_1", "city", "state"
                ],
                "weight": 100
            },
            "address_phone": {
                "attributes": [
                    "phone_number", "address_line_1", "city", "state", "postal_code"
                ],
                "weight": 100
            },
            "name_city_state_postal": {
                "attributes": [
                    "name", "city", "state", "postal_code"
                ],
                "weight": 150
            },
        }
    }

Now I have entity with name, city, state and postal_code set.

name = "Piotr's Restaurant"
state = "NY"
zip = "11217"
city = "New York"
// phone_number and other attributes are null by default

Resolving with just those attributes yields a match due to name_city_state_postal resolver:

    // Works just fine
    {
        "attributes": {
            "name": [name],
            "city": [city],
            "state": [state],
            "postal_code": [zip]
        }
    }

Now, if I add a phone number to the resolution request, it does not match the entity:

    // Does not work as expected
    {
        "attributes": {
            "name": [name],
            "city": [city],
            "state": [state],
            "postal_code": [zip],
            "phone_number": ["2063108455"]
        }
    }

My question is whether this is the expected behavior. From the docs (https://zentity.io/docs/entity-models/specification/):

"The weight level of the resolver. Resolvers with higher weight levels take precedence over resolvers with lower weight levels. If a resolution job uses resolvers with different weight levels, then the higher weight resolvers either must match or must not exist. This behavior can help prevent false matches."

Meaning the highest weight resolver name_city_state_postal (weight 150), should match and looks like it does. No other resolver matches (there is no address_line_1 or phone_number linked to the entity). IIUC, this query should also match the defined entity. Am I doing something wrong here?

Allow attributes to be represented as nested fields

Currently the "_attributes" section of the resolution response is a flat object, where each key is the name of an attribute. Allowing the attributes to be nested will allow users to save results in an index that follows the guidelines and best practices for the Elastic Common Schema (ECS), which encourages nesting by way of prefixes.

If this feature is released at the same time as #73, then it would create one breaking change instead of two.

Proposal

Allow periods (.) to be used in the attribute names of entity models, and used them to nest fields in the "_attributes" section of the resolution response.

Example

Entity model - Attribute names are flat and may contain periods. This example shows attributes which are grouped by prefixes.

{
  "attributes": {
    "name.first": {},
    "name.middle": {},
    "name.last": {},
    "location.address.street": {},
    "location.address.city": {},
    "location.address.state": {},
    "location.address.zip": {}
  }
}

Resolution request - Attribute names are flat and retain their periods. Nesting would not be allowed at this point. Rationale: Attributes may be arrays of values or objects with values and params (source), and allowing nested attributes here would make it difficult to determine whether the nested object was an attribute value or a nested attribute name.

{
  "attributes": {
    "name.first": [ "Alice" ],
    "name.middle": [ "Q" ],
    "name.last": [ "Jones" ]
  }
}

Resolution response - Attribute names are split and nested by their periods.

{
  "_attributes": {
    "name": {
      "first": [ "Alice" ],
      "middle": [ "Quincy" ],
      "last": [ "Jones" ]
    },
    "location": {
      "address": {
        "street": [ "123 Main St" ],
        "city": [ "Washington" ],
        "state": [ "DC" ],
        "zip": [ "20001" ]
      }
    }
  }
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.