Motivation

The iRODS indexing capability provides a policy framework around both full text and metadata indexing for the purposes of enhanced data discovery. Logical collections are annotated with metadata which indicates that any data objects or nested collections of data object should be indexed given a particular indexing technology, index type and index name.

Configuration

Collection Metadata

Collections are annotated with metadata indicating they should be indexed. The metadata is formatted is as follows:

irods::indexing::index <index name>::<index type> <technology>

Where <index name> is an assumed existing index within the given technology, and <index type> is either full_text, meaning the data object will be read, processed and then submitted to the index, or metadata where the metadata triples associated with qualifying data objects will be indexed.

The <technology> in the triple references the indexing technology, currently only suppored by elasticsearch. This string is used to dynamically build the policy invocations when the indexing policy is triggered in order to delegate the operations to the appropriate rule engine plugin.

The attribute is configurable within the plugin_specific_configuration for the indexing rule engine plugin.

Resource Metadata

An administrator may wish to restrict indexing activities to particular resources, for example when automatically ingesting data. Should a storage resource be at the edge, that resource may not be appropriate for indexing. In order to indicate a resource is available for indexing it may be annotated with metadata:

imeta add -R <resource name> irods::indexing::index true

By default, should no resource be tagged it is assumed that all resources are available for indexing. Should the tag exist on any resource in the system, it is assumed that all available resources for indexing are tagged.

Plugin Settings

There are currently three rule engine plugins to configure for the indexing capability which should be added to the "rule_engines" section of /etc/irods/server_config.json:

    "rule_engines": [
            {
                "instance_name": "irods_rule_engine_plugin-indexing-instance",
                "plugin_name": "irods_rule_engine_plugin-indexing",
                "plugin_specific_configuration": {
                }
            },
            {
                "instance_name": "irods_rule_engine_plugin-elasticsearch-instance",
                "plugin_name": "irods_rule_engine_plugin-elasticsearch",
                "plugin_specific_configuration": {
                    "hosts" : ["http://localhost:9200/"],
                    "bulk_count" : 100,
                    "read_size" : 4194304
                }
            },
            {
                "instance_name": "irods_rule_engine_plugin-document_type-instance",
                "plugin_name": "irods_rule_engine_plugin-document_type",
                "plugin_specific_configuration": {
                }
            },
        ]

The first is the main indexing rule engine plugin, the second is the plugin responsible for implementing the policy for the indexing technology, and the third is responsible for implementing the document type introspection. Currently the default imply returns text as the document type. This policy can be overridden to call out to services like Tika for a better introspection of the data.

Within each plugin configuration stanza, the "plugin_specific_configuration" object may contain a number of key-value pairs. The following pairs are currently applicable for the purpose of setting the indexing capability's operating parameters:

key	type	plugin component	purpose
minimum_delay_time	string	indexing	lower limit for randomly generated delay-task intervals
maximum_delay_time	string	indexing	upper limit for randomly generated delay-task intervals
es_version	string	elasticsearch	set to "6.x" or "7.x" depending on Elasticsearch version (ES 7 default)
job_limit_per_collection_indexing_operation	string	indexing	integer limit to number of concurrent collection operations ("" means no limit)
bulk_count	string	elasticsearch	the number of text chunks processed at once for ES full-text indexing
read_size	int	elasticsearch	the size of individual text chunks processed for ES full-text indexing

Policy Implementation

Policy names are are dynamically crafted by the indexing plugin in order to invoke a particular technology. The four policies an indexing technology must implement are crafted from base strings with the name of the technology as indicated by the collection metadata annotation.

Indexing Technology Policies

irods_policy_indexing_object_index_<technology>
irods_policy_indexing_object_purge_<technology>
irods_policy_indexing_metadata_index_<technology>
irods_policy_indexing_metadata_purge_<technology>

Document Type Policy

irods_policy_indexing_document_type_<technology>

d-w-moore2 / irods_capability_indexing Goto Github PK

irods_capability_indexing's Introduction

Motivation

Configuration

Collection Metadata

Resource Metadata

Plugin Settings

Policy Implementation

Indexing Technology Policies

Document Type Policy

irods_capability_indexing's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent