Giter Site home page Giter Site logo

irods_capability_indexing's Introduction

Motivation

The iRODS indexing capability provides a policy framework around both full text and metadata indexing for the purposes of enhanced data discovery. Logical collections are annotated with metadata which indicates that any data objects or nested collections of data object should be indexed given a particular indexing technology, index type and index name.

Configuration

Collection Metadata

Collections are annotated with metadata indicating they should be indexed. The metadata is formatted is as follows:

irods::indexing::index <index name>::<index type> <technology>

Where <index name> is an assumed existing index within the given technology, and <index type> is either full_text, meaning the data object will be read, processed and then submitted to the index, or metadata where the metadata triples associated with qualifying data objects will be indexed.

The <technology> in the triple references the indexing technology, currently only suppored by elasticsearch. This string is used to dynamically build the policy invocations when the indexing policy is triggered in order to delegate the operations to the appropriate rule engine plugin.

The attribute is configurable within the plugin_specific_configuration for the indexing rule engine plugin.

Resource Metadata

An administrator may wish to restrict indexing activities to particular resources, for example when automatically ingesting data. Should a storage resource be at the edge, that resource may not be appropriate for indexing. In order to indicate a resource is available for indexing it may be annotated with metadata:

imeta add -R <resource name> irods::indexing::index true

By default, should no resource be tagged it is assumed that all resources are available for indexing. Should the tag exist on any resource in the system, it is assumed that all available resources for indexing are tagged.

Plugin Settings

There are currently three rule engine plugins to configure for the indexing capability which should be added to the "rule_engines" section of /etc/irods/server_config.json:

    "rule_engines": [
            {
                "instance_name": "irods_rule_engine_plugin-indexing-instance",
                "plugin_name": "irods_rule_engine_plugin-indexing",
                "plugin_specific_configuration": {
                }
            },
            {
                "instance_name": "irods_rule_engine_plugin-elasticsearch-instance",
                "plugin_name": "irods_rule_engine_plugin-elasticsearch",
                "plugin_specific_configuration": {
                    "hosts" : ["http://localhost:9200/"],
                    "bulk_count" : 100,
                    "read_size" : 4194304
                }
            },
            {
                "instance_name": "irods_rule_engine_plugin-document_type-instance",
                "plugin_name": "irods_rule_engine_plugin-document_type",
                "plugin_specific_configuration": {
                }
            },
        ]

The first is the main indexing rule engine plugin, the second is the plugin responsible for implementing the policy for the indexing technology, and the third is responsible for implementing the document type introspection. Currently the default imply returns text as the document type. This policy can be overridden to call out to services like Tika for a better introspection of the data.

Within each plugin configuration stanza, the "plugin_specific_configuration" object may contain a number of key-value pairs. The following pairs are currently applicable for the purpose of setting the indexing capability's operating parameters:

key type plugin component purpose
minimum_delay_time string indexing lower limit for randomly generated delay-task intervals
maximum_delay_time string indexing upper limit for randomly generated delay-task intervals
es_version string elasticsearch set to "6.x" or "7.x" depending on Elasticsearch version (ES 7 default)
job_limit_per_collection_indexing_operation string indexing integer limit to number of concurrent collection operations ("" means no limit)
bulk_count string elasticsearch the number of text chunks processed at once for ES full-text indexing
read_size int elasticsearch the size of individual text chunks processed for ES full-text indexing

Policy Implementation

Policy names are are dynamically crafted by the indexing plugin in order to invoke a particular technology. The four policies an indexing technology must implement are crafted from base strings with the name of the technology as indicated by the collection metadata annotation.

Indexing Technology Policies

irods_policy_indexing_object_index_<technology>
irods_policy_indexing_object_purge_<technology>
irods_policy_indexing_metadata_index_<technology>
irods_policy_indexing_metadata_purge_<technology>

Document Type Policy

irods_policy_indexing_document_type_<technology>

irods_capability_indexing's People

Contributors

d-w-moore avatar swooshycueb avatar trel avatar alanking avatar jassigill2000 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.