Giter Site home page Giter Site logo

airavata-data-catalog's Introduction

Apache Airavata Data Catalog

Getting started

Start the PostgreSQL database in a docker container

docker-compose up

Run the API server

mvn install
cd data-catalog-api/server/service
mvn spring-boot:run

Run the API client

mvn install
cd data-catalog-api/client
mvn exec:java -Dexec.mainClass=org.apache.airavata.datacatalog.api.client.DataCatalogAPIClient

airavata-data-catalog's People

Contributors

isururanawaka avatar lahirujayathilake avatar machristie avatar smarru avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

airavata-data-catalog's Issues

Investigate Apache Calcite for query rewriting

We want to use https://calcite.apache.org/ to take a high-level SQL query that is written against a metadata schema and translate that into the actual PostgreSQL query.

From the design doc, we want to take something like this

SELECT *
FROM smilesdb
WHERE created_date > '2020-01-01' AND absorb < 300.0
ORDER BY created_date desc
LIMIT 10;

and transform it into this

SELECT
    dp.*
FROM
    data_product dp
    INNER JOIN data_product_metadata_schema dpms ON dp.data_product_id = dpms.data_product_id
    INNER JOIN metadata_schema ms ON ms.metadata_schema_id = dpms.metadata_schema_id
WHERE
    nullif(dp.metadata ->> 'absorb', '') :: float < 300.0
    AND dp.created_date > '2020-01-01'
    AND ms.schema_name = 'smilesdb'
ORDER BY
    dp.created_date DESC
LIMIT
    10;

Note: the query above is missing an authorization where clause, but I think for the initial investigation being able to produce the above query will be sufficient.

Support querying by parent or child of a data product

The use case is that you want to get all of the children of parents that have certain metadata values. Or you want to get all of the parents of children that have certain metadata values.

For example, getting children when filtering by parents:

select child.*
from my_schema child
inner join my_other_schema parent
parent.data_product_id = child.parent_data_product_id
where parent.field1 = 'abc'
-- etc.

See also #26 which deals with general JOIN support. Note that for this issue, supporting a JOIN is not strictly necessary. Some investigation is needed to find the best way to satisfy the requirement.

Metadata schema management APIs

Implement API to manage metadata schemas and their fields

  • createMetadataSchema(UserInfo userInfo, String schemaName)
  • createMetadataSchemaField(UserInfo userInfo, String schemaName, MetadataSchemaField metadataSchemaField)
  • updateMetadataSchemaField(UserInfo userInfo, String schemaName, MetadataSchemaField metadataSchemaField)
  • removeMetadataSchemaField(UserInfo userInfo, String schemaName, String metadataSchemaFieldName)
  • createDataProduct() - allows specifying zero or more schemas that metadata conforms to
  • addDataProductToMetadataSchema() - allows updating a data product by adding a metadata schema that the metadata conforms to
    • a boolean flag whether to recurse this operation for COLLECTIONs
  • removeDataProductFromMetadataSchema() reverses the previous operation
    • a boolean flag whether to recurse this operation for COLLECTIONs

MetadataSchemaField

Evaluate the overhead (absolute and relative) of loading JSONB documents with and without a GIN index

This came up in the design review. From Amila:

  1. The performance -- each time we insert/update documents the inverted index needs to be re-build. If we are are re-building the inverted index synchronously, the api call might need to wait more time. Asyncrhounous updates might be more suitable.

Perhaps we can do asynchronous updates, but first I think it would be good to quantify what overhead there is with updating the GIN index each time a JSONB document (i.e., the metadata column) is inserted into the data_product table.

Investigate supporting an in memory database

The data catalog implementation leans heavily on PostgreSQL features, but perhaps it's possible to provide an alternate implement for something like the H2 in memory database. This would make it easier to get started with development on data catalog and could also be used for unit tests.

In general, if the in memory database supports a JSON column type, the integration should be possible. The search need not be as performant or reliant upon indexes as the PostgreSQL implementation.

Parent/child relationship APIs

API methods to get parent/child of a given data product

  • getChildDataProducts(data_product_id)
    • this should be paginated
  • getParentDataProduct(data_product_id)

Custos integration

See the SharingManager.java interface that needs to be implemented for Custos integration.

We might also want a separate issue for SCIM integration with Custos.

Tasks

  • Data Catalog Sharing Exception(s) instead of just Exception and get rid of CustosSharingException
  • initialize permissions, entity types
  • sharing integration should create entities if not existing
  • include who is doing the sharing in the API. So you have the sharer and the sharee. (but come up with better names)
  • error: Not a managed type: class org.apache.custos.sharing.core.persistance.model.Entity
  • implement the data product sharing view
  • resolving user, need to lookup by custos user id the user record. Fetch it from Custos if it doesn't exist
  • need to create tenant or super-tenant configuration in config or database
  • can we get a snapshot release of Custos sharing-core-impl pushed to maven central repository?
  • simple sharing implementation or a no-op sharing implementation
    • this can be really really simple. Just a user table, group table, sharing table
  • integrate SharingManager into the DataCatalogAPIService
    • create entity when createDataProduct
    • add sharing management APIs
    • use sharing APIs to check permissions
  • create tenant if not exists
  • add tenant id to interface for public sharing methods? (remove it from the data product model?)
  • How to create a special public group in Custos?
  • Need to resolve Custos group and tenant as well
  • add group membership to data product sharing view. We want a row in the view for every member of a group when an entity is shared with a group.

Support system scope metadata schemas

Some metadata records will adhere to schemas that are the same across all tenants. For example, Airavata will have a metadata schema for experiment instances and that schema would apply to all tenants that have Airavata experiments. So it would be useful to not have to redefine global schemas again for each tenant.

On the query side of things, I think the API will support a query that looks like this:

select * from system.experiment where ...

where experiment is a system scoped metadata schema.

Get a list of values for a metadata field

Implement an API method(s) to get a list of values for a metadata field. For a field with discrete values, this API method might just return all known values. However, for a field with continuous values, the API method should return a set of ranges of values (bins) where the number of records that fall within each range are roughly equivalent (similar to Pandas qcut function). The API method should provide a parameter for how many bins to return for continuous values.

We might want to add a "continuous or discrete" attribute to metadata fields API since this seems like an important thing to describe about a metadata field.

Also, when it comes to continuous data, that only applies to continuous data types like numeric and date-time data types.

Support joins in metadata schema SQL query

Some use cases:

  • inner join between two metadata schema would return only data products that belong to both schemas, and one could also filter by fields in either or both schemas
  • join to the parent data product and then you can filter by the parent's metadata schema fields but return their children

Some thoughts:

  • would be nice if the client doesn't need to know how to do the join but the MetadataSchemaQueryExecutor handles it. Clients could issue a NATURAL JOIN in the query like so
select * from my_schema NATURAL JOIN other_schema;
  • maybe register a virtual parent_data_product table so clients can issue a query joining to the parent data product, again, without having to know the details of how to do the join:
select * from my_schema NATURAL JOIN parent_data_product;

Support projections in metadata schema SQL query

Currently the SQL query will return all matching DataProduct instances. So, for example, it only ever makes sense to issue a query that begins SELECT * FROM .... But it could be useful to return only a subset of metadata schema fields, both because it would be a smaller payload and because the client wouldn't need to retrieve the metadata schema field values from the 'metadata' field of each DataProduct.

It would be good to develop a compelling use case first though.

Investigate free form queries against the metadata column

The Calcite integration only support queries against pre-registered metadata schema fields, like so

SELECT
    *
FROM
    my_schema
WHERE
    (
        field1 < 5
        OR field3 = 'bar'
    )
    AND field1 > 0
    AND external_id = 'fff';

But it would be nice if one could query against unregistered metadata schema fields that are known to exist within the metadata JSONB column, something like:

SELECT
    *
FROM
    my_schema
WHERE
    metadata.some_other_field > 0;

There are two challenges. One is how to relax Calcite's validation to allow referencing fields that aren't known ahead of time. Second is how to support a syntax for referencing a JSON field that Calcite will parse.

One option might be to have the client queries use JSON functions that Calcite supports: https://calcite.apache.org/docs/reference.html#json-functions

For example:

SELECT
    *
FROM
    my_schema
WHERE
    JSON_EXISTS(metadata, '$.some_other_field > 0');

But, PostgreSQL doesn't yet natively support these functions (see https://www.depesz.com/2022/04/01/waiting-for-postgresql-15-sql-json-query-functions/) so they would need to be rewritten.

Investigate json-schema and JSON-LD for specifying the metadata schema

Instead of implementing our own mechanism for specifying a JSON schema, should we just use https://json-schema.org/ ?

The investigation should evaluate what are the advantages to adopting either json-schema or JSON-LD. What do we get by adopting them over our own API approach? Also, where are they deficient? I imagine we'll still need some metadata schema definition API to specify things like the type of index to create, etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.