Light

apache / airavata-data-catalog Goto Github PK

View Code? Open in Web Editor NEW

9.0 11.0 4.0 166 KB

Apache Airavata Data Catalog

Home Page: https://airavata.apache.org

License: Apache License 2.0

Java 100.00%

airavata apache data-catalog metadata schema search

airavata-data-catalog's Introduction

Apache Airavata Data Catalog

Getting started

Start the PostgreSQL database in a docker container

docker-compose up

Run the API server

mvn install
cd data-catalog-api/server/service
mvn spring-boot:run

Run the API client

mvn install
cd data-catalog-api/client
mvn exec:java -Dexec.mainClass=org.apache.airavata.datacatalog.api.client.DataCatalogAPIClient

airavata-data-catalog's People

Contributors

Stargazers

Watchers

Forkers

lahirujayathilake isururanawaka jayancv impiyush83

airavata-data-catalog's Issues

Investigate Apache Calcite for query rewriting

We want to use https://calcite.apache.org/ to take a high-level SQL query that is written against a metadata schema and translate that into the actual PostgreSQL query.

From the design doc, we want to take something like this

SELECT *
FROM smilesdb
WHERE created_date > '2020-01-01' AND absorb < 300.0
ORDER BY created_date desc
LIMIT 10;

and transform it into this

SELECT
    dp.*
FROM
    data_product dp
    INNER JOIN data_product_metadata_schema dpms ON dp.data_product_id = dpms.data_product_id
    INNER JOIN metadata_schema ms ON ms.metadata_schema_id = dpms.metadata_schema_id
WHERE
    nullif(dp.metadata ->> 'absorb', '') :: float < 300.0
    AND dp.created_date > '2020-01-01'
    AND ms.schema_name = 'smilesdb'
ORDER BY
    dp.created_date DESC
LIMIT
    10;

Note: the query above is missing an authorization where clause, but I think for the initial investigation being able to produce the above query will be sufficient.

CLI client interface for data catalog

Support order by, pagination in SQL query

Support querying by parent or child of a data product

The use case is that you want to get all of the children of parents that have certain metadata values. Or you want to get all of the parents of children that have certain metadata values.

For example, getting children when filtering by parents:

select child.*
from my_schema child
inner join my_other_schema parent
parent.data_product_id = child.parent_data_product_id
where parent.field1 = 'abc'
-- etc.

See also #26 which deals with general JOIN support. Note that for this issue, supporting a JOIN is not strictly necessary. Some investigation is needed to find the best way to satisfy the requirement.

Metadata schema management APIs

Implement API to manage metadata schemas and their fields

createMetadataSchema(UserInfo userInfo, String schemaName)
createMetadataSchemaField(UserInfo userInfo, String schemaName, MetadataSchemaField metadataSchemaField)
updateMetadataSchemaField(UserInfo userInfo, String schemaName, MetadataSchemaField metadataSchemaField)
removeMetadataSchemaField(UserInfo userInfo, String schemaName, String metadataSchemaFieldName)
createDataProduct() - allows specifying zero or more schemas that metadata conforms to
addDataProductToMetadataSchema() - allows updating a data product by adding a metadata schema that the metadata conforms to
- a boolean flag whether to recurse this operation for COLLECTIONs
removeDataProductFromMetadataSchema() reverses the previous operation
- a boolean flag whether to recurse this operation for COLLECTIONs

MetadataSchemaField

String fieldId
String fieldName
String jsonPath
- see jsonpath specification: https://goessner.net/articles/JsonPath/
  and https://www.postgresql.org/docs/current/functions-json.html#FUNCTIONS-SQLJSON-PATH
enum dataType (INTEGER, FLOAT, STRING, DATESTRING, BOOLEAN, INTEGER[], FLOAT[], STRING[], DATESTRING[], BOOLEAN[])

Support group sharing in Custos Sharing Manager

Handle resolving Custos group
add group membership to data product sharing view. We want a row in the view for every member of a group when an entity is shared with a group.

Include Tenant in the database model

Create a TenantEntity to model tenants and include a FK to tenants in UserEntity and we'll also need this for MetadataSchemaEntity as well. MetadataSchemaEntity.name will need to be unique for a given tenant_id. See https://gist.github.com/machristie/0c7e28f11347d735f5517c3b5bf14d57 for schema design.

Evaluate the overhead (absolute and relative) of loading JSONB documents with and without a GIN index

This came up in the design review. From Amila:

The performance -- each time we insert/update documents the inverted index needs to be re-build. If we are are re-building the inverted index synchronously, the api call might need to wait more time. Asyncrhounous updates might be more suitable.

Perhaps we can do asynchronous updates, but first I think it would be good to quantify what overhead there is with updating the GIN index each time a JSONB document (i.e., the metadata column) is inserted into the data_product table.

Investigate supporting an in memory database

The data catalog implementation leans heavily on PostgreSQL features, but perhaps it's possible to provide an alternate implement for something like the H2 in memory database. This would make it easier to get started with development on data catalog and could also be used for unit tests.

In general, if the in memory database supports a JSON column type, the integration should be possible. The search need not be as performant or reliant upon indexes as the PostgreSQL implementation.

GitHub Action for running a build on commit

Parent/child relationship APIs

API methods to get parent/child of a given data product

getChildDataProducts(data_product_id)
- this should be paginated
getParentDataProduct(data_product_id)

Create test cases for PostgresqlMetadataSchemaQueryWriterImpl

Test different types of metadata schema fields (INTEGER, STRING, etc.). Test different kinds of conditions on those filters (<=, >=, !=, etc.). See DataCatalogAPIClient.java for some tests that have already been implemented.

Custos integration

See the SharingManager.java interface that needs to be implemented for Custos integration.

We might also want a separate issue for SCIM integration with Custos.

Tasks

Support system scope metadata schemas

Some metadata records will adhere to schemas that are the same across all tenants. For example, Airavata will have a metadata schema for experiment instances and that schema would apply to all tenants that have Airavata experiments. So it would be useful to not have to redefine global schemas again for each tenant.

On the query side of things, I think the API will support a query that looks like this:

select * from system.experiment where ...

where experiment is a system scoped metadata schema.

Apply sharing APIs to metadata schemas

#12 applying the sharing APIs to data products. We could do the same for metadata schemas to allow users to collaborate on metadata schemas.

API methods to get parent/child data products

getChildDataProducts() for a given data product id
getParentDataProduct() for a given data product id

Get a list of values for a metadata field

Implement an API method(s) to get a list of values for a metadata field. For a field with discrete values, this API method might just return all known values. However, for a field with continuous values, the API method should return a set of ranges of values (bins) where the number of records that fall within each range are roughly equivalent (similar to Pandas qcut function). The API method should provide a parameter for how many bins to return for continuous values.

We might want to add a "continuous or discrete" attribute to metadata fields API since this seems like an important thing to describe about a metadata field.

Also, when it comes to continuous data, that only applies to continuous data types like numeric and date-time data types.

Implement public sharing with Custos SharingManager

create a special "public" group for new tenants

Data Product CRUD API

Implement API methods to create, read, update, and delete Data Product records.

Support joins in metadata schema SQL query

Some use cases:

inner join between two metadata schema would return only data products that belong to both schemas, and one could also filter by fields in either or both schemas
join to the parent data product and then you can filter by the parent's metadata schema fields but return their children

Some thoughts:

would be nice if the client doesn't need to know how to do the join but the MetadataSchemaQueryExecutor handles it. Clients could issue a NATURAL JOIN in the query like so

select * from my_schema NATURAL JOIN other_schema;

maybe register a virtual parent_data_product table so clients can issue a query joining to the parent data product, again, without having to know the details of how to do the join:

select * from my_schema NATURAL JOIN parent_data_product;

Integrate liquibase and database initialization scripts

Support projections in metadata schema SQL query

Currently the SQL query will return all matching DataProduct instances. So, for example, it only ever makes sense to issue a query that begins SELECT * FROM .... But it could be useful to return only a subset of metadata schema fields, both because it would be a smaller payload and because the client wouldn't need to retrieve the metadata schema field values from the 'metadata' field of each DataProduct.

It would be good to develop a compelling use case first though.

Support aliases for schema and field names

To allow schema and field names to evolve without breaking API level SQL queries, the API should support adding aliases for schema and field names.

Support creating indexes on metadata schema fields

Automatically create a B-Tree database index on metadata schema field and make sure it is used in queries that check for range of values.

Use parameter binding for the metadata search query

The MetadataSchemaQueryWriter should return not just Sql but a Collection of parameters to be bound to the query. This is beneficial for both security and performance.

Add additional fields to the DataProduct model

Review the design doc and add any missing fields

Investigate free form queries against the metadata column

The Calcite integration only support queries against pre-registered metadata schema fields, like so

SELECT
    *
FROM
    my_schema
WHERE
    (
        field1 < 5
        OR field3 = 'bar'
    )
    AND field1 > 0
    AND external_id = 'fff';

But it would be nice if one could query against unregistered metadata schema fields that are known to exist within the metadata JSONB column, something like:

SELECT
    *
FROM
    my_schema
WHERE
    metadata.some_other_field > 0;

There are two challenges. One is how to relax Calcite's validation to allow referencing fields that aren't known ahead of time. Second is how to support a syntax for referencing a JSON field that Calcite will parse.

One option might be to have the client queries use JSON functions that Calcite supports: https://calcite.apache.org/docs/reference.html#json-functions

For example:

SELECT
    *
FROM
    my_schema
WHERE
    JSON_EXISTS(metadata, '$.some_other_field > 0');

But, PostgreSQL doesn't yet natively support these functions (see https://www.depesz.com/2022/04/01/waiting-for-postgresql-15-sql-json-query-functions/) so they would need to be rewritten.

Investigate json-schema and JSON-LD for specifying the metadata schema

Instead of implementing our own mechanism for specifying a JSON schema, should we just use https://json-schema.org/ ?

Suresh: Some examples of JSON schema in scientific context - https://github.com/materials-data-facility/data-schemas/tree/master/schemas
Suresh: Another option is to consider JSON-LD. Particularly it will go well with schema.org vocabulary.
- Geoscience examples - https://github.com/ESIPFed/science-on-schema.org and https://geocodes.earthcube.org/#/landing
- psychoinformatics-de/datalad-hirni#102

The investigation should evaluate what are the advantages to adopting either json-schema or JSON-LD. What do we get by adopting them over our own API approach? Also, where are they deficient? I imagine we'll still need some metadata schema definition API to specify things like the type of index to create, etc.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.