The enceladus from absaoss

Timestamp in Unix Epoch

Source may contain Timestamp data as milliseconds since the Unix Epoch:

1528797510650 = 2018/09/12 09:58:30.650
As this is not a standard SimpleDateFormat there would need to be a custom Stadardisation specific string, e.g.: UNIXEPOCH

Provide a mechanism to convert native copy book to Spark Struct so this can be uploaded and managed directly in Menas.
Copybook contains too much information to easily store in Spark Struct schema, maybe we should rather store it in Mongo as an uploaded attachment.

Model versioning

With the new UI, all changes even micro changes are incurring multiple versions, this is good and bad as we have a full roll back on any change to a very granular point.
The proposed idea, is to have a separate version system for tracking changes in Menas (when creating and updating definition) compared to version of the dataset we run when doing spark-submit, we will have too many versions due to micro-changes that Menas is going to introduce

View updates even from canceled edit window

If a user clicks edit over Basic Info of any of the models (Schemas, Datasets, Mapping tables), changes description, and then clicks cancel, the change still gets shown in the current window. It almost seems as the update went through.

This goes away after window refresh and does not send any call to database or such.

Audit Trails

Testing

We need to prioritise stability for version 1.0.0.
To do this, we need to make up for missing unit, integration and system tests.

Historical view for entities

Have a view of each entity so that we can see historical versions (ie those that have been "deleted")

Old data lingers even if all items in view are deleted

After user deletes all the items in the view from the left column, the last item is still showing in the main view. The user can still go through all the tabs of it.

This should be disabled or nulled, same as in the left column where it just says No Data.

Search, Sort and Filter entities

Users should be able to search filter and sort their schemas/mapping tables/datasets in the Menas UI for easier navigation.

Menas Rest API Integration Tests

We need to add integration tests for the Menas Rest API.

New Menas Build

Handling string formatting in represented by string numerics

Handle string based formatting that source may output e.g.:
"1,000.00" = 1000.00

Conformance Unit Tests

We need to improve coverage of our unit tests in the conformance module.

Optimize the array type conformance by merging conformance rules

Array transformations are expensive because we explode and collapse arrays for each conformance rule operating on array columns. We could improve performance by merging rules that operate on the same array columns and decreasing to a single explosion-collapse.

Validate parameters moved from runtime to definition time

The UI should validate that whitespaces are not present in HDFS paths, that those paths exists, etc.

Entity names shouldn't have whitespace

Dataset/Mapping table/Schema names should not be allowed to have whitespace as that can cause issues

Allow entity Import/Export

Background

Currently, entity definitions in Menas cannot be exported. The only thing you can export is a file uploaded for a Schema entity.

Feature

It would be useful if users could export and import any entity definition as a json.

An open question is how to handle conflicts (cases where a dataset definition with the same name and version already exists). This conflict resolution should be drawn up in a shared doc and once resolved posted here.

This task is similar in nature to #594, but meant for manual single item import/export, while #594 is meant for direct promotion via HTTP and in bulk.

Expected task

API
UI adjustment

Implement type validation of default values

Menas should be able to validate that the specified default schema values are valid fo the associated data type.

Capture outbound schema back into Menas

Post Conformance result schema should be published back into Menas so we have inbound and outbound data schema
Maybe turning Menas into a central schema store and allow harvesting from other tools

Unit tests of handling arrays in XML

The unit tests should look like this:

For a given xml dataset and a schema check if the resulting Spark dataset is as expected
Tests should cover:
- empty arrays
- optional arrays
- single value arrays
- arrays of primitive values
- arrays of struct values
- array as a single attribute of a struct field

Elaborate conformance rule descriptions

Conformance rules should have a more elaborate description.

Utils Unit Tests

We need to improve coverage of our unit tests in the utils module.

Add scaladocs

We lack scaladocs on most APIs, would really help to improve documentation.

Better dataset declaration / runtime parameter design

This involves the reduction of parameters submitted to the spark for both Standardization and Conformance. Instead these will be provided via Menas.
This should greatly improve the platform's integration capabilities, e.g. with scheduling tools like Oozie.

Admin UI

Database Integration Tests

We need database integration tests to check all database interactions Menas can perform.
This involves for each test case:

Fixtures being imported into a live DB instance (containerized or not)
Running the test
Cleaning up the DB state for the next test

Runs view

The dataset should have a runs page showing run metadata, as per old Menas.
Reminder: add thousands separator.

More COBOL integration

This involves Cobrix integration in Standardiation.

Implement optimistic locking or similar concurrency control

Similarity Schema Checks

We can implement a set of routines that provide schema difference give 2 schemes to track schema changes between versions. That may include:

The list of new fields
The list of deleted fields
The list of changed properties of fields (type, nullability, etc)

But this will require some UI work as well to visualize the difference. Probably the easiest way is to display the above lists as text boxes.

Warning on outdated entity references

We could put a warning on views that have entity references pointing entities with old versions.
Example:

Schema A with latest version 5
Dataset X references schema Schema A (version4)
Show warning next to Schema A (version 4) on Dataset A view: "Schema A has newer version 5"

Dataset Conformance Rules CRUD

We need a view in Menas to add/update/delete/reorder Conformance Rules.

Enceladus and Menas model compatibility

Currently we have no way of ensuring that the models used in Enceladus (std and conformance) are compatible versions with the ones stored in Menas. We can add a header to the HTTP requests that says what model version Enceladus is using, then Menas can check compatibility and respond accordingly.

If the header is not specified Menas should ignore the compatibility check to avoid breaking compatibility for people using Encealdus as a library.

Standardization Unit Tests

We need to improve coverage of our unit tests in the standardization module.

Validate duplicate target columns for MT default values

Create schema validation for existing fields with errCol name

Ensure that if the input to standardization contains the pre-defined error column name, the schema of the field matches our predefined schema, otherwise fail.

Authorization and permissions

User and admin roles need to be differentiated. User Read/Write permissions need to be restricted to either their own entities or there could be some sort of group restrictions.

End-to-end Testing

We need to set up real e2e system tests which include everything

Drop (Do not persist) fields that are not registered in schema

Do not write parquet file with attributes not registered in the schema, this is usually a symptom where source has attributes, but the schema in the schema repository has not been updated.

Want to use this to ensure owners keep up with schema changes and give control of data distribution for attributes to data owners

Uber JAR Standardisation and Conformance

Provide 1 JAR artifact for both Conformance and Standardisation.
This will simplify configuration files and also reduce risk of running mismatched STD and CONF versions by users.

Rerun overwrite and also data access reader

Enable an efficient and clean way to implement rerun and also expected override capability for data loads (ie End of Month, End of Day and etc)

Handle schema insertion where schema has 0 fields

Currently conformance allows for 0 field schemas to be inserted and used, this is only picked up when running Enceladus. This is usually an error with schema definition generation or manual manipulation

Use RESTful http methods in Menas API

Currently disable calls calls to the new rest API use HTTP GET requests, but GET should be idempotent. Instead we should use HTTP DELETE.
POST to create entity.
PUT to update entity.

Table grouping

I would like to suggest that there be a different folder structure or way of searching on Menas.

I think it would be good to have a folder per source system and then the tables that belong to that source will be in there instead of there just being a list of tables with no indication of which source system they are from.

Performance optimisation for standardisation

Consider performance optimisation with respect to wide data (tables with thousands of columns). The options are discussed at Spark DEV Mailing Lists http://apache-spark-developers-list.1001551.n3.nabble.com/Performance-Spark-DataFrame-is-slow-with-wide-data-Polynomial-complexity-on-the-number-of-columns-is-tt24635.html

Add additional metadata for datasets

Allow users to capture the following information for dataset level definition:

Business Description
Frequency

Frequency definitely could feed into our discussion for integration with Oozie and schedulers

Currently we recommend user populate as additionalInfo metadata in INFO file, which should live in Dataset definition rather than "run" metadata (INFO file)

Schema editor

Users should be able to construct and validate their Schemas through the Menas UI.

Refactor Validation Utils

We have two separate ValidationExceptions:

za.co.absa.enceladus.conformance.interpreter.rules.ValidationException (only used in Standardization)
za.co.absa.enceladus.utils.validation.ValidationException (only used in Conformance)
Conformance also dips into za.co.absa.enceladus.utils.validation.ValidationUtils before throwing its own ValidationException, which blurs the line between which ValidationException is responsible for what.

Good design dictates a clear separation between the two or a unification.

There is also some code duplication in conformance rules, specifically aimed at validation, that would best be extracted to a common location.

Add performance summary to UI

It would be helpful to add some performance metrics to Run screen, such as:

input data size (in GB)
output data size (in GB)
Time it took to do standardization
Time it took to do dynamic conformance
Total processing time
For each checkpoint it would be helpful to show the elapsed time as a difference between start and end processing time, if available (for source and raw it may not be available).
Maybe also the number of records (they already shown as control measurements), could be part of performance summary as well.

An input data size and an output data size should be provided by backend. All other information can be derived from control checkpoints i guess.
There is a generic key/map pair in _INFO file. I guess we can use that to provide information about size of raw, standardized and published datasets. Also, possibly it should include information about standardization and conformance elapsed time.

The raw folder size, std folder size and conformed folder size can be saved in metadata fields of an _INFO file and Run object during std/dynconf jobs execution.

Create sample end to end working sample for custom rule

This involves creating an example module with an implementation of a Spark job using the enceladus-conformance as a library to implement a CustomConformanceRule and its respective interpreter.

absaoss / enceladus Goto Github PK

enceladus's People

Contributors

Stargazers

Watchers

Forkers

enceladus's Issues

Background

Feature

Expected task

Recommend Projects

Recommend Topics

Recommend Org