Data Discovery Platforms and Their Open Source Solutions What ques

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

https://eugeneyan.com/writing/data-discovery-platforms/ about eugeneyan-comments HOT 7 OPEN

eugeneyan commented on May 23, 2024

https://eugeneyan.com/writing/data-discovery-platforms/

from eugeneyan-comments.

Comments (7)

LePeti commented on May 23, 2024 1

Great post! Took a lot of advice and ideas from this post while researching which tool to pick for the team I'm in. After all we wen't with Google's Data Catalog b/c we are heavy users of GCP. Do you have experience with that? I'd be curious to hear your opinion. I'm not 100% happy b/c there is so much development to add (tools for uploading templates and tagging resources, automated metric stats calculators, etc.). But maybe these obstacles are there for most platforms.

from eugeneyan-comments.

hey-jude commented on May 23, 2024 1

Check spothero's link, please. That is same as saxobank's.

from eugeneyan-comments.

eugeneyan commented on May 23, 2024

I'm afraid I haven't looked at any proprietary data discovery tools (including Google Data Catalog). Have not been able to find much reviews of it too, though this may be helpful.

from eugeneyan-comments.

eugeneyan commented on May 23, 2024

Thanks for raising this JeongHoon! Fixed.

from eugeneyan-comments.

xiphl commented on May 23, 2024

I feel that you should include CKAN in the comparison list. While admittedly an older solution, it is viable as a data catalog for flat files. It doesnt include live connections to data sources, that is its limiting constaint. It is relatively mature as a product compared to these developing products.

from eugeneyan-comments.

FreddieSun commented on May 23, 2024

Great Post! lots of useful insights.

I have a question about feature reuse. Considering a case that 10 users using the feature A, feature A's owner is jack. Can Jack modify and optimize the feature? does Jack need to guarantee the quality of feature, both the performance and the accuracy? In this circumstance, feature reuse increase the feature owner's burden. So how does this kind collaboration works?

from eugeneyan-comments.

eugeneyan commented on May 23, 2024

@FreddieSun This is an interesting question and I don't have it all worked out yet.

If Jack is simply creating features as exhaust of his own machine learning pipeline, he can make the features available without guarantees of performance (i.e., Caveat Emptor). Thus, Jack can publish features he's using for his own use case and make it available to others without taking on the ops burden of updating/maintaining it for other use cases.

Alternatively, Jack can go the extra mile and maintain multiple versions of the feature in the short term. Thus, if he's updating a feature from v1 -> v2, he might provide v1 and v2 simultaneously for a period (e.g., a month) before deprecating v1. Consumers of Jack's features can be identified via looking at query logs before sending them a notification. Nonetheless, this is more burdensome and IMHO, Jack is in no way obliged to do this.

On the other hand, if Jack part of a team that provides features for internal users and their downstream use cases, he'll probably have to adhere to some contract with downstream users, such as ensuring the quality of embeddings, accuracy of imputed data, etc.

from eugeneyan-comments.

https://eugeneyan.com/writing/data-discovery-platforms/ about eugeneyan-comments HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent