Giter Site home page Giter Site logo

Comments (7)

LePeti avatar LePeti commented on May 23, 2024 1

Great post! Took a lot of advice and ideas from this post while researching which tool to pick for the team I'm in. After all we wen't with Google's Data Catalog b/c we are heavy users of GCP. Do you have experience with that? I'd be curious to hear your opinion. I'm not 100% happy b/c there is so much development to add (tools for uploading templates and tagging resources, automated metric stats calculators, etc.). But maybe these obstacles are there for most platforms.

from eugeneyan-comments.

hey-jude avatar hey-jude commented on May 23, 2024 1

Check spothero's link, please. That is same as saxobank's.

from eugeneyan-comments.

eugeneyan avatar eugeneyan commented on May 23, 2024

I'm afraid I haven't looked at any proprietary data discovery tools (including Google Data Catalog). Have not been able to find much reviews of it too, though this may be helpful.

from eugeneyan-comments.

eugeneyan avatar eugeneyan commented on May 23, 2024

Thanks for raising this JeongHoon! Fixed.

from eugeneyan-comments.

xiphl avatar xiphl commented on May 23, 2024

I feel that you should include CKAN in the comparison list. While admittedly an older solution, it is viable as a data catalog for flat files. It doesnt include live connections to data sources, that is its limiting constaint. It is relatively mature as a product compared to these developing products.

from eugeneyan-comments.

FreddieSun avatar FreddieSun commented on May 23, 2024

Great Post! lots of useful insights.

I have a question about feature reuse. Considering a case that 10 users using the feature A, feature A's owner is jack. Can Jack modify and optimize the feature? does Jack need to guarantee the quality of feature, both the performance and the accuracy? In this circumstance, feature reuse increase the feature owner's burden. So how does this kind collaboration works?

from eugeneyan-comments.

eugeneyan avatar eugeneyan commented on May 23, 2024

@FreddieSun This is an interesting question and I don't have it all worked out yet.

If Jack is simply creating features as exhaust of his own machine learning pipeline, he can make the features available without guarantees of performance (i.e., Caveat Emptor). Thus, Jack can publish features he's using for his own use case and make it available to others without taking on the ops burden of updating/maintaining it for other use cases.

Alternatively, Jack can go the extra mile and maintain multiple versions of the feature in the short term. Thus, if he's updating a feature from v1 -> v2, he might provide v1 and v2 simultaneously for a period (e.g., a month) before deprecating v1. Consumers of Jack's features can be identified via looking at query logs before sending them a notification. Nonetheless, this is more burdensome and IMHO, Jack is in no way obliged to do this.

On the other hand, if Jack part of a team that provides features for internal users and their downstream use cases, he'll probably have to adhere to some contract with downstream users, such as ensuring the quality of embeddings, accuracy of imputed data, etc.

from eugeneyan-comments.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.