Giter Site home page Giter Site logo

Comments (10)

josh-tobin avatar josh-tobin commented on July 29, 2024 1

Suggestion from @sayakpaul

Add a section to the troubleshooting lecture on things that are common practice in the research world that may not be worth the added complexity in the real world. Examples:

  • If one is careless enough data augmentation can degrade image quality unnecessarily. So, how could incorporate augmentation and at the same time ensure the quality is preserved as much as possible. So, the fast.ai team came up with a simple augmentation policy called presizing.
  • My model is getting dumbed down in the local minima. So what approaches should I take? Ex: LR schedules with decay, Cyclical Learning Rate, etc (just providing examples).
  • If I incorporate batch normalization in my model training then I might be making the model performance highly dependent on the batch statistics which might not be desirable during inference. So, how to go about handling this?

from course-gitbook.

chesterxgchen avatar chesterxgchen commented on July 29, 2024 1

In data management, should include new data formats such as
Apache Iceberg -- originally from Netflix, currently used by many big companies ( Netflix, Apple, Alibaba, Tencent, Adobe, LinkedIn (?), ..)

Delta Lake -- originated from Databricks, Open source in Linux Foundation, with greatest momentum due to the integration with Databricks Cloud, Spark/SparkSQL/SparkStreaming, MLFlow

Apache Hudi -- originated from Uber, suitable for upsert.

All there has time-travel support needed for data versioning. MLFlow is now integrated with Delta Lake to do data version control.

from course-gitbook.

sayakpaul avatar sayakpaul commented on July 29, 2024

@josh-tobin do you mind sharing the platforms/approaches/tools you have your mind to cover the Managing data at a larger scale. or do you plan to do it platform-independent?

One approach that I have found useful for quite a while now (apologies if it is naive):

  • Convert my data to multiple shards of TensorFlow Records
  • Copy them over to a GCS bucket in the same zone where my cloud VM resides for improved performance

from course-gitbook.

josh-tobin avatar josh-tobin commented on July 29, 2024

@josh-tobin do you mind sharing the platforms/approaches/tools you have your mind to cover the Managing data at a larger scale. or do you plan to do it platform-independent?

One approach that I have found useful for quite a while now (apologies if it is naive):

  • Convert my data to multiple shards of TensorFlow Records
  • Copy them over to a GCS bucket in the same zone where my cloud VM resides for improved performance

This is somewhat similar to how I've done it in the past. Some considerations are HDFS vs GCS/S3/etc, and how to build performant data loaders. I'd want to dig into this some more before making any concrete recommendations though.

from course-gitbook.

sayakpaul avatar sayakpaul commented on July 29, 2024

Great! This could be also stretched to show how much impactful a data input pipeline is for training a model with good hardware utilization.

from course-gitbook.

chesterxgchen avatar chesterxgchen commented on July 29, 2024

Monitoring --
Data quality monitoring : features quality, feature distribution visualization, feature skew, data distribution change over time. Feature training/test/validation distributions mismatch etc
Model monitoring --

from course-gitbook.

KDDS avatar KDDS commented on July 29, 2024

More aspects on data engineering. Like implementing massively parallel programming techniques and other cutting edge solution in today's world to make big data ready for DL/ML

from course-gitbook.

DanielhCarranza avatar DanielhCarranza commented on July 29, 2024

Data
Dataset shifts:

  • Proactive approaches (p.e. Causal Diagrams, DAGs, PAGs)
  • Reactive approaches

How to effectively handle Long-tail Data

from course-gitbook.

josh-tobin avatar josh-tobin commented on July 29, 2024

@DanielhCarranza say more about the reactive / proactive approaches you have in mind?

from course-gitbook.

nickdavidhaynes avatar nickdavidhaynes commented on July 29, 2024

I'd love to see a discussion on peer review (maybe it fits in the section on teams, or in testing/deployment section?). There are a lot of pieces need to be reviewed!

  • Training code/configuration
  • Serving code
  • Modeling approach — data and features selected, model architecture, choice of metric(s)
  • Experiment results
  • Plan and code for monitoring performance

I'm aware of a couple good blog posts on the topic, but I'm not sure anything definitive exists.

from course-gitbook.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.