Giter Site home page Giter Site logo

googlecloudplatform / ml-design-patterns Goto Github PK

View Code? Open in Web Editor NEW
1.8K 49.0 515.0 34.22 MB

Source code accompanying O'Reilly book: Machine Learning Design Patterns

License: Apache License 2.0

Jupyter Notebook 99.76% Python 0.23% Dockerfile 0.01% Shell 0.01%

ml-design-patterns's Introduction

This is not an official Google product

ml-design-patterns

Source code accompanying O'Reilly book:
Title: Machine Learning Design Patterns
Authors: Valliappa (Lak) Lakshmanan, Sara Robinson, Michael Munn

https://www.oreilly.com/library/view/machine-learning-design/9781098115777/

Buy from O'Reilly
Buy from Amazon

We will update this repo with source code as we write each chapter. Stay tuned!

Chapters

  • Preface
  • The Need for ML Design Patterns
  • Data representation design patterns
    • #1 Hashed Feature
    • #2 Embedding
    • #3 Feature Cross
    • #4 Multimodal Input
  • Problem representation design patterns
    • #5 Reframing
    • #6 Multilabel
    • #7 Ensemble
    • #8 Cascade
    • #9 Neutral Class
    • #10 Rebalancing
  • Patterns that modify model training
    • #11 Useful overfitting
    • #12 Checkpoints
    • #13 Transfer Learning
    • #14 Distribution Strategy
    • #15 Hyperparameter Tuning
  • Resilience patterns
    • #16 Stateless Serving Function
    • #17 Batch Serving
    • #18 Continuous Model Evaluation
    • #19 Two Phase Predictions
    • #20 Keyed Predictions
  • Reproducibility patterns
    • #21 Transform
    • #22 Repeatable Sampling
    • #23 Bridged Schema
    • #24 Windowed Inference
    • #25 Workflow Pipeline
    • #26 Feature Store
    • #27 Model Versioning
  • Responsible AI
    • #28 Heuristic benchmark
    • #29 Explainable Predictions
    • #30 Fairness Lens
  • Summary

ml-design-patterns's People

Contributors

laksantea avatar lakshmanok avatar munnm avatar sararob avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ml-design-patterns's Issues

Chapter 2 embeddings last cell - change request

Dear authors,

The last cell of the notebook could be confusing : indeed, when using pre-trained embedding vectors, it is not useful to have EMBED_DIM parameter defined. The build_hub_model() method does not use it hopefully :)

Best regards
Jerome

Chapter 3: Cascade

GCP AI notebook does not have kfp installed.

!pip install kfp has no incompatibilities.

Pipeline run completes with output: 569.155..

Dataflow Batch job creates a Zero Byte TensorFlow record file

I am working through the following notebook: https://github.com/GoogleCloudPlatform/ml-design-patterns/blob/master/02_data_representation/weather_search/wx_embeddings.ipynb. I am running a GCP AI Notebook VM with JupyterLab.

When I get to the following line of code: %run -m wxsearch.hrrr_to_tfrecord -- --startdate 20190915 --enddate 20190916 --outdir gs://{BUCKET}/wxsearch/data/2019 --project {PROJECT}, my Dataflow batch job indicates that it runs fine and to completion (first image below). However, the batch job produces a zero byte TensorFlow record file (second image below). The zero elements per second seems concerning to me in create_tfr, although I don't know if this is a problem.

Any thoughts as to what may be happening? The only modifications I made were to the bucket and project variables where I wrote my own bucket and project values into the command.

Screen Shot 2020-11-17 at 10 37 30 PM

Screen Shot 2020-11-17 at 10 31 23 PM

Chapter 5: Continued Evaluation - Class label consistency

The order of the publications is inconsistent between the original CLASSES definition and the source name function:

CLASSES = {
'github': 0,
'nytimes': 1,
'techcrunch': 2
}

labels = tf.constant(['github', 'techcrunch', 'nytimes'], dtype=tf.string)

Suggest:

labels = tf.constant(['github', 'nytimes','techcrunch'], dtype=tf.string)

Chapter 2: Training Times Machine Spec

It would be helpful to indicate the likely training times for a typical machine specification.

Using a GCP N1-standard notebook i found the DNN and Linear training times to be significantly longer than stated in the draft book.

Chapter 5: Evaluation job not available?

From the continuous_eval.ipynb notebook:

image

However, I see the option on "Evaluation" is not available (disabled):

Screenshot 2021-07-20 at 1 47 10 PM

What's the recommended course of action here?

Chapter 5: Two Phase Predictions

NameError: name 'image_batch' is not defined
feature_batch = mobilenet(image_batch)

Suggest:
train_image, train_label = next(train_data_gen)
feature_batch = mobilenet(train_image)

Phase 2: identifying instrument sounds, needs the '/audio_train_spectro' directory repopulated with images as the files are moved into audio_spectros/not_instrument/ or audio_spectros/instrument/ in the previous example

Similarly suggest:

feature_batch = vgg_model(image_instrument_train)

Chapter 1: Colab Auth

Running chapter 1 on GCP Notebook Google Colab is not installed. The notebook executes successfully without running this cell.

Installing the package (google-colab) flags some incompatibilities and then produces:

import error: cannot import name 'ordereddict' from 'pandas.compat'

causing the remainder of the notebook to fail.

Chapter 6: How should I determine a number of bridging?

Hello, I am studying this book. Thank you for writing such a well-structured textbook.

I have a question about the Bridged Schema pattern in section 23. How should I determine the amount of old data to be added to the training data.

In this repository, it is stated that adding 60,000 old data is best. However, in the line graph of the number of data and the R2 value, 60,000 is rather at the bottom of the R2 value. The higher the R2 value, the better the prediction, so it looks rather like the prediction accuracy is decreasing as older data is added.
From the results of this graph, I may conclude that the prediction accuracy is higher when learning only with new data.

I would be glad if anyone tell me why you decide that adding 60,000 old data was the best.

Chapter 5: Up the patience to at least 5?

@munnm, in the 05_resilience/continuous_eval.ipynb notebook, please set some value for the patience (preferably > 1) argument in the EarlyStopping() callback. Otherwise, it'd only run for a single epoch. Users not familiar with the EarlyStopping callback API of Keras might not understand what's wrong at the very first glance.

Chapter 2: Mixed Representation

mixed_image_tabular_model = Model(inputs=[image_tabular_input, tiled_input], outputs=merged_image_output)

NameError: name 'tiled_input' is not defined

Should be 'image_input'?

Chapter 6: Storm Reports

With tfx==0.24 had to substitute:

from tfx.extensions.google_cloud_big_query.example_gen.component import BigQueryExampleGen

Also transform step fails with kernel restart, see: tensorflow/tfx#2598

Chapter 2 : Feature Cross - Error in explaining text

Dear authors,

The text

'''Creating a Feature Cross with BQML
Next, we'll create a feature cross of the features is_male and mother_race. To create a feature cross we apply ML.FEATURE_CROSS to a STRUCT of the features is_male and mother_race cast as a string. The STRUCT clause creates an ordered pair of the two features. The TRANSFORM clause is used for engineering features of our model. This allows us to specify all preprocessing during model creation and apply those preprocessing steps during prediction and evaluation. The rest of the features within the TRANSFORM clause remain unchanged.'''

has to be changed as the two features crossed are 'is_male' and 'plurality' not 'mother_race'

Thanks

Best regards

Jerome

Chapter 3 : Cascade evaluate ValueError: The pyarrow library is not installed

Dear authors,
the evaluate component of the pipeline fails due to the lack of pyarrow module.

Solved by changing the module request in the pipeline definition :

dsl.pipeline(
    name='Cascade pipeline on SF bikeshare',
    description='Cascade pipeline on SF bikeshare'
)

def cascade_pipeline(
    project_id = PROJECT_ID
):
    ddlop = comp.func_to_container_op(run_bigquery_ddl, packages_to_install=['google-cloud-bigquery'])
        
    c1 = train_classification_model(ddlop, PROJECT_ID)
    c1_model_name = c1.outputs['created_table']
    
    c2a_input = create_training_data(ddlop, PROJECT_ID, c1_model_name, 'Typical')
    c2b_input = create_training_data(ddlop, PROJECT_ID, c1_model_name, 'Long')
    
    c3a_model = train_distance_model(ddlop, PROJECT_ID, c2a_input.outputs['created_table'], 'Typical')
    c3b_model = train_distance_model(ddlop, PROJECT_ID, c2b_input.outputs['created_table'], 'Long')
    
    evalop = comp.func_to_container_op(evaluate, packages_to_install=['google-cloud-bigquery[bqstorage,pandas]', 'pandas'])
    error = evalop(PROJECT_ID, c1_model_name, c3a_model.outputs['created_table'], c3b_model.outputs['created_table'])
    print(error.output)

Best Regards

Jerome

Chapter 3: Rebalancing

On GCP AI Notebook install xgboost
!pip install xgboost

Colab Auth as per issue #5

Needed to manually create natality dataset in US region

Replaced df['NEAREST_CENTROIDS_DISTANCE'].iloc[0]

with

average_pred['NEAREST_CENTROIDS_DISTANCE'].iloc[0]

Errors in two_phase_predictions.ipynb

  • "audio" path should be "audio_spectros".
  • "image_batch" variable not found.
  • fit_generator() method deprecated, prefer fit().
  • The spectrometer png files get moved, so need to redownload or copy them.

Chapter 5 : continuous evaluation

Dear authors,

it seems to me that the Section 2 of the continuous evaluation notebook needs to be updated. Indeed, the continuous evaluation mode is not more straightforward and requires more setup information to be used flawlessly,

Thanks
Best Regards

Jerome

Chapter 5: Continued Evaluation: Dataset Access, EarlyStopping, Evaluation

  1. The munn-sandbox is not publically available so the txtcls is not available.

I created using code from: https://datalab.office.datisan.com.au/notebooks/training-data-analyst/blogs/textclassification/txtcls.ipynb

as:

query="""
SELECT source, REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ') AS title FROM
(SELECT
ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
title
FROM
bigquery-public-data.hacker_news.stories
WHERE
REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.
://(.[^/]+)/'), '.com$')
AND LENGTH(title) > 10
)
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
"""

from google.cloud import bigquery
client = bigquery.Client()
df = client.query(query).to_dataframe()
df.to_csv('titles_full.csv', header=False, index=False, encoding='utf-8', sep=',')

I had to swap the column order:
COLUMNS = ['source', 'title']

  1. With EarlyStopping enabled training finished after just 2 Epochs

callbacks=[EarlyStopping(), TensorBoard(model_dir)],

without it loss was minimised after 20.

  1. Evaluation job section is 'to-do':

"some stuff here about setting up Eval jobs"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.