googlecloudplatform / ml-design-patterns Goto Github PK

Source code accompanying O'Reilly book: Machine Learning Design Patterns

License: Apache License 2.0

Jupyter Notebook 99.76% Python 0.23% Dockerfile 0.01% Shell 0.01%

ml-design-patterns's Introduction

This is not an official Google product

ml-design-patterns

Source code accompanying O'Reilly book:
Title: Machine Learning Design Patterns
Authors: Valliappa (Lak) Lakshmanan, Sara Robinson, Michael Munn

https://www.oreilly.com/library/view/machine-learning-design/9781098115777/

Buy from O'Reilly
Buy from Amazon

We will update this repo with source code as we write each chapter. Stay tuned!

Chapters

Preface
The Need for ML Design Patterns
Data representation design patterns
- #1 Hashed Feature
- #2 Embedding
- #3 Feature Cross
- #4 Multimodal Input
Problem representation design patterns
- #5 Reframing
- #6 Multilabel
- #7 Ensemble
- #8 Cascade
- #9 Neutral Class
- #10 Rebalancing
Patterns that modify model training
- #11 Useful overfitting
- #12 Checkpoints
- #13 Transfer Learning
- #14 Distribution Strategy
- #15 Hyperparameter Tuning
Resilience patterns
- #16 Stateless Serving Function
- #17 Batch Serving
- #18 Continuous Model Evaluation
- #19 Two Phase Predictions
- #20 Keyed Predictions
Reproducibility patterns
- #21 Transform
- #22 Repeatable Sampling
- #23 Bridged Schema
- #24 Windowed Inference
- #25 Workflow Pipeline
- #26 Feature Store
- #27 Model Versioning
Responsible AI
- #28 Heuristic benchmark
- #29 Explainable Predictions
- #30 Fairness Lens
Summary

ml-design-patterns's People

Contributors

Stargazers

Watchers

Forkers

mmizutani dedmari mdlglobal-atlassian-net santhh iamvazu y-martinez mansong1 nicolizamacorrea sarigopiram lilsummer mbettan tristan-fr krprls pallerlaakshay antranttu ravishekher001 yonghangzhou gradjitta allensmile huaizhengzhang weimingwill fffmysmbkn binbinmeng gitgithan mfreyeso bramamoorthy ruchitagarde cnarte shoneslab hendrarespati qu33nb aaronwwy nilsys moceansystems malwr-io jcassiojr jac2130 ravigurnatham zhubq saradhimpardha abhisekomkar darkangelkid qiuzhuang manoj-agarwal peterleong brennomello evi1angel khileshchauhan luisa13 muskanmahajan37 vadivelselvaraj vamsikavuru pwacholski luisfalva maybeee18 tensorady datastudysquad dineshyadavg hzitoun moonkang alessando-guida alex000kim glongh petanerd fanchi jeffreyung micseb peopzen ksharpdabu myeducationalrepos aukarous nipdep pnutnam pjerryhu donkas itsshaikaslam nickcorona lakshmiaddepalli srinivasramakrishnan abinayam02 dharai thsonvt stjordanis oddecust anilmm2005 jimdowling ml-d soroushgj sonalirungta abodacs dhirendra101 minoue-xx cabrossman alirezabayatmk ayeshabaig ahmed-gamal97 littlewat gehongpeng connoisseures junjiez

ml-design-patterns's Issues

Chapter 2 embeddings last cell - change request

Dear authors,

The last cell of the notebook could be confusing : indeed, when using pre-trained embedding vectors, it is not useful to have EMBED_DIM parameter defined. The build_hub_model() method does not use it hopefully :)

Best regards
Jerome

Chapter 6: Pipeline Build

Skaffold must be installed to use tfx cli as per https://www.tensorflow.org/tfx/tutorials/tfx/cloud-ai-platform-pipelines

tfx pipeline create command must be run from my_pipeline dir not workflow_pipeline as in README.md

Chapter 3: Cascade

GCP AI notebook does not have kfp installed.

!pip install kfp has no incompatibilities.

Pipeline run completes with output: 569.155..

Chapter 6: Model Versioning

Typo in last cell of notebook:

gcloud ai-platform predict --model 'flighs_regression' --version
'v1' --json-instances 'input.json'

Should be flights_regression.

Also ensure installed version of xbgoost used for training matches the AI Platform runtime version (https://cloud.google.com/ai-platform/training/docs/runtime-version-list) otherwise null predictions are produced.

Chapter 6: Transform

Had to copy data directory with csv files from https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/quests/serverlessml/data to run taxifare_fc.ipynb.

Model trained on Tensorflow 2.3 is incompatible with serving from GCP AI Platform, Tensorflow 2.2.0. Prediction fails with
input segment_ids[0] expected type int32 != int64. Training on TF2.2.0 predicts ok.

Dataflow Batch job creates a Zero Byte TensorFlow record file

I am working through the following notebook: https://github.com/GoogleCloudPlatform/ml-design-patterns/blob/master/02_data_representation/weather_search/wx_embeddings.ipynb. I am running a GCP AI Notebook VM with JupyterLab.

When I get to the following line of code: %run -m wxsearch.hrrr_to_tfrecord -- --startdate 20190915 --enddate 20190916 --outdir gs://{BUCKET}/wxsearch/data/2019 --project {PROJECT}, my Dataflow batch job indicates that it runs fine and to completion (first image below). However, the batch job produces a zero byte TensorFlow record file (second image below). The zero elements per second seems concerning to me in create_tfr, although I don't know if this is a problem.

Any thoughts as to what may be happening? The only modifications I made were to the bucket and project variables where I wrote my own bucket and project values into the command.

Chapter 5: Continued Evaluation - Class label consistency

The order of the publications is inconsistent between the original CLASSES definition and the source name function:

CLASSES = {
'github': 0,
'nytimes': 1,
'techcrunch': 2
}

labels = tf.constant(['github', 'techcrunch', 'nytimes'], dtype=tf.string)

Suggest:

labels = tf.constant(['github', 'nytimes','techcrunch'], dtype=tf.string)

Chapter 5: Continued Evaluation - Model Creation

Had to run 'gcloud ai-platform models create txtcls' prior to creation of new model version

Chapter 2: Training Times Machine Spec

It would be helpful to indicate the likely training times for a typical machine specification.

Using a GCP N1-standard notebook i found the DNN and Linear training times to be significantly longer than stated in the draft book.

Chapter 5: Evaluation job not available?

From the continuous_eval.ipynb notebook:

However, I see the option on "Evaluation" is not available (disabled):

What's the recommended course of action here?

Chapter 5: Two Phase Predictions

NameError: name 'image_batch' is not defined
feature_batch = mobilenet(image_batch)

Suggest:
train_image, train_label = next(train_data_gen)
feature_batch = mobilenet(train_image)

Phase 2: identifying instrument sounds, needs the '/audio_train_spectro' directory repopulated with images as the files are moved into audio_spectros/not_instrument/ or audio_spectros/instrument/ in the previous example

Similarly suggest:

feature_batch = vgg_model(image_instrument_train)

Chapter 1: Colab Auth

Running chapter 1 on GCP Notebook Google Colab is not installed. The notebook executes successfully without running this cell.

Installing the package (google-colab) flags some incompatibilities and then produces:

import error: cannot import name 'ordereddict' from 'pandas.compat'

causing the remainder of the notebook to fail.

Chapter 6: How should I determine a number of bridging?

Hello, I am studying this book. Thank you for writing such a well-structured textbook.

I have a question about the Bridged Schema pattern in section 23. How should I determine the amount of old data to be added to the training data.

In this repository, it is stated that adding 60,000 old data is best. However, in the line graph of the number of data and the R2 value, 60,000 is rather at the bottom of the R2 value. The higher the R2 value, the better the prediction, so it looks rather like the prediction accuracy is decreasing as older data is added.
From the results of this graph, I may conclude that the prediction accuracy is higher when learning only with new data.

I would be glad if anyone tell me why you decide that adding 60,000 old data was the best.

Chapter 5: Up the patience to at least 5?

@munnm, in the 05_resilience/continuous_eval.ipynb notebook, please set some value for the patience (preferably > 1) argument in the EarlyStopping() callback. Otherwise, it'd only run for a single epoch. Users not familiar with the EarlyStopping callback API of Keras might not understand what's wrong at the very first glance.

Chapter 2: Mixed Representation

mixed_image_tabular_model = Model(inputs=[image_tabular_input, tiled_input], outputs=merged_image_output)

NameError: name 'tiled_input' is not defined

Should be 'image_input'?

Chapter 6: Storm Reports

With tfx==0.24 had to substitute:

from tfx.extensions.google_cloud_big_query.example_gen.component import BigQueryExampleGen

Also transform step fails with kernel restart, see: tensorflow/tfx#2598

Chapter 2 : Feature Cross - Error in explaining text

Dear authors,

The text

'''Creating a Feature Cross with BQML
Next, we'll create a feature cross of the features is_male and mother_race. To create a feature cross we apply ML.FEATURE_CROSS to a STRUCT of the features is_male and mother_race cast as a string. The STRUCT clause creates an ordered pair of the two features. The TRANSFORM clause is used for engineering features of our model. This allows us to specify all preprocessing during model creation and apply those preprocessing steps during prediction and evaluation. The rest of the features within the TRANSFORM clause remain unchanged.'''

has to be changed as the two features crossed are 'is_male' and 'plurality' not 'mother_race'

Thanks

Best regards

Jerome

chapter 7: Explainability shap.DeepExplainer fails "Can't convert non-rectangular Python sequence to Tensor".

shap.DeepExplainer call fails with error "Can't convert non-rectangular Python sequence to Tensor".
tensorflow version 2.1.1
shap version 0.37.0

seems similar to: shap/shap#850
thanks,
jim

Chapter 3 : Cascade evaluate ValueError: The pyarrow library is not installed

Dear authors,
the evaluate component of the pipeline fails due to the lack of pyarrow module.

Solved by changing the module request in the pipeline definition :

dsl.pipeline(
    name='Cascade pipeline on SF bikeshare',
    description='Cascade pipeline on SF bikeshare'
)

def cascade_pipeline(
    project_id = PROJECT_ID
):
    ddlop = comp.func_to_container_op(run_bigquery_ddl, packages_to_install=['google-cloud-bigquery'])
        
    c1 = train_classification_model(ddlop, PROJECT_ID)
    c1_model_name = c1.outputs['created_table']
    
    c2a_input = create_training_data(ddlop, PROJECT_ID, c1_model_name, 'Typical')
    c2b_input = create_training_data(ddlop, PROJECT_ID, c1_model_name, 'Long')
    
    c3a_model = train_distance_model(ddlop, PROJECT_ID, c2a_input.outputs['created_table'], 'Typical')
    c3b_model = train_distance_model(ddlop, PROJECT_ID, c2b_input.outputs['created_table'], 'Long')
    
    evalop = comp.func_to_container_op(evaluate, packages_to_install=['google-cloud-bigquery[bqstorage,pandas]', 'pandas'])
    error = evalop(PROJECT_ID, c1_model_name, c3a_model.outputs['created_table'], c3b_model.outputs['created_table'])
    print(error.output)

Best Regards

Jerome

Chapter 2 Feature Cross BQML with L2 worse rmse than without ...

Dear authors

bqlm without l2, rmse = 4.828183
bqlm with l2m rmse = 4.844853

arghhhhhhh ;-)

Best regards

Jerome

Chapter 3 : multilabel : unable to get the dataset

Hi everyone,

I have no access to thte bucket to get the dataset.
AccessDeniedException: 403 [email protected] does not have storage.objects.list access to the Google Cloud Storage bucket.

Best Regards

Jerome

Chapter 3: Rebalancing

On GCP AI Notebook install xgboost
!pip install xgboost

Colab Auth as per issue #5

Needed to manually create natality dataset in US region

Replaced df['NEAREST_CENTROIDS_DISTANCE'].iloc[0]

with

average_pred['NEAREST_CENTROIDS_DISTANCE'].iloc[0]

Errors in two_phase_predictions.ipynb

"audio" path should be "audio_spectros".
"image_batch" variable not found.
fit_generator() method deprecated, prefer fit().
The spectrometer png files get moved, so need to redownload or copy them.

Chapter 4 : checkpointing : callback not used in the fit() method

Hi authors team,

just to inform you that the checkpointing callback is not called in the fit() method.
I guess that it is a typo as the callback is used in the code snippet in the book.

Best regards

Jerome

Chapter 5 : continuous evaluation

Dear authors,

it seems to me that the Section 2 of the continuous evaluation notebook needs to be updated. Indeed, the continuous evaluation mode is not more straightforward and requires more setup information to be used flawlessly,

Thanks
Best Regards

Jerome

Chapter 5: Continued Evaluation: Dataset Access, EarlyStopping, Evaluation

The munn-sandbox is not publically available so the txtcls is not available.

I created using code from: https://datalab.office.datisan.com.au/notebooks/training-data-analyst/blogs/textclassification/txtcls.ipynb

as:

query="""
SELECT source, REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ') AS title FROM
(SELECT
ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
title
FROM
bigquery-public-data.hacker_news.stories
WHERE
REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.://(.[^/]+)/'), '.com$')
AND LENGTH(title) > 10
)
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
"""

from google.cloud import bigquery
client = bigquery.Client()
df = client.query(query).to_dataframe()
df.to_csv('titles_full.csv', header=False, index=False, encoding='utf-8', sep=',')

I had to swap the column order:
COLUMNS = ['source', 'title']

With EarlyStopping enabled training finished after just 2 Epochs

callbacks=[EarlyStopping(), TensorBoard(model_dir)],

without it loss was minimised after 20.

Evaluation job section is 'to-do':

"some stuff here about setting up Eval jobs"

Chapter 7: Responsible AI

Deepexplainer fails with TF2.3, ok with 2.2

chapter 7 - access to gs bucket denied

getting access denied trying to copy gs://ml-design-patterns/auto-mpg.csv - chapter 7 explainability.ipynb

jim

googlecloudplatform / ml-design-patterns Goto Github PK

ml-design-patterns's Introduction

ml-design-patterns

Chapters

ml-design-patterns's People

Contributors

Stargazers

Watchers

Forkers

ml-design-patterns's Issues

Recommend Projects

Recommend Topics

Recommend Org