hundredblocks / ml-powered-applications Goto Github PK

Companion repository for the book Building Machine Learning Powered Applications

License: MIT License

Python 0.36% Jupyter Notebook 99.58% Shell 0.01% HTML 0.05%

ml-powered-applications's Introduction

Building ML Powered Applications

Welcome to the companion code repository for the O'Reilly book Building ML Powered Applications. The book is available on Amazon.

This repository consists of three parts:

A set of Jupyter notebooks in the notebook folder serve to illustrate concepts covered in the book.
A library in the ml_editor folder contains core functions for the book's case study example, a Machine Learning driven writing assistant.
A Flask app demonstrates a simple way to serve results to users
The images/bmlpa_figures folder contains reproductions of a few figures which were hard to read in the first print version.

Credit and thanks go to Bruno Guisard who conducted a thorough review of the code in this repository.

Setup instructions

Python environment

This repository has been tested on Python 3.6 and 3.7. It aims to support any Python 3 version.

To setup, start by cloning the repository:

git clone https://github.com/hundredblocks/ml-powered-applications.git

Then, navigate to the repository and create a python virtual environment using virtualenv:

cd ml-powered-applications

virtualenv ml_editor

You can then activate it by running:

source ml_editor/bin/activate

Then, install project requirements by using:

pip install -r requirements.txt

The library uses a few models from spacy. To download the small and large English model (required to run the app and the notebooks), run these commands from a terminal with your virtualenv activated:

python -m spacy download en_core_web_sm

python -m spacy download en_core_web_lg

Finally, the notebooks and library leverage the nltk package. The package comes with a set of resources that need to be individually downloaded. To do so, open a Python session in an activated virtual environment, import nltk, and download the required resource.

Here is an example of how to do this for the punkt package from an active virtual environment with nltk installed:

python

import nltk

nltk.download('punkt')

Notebook examples

The notebook folder contains usage examples for concepts covered in the book. Most of the examples only use one of the subfolders in archive (the one that contains data for writers.stackexchange.com).

I've included a processed version of the data as a .csv for convenience.

If you wanted to generate this data yourself, or generate it for another subfolder, you should:

Download a subfolder from the stackoverflow archives
Run parse_xml_to_csv to convert it to a DataFrame
Run generate_model_text_features to generate a DataFrames with precomputed features

The notebooks belong to a few categories of concepts, described below.

Data Exploration and Transformation

Initial Model Training and Performance Analysis

Improving the Model

Model Comparison

Comparing Models

Generating Suggestions from Models

Generating Recommendations

Pretrained models

You can train and save models using the notebooks in the notebook folder. For convenience, I've included three trained models and two vectorizers, serialized in the models folder. These models are loaded by notebooks demonstrating methods to compare model results, as well as in the flask app.

Running the prototype Flask app

To run the app, simply navigate to the root of the repository and run:

FLASK_APP=app.py flask run

The above command should spin up a local web-app you can access at http://127.0.0.1:5000/

Troubleshooting

If you have any questions or encounter any roadblocks, please feel free to open an issue or email me at [email protected].

Project structure inspired by the great Cookiecutter Data Science.

ml-powered-applications's People

Contributors

Stargazers

Watchers

Forkers

aabdi406 herutim itsshaikaslam peterg75 eliekawerk ssitb danishack j-mon disdi frankfan007 victor8733 foeinlove peterleong allensmile garcer3 visenger franc3000 sachinthakur9614 dylan-stark kev-kutkin todun garynth41 aiexperts manilwagle varianze lakshya0002 alexanet tao-cao taodsqi albertotono kiminh toashiqur samlex20 wfule ike-okonkwo sundaravelpandian marcelomata learning-materials roininja artmana aihill jjssee udapy bkjackson bikashckarmokar rtodd3ht erisriso jaganselva app-repositories nealquinn nancynwei clabra obinnaobeleagu nicotrombon boboburo ssimontacchi rcuevass katiekool abdelgo deepchatterjeevns balijepalli lhoupert xiaoya27 adityajadhavab cleysonl mengwangk faisal-w deepak-rai-1027 janardhanv marnixvdb miriam99 nlauchande balivada987 mustafaalahmid ronaldjames nyleng shantanu0304 bigsave24 rahuketu86 shubhamp05 porom004 imanojkumar bentosilva harmalh neelroshania sainb photonsinnovate abarton214 utmcontent snapbuy nermen-salama qianrenjian yxlee245 venkatsquest whiteandark scaiado proffdeep statdataanalyzer jsjeong-me smartking1

ml-powered-applications's Issues

Not an issue, But nice feature to have

This is not an issue but is a nice to have feature.
This book's notebook file names needed to be prefixed with chapter numbers in which the file is discussed and used.
I have listed below, preliminary prefixed notebook file names.
Author of this book can check if these prefixes are correct and if correct author can update the files names.

ch04_ch05_clustering_data.ipynb
ch04_ch06_ch07_vectorizing_text.ipynb
ch04_ch06_dataset_exploration.ipynb
ch04_exploring_data_to_generate_features.ipynb
ch04_tabular_data_vectorization.ipynb
ch04_third_model.ipynb
ch05_black_box_explainer.ipynb
ch05_comparing_data_to_predictions.ipynb
ch05_feature_importance.ipynb
ch05_splitting_data.ipynb
ch05_top_k.ipynb
ch05_train_simple_model.ipynb
ch07_ch08_second_model.ipynb
ch07_ch11_comparing_models.ipynb
ch07_generating_recommendations.ipynb

/v1 does not work out of the box

I cloned the repo and followed the instructions. The venv seems to have been installed alright. Following along the book, I wanted to play with the v1. When I click on the "Get Recommendations" button, I see this on the server console:

127.0.0.1 - - [01/Jan/2020 18:52:38] "GET /v1 HTTP/1.1" 200 -
Exception on /v1 [POST]
Traceback (most recent call last):
  File "/Users/me/gh/ml-powered-applications/ml_editor/lib/python3.7/site-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/Users/me/gh/ml-powered-applications/ml_editor/lib/python3.7/site-packages/flask/app.py", line 1951, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/Users/me/gh/ml-powered-applications/ml_editor/lib/python3.7/site-packages/flask/app.py", line 1820, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/Users/me/gh/ml-powered-applications/ml_editor/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/Users/me/gh/ml-powered-applications/ml_editor/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/Users/me/gh/ml-powered-applications/ml_editor/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/Users/me/gh/ml-powered-applications/app.py", line 15, in v1
    return handle_text_request(request, "v1.html")
  File "/Users/me/gh/ml-powered-applications/app.py", line 31, in handle_text_request
    suggestions = get_recommendations_from_input(question)
  File "/Users/me/gh/ml-powered-applications/ml_editor/ml_editor.py", line 279, in get_recommendations_from_input
    tokenized_sentences = preprocess_input(processed)
  File "/Users/me/gh/ml-powered-applications/ml_editor/ml_editor.py", line 45, in preprocess_input
    sentences = nltk.sent_tokenize(text)
  File "/Users/me/gh/ml-powered-applications/ml_editor/lib/python3.7/site-packages/nltk/tokenize/__init__.py", line 105, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/Users/me/gh/ml-powered-applications/ml_editor/lib/python3.7/site-packages/nltk/data.py", line 868, in load
    opened_resource = _open(resource_url)
  File "/Users/me/gh/ml-powered-applications/ml_editor/lib/python3.7/site-packages/nltk/data.py", line 993, in _open
    return find(path_, path + ['']).open()
  File "/Users/me/gh/ml-powered-applications/ml_editor/lib/python3.7/site-packages/nltk/data.py", line 701, in find
    raise LookupError(resource_not_found)
LookupError:
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/PY3/english.pickle

  Searched in:
    - '/Users/me/nltk_data'
    - '/Users/me/gh/ml-powered-applications/ml_editor/bin/../nltk_data'
    - '/Users/me/gh/ml-powered-applications/ml_editor/bin/../share/nltk_data'
    - '/Users/me/gh/ml-powered-applications/ml_editor/bin/../lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

127.0.0.1 - - [01/Jan/2020 18:52:44] "POST /v1 HTTP/1.1" 500 -

Is there an instruction missing?
I made sure nltk is installed in the venv which is activated.

Issue with install requirements.txt

ERROR.docx

Attached is the error message that I copy and paste it in the microsoft word doc.

can't run FLASK_APP=app.py flask run

Try 'flask run --help' for help.

Error: While importing 'app', an ImportError was raised:

Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/flask/cli.py", line 218, in locate_app
import(module_name)
File "/Users/xingvoong/github/ml-study/ml-powered-applications/app.py", line 5, in
from ml_editor.ml_editor import get_recommendations_from_input
File "/Users/xingvoong/github/ml-study/ml-powered-applications/ml_editor/ml_editor.py", line 5, in
import pyphen
ModuleNotFoundError: No module named 'pyphen'

I already install pyphen.

vectorizing_text.ipynb Cell #4 error

On my Windows 10 laptop with GPU, out of 15 notebook python files of this book, 14 completed successfully without errors. One program vectorizing_text.ipynb is giving error message "Kernel Restarting. Kernel appears to have died. It will restart automatically".

It dies on cell 4,
umap_embedder = umap.UMAP()
umap_bow = umap_embedder.fit_transform(bag_of_words)

Is this error due to less RAM memory on my laptop?
Any suggestions or workarounds to fix this error please? Thanks.

Can this program be run by lowering size of bag_of_words?

How to estimate memory requirements of a notebook program file?

My PC has Intel i7-9750H CPU @ 2.60 GHz and
NVIDIA GeForce RTX 2070 with Max-Q Design, RAM 16 GB

What kind of hardware you have used to test these programs?
How much RAM memory you have on your computer?

Is it a UNIX system computer?

I have UNIX on a basic very low clock speed laptop, (no GPU).
I don't think it will work.
I do not know how to setup my unix (Ubantu 18.4) laptop for Machine Learning use.

Any suggestions to get this program working please?

Thanks,
SSJ

Can't run pip install -r requirements.txt

Could you please provide a docker image, which has installed all the dependencies of this app, to run your sample code?

It always shows many types of dependencies error when we are preparing the run environments.

Pandas 0.24.2 not compatible with new versions of python

Hello

I am trying to execute the project from the book. However it says that needs Python 3.6 or 3.7. Now in anaconda I don´t see these versions allowed anymore , the oldest one is 3.8 . Aditionally, in the codes you are using Pandas 0.24.2 which is not compatible with the lastst versions of Python.

I am having a bit of problems to follow the project of the book for these reasons. Is there any other repo which is more recent ?

ImportError: cannot import name 'joblib' from 'sklearn.externals', ModuleNotFoundError: No module named 'sklearn.externals.joblib'

Not sure if I have to update something, but I am currently having an issue running the code within the "Training a simple model" notebook. It is specifically with importing joblib. When I ran the code initially how it is set up, I get the first error listed above. (Note: the notebook currently imports with "from sklearn.externals import joblib"). So I rewrote the line to "import joblib" and get the second error above. Not sure what to do from here.

Licence?

Hello,

Would you mind adding a licence to this project code?

Personally I'm not going to use it directly for anything, but it might be a good starter for a project and without a licence i'd hesitate to use it for that purpose.

Apache 2, MIT, CC-BY-SA, etc. would be great.

Thanks

ModuleNotFoundError: No module named 'sklearn.externals.joblib' on joblib.load

I'm trying to run train_simple_model.ipynb and already seen the ImportError: cannot import name 'joblib' from 'sklearn.externals', so I installed and upgraded joblib to 1.01 and did import joblib directly which cleared that error.

However, in the first cell, the code is failing through this path: from ml_editor.model_v1 import get_model_probabilities_for_input_texts --> VECTORIZER = joblib.load(curr_path / vectorizer_path) --> obj = _unpickle(fobj, filename, mmap_mode) --> obj = unpickler.load() --> dispatchkey[0] --> klass = self.find_class(module, name) --> import(module, level=0) --> ModuleNotFoundError: No module named 'sklearn.externals.joblib'.

I suspect this has got to do with how vectorizer_1.pkl is created. Is it because vectorizer_1.pkl was saved with the old joblib, so when loading, it is asking for the old joblib library?

I was trying to recreate the 3 models and 2 vectorizers using my new joblib, hoping that this error will go away, then realized from searching joblib.dump that I can't find where the models and vectorizers are created. It seems that vectorizer_1.pkl is only created at the end of train_simple_model.ipynb with joblib.dump(vectorizer, vectorizer_path) but it is already being used in the 1st cell of the notebook, leading to the error in this issue.

Are those artifacts in models folder pre-trained somewhere already? If not, which notebooks generated them? (so i can run these notebooks on new libraries to create loadable versions of the pickles). I hope to go through these notebooks without downgrading libraries or pinning to old versions as that is not sustainable in the long run.

P.s I also saw a ModuleNotFoundError: No module named 'sklearn.ensemble.forest' when loading models, it's probably related to the pickled model being trained on an older scikit-learn API.