tirthajyoti / machine-learning-with-python Goto Github PK

View Code? Open in Web Editor NEW

3.0K 157.0 1.8K 98.99 MB

Practice and tutorial-style notebooks covering wide variety of machine learning techniques

Home Page: https://machine-learning-with-python.readthedocs.io/en/latest/

License: BSD 2-Clause "Simplified" License

Jupyter Notebook 99.78% Python 0.20% HTML 0.01% CSS 0.01%

numpy statistics pandas matplotlib regression scikit-learn classification clustering decision-trees random-forest

machine-learning-with-python's Introduction

Python Machine Learning Jupyter Notebooks (ML website)

Dr. Tirthajyoti Sarkar, Fremont, California (Please feel free to connect on LinkedIn here)

Also check out these super-useful Repos that I curated

Requirements

Python 3.6+
NumPy (pip install numpy)
Pandas (pip install pandas)
Scikit-learn (pip install scikit-learn)
SciPy (pip install scipy)
Statsmodels (pip install statsmodels)
MatplotLib (pip install matplotlib)
Seaborn (pip install seaborn)
Sympy (pip install sympy)
Flask (pip install flask)
WTForms (pip install wtforms)
Tensorflow (pip install tensorflow>=1.15)
Keras (pip install keras)
pdpipe (pip install pdpipe)

You can start with this article that I wrote in Heartbeat magazine (on Medium platform):

"Some Essential Hacks and Tricks for Machine Learning with Python"

Essential tutorial-type notebooks on Pandas and Numpy

Jupyter notebooks covering a wide range of functions and operations on the topics of NumPy, Pandans, Seaborn, Matplotlib etc.

Detailed Numpy operations
Detailed Pandas operations
Numpy and Pandas quick basics
Matplotlib and Seaborn quick basics
Advanced Pandas operations
How to read various data sources
PDF reading and table processing demo
How fast are Numpy operations compared to pure Python code? (Read my article on Medium related to this topic)
Fast reading from Numpy using .npy file format (Read my article on Medium on this topic)

Tutorial-type notebooks covering regression, classification, clustering, dimensionality reduction, and some basic neural network algorithms

Regression

Simple linear regression with t-statistic generation

Polynomial regression using scikit-learn pipeline feature (check the article I wrote on Towards Data Science)
Decision trees and Random Forest regression (showing how the Random Forest works as a robust/regularized meta-estimator rejecting overfitting)
Detailed visual analytics and goodness-of-fit diagnostic tests for a linear regression problem
Robust linear regression using HuberRegressor from Scikit-learn

Classification

Logistic regression/classification (Here is the Notebook)

k-nearest neighbor classification (Here is the Notebook)
Decision trees and Random Forest Classification (Here is the Notebook)
Support vector machine classification (Here is the Notebook) (check the article I wrote in Towards Data Science on SVM and sorting algorithm)

Naive Bayes classification (Here is the Notebook)

Clustering

K-means clustering (Here is the Notebook)
Affinity propagation (showing its time complexity and the effect of damping factor) (Here is the Notebook)
Mean-shift technique (showing its time complexity and the effect of noise on cluster discovery) (Here is the Notebook)
DBSCAN (showing how it can generically detect areas of high density irrespective of cluster shapes, which the k-means fails to do) (Here is the Notebook)
Hierarchical clustering with Dendograms showing how to choose optimal number of clusters (Here is the Notebook)

Dimensionality reduction

Principal component analysis

Deep Learning/Neural Network

Demo notebook to illustrate the superiority of deep neural network for complex nonlinear function approximation task
Step-by-step building of 1-hidden-layer and 2-hidden-layer dense network using basic TensorFlow methods

Random data generation using symbolic expressions

How to use Sympy package to generate random datasets using symbolic mathematical expressions.
Here is my article on Medium on this topic: Random regression and classification problem generation with symbolic expression

Synthetic data generation techniques

Notebooks here

Simple deployment examples (serving ML models on web API)

Serving a linear regression model through a simple HTTP server interface. User needs to request predictions by executing a Python script. Uses Flask and Gunicorn.
Serving a recurrent neural network (RNN) through a HTTP webpage, complete with a web form, where users can input parameters and click a button to generate text based on the pre-trained RNN model. Uses Flask, Jinja, Keras/TensorFlow, WTForms.

Object-oriented programming with machine learning

Implementing some of the core OOP principles in a machine learning context by building your own Scikit-learn-like estimator, and making it better.

See my articles on Medium on this topic.

Unit testing ML code with Pytest

Check the files and detailed instructions in the Pytest directory to understand how one should write unit testing code/module for machine learning models

Memory and timing profiling

Profiling data science code and ML models for memory footprint and computing time is a critical but often overlooed area. Here are a couple of Notebooks showing the ideas,

machine-learning-with-python's People

Contributors

Stargazers

Watchers

Forkers

pursh2002 overfitter harshilpatel548 sa11 sheshankvamshi4 safibaig vserpak bigdata06 paulozip ankancode nhatnguyen12 optionalg marclachapelle benjaminudoh10 hbcbh1999 nalinik dgq2011 thegreenjedi shah0150 manilwagle shabbir-hasan oav711 s91-maker sthitaprajnas vanradd waiyong nojuman ftolleson cruncharlie kamilamin vahagn37 karant17 linlanggirl levyhe sunqiangzai bobking1990 grseb9s winggy chellabeatrixkiddo amitgupta847 thgngu hinafirdaus salut7 birajaghoshal fitrialif wata909 iamneaz tyjk hubbucket-team lepy ralvite carlislelee goldenamir therolfe tito421 thangarajan8 dataswang faadal nanduriprabhakar avinashtangudu bansalkanav da115115 swapnilawasthi sandokhan saytosid kaggleabc wguo123 vrad1 anoru xiangqianzsh tarunsingh272 rkakamilan msoancah wojohowitz00 kashyap2108 raininging nanda-dash ephraimmagopa prateek2901 karthiknagaraja scottlynn73 sunshine1204 aaronvvv raghavendrachari08 wahe3bru yushu-liu zyyj007 jdc08161063 mist-lin jsrlavande little1tow oulongwen fooway allensmile weixianliao googlepeng pileggio wyx1227 mfcardenas nicoleyesnicole

machine-learning-with-python's Issues

Question about How fast are NumPy ops.ipynb

Hey just wondering, for the How fast are NumPy ops.ipynb

When considering the speed for the log(10) of all the elements in the Numpy array a1, shouldn't you also include the creation of the initial Numpy array?

Line 50 is this:

t1=time.time()
a2=np.log10(a1)
t2 = time.time()
print("With direct NumPy log10 method it took {} seconds".format(t2-t1))
speed.append(t2-t1)

But isn't it more fair to make it this:

t1=time.time()
a1 = np.array(l1)
a2=np.log10(a1)
t2 = time.time()
print("With direct NumPy log10 method it took {} seconds".format(t2-t1))
speed.append(t2-t1)

Considering that it is an additional step not present in the other methods? In your code the bolded line is line 40.

df1.csv? df2.csv?

May I know where I can download the df1.csv/df2.csv in the Pandas Operations notebook? Thanks

Using scipy's genetic algorithm for initial parameter estimation in gradient descent

I see you are writing Python code for optimization on GitHub. A general problem for gradient descent and other non-linear algorithms - particularly for more complex equations - is the choice of initial parameters to start the "descent" in error space. Without good starting parameters, the algorithm will stop in a local error minimum. For this reason the authors of scipy have added a genetic algorithm for initial parameter estimation for use in gradient descent. The module is named scipy.optimize.differential_evolution.

I have used scipy's Differential Evolution genetic algorithm to determine initial parameters for fitting a double Lorentzian peak equation to Raman spectroscopy of carbon nanotubes and found that the results were excellent. The GitHub project, with a test spectroscopy data file, is:

https://github.com/zunzun/RamanSpectroscopyFit

If you have any questions, please let me know. My background is in nuclear engineering and industrial radiation physics, and I love Python, so I will be glad to help.

Adding Reinforcement Machine Learning models

I don't see any reinforcement machine learning models. Can I add UCB and Thompson sampling? I would love to work on it.

Machine-Learning-with-Python/Classification/DecisionTrees_RandomForest_Classification.ipynb

Add link to license in readme

Thought it may be helpful/convenient to have a link to the license in the readme

Add indications on how to run Jupyter notebooks with Docker in a few minutes

The https://github.com/machine-learning-helpers/docker-python-jupyter project builds a Docker image so that the (your) Jupyter notebooks can be run out-of-the-box on almost any platform in a few minutes.

It gives something like:

Initialization of the Git repository for the Jupyter notebooks:

$ mkdir -p ~/dev/ml
$ cd ~/dev/ml
$ git clone https://github.com/tirthajyoti/PythonMachineLearning.git

Initialization of the Docker image to run those Jupyter notebooks:

$ docker pull artificialintelligence/python-jupyter

Usage:

$ cd ~/dev/ml/PythonMachineLearning
$ docker run -d -p 9000:8888 -v ${PWD}:/notebook -v ${PWD}:/data artificialintelligence/python-jupyter

And then you can open http://localhost:9000 in your browser.

Any modification to the notebooks may be committed to the Git repository (if you are registered as a contributor), and/or submitted as a pull request.

Shutdown the Docker image

$ docker ps
CONTAINER ID        IMAGE                                   COMMAND                  CREATED             STATUS              PORTS                    NAMES
431b12a93ccf        artificialintelligence/python-jupyter   "/bin/sh -c 'jupyt..."   4 minutes ago       Up 4 minutes        0.0.0.0:9000->8888/tcp   friendly_euclid
$ docker kill 431b12a93ccf

So, all the above could be added to your README.md file.

Statistically significant function in regression model

Hi,

I'm wondering what the yesno-fuction does in the following notebook:
https://github.com/tirthajyoti/Machine-Learning-with-Python/blob/master/Regression/Regression_Diagnostics.ipynb

def yes_no(b):
if b:
return 'Yes'
else:
return 'No'

It should decide whether a parameter is significantly important or not for the model?
Where does the b refer to and what's the threshold for it to decide it's not statistically significant?

I usually look at the p-values in the statsmodels-ols table and when they fall below 0.05, they are significant, but in this notebook something else seems to be happening and I'm wondering if you could elaborate a bit on it (What is b?, how is it calculated?, what's the b's threshold? How to change the threshold from 0.01 to 0.05?) When the p-value in the ols-table is above 0.05, but the yes_no-function decides it's significant, what should I do (leave the parameter out or not)?

Kind regards,
Matthias

Wrong interpretation of the Shapiro-Wilk test

In the Regression_diagnostics notebook , you are presenting the Shapiro-Wilk test.

The Shapiro-Wilk test's null hypothesis is that the data come from a Gaussian distribution. Therefore, the lower the p-value, the higher the change to reject the Gaussian distribution. The notebook says the opposite: