dlab-berkeley / python-machine-learning Goto Github PK

D-Lab's 6 hour introduction to machine learning in Python. Learn how to perform classification, regression, clustering, and do model selection using scikit-learn in Python.

License: Other

Jupyter Notebook 100.00%

machine-learning scikit-learn classification regression auto-ml clustering python

python-machine-learning's People

Stargazers

Watchers

Forkers

shel-zaroo aislingscott yix90 lsheiba carl2mario boringppl john2912 yusuf tugbat ruschenpohler kosukeiwai agustintoll vladperervenko eunice21 christianjauregui omerfarukballi yixic94 jinkim2019 agneselimbi fruhj aybukemeydan actuaryjoe takaushik askery siyingli-bk rmontellano2022 seanmperez ashqtan firminayivodji meiqingli dlab-frontdesk kazutoki-m bdbomfim annajiat wetchler francolt yuzhangnju o702203 hsiaohanlin1 rakeshr10 faakyewaa aprilfang rangsutu88 koohong acasendino lihongma999 jedazcun anara-cal larshoffmann3 sougatob dlab-berkeley d-sotomayor whryan andres-vindas csg33k jessica-gustin mquintus awspotify sandrar1011 copydemon goliath-research hakancelik awstalend

python-machine-learning's Issues

Bring README up to standard

The README file needs to be brought up to the D-Lab standard.

Part 4: TPOT

Introduction: The link in the introduction currently redirects to the github repo which is a little hard to navigate. It may be better to link to their Github docs: http://epistasislab.github.io/tpot/
Classification/Regression: Some links that direct to the TPOT API documentation would also be useful for the TPOTClassifier/TPOTRegressor to explore the parameters http://epistasislab.github.io/tpot/api/

03_preprocessing fit vs fit_transform vs transform

We received a lot of questions about fit_transform and transform
Having a dedicated section briefly explaining fit vs fit_transform vs transform (or link to a document/article) might be helpful for participants to use them in the future.

Missing codes for 03_preprocessing_solutions.ipynb

The solution for the lesson 3 is missing codes and should be updated with codes (for challenge 2 and maybe for 3)

Confusion matrix in # 1_classification.ipynb

1_classification.ipynb

In the markdown cell of confusion matrix:
-----Incorrect-----

Precision:
(Predicted Positives) / (True Positives)
Recall (or Sensitivity):
(Predicted Positives) / (Condition Positives)
Specificity (like recall for negative examples):
(Predicted Negatives) / (Condition Negatives)

This should be corrected to:
-----Correct-----

Precision:
(True Positives) / (Predicted Positives)
Recall (or Sensitivity):
(True Positives) / (Condition Positives)
Specificity (like recall for negative examples):
(True Negatives) / (Condition Negatives)

No answer to challenge 3 under 03_preprocessing_solutions

Typo on confusion matrix in Machine_Learning_Review.pdf

Machine_Learning_Review.pdf

On page 12:
We should swap Specificity and Sensitivity each other per their definitions.

03_preprocessing

The code breaks at the following line:

data = data.dropna(subset='sex')

See comments in the notebook. Likely an issue with dependencies.

Python - machine learning Updates

@EastBayEv Here is the list of updates for this workshop.
Part 1 - Classification:

Include code on how to plot an visualize decision trees
Potentially use more interesting data set.
Part 2 - Regression:
minor updates on regression section in line 5 to remove error ( need to create copy of data) before assigning values
Maybe some notes about non-parametric methods?
Part 3 - Clustering:
Consider adding a section on visualizing and using dendrograms
Consider including section on dimensionality reduction ( eg PCA ) because these usually work in tandem with clustering
Other advanced topics like Self organizing maps (SOM) in challenges or future readings
Part 4 - TPOT:
minor updates in tpot section in line 6 for windows machines (using !cat)

Part 2: Regression

Consider starting this section with an overview of the dataprocessing workflow, such as a flowchart or decision tree.
Consider including guidelines on choosing which type of regression to use.
Potentially we could emphasize the non-parametric / non-linear regression techniques, since I see these as the more likely cases in which people would turn to sklearn.
I would also suggest introducing a little more scaffolding into the challenge exercises. (or potentially intersperse smaller challenge questions throughout a notebook). I find that going from passive reading/executing of the notebook to coding can be difficult sometimes. For example, break the challenge question down into steps, and include a bit of skeleton code for students to start from.

Estimators vs Transformers

A small terminology issue: The regression notebook seems to confuse "estimators" and "transformers". The notebook refers, for example, to SimpleImputer as an estimator but I think it should be classified as a transformer. My understanding is that estimators involve prediction, whereas transformers do not (i.e. preprocessing). Here's a stackoverflow discussion: https://stackoverflow.com/questions/54899647/what-is-the-difference-between-transformer-and-estimator-in-sklearn

Solutions Notebook

Remove the 'iid' parameter from the GridSearch objects as this has been deprecated.

Slide deck suggestions

How about a slide toward the beginning about what a "model" is (vs algorithm, for example) in machine learning?

The workshop mainly uses scikit-learn. Do we want to explain why we are using this library, and maybe just give a brief mention of the other big ml libraries like PyTorch and TensorFlow?

Do we want to say something in this intro about deep learning even if we aren't going to get there in the workshop?

Do we want to mention the trend toward AutoML, particularly because there is a notebook on this included at the end of the workshop?

Vulnerabilities in requirements

The dependabot is raising various issues with the requirements.txt file. Overall, it's too complicated - there should only be a few packages in this file: numpy, matplotlib, pandas, scipy, scikit-learn, and tpot.

It will probably be easiest just to remake this file.

06_clustering - need to change path file for spotify data

Comments from the workshop

Here is some feedback for the workshop and I'd love to participate in improving the workshop!

Presenter:

Excellent pacing and very clear explanation of complex concepts!
Multiple breaks help pacing the workshop and participants used that time to ask questions.
I like how presenter showed what the future workshops on deep learning will cover!

Content:

This workshop covers a lot of important content and concepts! Participants expressed that the workshop was very useful (in the chat at the end of the workshop)
The instructor used iPad to draw diagrams / explain and incorporating by imbedding photos into the jupyter notebook will be helpful.
Introduction slide decks explaining broad concepts machine learning and what we will cover (and not cover) were very helpful.
Some repetition of the process might be helpful (explicitly showing case by case what the model, the loss function, the goal etc). There are a lot of jargon and it might be helpful to have a cheat sheet / list of terms with brief definition might also be helpful.
Bullet points for objectives for each lesson might be helpful to clarify what concepts are covered for each lesson!
It would be nice to have a "take-home message slide" that summarizes what we have covered for the workshop.
There are challenge questions with no coding component (For example, lesson 03, challenge 1), and it would be a good opportunity to use zoom poll function to see where participants are in terms of understanding concepts.

Correlation in Regression Lesson

The correlation didn't work for some participants because of their version and the fact that its default wasn't numeric_only=True.

03_preprocessing.ipynb code for dropping nan values

Version issues:
Some participants have issues with:
data = data.dropna(subset='sex')

Fixed by using:
data = data.dropna(subset=['sex'])

06_clustering - Pickling error

PicklingError: ("Can't pickle <class 'numpy.dtype[float64]'>: it's not found as numpy.dtype[float64]", 'PicklingError while hashing {'X': array([[-0.8137693 , 0.91314495, 0.57967737, 0.00940568, 0.89901425,\n -0.49161791],\n [ 1.63793717, -0.2249728 , -1.51950333, 0.40881057, -0.92860994,\n -0.29477607],\n [-0.9352552 , -0.59635859, 1.02684012, 0.36443225, 0.91528059,\n -0.38068585],\n ...,\n [-0.14376985, 0.91913504, 0.41820193, 2.33187118, 0.78938354,\n 0.40167704],\n [-0.68216227, 0.07453187, 1.30424664, 0.04392215, 0.83929671,\n 0.06888088],\n [ 0.20869125, 0.03859131, 0.58381776, -0.39986107, 0.47987732,\n -0.46492749]]), 'connectivity': None, 'n_clusters': None, 'return_distance': False}: PicklingError("Can't pickle <class 'numpy.dtype[float64]'>: it's not found as numpy.dtype[float64]")')

It is time to retire the iris dataset...

... in all D-Lab workshops

https://armchairecology.blog/iris-dataset/

Part 1: Classification

I think we've already decided to replace the iris dataset, but I'll include it here as a reminder. With whatever dataset we are using, I would suggest putting the dataset into pandas dataframe form since that will be the most common form for people to work with.
Include a visualization of the decision tree, if possible.
I think there is some inconsistency in the definitions of specificity, precision, and recall between the notebook and the slides, but I'm having trouble remembering exactly what it is.
For each of the ML models included I would suggest emphasizing the parameters that one is likely to tune using hyperparameter optimization
Instead of random forests, consider using the time to discuss another classification technique (such as SVM)
Include a more explicit discussion of cross-validation number of folds and how it plays into over/under-fitting

Add a Readme

Do we want to add the usual Readme with install instructions, etc.?

01_regression.ipynb Correlation

When using Pandas version > 2, the correlation code requires an additional argument.

Running data.corr() might give an error which looks like this:

ValueError: could not convert string to float: 'chevrolet chevelle malibu'

Need to change the code snippet to:

data.corr(numeric_only=True)

Notebook 3 challenge 1 Confusing Language

Notebook 3 challenge 1 seems to have confusing language about fit and we wanna say we are precorssing the whole dataset. (Jose knows better)

04_classification decision tree classifier

in the tree visualization, we are using the wrong dataset so it's only classifying 2 types of penguins and not all three that we have in the data

04_classification - seaborn

Issue with seaborn when plotting histograms. doesn't seem to be an issue with other machines besides Renata's. tried updating seaborn

Re-organize and simplify slides

Since the slides are organized into distinct sections, I think that inserting headers in the slides to prompt the instructor and students when it is time to go to a notebook (and vice versa in the notebooks) will help the materials stand on their own more easily.

Also, I would suggest simplifying the slides, especially in the model evaluation section. Many of these topics are covered in the individual notebooks, and I would suggest focusing on higher-level/ abstract discussion of model evaluation, rather than specific metrics, perhaps also including mention of the role of researcher interpretation on model evaluation.

Notebook 4 formula

In the formula, instead of saying true positive over conditioned positives, Jose wanted it to be true positive + false negatives.

01_regression.ipynb Trouble shooting from students' end

Error with data.corr() in Exploratory Data Analysis:
Error message: ValueError: could not convert string to float: 'chevrolet chevelle malibu'
Solution: data.drop('car name',axis=1).corr()

Regression suggestions

There is a quite a lot of time spent on preprocessing in this notebook. Would it be possible to streamline the preprocessing a bit and focus more on the regression models themselves? Perhaps the emphasis of the preprocessing section could be more on feature selection for the models rather than cleaning and transforming data?

The challenge at the end seems rather difficult and open-ended. I wonder whether learners are usually able to complete this quickly enough to go over together by the end of the workshop?

Rename Repo to "Python Introduction to Machine Learning: Parts 1-2"

General Workshop Improvements

Replace Iris dataset
There is no baseline model for classification (decision tree?). What about logistic regression?
Feel like some sections lack explanations (e.g., feature importances, comparing different algorithms, no ROC curves?)
Other types of hyperparameter tuning (RandomSearch, Bayes Search)
XGBoost is generally considered the gold standard for shallow learning models. Replace AdaBoost?
Code could be cleaned up in general/more comments
Regression section would be a great place to introduce general modeling pipelines (data cleaning, feature transformation, feature engineering (maybe not applicable here), model training, hyperparameter tuning/cross-validation, model evaluation)
No need for a separate dummyencoder class -> this can be handled using onehotencoder or even Pandas get_dummies
If we are going to use a transformer + pipelines, we should think about adding the model object to the pipeline as well. In general, this is a better practice as you can then save off entire model pipelines vs just feature transformation pipelines.
I typically see KNN used for more naive classification vs regression. Not sure if it is necessary to include
We don't talk about Naive Bayes in classification. I feel this is a canonical algorithm that could be introduced
No mention of any dimensionality reduction/latent variable techniques for clustering seems like a gap

Part 3: Clustering

I've noticed that the number of algorithms introduced in the classification and clustering sections is much less than in regression. For consistency, I would suggest including more examples of algorithms in the first and third notebooks (even if it is only there as a reference for later)
Dataset: I would encourage using an imported dataset (rather than made up dots, or random shapes), and the same dataset for both types of clustering. This would work well with the same dataset as classification. (although for clustering sticking to 2-D makes a lot of sense!) Using a dataset would also allow us to emphasize how the user has to interpret the results of a clustering algorithm.
I would also suggest ordering the notebooks such that classification and clustering are next to each other since they illustrate the difference between supervised and unsupervised learning.
For both sections, I would include information on evaluating cluster fit with metrics, and how that can play into parameter selection.
For the challenge section, I think that solutions are already included in this notebook.

sensitivity/specificity slide definitions switched?

Inconsistency with Challenge 1: Model evaluation in 04_classification

Seaborn package

The seaborn package requires installation, but it is not mentioned in the notebooks.

Simplify initial demonstration of models by using default parameter settings

In the classification notebook, when each machine learning model (ex. decision trees) is introduced for the first time in the workshop it is instantiated with all the hyperparameters explicitly selected (each of which requires explanation). A simpler approach might be to first introduce the model using just the default settings to establish basic understanding, then later explain the hyperparameters for more advanced tuning of the model.

file path issue on 07_dimensionality reduction

pd.read_csv("../data/world_happiness.csv")

Consider log-spaced grid search arrays instead of equal-spaced ones

03_preprocessing.ipynb Challenge 3

Doing normalization without dropping NaN values should return an error in the challenge but it doesn't

Part 2. Regression

Imputation for Categorical Variables: np.unique() in the imputation section could be confusing compared to the previous output where the NaNs are in a dataframe. Maybe consider converting cp_imp back to a pandas dataframe to show the difference between the two after imputation.
Dummy Encoding: I believe dummy encoding can be done by passing in "drop='first'" as an argument in sklearn.OneHotEncoder object. This should remove the need to create a separate DummyEncoding class.
ColumnTransformer: Spelling mistakes in "ColumntTransformer for Combined Preprocessing" opening description -> "ColumntTransformer" should be "ColumnTransformer", "differntially" should be "differentially"
Transform the test Data: Spelling mistake after data is saved -> "...everything else is just a matter of choosing your mdoel..." should be "model"
GLM Ridge Regression: Spelling mistake in opening description -> "Ridge regression takes a hyerparameter..." should be "hyperparameter"
GLM Ridge Regression: "Leave One Out Cross Validation" (LOOCV) is not explained. A "see more" link might be useful.
Non-Linear Models: Might be helpful to include a quick explainer comparing linear vs non-linear models and pros/cons. Currently they are introduced without explanation.

Iris dataset should be replaced and removed

Per D-Lab guidelines, the Iris dataset should be removed and replaced from all notebooks.

Part 3: Clustering

General Note: May be useful to add some content on dimensionality reduction (e.g., PCA, SVD, LCA).
K-Means Clustering: May want to touch upon the problem of determine the optimal number of clusters and some common heuristics used (e.g., elbow method, silhouette)
Agglomerative Clustering: Change the n_clusters parameter in the model from 3 to 2.
Agglomerative Clustering: In the visualization, we assign a label to the scatter plots, but never call plt.legend() to add the legend.
Challenge DBSCAN: When comparing the inferred clusters to the set of labels in the blobs data, we do not call len(set(labels)) to compare with the inferred number of clusters as we do in the moons data.

Classification suggestions

Most courses on ML teach regression before classification (including The Elements of Statistical Learning, which we recommend in the slide deck). Is there an advantage to reversing this and teaching classification first instead?

The most common algorithm for classification is logistic regression. Why don't we start with this here, before jumping to decision trees, etc.?

The explanation of each algorithm (how it works, what it's advantages and disadvantages are, when to use it) seems rather thin. Do we want to give a bit more of an introduction (and intuition) to the algorithms being taught?

01_regression.ipynb link to the slides

Is there a link to the slides used in the regression lesson (not the introduction slides)?

Part 4: TPOT

I'm not super experienced in TPOT, but in teaching this section the notebook often feels a little less complete and more tacked on. I would suggest one of the following: a) expanding the notebook to include information such as when and how to use tpot, what algorithms are compatible with it, and what parameters are important to know or think about or b) change the tpot code to an appendix or extra resource rather than a part of the core curriculum, and focus on teaching students how to select algorithms themselves.
Dataset: This notebook also uses iris and should be replaced.

GridSearch parameter deprecation: iid

Sklearn has deprecated the iid parameter for GridSearchCV (documented here). This just needs to be updated in the regression and classification notebooks.