Giter Site home page Giter Site logo

dlab-berkeley / python-machine-learning Goto Github PK

View Code? Open in Web Editor NEW
77.0 14.0 65.0 8.1 MB

D-Lab's 6 hour introduction to machine learning in Python. Learn how to perform classification, regression, clustering, and do model selection using scikit-learn in Python.

License: Other

Jupyter Notebook 100.00%
machine-learning scikit-learn classification regression auto-ml clustering python

python-machine-learning's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-machine-learning's Issues

03_preprocessing fit vs fit_transform vs transform

We received a lot of questions about fit_transform and transform
Having a dedicated section briefly explaining fit vs fit_transform vs transform (or link to a document/article) might be helpful for participants to use them in the future.

Confusion matrix in # 1_classification.ipynb

1_classification.ipynb

In the markdown cell of confusion matrix:
-----Incorrect-----

  1. Precision:
    (Predicted Positives) / (True Positives)
  2. Recall (or Sensitivity):
    (Predicted Positives) / (Condition Positives)
  3. Specificity (like recall for negative examples):
    (Predicted Negatives) / (Condition Negatives)

This should be corrected to:
-----Correct-----

  1. Precision:
    (True Positives) / (Predicted Positives)
  2. Recall (or Sensitivity):
    (True Positives) / (Condition Positives)
  3. Specificity (like recall for negative examples):
    (True Negatives) / (Condition Negatives)

03_preprocessing

The code breaks at the following line:

data = data.dropna(subset='sex')

See comments in the notebook. Likely an issue with dependencies.

Python - machine learning Updates

@EastBayEv Here is the list of updates for this workshop.
Part 1 - Classification:

  • Include code on how to plot an visualize decision trees
  • Potentially use more interesting data set.
    Part 2 - Regression:
  • minor updates on regression section in line 5 to remove error ( need to create copy of data) before assigning values
  • Maybe some notes about non-parametric methods?
    Part 3 - Clustering:
  • Consider adding a section on visualizing and using dendrograms
  • Consider including section on dimensionality reduction ( eg PCA ) because these usually work in tandem with clustering
  • Other advanced topics like Self organizing maps (SOM) in challenges or future readings
    Part 4 - TPOT:
  • minor updates in tpot section in line 6 for windows machines (using !cat)

Part 2: Regression

  1. Consider starting this section with an overview of the dataprocessing workflow, such as a flowchart or decision tree.
  2. Consider including guidelines on choosing which type of regression to use.
  3. Potentially we could emphasize the non-parametric / non-linear regression techniques, since I see these as the more likely cases in which people would turn to sklearn.
  4. I would also suggest introducing a little more scaffolding into the challenge exercises. (or potentially intersperse smaller challenge questions throughout a notebook). I find that going from passive reading/executing of the notebook to coding can be difficult sometimes. For example, break the challenge question down into steps, and include a bit of skeleton code for students to start from.

Estimators vs Transformers

A small terminology issue: The regression notebook seems to confuse "estimators" and "transformers". The notebook refers, for example, to SimpleImputer as an estimator but I think it should be classified as a transformer. My understanding is that estimators involve prediction, whereas transformers do not (i.e. preprocessing). Here's a stackoverflow discussion: https://stackoverflow.com/questions/54899647/what-is-the-difference-between-transformer-and-estimator-in-sklearn

Solutions Notebook

  1. Remove the 'iid' parameter from the GridSearch objects as this has been deprecated.

Slide deck suggestions

How about a slide toward the beginning about what a "model" is (vs algorithm, for example) in machine learning?

The workshop mainly uses scikit-learn. Do we want to explain why we are using this library, and maybe just give a brief mention of the other big ml libraries like PyTorch and TensorFlow?

Do we want to say something in this intro about deep learning even if we aren't going to get there in the workshop?

Do we want to mention the trend toward AutoML, particularly because there is a notebook on this included at the end of the workshop?

Vulnerabilities in requirements

The dependabot is raising various issues with the requirements.txt file. Overall, it's too complicated - there should only be a few packages in this file: numpy, matplotlib, pandas, scipy, scikit-learn, and tpot.

It will probably be easiest just to remake this file.

Comments from the workshop

Here is some feedback for the workshop and I'd love to participate in improving the workshop!

Presenter:

  • Excellent pacing and very clear explanation of complex concepts!
  • Multiple breaks help pacing the workshop and participants used that time to ask questions.
  • I like how presenter showed what the future workshops on deep learning will cover!

Content:

  • This workshop covers a lot of important content and concepts! Participants expressed that the workshop was very useful (in the chat at the end of the workshop)
  • The instructor used iPad to draw diagrams / explain and incorporating by imbedding photos into the jupyter notebook will be helpful.
  • Introduction slide decks explaining broad concepts machine learning and what we will cover (and not cover) were very helpful.
  • Some repetition of the process might be helpful (explicitly showing case by case what the model, the loss function, the goal etc). There are a lot of jargon and it might be helpful to have a cheat sheet / list of terms with brief definition might also be helpful.
  • Bullet points for objectives for each lesson might be helpful to clarify what concepts are covered for each lesson!
  • It would be nice to have a "take-home message slide" that summarizes what we have covered for the workshop.
  • There are challenge questions with no coding component (For example, lesson 03, challenge 1), and it would be a good opportunity to use zoom poll function to see where participants are in terms of understanding concepts.

Correlation in Regression Lesson

The correlation didn't work for some participants because of their version and the fact that its default wasn't numeric_only=True.

06_clustering - Pickling error

PicklingError: ("Can't pickle <class 'numpy.dtype[float64]'>: it's not found as numpy.dtype[float64]", 'PicklingError while hashing {'X': array([[-0.8137693 , 0.91314495, 0.57967737, 0.00940568, 0.89901425,\n -0.49161791],\n [ 1.63793717, -0.2249728 , -1.51950333, 0.40881057, -0.92860994,\n -0.29477607],\n [-0.9352552 , -0.59635859, 1.02684012, 0.36443225, 0.91528059,\n -0.38068585],\n ...,\n [-0.14376985, 0.91913504, 0.41820193, 2.33187118, 0.78938354,\n 0.40167704],\n [-0.68216227, 0.07453187, 1.30424664, 0.04392215, 0.83929671,\n 0.06888088],\n [ 0.20869125, 0.03859131, 0.58381776, -0.39986107, 0.47987732,\n -0.46492749]]), 'connectivity': None, 'n_clusters': None, 'return_distance': False}: PicklingError("Can't pickle <class 'numpy.dtype[float64]'>: it's not found as numpy.dtype[float64]")')

Part 1: Classification

  1. I think we've already decided to replace the iris dataset, but I'll include it here as a reminder. With whatever dataset we are using, I would suggest putting the dataset into pandas dataframe form since that will be the most common form for people to work with.

  2. Include a visualization of the decision tree, if possible.

  3. I think there is some inconsistency in the definitions of specificity, precision, and recall between the notebook and the slides, but I'm having trouble remembering exactly what it is.

  4. For each of the ML models included I would suggest emphasizing the parameters that one is likely to tune using hyperparameter optimization

  5. Instead of random forests, consider using the time to discuss another classification technique (such as SVM)

  6. Include a more explicit discussion of cross-validation number of folds and how it plays into over/under-fitting

Add a Readme

Do we want to add the usual Readme with install instructions, etc.?

01_regression.ipynb Correlation

When using Pandas version > 2, the correlation code requires an additional argument.

Running data.corr() might give an error which looks like this:

ValueError: could not convert string to float: 'chevrolet chevelle malibu'

Need to change the code snippet to:

data.corr(numeric_only=True)

04_classification - seaborn

Issue with seaborn when plotting histograms. doesn't seem to be an issue with other machines besides Renata's. tried updating seaborn

Re-organize and simplify slides

Since the slides are organized into distinct sections, I think that inserting headers in the slides to prompt the instructor and students when it is time to go to a notebook (and vice versa in the notebooks) will help the materials stand on their own more easily.

Also, I would suggest simplifying the slides, especially in the model evaluation section. Many of these topics are covered in the individual notebooks, and I would suggest focusing on higher-level/ abstract discussion of model evaluation, rather than specific metrics, perhaps also including mention of the role of researcher interpretation on model evaluation.

Notebook 4 formula

In the formula, instead of saying true positive over conditioned positives, Jose wanted it to be true positive + false negatives.

Regression suggestions

There is a quite a lot of time spent on preprocessing in this notebook. Would it be possible to streamline the preprocessing a bit and focus more on the regression models themselves? Perhaps the emphasis of the preprocessing section could be more on feature selection for the models rather than cleaning and transforming data?

The challenge at the end seems rather difficult and open-ended. I wonder whether learners are usually able to complete this quickly enough to go over together by the end of the workshop?

General Workshop Improvements

  • Replace Iris dataset
  • There is no baseline model for classification (decision tree?). What about logistic regression?
  • Feel like some sections lack explanations (e.g., feature importances, comparing different algorithms, no ROC curves?)
  • Other types of hyperparameter tuning (RandomSearch, Bayes Search)
  • XGBoost is generally considered the gold standard for shallow learning models. Replace AdaBoost?
  • Code could be cleaned up in general/more comments
  • Regression section would be a great place to introduce general modeling pipelines (data cleaning, feature transformation, feature engineering (maybe not applicable here), model training, hyperparameter tuning/cross-validation, model evaluation)
  • No need for a separate dummyencoder class -> this can be handled using onehotencoder or even Pandas get_dummies
  • If we are going to use a transformer + pipelines, we should think about adding the model object to the pipeline as well. In general, this is a better practice as you can then save off entire model pipelines vs just feature transformation pipelines.
  • I typically see KNN used for more naive classification vs regression. Not sure if it is necessary to include
  • We don't talk about Naive Bayes in classification. I feel this is a canonical algorithm that could be introduced
  • No mention of any dimensionality reduction/latent variable techniques for clustering seems like a gap

Part 3: Clustering

  1. I've noticed that the number of algorithms introduced in the classification and clustering sections is much less than in regression. For consistency, I would suggest including more examples of algorithms in the first and third notebooks (even if it is only there as a reference for later)
  2. Dataset: I would encourage using an imported dataset (rather than made up dots, or random shapes), and the same dataset for both types of clustering. This would work well with the same dataset as classification. (although for clustering sticking to 2-D makes a lot of sense!) Using a dataset would also allow us to emphasize how the user has to interpret the results of a clustering algorithm.
  3. I would also suggest ordering the notebooks such that classification and clustering are next to each other since they illustrate the difference between supervised and unsupervised learning.
  4. For both sections, I would include information on evaluating cluster fit with metrics, and how that can play into parameter selection.
  5. For the challenge section, I think that solutions are already included in this notebook.

Seaborn package

The seaborn package requires installation, but it is not mentioned in the notebooks.

Simplify initial demonstration of models by using default parameter settings

In the classification notebook, when each machine learning model (ex. decision trees) is introduced for the first time in the workshop it is instantiated with all the hyperparameters explicitly selected (each of which requires explanation). A simpler approach might be to first introduce the model using just the default settings to establish basic understanding, then later explain the hyperparameters for more advanced tuning of the model.

Part 2. Regression

  1. Imputation for Categorical Variables: np.unique() in the imputation section could be confusing compared to the previous output where the NaNs are in a dataframe. Maybe consider converting cp_imp back to a pandas dataframe to show the difference between the two after imputation.
  2. Dummy Encoding: I believe dummy encoding can be done by passing in "drop='first'" as an argument in sklearn.OneHotEncoder object. This should remove the need to create a separate DummyEncoding class.
  3. ColumnTransformer: Spelling mistakes in "ColumntTransformer for Combined Preprocessing" opening description -> "ColumntTransformer" should be "ColumnTransformer", "differntially" should be "differentially"
  4. Transform the test Data: Spelling mistake after data is saved -> "...everything else is just a matter of choosing your mdoel..." should be "model"
  5. GLM Ridge Regression: Spelling mistake in opening description -> "Ridge regression takes a hyerparameter..." should be "hyperparameter"
  6. GLM Ridge Regression: "Leave One Out Cross Validation" (LOOCV) is not explained. A "see more" link might be useful.
  7. Non-Linear Models: Might be helpful to include a quick explainer comparing linear vs non-linear models and pros/cons. Currently they are introduced without explanation.

Part 3: Clustering

  1. General Note: May be useful to add some content on dimensionality reduction (e.g., PCA, SVD, LCA).
  2. K-Means Clustering: May want to touch upon the problem of determine the optimal number of clusters and some common heuristics used (e.g., elbow method, silhouette)
  3. Agglomerative Clustering: Change the n_clusters parameter in the model from 3 to 2.
  4. Agglomerative Clustering: In the visualization, we assign a label to the scatter plots, but never call plt.legend() to add the legend.
  5. Challenge DBSCAN: When comparing the inferred clusters to the set of labels in the blobs data, we do not call len(set(labels)) to compare with the inferred number of clusters as we do in the moons data.

Classification suggestions

Most courses on ML teach regression before classification (including The Elements of Statistical Learning, which we recommend in the slide deck). Is there an advantage to reversing this and teaching classification first instead?

The most common algorithm for classification is logistic regression. Why don't we start with this here, before jumping to decision trees, etc.?

The explanation of each algorithm (how it works, what it's advantages and disadvantages are, when to use it) seems rather thin. Do we want to give a bit more of an introduction (and intuition) to the algorithms being taught?

Part 4: TPOT

  1. I'm not super experienced in TPOT, but in teaching this section the notebook often feels a little less complete and more tacked on. I would suggest one of the following: a) expanding the notebook to include information such as when and how to use tpot, what algorithms are compatible with it, and what parameters are important to know or think about or b) change the tpot code to an appendix or extra resource rather than a part of the core curriculum, and focus on teaching students how to select algorithms themselves.
  2. Dataset: This notebook also uses iris and should be replaced.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.