Giter Site home page Giter Site logo

rhiever / data-analysis-and-machine-learning-projects Goto Github PK

View Code? Open in Web Editor NEW
6.0K 336.0 2.0K 19.94 MB

Repository of teaching materials, code, and data for my data analysis and machine learning projects.

Home Page: http://www.randalolson.com/blog/

Python 0.94% Jupyter Notebook 99.06%
machine-learning python data-analysis data-science ipython-notebook evolutionary-algorithm

data-analysis-and-machine-learning-projects's Introduction

Python 2.7 Python 3.5 License

Randy Olson's data analysis and machine learning projects

© 2016 - current, Randal S. Olson

This is a repository of teaching materials, code, and data for my data analysis and machine learning projects.

Each repository will (usually) correspond to one of the blog posts on my web site.

Be sure to check the documentation (usually in IPython Notebook format) in the directory you're interested in for the notes on the analysis, data usage terms, etc.

If you don't have the necessary software installed to run IPython Notebook, don't fret. You can use nbviewer to view a notebook on the web.

For example, if you want to view the notebook in the wheres-waldo-path-optimization directory, copy the full link to the notebook then paste it into nbviewer.

License

Instructional Material

All instructional material in this repository is made available under the Creative Commons Attribution license. The following is a human-readable summary of (and not a substitute for) the full legal text of the CC BY 4.0 license.

You are free to:

  • Share—copy and redistribute the material in any medium or format
  • Adapt—remix, transform, and build upon the material

for any purpose, even commercially.

The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:

  • Attribution—You must give appropriate credit (mentioning that your work is derived from work that is © Randal S. Olson and, where practical, linking to http://www.randalolson.com/), provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

No additional restrictions—You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Notices:

  • You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation.
  • No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.

Software

Except where otherwise noted, the example programs and other software provided in this repository are made available under the OSI-approved MIT license.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data-analysis-and-machine-learning-projects's People

Contributors

allardbrain avatar andrewliesinger avatar haraldschilly avatar igorrocha avatar mgschwan avatar nyoung85 avatar pauldebus avatar rhiever avatar yaph avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-analysis-and-machine-learning-projects's Issues

Feedback from Sebastian on ML notebook

Feedback from @rasbt:

  • Okay, let be me very nit-picky here. I would either spell all the package name in lower-case or use the common convention: NumPy, seaborn, matplotlib, SciPy, scikit-learn
  • In "scikit-learn: The main Machine Learning package in Python." I would suggest replacing "main" by "essential" or so. It is really great for basic stuff, and essential has the positive tone of "important" and also "fundamental" at the same time.
  • About the iris images: They look great! But one question, are they really attribution free? I am wondering because I looked very hard to find some good ones that meet this criterion
  • Since this is more of a beginner audience, maybe define "accuracy" E.g., "fraction of correctly classified flower samples"
  • "hand-measuring 100 randomly-sampled flowers of each species" -> Maybe use "50" so that the reader can directly relate to the dataset.
  • Instead of "scatter matrix", maybe consider the term "scatter plot matrix" since "scatter matrix" is typically something else: an "unnormalized" covariance matrix (e.g., in LDA)
  • Maybe mention that random forests are scale-invariant, e.g., you could mention that a typical procedure in the data preprocessing pipeline (required by most ML algos) is to scale the features because you are using decision trees (I believe this is the only scale-invariant algo that is used in ML) -- maybe also explain what a decision tree is and how it relates to random forests in a few sentences. On a side-note, but you probably already now this: Most gradient-based optimization algos
  • "There are several random forest classifier parameters that we can tune" -- yes there are, but typically, the idea behind random forest is that you don't need to tune any of these except for the number of trees.
  • "It's obviously a problem that our model performs quite differently depending on the data it's trained on." Maybe it would be too much for this intro, but you could mention high variance (overfitting) and high bias (underfitting); I suspect the high variance here comes from the fact that you are only using 10 trees, in RF you typically use hundreds or thousands of trees since it is a special case of bagging with unpruned decision trees after all. Also, Iris may not be the best example for RF since it is a very simple dataset that does not have many features (the random sampling of features is e.g., the advantage of RF over regular bagging). In general, maybe consider starting this section with an unpruned decision tree instead of random forests. And in the end, conclude with random forests and explain why they are typically better (with respect to overfitting). Nice side effect: you can visualize the decision tree with GraphViz. If you decide to stick with RF, consider tuning the n_estimators parameter instead.
  • When you plot the cross-val error, I could also print the standard deviation
  • RandomForestClassifier(n_estimators=10, max_depth=1); I wouldn't recommend showing people this example, this could give them the wrong idea; you don't prune trees in a forest.
  • Maybe also mention the problems with KNN, because people could think that it is typically a great classifier since it performs so well here. It's really susceptible to the curse of dimensionality, and you always have to keep the training set around (lazy learner). In this context, I would also mention that the scale of the features matters (if you use Euclidean distance) and in this case we don't have to worry about it because everything is in cm.

KeyError while using seaborn plotting

Hi. I am getting KeyError 'class' while attempting to plot iris data.

sb.pairplot(iris_data.dropna(), hue='class')

gives the following stack trace, please advise.


KeyError Traceback (most recent call last)
in ()
----> 1 sb.pairplot(iris_data.dropna(), hue='class')

/Users/mgudipati/anaconda/lib/python2.7/site-packages/seaborn/linearmodels.py in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind, diag_kind, markers, size, aspect, dropna, plot_kws, diag_kws, grid_kws)
1583 hue_order=hue_order, palette=palette,
1584 diag_sharey=diag_sharey,
-> 1585 size=size, aspect=aspect, dropna=dropna, **grid_kws)
1586
1587 # Add the markers here as PairGrid has figured out how many levels of the

/Users/mgudipati/anaconda/lib/python2.7/site-packages/seaborn/axisgrid.py in init(self, data, hue, hue_order, palette, hue_kws, vars, x_vars, y_vars, diag_sharey, size, aspect, despine, dropna)
1221 index=data.index)
1222 else:
-> 1223 hue_names = utils.categorical_order(data[hue], hue_order)
1224 if dropna:
1225 # Filter NA from the list of unique hue names

/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in getitem(self, key)
1990 return self._getitem_multilevel(key)
1991 else:
-> 1992 return self._getitem_column(key)
1993
1994 def _getitem_column(self, key):

/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
1997 # get column
1998 if self.columns.is_unique:
-> 1999 return self._get_item_cache(key)
2000
2001 # duplicate columns & possible reduce dimensionality

/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
1343 res = cache.get(item)
1344 if res is None:
-> 1345 values = self._data.get(item)
1346 res = self._box_item_values(item, values)
1347 cache[item] = res

/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item, fastpath)
3223
3224 if not isnull(item):
-> 3225 loc = self.items.get_loc(item)
3226 else:
3227 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance)
1876 return self._engine.get_loc(key)
1877 except KeyError:
-> 1878 return self._engine.get_loc(self._maybe_cast_indexer(key))
1879
1880 indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4027)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3891)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12408)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12359)()

KeyError: 'class'

ML notebook: Expand data testing section

Expand the data testing section to create actual unit tests and explain assert statements a little better. Currently newcomers to unit tests would have no idea what's going on with assert statements.

twitter got 400 in follower-factory

The entire error reads:
TwitterHTTPError: Twitter sent status 400 for URL: 1.1/application/rate_limit_status.json using parameters: (oauth_consumer_key=&oauth_nonce=8407703836662294415&oauth_signature_method=HMAC-SHA1&oauth_timestamp=1607558162&oauth_version=1.0&oauth_signature=wkpQTGhB4qpoMvkpmvzUFMfNJ%2Bc%3D)
details: {'errors': [{'code': 215, 'message': 'Bad Authentication data.'}]}
It seems the authentication method or API is no longer valid.

Best parameters result not reproducible

Hi, this is a very helpful example. Fun to read and easy to follow. I just have one question. You close with Reproducibility, but each time I run the cell to compute the best parameters for DecisionTreeClassifier, I get different answers most of time. That would appear to be a result not reproducible. Any reason?

Getting this error when i try to plot my dataframe 'callers' on sns


KeyError Traceback (most recent call last)
~\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2894 try:
-> 2895 return self._engine.get_loc(casted_key)
2896 except KeyError as err:

pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'callers'

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)
in
----> 1 sns.pairplot(callers , hue = 'callers')

~\AppData\Roaming\Python\Python38\site-packages\seaborn_decorators.py in inner_f(*args, **kwargs)
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 46 return f(**kwargs)
47 return inner_f
48

~\AppData\Roaming\Python\Python38\site-packages\seaborn\axisgrid.py in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind, diag_kind, markers, height, aspect, corner, dropna, plot_kws, diag_kws, grid_kws, size)
1923 # Set up the PairGrid
1924 grid_kws.setdefault("diag_sharey", diag_kind == "hist")
-> 1925 grid = PairGrid(data, vars=vars, x_vars=x_vars, y_vars=y_vars, hue=hue,
1926 hue_order=hue_order, palette=palette, corner=corner,
1927 height=height, aspect=aspect, dropna=dropna, **grid_kws)

~\AppData\Roaming\Python\Python38\site-packages\seaborn_decorators.py in inner_f(*args, **kwargs)
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 46 return f(**kwargs)
47 return inner_f
48

~\AppData\Roaming\Python\Python38\site-packages\seaborn\axisgrid.py in init(self, data, hue, hue_order, palette, hue_kws, vars, x_vars, y_vars, corner, diag_sharey, height, aspect, layout_pad, despine, dropna, size)
1212 index=data.index)
1213 else:
-> 1214 hue_names = categorical_order(data[hue], hue_order)
1215 if dropna:
1216 # Filter NA from the list of unique hue names

~\Anaconda\lib\site-packages\pandas\core\frame.py in getitem(self, key)
2900 if self.columns.nlevels > 1:
2901 return self._getitem_multilevel(key)
-> 2902 indexer = self.columns.get_loc(key)
2903 if is_integer(indexer):
2904 indexer = [indexer]

~\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2895 return self._engine.get_loc(casted_key)
2896 except KeyError as err:
-> 2897 raise KeyError(key) from err
2898
2899 if tolerance is not None:

KeyError: 'callers'

Hi, I'm getting a keyerror of species, please advice after looking at this error

I am using seaborn but it is just a command to count data points for each class are present
I wrote this- iris["species"].value_counts()


KeyError Traceback (most recent call last)
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2645 try:
-> 2646 return self._engine.get_loc(key)
2647 except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'species'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
in
----> 1 iris["species"].value_counts()

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in getitem(self, key)
2798 if self.columns.nlevels > 1:
2799 return self._getitem_multilevel(key)
-> 2800 indexer = self.columns.get_loc(key)
2801 if is_integer(indexer):
2802 indexer = [indexer]

~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2646 return self._engine.get_loc(key)
2647 except KeyError:
-> 2648 return self._engine.get_loc(self._maybe_cast_indexer(key))
2649 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2650 if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'species'

Thank you

follower factory's alpha value calculation doesn't work for small accounts

That alpha = ... for the plot assumes a large number of followers. Otherwise, the alpha value is out of bounds!

e.g. I think this is an easy fix, although maybe there are better ideas out there for normalizing the transparency w.r.t. the total number of points to plot.

alpha=0.1 * min(9, 80000. / len(days_since_2006)))

ML notebook: Add interpretation section

What features are being used to make the classification?

_Type of Classification Description Example_
Categorical (Nominal) Classification of entities into particular categories. That thing is a dog.That thing is a car.
Ordinal Classification of entities in some kind of ordered relationship. You are stronger than him.It is hotter today than yesterday.
Adjectival or Predicative Classification based on some quality of an entity. That car is fast.She is smart.
Cardinal Classification based on a numerical value. He is six feet tall.It is 25.3 degrees today.

Categorical classification is also called nominal classification because it classifies an entity in terms of the name of the class it belongs to. This is the type of classification we focus on in this document.

Why are those features important?

Let’s imagine that you’ve landed a consulting gig with a bank who have asked you to identify those who have a high likelihood of default on the next month’s bill. Armed with the machine learning techniques that you’ve learnt and practiced, let’s say you proceed to analyze the data set given by your client and have used a random forest algorithm that achieves a reasonably high accuracy. Your next task is to present to the business stakeholders from the client’s team how you achieved these results. What would you say to them? Will they be able to understand all the hyperparameters of the algorithm that you tweaked in order to land on your final model? How will they react when you start talking about the number of estimators and Gini criterion of the random forest?
Although it is important to be proficient in understanding the inner workings of the algorithm, it is far more essential to be able to communicate the findings to an audience who may not have any theoretical / practical knowledge of machine learning. Just showing that the algorithm predicts well is not enough. You have to attribute the predictions to the elements of the input data that contribute to your accuracy. Thankfully, the random forest implementation of sklearn does give an output called “feature importances” which helps us explain the predictive power of the features in the dataset. But, there are certain drawbacks to this method that we will explore in this post, and an alternative technique to assess the feature importances that overcomes these drawbacks.

What does that say about the problem domain?

A problem domain is the area of expertise or application that needs to be examined to solve a problem. A problem domain is simply looking at only the topics of an individual's interest, and excluding everything else. For example, when developing a system to measure good practice in medicine, carpet drawings at hospitals would not be included in the problem domain. In this example, the domain refers to relevant topics solely within the delimited area of interest: medicine. This points to a limitation of an overly specific, or overly bounded, problem domain. An individual may think they are interested in medicine and not interior design, but a better solution exists outside of the problem domain as it was initially conceived. For example, when IDEO researchers noticed that patients in hospitals spent a huge amount of time staring at acoustic ceiling tiles, which "became a symbol of the overall ambiance: a mix of boredom and anxiety from feeling lost, uninformed, and out of control."

UsageError: Line magic function `%install_ext` not found.

In the file Example Machine Learning Notebook.ipynb, codeline 37, since %install_ext was depreciated, now it is better to ask the user to install watermark:

pip install watermark

followed by

%load_ext watermark

%watermark -a 'author' -nmv --packages numpy,pandas,sklearn,matplotlib,seaborn

ML notebook: Add preprocessing and a sklearn pipeline

From a Reddit comment:

Advice: I think you are missing a few big things like preprocessing/scaling and pipelines.

Before using the learners, inputs should be scaled so that each feature has equal weight. Something like StandardScaler or MinMaxScaler are both appropriate (from sklearn.preprocessing). If you think some features are more important, you can scale them later to increase their relative importance in prediction. These are more parameters you would tune using CV, but these can be really numerous, so GridSearch is out the window and you would have to consider some alternatives like Nelder Mead search, genetic search, or multivariate gradient descent if you suspect convexity.

You have to fit these scalers on the training data and then use the trained fit to transform the testing data. Using Pipelines simplifies this whole process (fits the scaler and learner at once, transforms and predicts at once).

Ball Outcome

Cricinfo want to create a simple cricket simulator to test their scorecard and so have come to you for help.
After a ball is bowled, there are eight possible outcomes. Below we list the eight outcomes and each outcome’s
e"ect on the score:
• 0 runs: add 1 to the ball count
• 1 run: add 1 to the ball count, add 1 to the run count
• 2 runs: add 1 to the ball count, add 2 to the run count
• 4 runs: add 1 to the ball count, add 4 to the run count, add 1 to the 4s count
• 6 runs: add 1 to the ball count, add 6 to the run count, add 1 to the 6s count
• Wide: add 1 to the extras count
• No ball: add 1 to the extras count
• Out: add 1 to the ball count, mark batsman as out
Cricinfo store a batsman’s record using the variable:
state = list(balls = 0, runs = 0, fours = 0, sixes = 0, extras = 0, out = FALSE)
Write the function oneBall that takes the input state and one outcome and returns the updated state based
on the eight outcomes above

ML notebook: Add interpretation section

Add a section near the end trying to interpret the model:

  • What features are being used to make the classification?
  • Why are those features important?
  • What does that say about the problem domain?

Some stations do not report average daily temp

KSAF in Santa Fe, New Mexico does not record an average daily temperature, only the mean. In order for wunderground_parser.py to work in this case, all of the index values need to be shifted down by one after reading weather_data[0].

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.