rhiever / data-analysis-and-machine-learning-projects Goto Github PK

View Code? Open in Web Editor NEW

6.0K 336.0 2.0K 19.94 MB

Repository of teaching materials, code, and data for my data analysis and machine learning projects.

Home Page: http://www.randalolson.com/blog/

Python 0.94% Jupyter Notebook 99.06%

machine-learning python data-analysis data-science ipython-notebook evolutionary-algorithm

data-analysis-and-machine-learning-projects's Introduction

Randy Olson's data analysis and machine learning projects

This is a repository of teaching materials, code, and data for my data analysis and machine learning projects.

Each repository will (usually) correspond to one of the blog posts on my web site.

Be sure to check the documentation (usually in IPython Notebook format) in the directory you're interested in for the notes on the analysis, data usage terms, etc.

If you don't have the necessary software installed to run IPython Notebook, don't fret. You can use nbviewer to view a notebook on the web.

For example, if you want to view the notebook in the wheres-waldo-path-optimization directory, copy the full link to the notebook then paste it into nbviewer.

License

Instructional Material

All instructional material in this repository is made available under the Creative Commons Attribution license. The following is a human-readable summary of (and not a substitute for) the full legal text of the CC BY 4.0 license.

You are free to:

Share—copy and redistribute the material in any medium or format
Adapt—remix, transform, and build upon the material

for any purpose, even commercially.

The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:

Attribution—You must give appropriate credit (mentioning that your work is derived from work that is © Randal S. Olson and, where practical, linking to http://www.randalolson.com/), provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

No additional restrictions—You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Notices:

You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation.
No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.

Software

Except where otherwise noted, the example programs and other software provided in this repository are made available under the OSI-approved MIT license.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data-analysis-and-machine-learning-projects's People

Contributors

Stargazers

Watchers

Forkers

mapleyustat mahakoala ljdawn pepsalehi dcritchlow livlab thomhutchison martindale ionux andrewliesinger graph1994 yoachim dtswing piemonkey gibbotcg georgiosvsl alphacityco chazarovsky emiamar jonlaliberte neuralmarket msea1 mgooch brendanbabb metamutator udemirezen jlertle tamiral mglerner agrainofsand palashmittal ddofer marcochchan sirhopsy dendr01d kod3r rudyzhou2 ml-ai-nlp-ir lucashmsilva skokal01 wanypie voxlol janusnic dyhpoon 0x0all flyunicorn limadanillo suranands rapier83 demokite044 alphastaxllc jworopaj luiseduardohdbackup fghj1 dkmy manniru grenoblien dominius r8-forks shahzaibalikhan prodigeni nicholasgodfreyclarke vikitripathi pcdinh douglaw joannekoong zpxiaomi goshman mnyary ericjang bertomartin johnsontrey shaun10 andy9775 michaelbernstein ngpestelos wleftwich wavelets efbbrown monokrome crudbug juliowaissman qbektrix patslat 99plus2 chuckchen rkyleg huyna oilgasdataanalyst fatelei baatar flychen50 cjschneider2 jacemonje abgese domenicosolazzo gijs akashaio marcobonaldo magirtopcu

data-analysis-and-machine-learning-projects's Issues

US-Weather-History NOAA weather data convenient link

Hi I found a convenient link which gives the weather table directly:

base_url='https://www.wunderground.com/history/airport/{WeatherStation}/{starting_date.year}/{starting_date.month}/{starting_date.day}/CustomHistory.html?dayend={ending_date.day}&monthend={ending_date.month}&yearend={ending_date.year}&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo=&format=1'

MY_PROJECTS

iris["species"].value_counts()

Add BSD or MIT license for code

Software Carpentry does this in their license. CC recommends it.

Example Machine Learning Notebook.ipynb

Hi,

I think cell 37 should be dt_scores instead of rf_scores.

Feedback from Sebastian on ML notebook

Feedback from @rasbt:

GA code always reduces population size down to 100

Regardless of what the user passes to run_genetic_algorithm() for population_size, the code always reduces the population size down to 100. Need to rework that code so it's more dynamic.

KeyError while using seaborn plotting

Hi. I am getting KeyError 'class' while attempting to plot iris data.

sb.pairplot(iris_data.dropna(), hue='class')

gives the following stack trace, please advise.

KeyError Traceback (most recent call last)
in ()
----> 1 sb.pairplot(iris_data.dropna(), hue='class')

/Users/mgudipati/anaconda/lib/python2.7/site-packages/seaborn/linearmodels.py in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind, diag_kind, markers, size, aspect, dropna, plot_kws, diag_kws, grid_kws)
1583 hue_order=hue_order, palette=palette,
1584 diag_sharey=diag_sharey,
-> 1585 size=size, aspect=aspect, dropna=dropna, **grid_kws)
1586
1587 # Add the markers here as PairGrid has figured out how many levels of the

/Users/mgudipati/anaconda/lib/python2.7/site-packages/seaborn/axisgrid.py in init(self, data, hue, hue_order, palette, hue_kws, vars, x_vars, y_vars, diag_sharey, size, aspect, despine, dropna)
1221 index=data.index)
1222 else:
-> 1223 hue_names = utils.categorical_order(data[hue], hue_order)
1224 if dropna:
1225 # Filter NA from the list of unique hue names

/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in getitem(self, key)
1990 return self._getitem_multilevel(key)
1991 else:
-> 1992 return self._getitem_column(key)
1993
1994 def _getitem_column(self, key):

/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
1997 # get column
1998 if self.columns.is_unique:
-> 1999 return self._get_item_cache(key)
2000
2001 # duplicate columns & possible reduce dimensionality

/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
1343 res = cache.get(item)
1344 if res is None:
-> 1345 values = self._data.get(item)
1346 res = self._box_item_values(item, values)
1347 cache[item] = res

/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item, fastpath)
3223
3224 if not isnull(item):
-> 3225 loc = self.items.get_loc(item)
3226 else:
3227 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/mgudipati/anaconda/lib/python2.7/site-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance)
1876 return self._engine.get_loc(key)
1877 except KeyError:
-> 1878 return self._engine.get_loc(self._maybe_cast_indexer(key))
1879
1880 indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4027)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3891)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12408)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12359)()

KeyError: 'class'

what is ML ?

Machine Learing

ML notebook: Expand data testing section

Expand the data testing section to create actual unit tests and explain assert statements a little better. Currently newcomers to unit tests would have no idea what's going on with assert statements.

why did you not use Naives bayes?

it rejects the point of overfitting and gives 93% accuracy

twitter got 400 in follower-factory

The entire error reads:
TwitterHTTPError: Twitter sent status 400 for URL: 1.1/application/rate_limit_status.json using parameters: (oauth_consumer_key=&oauth_nonce=8407703836662294415&oauth_signature_method=HMAC-SHA1&oauth_timestamp=1607558162&oauth_version=1.0&oauth_signature=wkpQTGhB4qpoMvkpmvzUFMfNJ%2Bc%3D)
details: {'errors': [{'code': 215, 'message': 'Bad Authentication data.'}]}
It seems the authentication method or API is no longer valid.

Best parameters result not reproducible

Hi, this is a very helpful example. Fun to read and easy to follow. I just have one question. You close with Reproducibility, but each time I run the cell to compute the best parameters for DecisionTreeClassifier, I get different answers most of time. That would appear to be a result not reproducible. Any reason?

Add this repo as a resource

https://github.com/josephmisiti/awesome-machine-learning

disclaimer - it is my repo!

Getting this error when i try to plot my dataframe 'callers' on sns

KeyError Traceback (most recent call last)
~\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2894 try:
-> 2895 return self._engine.get_loc(casted_key)
2896 except KeyError as err:

pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'callers'

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)
in
----> 1 sns.pairplot(callers , hue = 'callers')

~\AppData\Roaming\Python\Python38\site-packages\seaborn_decorators.py in inner_f(*args, **kwargs)
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 46 return f(**kwargs)
47 return inner_f
48

~\AppData\Roaming\Python\Python38\site-packages\seaborn\axisgrid.py in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind, diag_kind, markers, height, aspect, corner, dropna, plot_kws, diag_kws, grid_kws, size)
1923 # Set up the PairGrid
1924 grid_kws.setdefault("diag_sharey", diag_kind == "hist")
-> 1925 grid = PairGrid(data, vars=vars, x_vars=x_vars, y_vars=y_vars, hue=hue,
1926 hue_order=hue_order, palette=palette, corner=corner,
1927 height=height, aspect=aspect, dropna=dropna, **grid_kws)

~\AppData\Roaming\Python\Python38\site-packages\seaborn\axisgrid.py in init(self, data, hue, hue_order, palette, hue_kws, vars, x_vars, y_vars, corner, diag_sharey, height, aspect, layout_pad, despine, dropna, size)
1212 index=data.index)
1213 else:
-> 1214 hue_names = categorical_order(data[hue], hue_order)
1215 if dropna:
1216 # Filter NA from the list of unique hue names

~\Anaconda\lib\site-packages\pandas\core\frame.py in getitem(self, key)
2900 if self.columns.nlevels > 1:
2901 return self._getitem_multilevel(key)
-> 2902 indexer = self.columns.get_loc(key)
2903 if is_integer(indexer):
2904 indexer = [indexer]

~\Anaconda\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2895 return self._engine.get_loc(casted_key)
2896 except KeyError as err:
-> 2897 raise KeyError(key) from err
2898
2899 if tolerance is not None:

KeyError: 'callers'

Hi, I'm getting a keyerror of species, please advice after looking at this error

I am using seaborn but it is just a command to count data points for each class are present
I wrote this- iris["species"].value_counts()

KeyError Traceback (most recent call last)
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2645 try:
-> 2646 return self._engine.get_loc(key)
2647 except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'species'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
in
----> 1 iris["species"].value_counts()

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in getitem(self, key)
2798 if self.columns.nlevels > 1:
2799 return self._getitem_multilevel(key)
-> 2800 indexer = self.columns.get_loc(key)
2801 if is_integer(indexer):
2802 indexer = [indexer]

~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2646 return self._engine.get_loc(key)
2647 except KeyError:
-> 2648 return self._engine.get_loc(self._maybe_cast_indexer(key))
2649 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2650 if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'species'

Thank you

Optimal road trip

Error in finding the path and api key

follower factory's alpha value calculation doesn't work for small accounts

That alpha = ... for the plot assumes a large number of followers. Otherwise, the alpha value is out of bounds!

e.g. I think this is an easy fix, although maybe there are better ideas out there for normalizing the transparency w.r.t. the total number of points to plot.

alpha=0.1 * min(9, 80000. / len(days_since_2006)))

ML notebook: Add interpretation section

What features are being used to make the classification?

_Type of Classification	Description	Example_
Categorical (Nominal)	Classification of entities into particular categories.	That thing is a dog.That thing is a car.
Ordinal	Classification of entities in some kind of ordered relationship.	You are stronger than him.It is hotter today than yesterday.
Adjectival or Predicative	Classification based on some quality of an entity.	That car is fast.She is smart.
Cardinal	Classification based on a numerical value.	He is six feet tall.It is 25.3 degrees today.

Categorical classification is also called nominal classification because it classifies an entity in terms of the name of the class it belongs to. This is the type of classification we focus on in this document.

Why are those features important?

Let’s imagine that you’ve landed a consulting gig with a bank who have asked you to identify those who have a high likelihood of default on the next month’s bill. Armed with the machine learning techniques that you’ve learnt and practiced, let’s say you proceed to analyze the data set given by your client and have used a random forest algorithm that achieves a reasonably high accuracy. Your next task is to present to the business stakeholders from the client’s team how you achieved these results. What would you say to them? Will they be able to understand all the hyperparameters of the algorithm that you tweaked in order to land on your final model? How will they react when you start talking about the number of estimators and Gini criterion of the random forest?
Although it is important to be proficient in understanding the inner workings of the algorithm, it is far more essential to be able to communicate the findings to an audience who may not have any theoretical / practical knowledge of machine learning. Just showing that the algorithm predicts well is not enough. You have to attribute the predictions to the elements of the input data that contribute to your accuracy. Thankfully, the random forest implementation of sklearn does give an output called “feature importances” which helps us explain the predictive power of the features in the dataset. But, there are certain drawbacks to this method that we will explore in this post, and an alternative technique to assess the feature importances that overcomes these drawbacks.

What does that say about the problem domain?

A problem domain is the area of expertise or application that needs to be examined to solve a problem. A problem domain is simply looking at only the topics of an individual's interest, and excluding everything else. For example, when developing a system to measure good practice in medicine, carpet drawings at hospitals would not be included in the problem domain. In this example, the domain refers to relevant topics solely within the delimited area of interest: medicine. This points to a limitation of an overly specific, or overly bounded, problem domain. An individual may think they are interested in medicine and not interior design, but a better solution exists outside of the problem domain as it was initially conceived. For example, when IDEO researchers noticed that patients in hospitals spent a huge amount of time staring at acoustic ceiling tiles, which "became a symbol of the overall ambiance: a mix of boredom and anxiety from feeling lost, uninformed, and out of control."

your blog is broken, sir

can not open the website

UsageError: Line magic function `%install_ext` not found.

In the file Example Machine Learning Notebook.ipynb, codeline 37, since %install_ext was depreciated, now it is better to ask the user to install watermark:

pip install watermark

followed by

%load_ext watermark

%watermark -a 'author' -nmv --packages numpy,pandas,sklearn,matplotlib,seaborn

Lost Directories

I want to reopened

ML notebook: Add preprocessing and a sklearn pipeline

From a Reddit comment:

Advice: I think you are missing a few big things like preprocessing/scaling and pipelines.

Before using the learners, inputs should be scaled so that each feature has equal weight. Something like StandardScaler or MinMaxScaler are both appropriate (from sklearn.preprocessing). If you think some features are more important, you can scale them later to increase their relative importance in prediction. These are more parameters you would tune using CV, but these can be really numerous, so GridSearch is out the window and you would have to consider some alternatives like Nelder Mead search, genetic search, or multivariate gradient descent if you suspect convexity.

You have to fit these scalers on the training data and then use the trained fit to transform the testing data. Using Pipelines simplifies this whole process (fits the scaler and learner at once, transforms and predicts at once).

Ball Outcome

Cricinfo want to create a simple cricket simulator to test their scorecard and so have come to you for help.
After a ball is bowled, there are eight possible outcomes. Below we list the eight outcomes and each outcome’s
e"ect on the score:
• 0 runs: add 1 to the ball count
• 1 run: add 1 to the ball count, add 1 to the run count
• 2 runs: add 1 to the ball count, add 2 to the run count
• 4 runs: add 1 to the ball count, add 4 to the run count, add 1 to the 4s count
• 6 runs: add 1 to the ball count, add 6 to the run count, add 1 to the 6s count
• Wide: add 1 to the extras count
• No ball: add 1 to the extras count
• Out: add 1 to the ball count, mark batsman as out
Cricinfo store a batsman’s record using the variable:
state = list(balls = 0, runs = 0, fours = 0, sixes = 0, extras = 0, out = FALSE)
Write the function oneBall that takes the input state and one outcome and returns the updated state based
on the eight outcomes above

ML notebook: Add interpretation section

Add a section near the end trying to interpret the model:

What features are being used to make the classification?
Why are those features important?
What does that say about the problem domain?

rhiever / data-analysis-and-machine-learning-projects Goto Github PK

data-analysis-and-machine-learning-projects's Introduction

Randy Olson's data analysis and machine learning projects

License

Instructional Material

Software

data-analysis-and-machine-learning-projects's People

Contributors

Stargazers

Watchers

Forkers

data-analysis-and-machine-learning-projects's Issues

KeyError: 'species'

What features are being used to make the classification?

Why are those features important?

What does that say about the problem domain?

Recommend Projects

Recommend Topics

Recommend Org