datapipelineau / learningdataminingwithpython Goto Github PK

Updated code for the Learning Data Mining With Python book

Jupyter Notebook 99.93% Python 0.07%

learningdataminingwithpython's Introduction

LearningDataMiningWithPython

Updated code for the Learning Data Mining With Python book.

Libraries change, bugs get found, and things could use a little more explaining. That's the role of this repository, to act as an addition to the book "Learning Data Mining with Python", written by Robert Layton. This git repository will be updated with improved code and instuctions, designed to further the lessons learnt in the book.

Buy the book here:

Scope of this Repo

At this stage, we won't be going past the scope of the book. Feel free to add a feature request, and I'll try fulfil it somehow, somewhere.

Want to go further?

Check out the author's website at

I also have a site LearningTensorFlow.com if you want to learn about Google's TensorFlow algorithm. I also also have the dataPipeline website, which contains a blog that talks about all things data analysis and projects.

About This Book

Harness the power of Python to analyze data and create insightful predictive models

Learn data mining in practical terms, using a wide variety of libraries and techniques
Learn how to find, manipulate, and analyze data using Python
Step-by-step instructions on creating real-world applications of data mining techniques

Who This Book Is For

If you are a programmer who wants to get started with data mining, then this book is for you.

What You Will Learn

Apply data mining concepts to real-world problems
Predict the outcome of sports matches based on past results
Determine the author of a document based on their writing style
Use APIs to download datasets from social media and other online services
Find and extract good features from difficult datasets
Create models that solve real-world problems
Design and develop data mining applications using a variety of datasets
Set up reproducible experiments and generate robust results
Recommend movies, online celebrities, and news articles based on personal preferences
Compute on big data, including real-time data from the Internet

In Detail

The next step in the information age is to gain insights from the deluge of data coming our way. Data mining provides a way of finding this insight, and Python is one of the most popular languages for data mining, providing both power and flexibility in analysis.

This book teaches you to design and develop data mining applications using a variety of datasets, starting with basic classification and affinity analysis. Next, we move on to more complex data types including text, images, and graphs. In every chapter, we create models that solve real-world problems.

There is a rich and varied set of libraries available in Python for data mining. This book covers a large number, including the IPython Notebook, pandas, scikit-learn and NLTK.

Each chapter of this book introduces you to new algorithms and techniques. By the end of the book, you will gain a large insight into using Python for data mining, with a good knowledge and understanding of the algorithms and implementations.

learningdataminingwithpython's People

Contributors

Stargazers

Watchers

Forkers

dawei756 will015 yaoclee slideclick yunxileo henrique-r-luz magicjane mazl aurora-xu wangjingwen yanghuibit littlebirdliu realyfc imgoodman ammy2020 fajrirahmat stevejake007 quantumplatypus joucks98 samueltt davidsxf praiseslow ssooffii viviansun2013 fangch2004 miyasgi nettao bigbigradish uniqueness001 leetschau wojohowitz00 saisai massoud12345 xiaoqilt dhpo munendra7777 yellowzunzhi theodorusd yaozhengjie sandeshchinchole srikanthlakkoju mr-elk chenxinjingjing littlecorn xw4jb reyyantezel github-liu168 martatolos nixia11111 williamlage eolt chenjiect jolly-w phanulab enemyatgates natdennett alexander-zhoukai

learningdataminingwithpython's Issues

p197 net1.fit(X_train,y_train)

TypeError: ('An update must have the same type as the original shared variable (shared_var=hidden.b, shared_var.type=TensorType(float32, vector), update_val=Elemwise{add,no_inplace}.0, update_val.type=TensorType(float64, vector)).', 'If the difference is related to the broadcast pattern, you can call the tensor.unbroadcast(var, axis_to_unbroadcast[, ...]) function to remove broadcastable dimensions.')

Label Json format in ch6_label_twitter

I have created json files with tweets, and I followed your tutorial in a similar to https://github.com/giswqs/Learning-Python/blob/master/Learning-Data-Mining-with-Python/Chapter%206/ch6_label_twitter.ipynb

way. However, I can not get
<IPython.core.display.Javascript at 0x10562f438> object in [119] and I get an errors
Javascript error adding output! ReferenceError: load_next_tweet is not defined See your browser Javascript console for more details.
after I run [120]

List of what I have installed
alabaster==0.7.9
anaconda-clean==1.0
anaconda-client==1.5.1
anaconda-navigator==1.2.3
argcomplete==1.0.0
astroid==1.4.7
astropy==1.0.4
Babel==2.3.4
backports-abc==0.4
backports.shutil-get-terminal-size==1.0.0
backports.ssl-match-hostname==3.4.0.2
beautifulsoup4==4.5.1
bitarray==0.8.1
blaze==0.10.1
bokeh==0.12.2
boto==2.42.0
Bottleneck==1.0.0
cdecimal==2.3
cffi==1.7.0
chest==0.2.3
click==6.6
cloudpickle==0.2.1
clyent==1.2.2
colorama==0.3.7
configobj==5.0.6
configparser==3.5.0
contextlib2==0.5.3
cryptography==1.5
cycler==0.10.0
Cython==0.24.1
cytoolz==0.8.0
dask==0.11.0
datashape==0.5.2
decorator==4.0.10
dill==0.2.5
docutils==0.12
dynd==0.7.3.dev1
enum34==1.1.6
et-xmlfile==1.0.1
fastcache==1.0.2
filelock==2.0.6
Flask==0.11.1
Flask-Cors==2.1.2
funcsigs==1.0.2
functools32==3.2.3.post2
futures==3.0.5
gevent==1.1.2
greenlet==0.4.10
grin==1.2.1
h5py==2.5.0
HeapDict==1.0.0
idna==2.1
imagesize==0.7.1
ipaddress==1.0.16
ipykernel==4.5.0
ipython==5.1.0
ipython-genutils==0.1.0
ipywidgets==5.2.2
itsdangerous==0.24
jdcal==1.2
jedi==0.9.0
Jinja2==2.8
jsonschema==2.5.1
jupyter==1.0.0
jupyter-client==4.4.0
jupyter-console==5.0.0
jupyter-core==4.2.0
Keras==1.1.0
lazy-object-proxy==1.2.1
llvmlite==0.13.0
llvmpy==0.12.7
locket==0.2.0
lxml==3.6.4
MarkupSafe==0.23
matplotlib==1.4.3
mistune==0.7.3
mock==2.0.0
mpmath==0.19
multipledispatch==0.4.8
nb-anacondacloud==1.2.0
nb-conda==2.0.0
nb-conda-kernels==2.0.0
nbconvert==4.2.0
nbformat==4.1.0
nbpresent==3.0.2
networkx==1.11
nltk==3.2.1
nose==1.3.7
notebook==4.2.3
numba==0.15.1
numexpr==2.4.4
numpy==1.11.2
odo==0.5.0
openpyxl==2.3.2
pandas==0.17.1
partd==0.3.6
path.py==0.0.0
pathlib2==2.1.0
patsy==0.4.1
pbr==1.10.0
pep8==1.7.0
pexpect==4.0.1
pickleshare==0.7.4
Pillow==3.3.1
pkginfo==1.3.2
ply==3.9
prompt-toolkit==1.0.3
protobuf==3.1.0
psutil==4.3.1
ptyprocess==0.5.1
py==1.4.31
pyaml==16.9.0
pyasn1==0.1.9
pycairo==1.10.0
pycosat==0.6.1
pycparser==2.14
pycrypto==2.6.1
pycurl==7.43.0
pyflakes==1.3.0
Pygments==2.1.3
pylint==1.5.4
pyOpenSSL==16.0.0
pyparsing==2.0.3
pytest==2.9.2
python-dateutil==2.5.3
pytz==2016.6.1
PyYAML==3.12
pyzmq==15.4.0
QtAwesome==0.3.3
qtconsole==4.2.1
QtPy==1.1.2
redis==2.10.5
requests==2.11.1
rope==0.9.4
scikit-image==0.11.3
scikit-learn==0.19.dev0
scipy==0.17.1
simplegeneric==0.8.1
simplejson==3.10.0
singledispatch==3.4.0.3
six==1.10.0
snowballstemmer==1.2.1
sockjs-tornado==1.0.3
Sphinx==1.4.6
spyder==3.0.1
SQLAlchemy==1.0.13
statsmodels==0.6.1
sympy==1.0
tables==3.2.2
tensorflow==0.11.0rc0
terminado==0.6
Theano==0.8.2
toolz==0.8.0
tornado==4.4.1
traitlets==4.3.0
twitter==1.17.1
unicodecsv==0.14.1
wcwidth==0.1.7
Werkzeug==0.11.11
widgetsnbextension==1.2.6
wrapt==1.10.6
xgboost==0.40
xlrd==1.0.0
XlsxWriter==0.9.3
xlwt==1.1.2

Browser chrome on ubuntu 14.04 ( but the same issue appear under Win10)

in ch6 error

In book,
code:

from sklearn.base import TransformerMixin
class NLTKBOW(TransformerMixin):
    def fit(self,x,y=None):
        return self
    def transform(self,x):
        return [{word: True for word in word_tokenize(document)} for document in x]
from sklearn.feature_extraction import DictVectorizer
from sklearn.naive_bayes import BernoulliNB
import os
input_filename = os.path.join(os.path.expanduser("~"),"data","twitter","python_tweets.json")
labels_filename = os.path.join(os.path.expanduser("~"),"data","twitter","python_classes.json")
import json
tweets = []
with open(input_filename) as inf:
    for line in inf:
        if len(line.strip()) == 0:
            continue
        tweets.append(json.loads(line)['text'])
with open(labels_filename) as inf:
    labels = json.load(inf)
from sklearn.pipeline import Pipeline
pipeline = Pipeline([('bag-of-words',NLTKBOW()),
                    ('vectorizer',DictVectorizer()),
                    ('naive-bayes',BernoulliNB())
                    ])
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(pipeline,tweets,labels,cv=100,scoring='f1')
import numpy as np
print("Score:{:.3f}".format(np.mean(scores)))

error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-222b6b4d208f> in <module>()
      1 from sklearn.cross_validation import cross_val_score
----> 2 scores = cross_val_score(pipeline,tweets,labels,cv=100,scoring='f1')
      3 import numpy as np
      4 print("Score:{:.3f}".format(np.mean(scores)))

/home/kongnian/anaconda3/lib/python3.5/site-packages/sklearn/cross_validation.py in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
   1420         Array of scores of the estimator for each run of the cross validation.
   1421     """
-> 1422     X, y = indexable(X, y)
   1423 
   1424     cv = check_cv(cv, X, y, classifier=is_classifier(estimator))

/home/kongnian/anaconda3/lib/python3.5/site-packages/sklearn/utils/validation.py in indexable(*iterables)
    199         else:
    200             result.append(np.array(X))
--> 201     check_consistent_length(*result)
    202     return result
    203 

/home/kongnian/anaconda3/lib/python3.5/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    174     if len(uniques) > 1:
    175         raise ValueError("Found arrays with inconsistent numbers of samples: "
--> 176                          "%s" % str(uniques))
    177 
    178 

ValueError: Found arrays with inconsistent numbers of samples: [ 1 98]

In github,
code:

# Labelling the class values for the twitter dataset.
import os
input_filename = os.path.join(os.path.expanduser("~"), "data", "twitter", "python_tweets.json")
classes_filename = os.path.join(os.path.expanduser("~"), "data", "twitter", "python_classes.json")
import json
tweets = []
with open(input_filename) as inf:
    for line in inf:
        if len(line.strip()) == 0:
            continue
        tweets.append(json.loads(line)['text'])
print("Loaded {} tweets".format(len(tweets)))
with open(classes_filename) as inf:
    labels = json.load(inf)
n_samples = min(len(tweets), len(labels))
sample_tweets = [t.lower() for t in tweets[:n_samples]]
labels = labels[:n_samples]
import numpy as np
y_true = np.array(labels)
print("{:.1f}% have class 1".format(np.mean(y_true == 1) * 100))
from sklearn.base import TransformerMixin
from nltk import word_tokenize

class NLTKBOW(TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [{word: True for word in word_tokenize(document)}
                 for document in X]
from sklearn.feature_extraction import DictVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
pipeline = Pipeline([('bag-of-words', NLTKBOW()),
                     ('vectorizer', DictVectorizer()),
                     ('naive-bayes', BernoulliNB())
                     ])
scores = cross_val_score(pipeline, sample_tweets, y_true, cv=10, scoring='f1')
print("Score: {:.3f}".format(np.mean(scores)))

error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-321d5e05b52f> in <module>()
      3                      ('naive-bayes', BernoulliNB())
      4                      ])
----> 5 scores = cross_val_score(pipeline, sample_tweets, y_true, cv=10, scoring='f1')
      6 print("Score: {:.3f}".format(np.mean(scores)))

/home/kongnian/anaconda3/lib/python3.5/site-packages/sklearn/cross_validation.py in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
   1422     X, y = indexable(X, y)
   1423 
-> 1424     cv = check_cv(cv, X, y, classifier=is_classifier(estimator))
   1425     scorer = check_scoring(estimator, scoring=scoring)
   1426     # We clone the estimator to make sure that all the folds are

/home/kongnian/anaconda3/lib/python3.5/site-packages/sklearn/cross_validation.py in check_cv(cv, X, y, classifier)
   1675         if classifier:
   1676             if type_of_target(y) in ['binary', 'multiclass']:
-> 1677                 cv = StratifiedKFold(y, cv)
   1678             else:
   1679                 cv = KFold(_num_samples(y), cv)

/home/kongnian/anaconda3/lib/python3.5/site-packages/sklearn/cross_validation.py in __init__(self, y, n_folds, shuffle, random_state)
    503                  random_state=None):
    504         super(StratifiedKFold, self).__init__(
--> 505             len(y), n_folds, shuffle, random_state)
    506         y = np.asarray(y)
    507         n_samples = y.shape[0]

/home/kongnian/anaconda3/lib/python3.5/site-packages/sklearn/cross_validation.py in __init__(self, n, n_folds, shuffle, random_state)
    243             raise ValueError(
    244                 ("Cannot have number of folds n_folds={0} greater"
--> 245                  " than the number of samples: {1}.").format(n_folds, n))
    246 
    247         if not isinstance(shuffle, bool):

ValueError: Cannot have number of folds n_folds=10 greater than the number of samples: 1.

chap9 code / wrong enron folder

according to p201 "set the data folder for the Enron Dataset enron_data_folder=....."
if you execute the script with this, it won't work. A minor change would be to modify :


def get_enron_corpus(num_authors=10, data_folder=**enron_data_folder**,
                     min_docs_author=10, max_docs_author=100,
                     random_state=None):

maybe it's better provide the dataset of the book

Thanks for your sharing, maybe it is better with dataset used by the book provided, i find it hard for me to get the dataset of chapter 6

安然公司数据训练

page158 scores=cross_val_score(pipeline,documents,classes,scoring='f1')中pipeline如何赋值

getdata.py fail to download books

while the the 1st book for burton with id **4657 _was correct, the next one with _2400 failed : after looking into the gutenberg path, it seems that this url not pointing to the book. Got some issues with some other links too.
I Got good results with :url_base= "http://eremita.di.uminho.pt"

Missing directory name

From the Chapter 9, the getdata.py is referencing "data_folder" many times ( Line 63 for exemple). The script will fails as there is no reference to it.

第11章

hidden_layer=lasagne.layers.DenseLayer(input_layer,num_units=12,nonlinearity=lasagne.nonlinearities.sigmoid)
output_layer=lasagne.layers.DenseLayer(hidden_layer,num_units=3,nonlinearity=lasagne.nonlinearities.softmax)
theano.gradient.DisconnectedInputError: grad method was asked to compute the gradient with respect to a variable that is not part of the computational graph of the cost, or is used only by a non-differentiable operator: W
在运行是总报错求指导

How it works in Naive Bayes

Hi,

thank you for your book: nice indeed. I'm reading it on Safari. I cannot understand something. In Paragraph 'How it Works' in Chapter 'Social Media Insight using Naive Bayes' shouldn't it be for the sample [0, 0, 0, 1] P(D|C=0) = P(D1|C=0) x P(D2|C=0) x P(D3|C=0) x P(D4|C=0) = 0.7 x 0.6 x 0.6 x 0.7 instead of P(D|C=0) = P(D1|C=0) x P(D2|C=0) x P(D3|C=0) x P(D4|C=0) = 0.3 x 0.6 x 0.6 x 0.7? For Class 0 we have : [0.3, 0.4, 0.4, 0.7] and it is said 'The second and third values are 0.6, because the value of that feature in the sample was 0.' The first value is 0 as well...

Thank you for your help in advance
Bye
Fabio

in ch5 ValueError

Code:

import os
import pandas as pd
import numpy as np
data_folder = os.path.join(os.path.expanduser("~"),"data","Ads")
data_filename = os.path.join(data_folder,"ad.data")
def convert_number(x):
    try:
        return float(x)
    except ValueError:
        return np.nan
from collections import defaultdict
converters = defaultdict(convert_number)
converters[1558] = lambda x:1 if x.strip() == "ad." else 0
ads = pd.read_csv(data_filename,header=None,converters=converters)
x = ads.drop(1558,axis=1).values
y = ads[1558]
from sklearn.decomposition import PCA
pca = PCA(n_components=5)
xd = pca.fit_transform(x)
np.set_printoptions(precision=3,suppress=True)
pca.explained_variance_ratio_

Error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-0c33a13666bd> in <module>()
      1 from sklearn.decomposition import PCA
      2 pca = PCA(n_components=5)
----> 3 xd = pca.fit_transform(x)
      4 import numpy as np
      5 np.set_printoptions(precision=3,suppress=True)

/home/kongnian/anaconda3/lib/python3.5/site-packages/sklearn/decomposition/pca.py in fit_transform(self, X, y)
    239 
    240         """
--> 241         U, S, V = self._fit(X)
    242         U = U[:, :self.n_components_]
    243 

/home/kongnian/anaconda3/lib/python3.5/site-packages/sklearn/decomposition/pca.py in _fit(self, X)
    266             requested.
    267         """
--> 268         X = check_array(X)
    269         n_samples, n_features = X.shape
    270         X = as_float_array(X, copy=self.copy)

/home/kongnian/anaconda3/lib/python3.5/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    371                                       force_all_finite)
    372     else:
--> 373         array = np.array(array, dtype=dtype, order=order, copy=copy)
    374 
    375         if ensure_2d:

ValueError: could not convert string to float: '?'

OS Information:
Linux ubuntu 4.4.0-41-generic #61-Ubuntu SMP Tue Sep 27 17:27:48 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

import statement update train_test_split

noted that in the chapter one oner_application code you're using an outdated import statement for sklearn's train_test_split function. please update as follows to comply with sklearn version 0.24.1

existing code
from sklearn.cross_validation import train_test_split

please update to
from sklearn.model_selection import train_test_split

the code runs fine after that amendment, but as this is likely in other notebooks in this publication it will require changes in subsequent files barring users who practice version control.