Giter Site home page Giter Site logo

topepo / fes Goto Github PK

View Code? Open in Web Editor NEW
711.0 53.0 234.0 581.02 MB

Code and Resources for "Feature Engineering and Selection: A Practical Approach for Predictive Models" by Kuhn and Johnson

Home Page: https://bookdown.org/max/FES

License: GNU General Public License v2.0

HTML 63.09% R 35.55% Makefile 1.21% Shell 0.15%

fes's Introduction

Code and Data Sets for Feature Engineering and Selection by Max Kuhn and Kjell Johnson (2019).

Link to buy on Amazon

Link to buy from CRC Press

The repo currently contains:

  • A venue to ask questions or make comments about the book. Please help us make it better!
  • Data_Sets directory contains all of the new data sets used in the text.
  • Other directories contain code that reproduces the analyses in the book. Some analyses are contained in since subdirectories or files while others are split up (as is best for each case). Each analysis has a list of required R packages and enumerates the package versions that were used for the analyses contained in the text. Most analyses use visualizations and tables that are consistent with the on-line version and are interactive where possible.

(there is no code for chapter 10)

For questions or comments, please file an issue.

fes's People

Contributors

davft avatar kjell-stattenacity avatar topepo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fes's Issues

Elaboration On Pre-Processing

3.4.7 What Should Be Included Inside of Resampling?
Version: 2018-09-08

When should you do pre-processing to avoid over fitting?

Do you do it both on the training and the test set, which may give you different values. Or do you apply the same values from the training set to the test set, which may lead to over-fitting.

I am confused by this quote below. For example, if you use mode imputation for a class on the training set, should you use mode imputation on the class for the test set, even if you get a different mode? Also, if you use mean imputation for numerical data in the training set, should you also use mean imputation on the test set (meaning you will have a different mean value than the training set)?

"To provide a solid methodology, one should constrain themselves to developing the list of preprocessing techniques, estimate them only in the presence of the training data points, and then apply the techniques to future data (including the test set)."

http://www.feat.engineering/review-predictive-modeling-process.html#resampling

Precision equation in section 3.2.2

Version: 2018-05-12
Section: 3.2.2
precision equation’s numerator says “ # truly non-STEM predicted correctly“. It should say “ # truly STEM predicted correctly“. The numeric values are correct

Section 3.2.1 Minor Typo

Section 3.2.1
2018-09-09
"performace" should be "performance". "that what it would" should be "than what it would"

In general, this type of sample makes the model performace metric appear worse that what it would be without the sample.

typo in chapter 6.3.1.2

in version date "2018-05-12" in chapter 6.3.1.2

The second row from the left in Figure 6.9

should be

The second column from the left in Figure 6.9

hm, since this is the second occurrence of this issue maybe my english fails here. in that case sorry for the inconvenience.

if this is a mistake then this also occurring in chapter 6.3.1.3

The ICA components are shown in the third row of Figure 6.9

A bit of feedback

Great idea for a book. So much of an analyst's time is spent here, and no real resources are available treating it in a comprehensive manner.

I did a quick skim over your work so far. Great job! here are a few thoughts:

  1. I think that some time should be spent on engineering features with a specific outcome in mind. For example, you cover PCA, but not LDA.

  2. There are a lot more 1:1 transformations that I imagine your readers would want to know about. see: http://www.deducer.org/pmwiki/pmwiki.php?n=Main.TransformVariables

  3. I'm not a big fan of hashing categorical variables, though it definitely should be covered. Another option when you have a specific outcome (and a prediction algorithm that needs dimension reduction) is to fit a simple hierarchical model e.g.:

$$
y_{ij} = \theta_jX_{ij} + \epsilon_{ij}
$$
where
$$
\theta_j \sim N(0, \sigma).
$$
Here X is the dummy coded categorical variable. Then use \theta as the encoded value for X. This shrinks rare categories toward 0 so they don't have a lot of impact, and puts the rest on the right scale for a regression. Often I find that interactions with this extracted feature are important.

On a related note, you should mention the option of including dummy codes for the c most common categories and an 'other' bucket for the rest.

  1. Feature engineering is perhaps most important in dealing with text, where the raw data is rarely if ever analyzed directly. This should probably be a whole chapter alone and deal with bag-of-words, co-occurance, sentiment analysis, word vectors, stemming, tokenizing, etc.

  2. What about feature extraction from network data? e.g. number of friends, shared partners, etc.

  3. You sort of assume that you've already got your data in a flat file. How about dealing with engineering out of a structure DB (e.g SQL)? If you have an outcome in one table, how do you extract relevant features from other tables if there is a one-to-one, many-to-one, one-to-many or many-to-one relationship?

  4. What about image data? e.g. using the (intermediate) output of a general purpose NN model (imagenet), gabor filters, etc.

  5. Shouldn't there be a discussion of what transformations are needed for what types of models? e.g. Don't code interactions for an RF. Do scale your predictors if your fitting algorithm is gradient descent.

  6. How should outliers be handled? univariate and multivariate. The choice between row exclusion and column transformation.

I want to translate this project into Chinese.

Dear Max Kuhn and Kjell Johnson,
I think this project is a great work and will help a lot people know how to deal problems in FE.

also, i didn't see knowledge system introduced the FE should how to push.Before this.

this project can teach me know how, why and next. thats why I think this is a great job.

so, I want to translate this project into Chinese to help more people know it.

I don't know if this is OK?Max Kuhn and Kjell Johnson can give me some feedbacks, if anything is wrong.https://www.ourantech.club/2018/06/21/feature-engineering/

Typos in `1 Introduction`

In the version dated "2018-05-12", several typos in the Introduction section.

1.2.7 Big Data

First, it simple might not

Should be "simply" not "simple"

For high bias, lo variance models

Should be "low" not "lo"

doubling or tripling the amount of training data is unlikely to more the parameter estimates

Incomplete sentence, should read something about "unlikely to improve the parameter estimates", etc.

1.4 Feature Selection

with 0 possible values, is converted to a set of -1 binary variables

Should read "a set of 0-1 binary variables".

Chapter 6 typos

  • p111 par1 Last sentence is not a sentence
  • p111 par2 "learn how to mitagate" -> learn how to mit_i_gate
  • p111 par4 "In this chapter we will provide such approaches for" next word missing?
  • "as you can see": not my preferred writing style code/example to see how \lambda = -1.09 was derived?
  • p131 par6 distribution
  • Typo on pg 122. “Resampling. Process.”
  • Typo: pg 133 ‘add s to component in 2nd to last sentence in 2nd paragraph
  • ReLu? Used on pg 137 and then defined on pg 138
  • P 137 paragraph 2 typo “east” should be ease
  • Section 6.3.1.3 Last paragraph: It is not clear, that "the data" refers to the Chicago weekend data.

Minor missing word in Section 3.4.5

In the version dated "2018-05-12", there is a word missing in Section 3.4.5, 3rd paragraph:

and that the true value represented by the green vertical line.
and that the true value is represented by the green vertical line.

typo in chapter 3.1

in version date 2018-05-12 in chapter 3.1 there is a sentence

While the imbalance hasa significant impact on the analysis

there is an "a" which should not be there

Section 2.3 Minor Typo

Version 2018-09-09
Section 2.3
"algoroithm" is mispelled.

For each interaction term, the same resampling algoroithm was used to quantify the cross-validated ROC from a model with only the main effects and a model with the main effects and the interaction term.

Please cite relevant prior work

I honestly think encoding categoricals chapter should cite our prior work on effects/impact coding and vtreat. It is provably stronger than hashing (we have a note on that if you want) and our work has citation instructions here github.com/WinVector/vtre…

minor typos section 3

Great material, thanks for sharing that !

In the version dated "2018-05-12", there is a minor typo in section 3.6 where there should be a dot after the reference.

> algorithms that can be employed (Chong and Żak 2008) These methods
> algorithms that can be employed (Chong and Żak 2008). These methods

Also, in section 3.7, you enumerated 2 things :

1. It prevents the test set from being used during the model development process and
2. Many evaluations of an external data set are used to assess the differences.

I know it's a quite minor detail, but the "end" between the bullet points might be inconsistent with the rest of the book enumerations. I don't think there is a strict rule for that, but regarding those things I often opt for consistency. You seem to frequently use upper-case at the beginning and ending with a dot. That being said, it's not a big deal anyways.

I really look forward to read the missing values handling section (one important concern for practitioners handling real-world data). Please give some love to this section.

Thanks

Section 6.2: Typo?

Hello everyone,

I believe there is a typo under the explanation of the hinge function. :

One other feature construction method related to splines and the multivariate adaptive regression spline (MARS) model (Friedman 1991) is the single, fixed knot spline. The hinge function transformation used by that methodology is.

h ( x ) = x I ( x > 0 )

Where I is an indicator function that is zero when x is greater than zero and zero otherwise.

Shouldn't it be , where I is an indicator function that is x when x is greater than zero and zero otherwise.

Thank you!

Some ideas for interactions chapter

In Chapter 7 Detecting Interaction Effects and particularly in Section 7.4.2 The Lasso, I think that it could be interesting to provide some comments on

Michael Lim & Trevor Hastie 2016 Learning interactions via hierarchical group-lasso regularization

The authors provide a package with their method in CRAN called glinternet.

It was compared in

A systematic comparison of statistical methods to detect interactions in exposome-health associations. https://www.ncbi.nlm.nih.gov/pubmed/28709428

and glinternet was one of the methods with better performance.

needs rephrasing

in version date 2018-05-12 in chapter 1 there is a sentence

but one of goal this book
i guess that needs rephrasing

Switch host for mathjax?

Would it be possible to switch the CDN for mathjax? I work in a US corporation that unfortunately blocks access to bootcss due to the country that hosts the site. Thus, all of the mathematical notation throughout the book does not render. Raw LaTeX is displayed instead.

A number of books published to bookdown.org suffer from the same issue.

We publish internally using a number of rendering packages, including bookdown. For example, one switch (and minor upgrade) that we employ is to go from:

https://cdn.bootcss.com/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML

to

https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-MML-AM_CHTML

by way of

output: 
  bookdown::html_document2:
    mathjax: "https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-MML-AM_HTMLorMML.js"

Thanks!

Consider explicit definitions of "feature" and "feature engineering" in the first chapter

Comments or questions on the content

Thanks for making your materials available for comment in the draft stage--what a great idea!

I would suggest that you consider explicit definitions of "feature" and "feature engineering" in the first chapter. You have explanations of what they are, but consider sentences that start like this:

A feature is a...

Feature engineering is the process of...

Date on the first page: 2018-05-12

Typo in Section 5.5

In the version dated "2018-05-12", Section 5.5, the number 0.21 is stated as a percentage in the first line below but used as a proportion a few lines later also shown below. Table 5.2 also confirms that the number should be 0.21 (or 21%).

> The rate of hyperlinks in the STEM profiles was 0.21% while 11.6% of the non-STEM profiles contained at least one link.

> For the STEM profiles, the odds of containing a hyperlink are relatively small with a value of 0.21 / 0.79 = 0.266.

A few minor issues picked up

In the version dated "2018-05-12", the following minor issues were picked up:

Section 1.2.2 Supervised and Unsupervised Procedures:
Exploratory data analysis (EDA) (Tukey 1977) is used understand the
Should be (seems like a “to” is missing there):
Exploratory data analysis (EDA) (Tukey 1977) is used to understand the

Section 1.2.4 The Model versus the Modeling Process
Summary measures for each model, such as model accuracy, are used to understand the level of difficulty for the problem and and to determine which models appear to best suit the data
Should be (the one “and” must be deleted):
Summary measures for each model, such as model accuracy, are used to understand the level of difficulty for the problem and to determine which models appear to best suit the data

Section 1.2.5 Model Bias and Variance
Linear regression is linear in the model parameters and adding polynomial terms to the model can be effective way
Should be (“an” is missing):
Linear regression is linear in the model parameters and adding polynomial terms to the model can be an effective way

Section 1.2.5 Model Bias and Variance
was improved by modifying the predictor variables and was able to show results on par with a support vector machine model (low bias)
Shouldn’t this be (the example mentioned compared to a neural network model, not an SVM model?):
was improved by modifying the predictor variables and was able to show results on par with a neural network model (low bias)

Adding mind maps as summary to sections of book

hello mr. Kuhn and mr. Johnson,
thanks for making your book available online, it is already an inspiring read.

i just love mind maps and i made it a habit to generate them when i think it will help my future me to quickly understand things i once knew. maybe it would be beneficial for your readers if something like the mind map below would be placed at the end of sections.
best regards
uwe sterr

engineeringnumericpredictors

Minor typos in chapters 2 - 4

In version dated "2018-05-12":

  • Part 2.4 Predictive Modeling Across Sets, first sentence:

Physicians have an strong preference towards logistic regression due to its inherent interpretability.

Should be "a"

  • Part 2.4 Predictive Modeling Across Sets, under Fig 2.10:

It is also interesting to note that the model of the risk set requires all 8 predictors while recursive feature elimination for the risk, imaging predictors and imaging predictor interactions set only requires only 4 predictors to achieve a better cross-validated area under the ROC curve.

One "only" is unnecessary.

  • Part 2.2 Preprocessing, under Fig 2.3:

These three pairs are highlighted in red boxes along the diagonal of the coorleation matrix in Figure 2.3.

Should be "correlation".

  • Part 3 A Review of the Predictive Modeling Process second sentence:

These topics are fairly general with regards to empirical modeling and include: metric for measuring performance for regression and classification problems, approaches for optimal data usage which includes data splitting and resampling, best practices for model tuning, and recommendations for comparing model performance.

Should be "include" not "includes"

  • Part 3.1 Illustrative Example: OkCupid Profile Data second paragraph:

While the imbalance hasa significant impact on the analysis, the illustration presented here will mostly side-step this issue by down-sampling the instances such that the number of profiles in each class are equal.

Should be "has a".

  • Part 3.2 Measuring Performance, near "(McElreath 2016)" reference:

The question that one really wants to know is “if my value was predicted to be an event, what is are the chances that it is truly is an event?” or Pr[Y = STEM|P = STEM].

"is" is unnecessary.

  • Part 3.2 Measuring Performance, near "(McElreath 2016)" reference:

Sensitivity (or specificity, depending on one’s point of view) are the “likelihood” parts of this equation.

Should be "is ... part" instead.

  • Part 3.2 Measuring Performance, above Fig 3.3:

Table 3.1 can also be visualized using a mosaic plot such as the one shown in Figure 3.3(b) where the size of the blocks are proportional to the amount of data in each cell.

Should be either "where the sizes ... are" or "where the size ... is".

  • Part 3.2 Measuring Performance, above Fig 3.3:

The mosaic plot for this confusion matrix is shown in Figure 3.3(a) where the blue block in the upper left becomes larger but there is also an increase in the red block in the lower right.

There is no red block in the lower right (probably meant to be upper right).

  • Part 3.3 Data Splitting, first paragraph:

The test set is used only at the conclusion of these activities for estimating a final, unbiased assessment of the model’s performance. It is critical that the test set not be used prior to this point. Looking at its results will bias the outcomes since the testing data will have become part of the model development process.

"Its" in the sentence above refers to test set which has no "results", hence indicated sentence probably requires rephrasing.

  • Part 3.4.1 V-Fold Cross-Validation and Its Variants, third to last paragraph:

Also, Section 4.4 has a more extensive description of how the assessment datasets can be used to drive improvements to models.

Should be "data sets" instead of "datasets".

  • Part 3.4.6 What Should Be Included Inside of Resampling?, second sentence:

This is somewhat of a simplification.

"of" is unnecessary.

  • Part 3.4.6 What Should Be Included Inside of Resampling?, first paragraph:

For example, in Section 1.1, a transformation procedure was used to modify the predictors variables and this resulted in an improvement in performance.

"predictors variables" is incorrect, I think, it should be "predictor variables", or only "predictors" or "variables".

  • Part 3.4.6 What Should Be Included Inside of Resampling?, second to last paragraph:

While the test set data often have the outcome data blinded, it is possible to “train to the test” by only using the training set samples that are most similar to the test set data.

Should be "has" instead of "have".

  • Part 3.6 Model Optimization and Tuning near "(Srivastava et al. 2014)" reference:

This is the rate at which coefficients are randomly set to zero during and is most likely to attenuate overfitting (Srivastava et al. 2014).

Unfinished part of sentence probably - "during ..." ?

  • Part 3.6 Model Optimization and Tuning, above Table 3.3:

The learning rate parameter controls the rate of decent during the parameter estimation iterations and these values were contrasted to be between zero and one.

Should be "descent".

  • Part 3.6 Model Optimization and Tuning, last paragraph:

Depending on the problem, this bias might over-estimate the model’s true performance.

Should be "overestimate" instead.

missing blank

in version date 2018-05-12 in chapter 1

consider Figure 1.2a
should be
consider Figure 1.2 a

double "for"

in version date 2018-05-12 in chapter 2.3 there is a sentence

Let’s consider the results for for MaxRemodelingRatio which indicates

one "for" is needs to be deleted

Missing words

Two small fill-words missing:

In the version dated "2018-05-12", at the end of paragraph 1.2.1:

[...] The risk of this type of overfitting is especially dangerous when the number of data points is small and the number of potential predictors is very large. [...]

In the version dated "2018-05-12", at the beginning of paragraph 1.2.2:

[...] Exploratory data analysis (EDA) (Tukey 1977) is used to understand the major characteristics of the predictors and outcome so that any particular challenges associated with the data can be discovered prior to modeling. [...]

Two chapter six typos

In section 6.1 it should be "Unfortunately":

Unfortnately, the Box-Cox procedure cannot

In section 6.2.1 it should be "grid":

This is a tuning parameter that can be determined over gird

Suggestion for Encoding Categoricals

Consider adding methods for encoding categorical variables that include entity embeddings or so called *-2vec where instead of using large sparse dummy variables or even hashing, we instead use dense embeddings.

This is largely applicable when using a neural network but has been used more generally (GBDT). This method has successfully been used for winning kaggle contests. Most info in contained in blog posts, Jeremy howard's fastai course but there is this paper I am aware of:

Guo, C., & Berkhahn, F. (2016). Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737.


Also, another useful technique - also popularized by Kaggle competitors is the mean-encoding (aka target encoding, aka likelihood encoding) approach. There are many variants but here are a couple descriptions:

https://www.coursera.org/learn/competitive-data-science/lecture/b5Gxv/concept-of-mean-encoding

http://www.statsblogs.com/2017/09/25/custom-level-coding-in-vtreat/

https://datascience.stackexchange.com/questions/11024/encoding-categorical-variables-using-likelihood-estimation

Wrong bibtex name expansion

In Section 3.6 Model Optimization and Tuning the sentence

See I, Bengio, and Courville (2016) for an excellent primer on neural networks and deep learning
models.

has just 'I.' for Ian Goodfellow.

Maybe you have Lastname, Firstname wrong ... I generally stick to First Last and Anotherfirst Anotherlast and Yetmorefirst Yetmorelast in my bibtex file.

typo in chapter 6.3.4

in version date "2018-05-12" in chapter 6.3.4

For example, if the true class boundary is a diagonal line, most classification trees cannot easily would have difficultly emulating this pattern.

should probably be

For example, if the true class boundary is a diagonal line, most classification trees would have difficultly emulating this pattern.

Section 4.3.1 Typo

2018-09-09
What is the maximum length? I believe "0K" is a typo.

The maximum length was approximately 0K characters although 10% of the profiles contained less than 14 characters.

Request for additional discussion in 5.6 Factors versus Dummy Variables in Tree-Based Models

Reference to version dated 2018-05-12. I am hoping you could add some additional discussion to the section 5.6 Factors versus Dummy Variables in Tree-Based Models regarding variable importance. My team and I have been working with our categorical data both ways (dummy encodings and factors) and we noticed the relative importance of dummy encoded variables is usually at the bottom of the importance ranking, while when we use the factors directly, sometimes they are ranked as the most important. This has been important for us regarding gaining inference from these models. We are using gbm and randomForest packages along with caret for CV tuning.

Qualitative x quantitative in 6.2.2

There is a qualitative mistakenly used instead of quantitative in Section 6.2.2 that may cause confusion. The first paragraph says

Binning, also known as categorization or discretization, is the process of translating a qualitative variable into a set of two or more qualitative buckets. For example, a variable might be translated into quantiles

I believe it was meant to say a quantitative variable.

Btw, if this is still debatable, I'd argue that categorical would be clearer than qualitative for describing variables that cannot be ordered throughout the book.

Section 5.1 Creating Dummy Variables

In this section it is explained that when encoding weekday, a categorical variable with 7 manifestations, it is okay to use only 6 dummy variables. Or generally speaking, when encoding a variable with N manifestations, it is okay to use only N-1 dummy variables

I agree that this is the right approach for linear models. For tree models however, I would be cautious about this approach. I would advise everybody to use all N dummy variables (not just N-1) for tree models. If you use a, say, Decision Tree, and you have no dummy variable for category N (assuming that this is the reference category that was left out), and the Decision Tree should learn some property about category N, it would have to split up every other category (1..N-1) until it eventually can learn the property about category N.

This should be mentioned in this section or somewhere in the chapter.

typo in chapter 6.3.1

in version date "2018-05-12" in chapter 6.3.1

The drawback of using a supervised method is that we need enough a larger data set to reserve samples for testing and validation to avoid overfitting.

i guess you want to delete the word enough

Preventing overfitting with supervised encodings: adding noise

Probably jumping the gun here since the overfitting chapter isn't written yet: I see a lot of Kaggle entries adding small amounts of Gaussian noise to prevent overfitting during feature engineering (in likelihood encodings for example), which feels a bit weird to me.

I'd imagine that averaged-out-of-fold predictions might be more appropriate, but that's more computationally expensive. I'd love to see a discussion comparing these two approaches and when one might be better than the other. I'd also be curious how you might select the amount of noise to add, which seems like it would require a validation set anyway.

Wrong color reference to figure 1.2b

In the version dated "2018-05-12", the diagonal line in figure 1.2b in section 1.1 is refered to as being red, while it is dashed grey (is there a typo in 'a' ineffective model rather being 'an' ineffictive model? No native here :)):

[...] upper left corner while a ineffective model will stay along the diagonal line shown in red. [...]

typo in chapter 6.3.4

in version date "2018-05-12" in chapter 6.3.4

Data depth measures how close data a data point is to the center of its distribution

should be

Data depth measures how close a data point is to the center of its distribution

code sharing

Hello,
Would you kindly share the code that produces Figure 4.14(c)

Thank you,

Ronen

Typos - Pedram

from preface: believe effectively should be effective here.

In addition to having a good approach to the modeling process, building an effectively predictive model requires other good practices.

Suggestions for section on "Creating Features from Text Data"

The following feedback is based on the version dated "2018-05-24" and is for section 5.5: Creating Features from Text Data.

Typos

The rate of hyperlinks in the STEM profiles was 0.21% while 11.6% of the non-STEM profiles contained at least one link.

The rate for the STEM profiles is 21.0%.

Therefore the words or phrases used in these open text fields could be very important to predictng the outcome.

List of common methods for preprocessing text data

I would suggest adding even more basic processing here, such as tokenization (single word, n-grams), cleaning such as removing HTML (sometimes it is informative, like with the links), etc. You don't need to mention everything here, but here are typical types of super basic features for text.

Discussion of stemming

Stemming can be useful but it is not always. For example, check out Comparing Apples to Apple by Schofield and Mimno. Often practitioners assume stemming is good practice but detailed study has found that, for topic modeling at least, stemming doesn't help and it sometimes hurts. I have found similar results with text classification. It's at the very least something that practitioners should not automatically use. Instead, build a model with stemming and without and compare, for example, AUC. I don't know that you will want to spend the pages here on training both and comparing, but I would definitely mention that it is not necessarily going to make the model better to stem and include the Schofield Mimno reference. Here is some more recent work by the same authors about the effect of text preprocessing on models for text:

We find that many common practices either have no measurable effect or have a negative effect after accounting for biases induced by feature selection.

😬😬😬

Discussion of tf-idf

For the paragraphs on tf-idf, I'd suggest the following minor edits:

The strategy shown here for computing and evaluating features from text is fairly simplistic and is not the only approach that could be taken. One other method for determining relevant features uses the term frequency-inverse document frequency (tf-idf) statistic (Amati and Van Rijsbergen 2002). Here the goal is to find words or terms that are important to individual documents in the collection of documents at hand. For example, if the words in this book were processed, there are some that would have high frequency (e.g. predictor, encode, or resample) but are unusual in most other contexts.

For a word W in document D, the term frequency tf is the number of times that W is contained in D (usually adjusted for the length of D). The inverse document frequency idf is a weight that will normalize by how often the word occurs in the current collection of documents. As an example, suppose the term frequency of the word feynman in a specific profile was 2. In the sample of profiles used to derive the features, this word occurs 55 times out of 10,000. The inverse document frequency is usually represented as the log of the ratio, or log2(10000/55)
or 7.5. The tf-idf value for feynman in this profile would be 2×log2(10000/55) or 15. However, suppose that in the same profile, the word internet also occurs twice. This word is more prevalent across profiles and is contained at least once 1529 times. The tf-idf value here is 5.4; in this way the same raw count is down-weighted according to its abundance in the overall data set. The word feynman occurs the same number of times as internet in this hypothetical profile, but feynman is more distinctive or important (as measured by tf-idf) because it is more rare overall, and thus possibly more effective as a feature for prediction.

Adding mouse hover information to fig. 6.11

in version date "2018-05-12" in figure 6.11 it would be helpful to understand the text when the name of a station would be displayed when the mouse hovers of it, similarly to fig. 6.10

Figure 2.3 has the wrong pairs highlighted

In the currently visible version, Figure 2.3 highlights the wrong pairs of variables.

Three blocks are highlighted, and the text below the figures describes which. However, what is in the figure does not correspond. I include an elarged screenshot to make it more obvious:

image

We have 'adjacent pairs' 1, 2 and 8 highlighted. I think we want 3, 9 and 10 instead.

inconsistent reference formats

In the version dated "2018-05-12", capitalization in article/book names are inconsistent. Also, sometimes initials are used for names, sometimes not:

Agresti, Alan. 2012. Categorical Data Analysis. Wiley-Interscience.
Altman, D. 1991. “Categorising Continuous Variables.” British Journal of Cancer, no. 5:975.

typo in chapter 6.3.1.1

in version date "2018-05-12" in chapter 6.3.1.1

The left-most row of Figure 6.9

should be

The left-most column of Figure 6.9

Double "and"

In the version dated "2018-05-12", section 1.2.4 in the third paragraph:

Summary measures for each model, such as model accuracy, are used to understand the level of difficulty for the problem and and to determine which models appear to best suit the data.

Typo in 6.3.1.3 Independent Component Analysis

It reads:

The first component has very little relationship with the outcome nad contains a set of outlying values. In this component, most stations had little to no contribution to the score.

nad should read and

Feature Engineering & Selection vs. Representation Learning

I'm a big fun of Applied Predictive Modeling and writing a book entirely on feature engineering & selection is an excellent opportunity to focus on a key activity of of the end-to-end process. I find the initial chapters very well done - as usual - and the fact that they are free online is really admirable. Hence, I'm providing an honest feedback regarding a key point which I think can be in the mind many potential readers. This point is about the overall motivation of feature engineering and selection.

Recently - e.g. I. Goodfellow and Y. Bengio and A. Courville, Deep Learning, 2016 (http://www.deeplearningbook.org/contents/intro.html ) - it is reported the distinction between "classical machine learning" vs. "representation learning", where the difference is that in the former approach features are hand designed by data scientists while in the latter they are learnt by machines (Figure 1.5)

"[...] One solution to this problem is to use machine learning to discover not only the mapping from representation to output but also the representation itself.This approach is known as representation learning. Learned representations often result in much better performance than can be obtained with hand-designed representations. They also enable AI systems to rapidly adapt to new tasks, with minimal human intervention. A representation learning algorithm can discover a good set of features for a simple task in minutes, or for a complex task in hours to months. Manually designing features for a complex task requires a great deal of human time and effort; it can take decades for an entire community of researchers. [..]"

I think the reader could be interested in your point of view.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.