blue-yonder / tsfresh Goto Github PK

View Code? Open in Web Editor NEW

8.2K 167.0 1.2K 8.17 MB

Automatic extraction of relevant features from time series:

Home Page: http://tsfresh.readthedocs.io

License: MIT License

Jupyter Notebook 76.42% Python 23.56% Dockerfile 0.01%

data-science feature-extraction time-series

tsfresh's Introduction

tsfresh

This repository contains the TSFRESH python package. The abbreviation stands for

"Time Series Feature extraction based on scalable hypothesis tests".

The package provides systematic time-series feature extraction by combining established algorithms from statistics, time-series analysis, signal processing, and nonlinear dynamics with a robust feature selection algorithm. In this context, the term time-series is interpreted in the broadest possible sense, such that any types of sampled data or even event sequences can be characterised.

Spend less time on feature engineering

Data Scientists often spend most of their time either cleaning data or building features. While we cannot change the first thing, the second can be automated. TSFRESH frees your time spent on building features by extracting them automatically. Hence, you have more time to study the newest deep learning paper, read hacker news or build better models.

Automatic extraction of 100s of features

TSFRESH automatically extracts 100s of features from time series. Those features describe basic characteristics of the time series such as the number of peaks, the average or maximal value or more complex features such as the time reversal symmetry statistic.

The set of features can then be used to construct statistical or machine learning models on the time series to be used for example in regression or classification tasks.

Forget irrelevant features

Time series often contain noise, redundancies or irrelevant information. As a result most of the extracted features will not be useful for the machine learning task at hand.

To avoid extracting irrelevant features, the TSFRESH package has a built-in filtering procedure. This filtering procedure evaluates the explaining power and importance of each characteristic for the regression or classification tasks at hand.

It is based on the well developed theory of hypothesis testing and uses a multiple test procedure. As a result the filtering process mathematically controls the percentage of irrelevant extracted features.

The TSFRESH package is described in the following open access paper:

Christ, M., Braun, N., Neuffer, J., and Kempa-Liehr A.W. (2018). Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh -- A Python package). Neurocomputing 307, p. 72-77, doi: 10.1016/j.neucom.2018.03.067.

The FRESH algorithm is described in the following whitepaper:

Christ, M., Kempa-Liehr, A.W., and Feindt, M. (2017). Distributed and parallel time series feature extraction for industrial big data applications. ArXiv e-print 1610.07717, https://arxiv.org/abs/1610.07717.

Systematic time-series feature extraction even works for unsupervised problems:

Teh, H.Y., Wang, K.I-K., Kempa-Liehr, A.W. (2021). Expect the Unexpected: Unsupervised feature selection for automated sensor anomaly detection. IEEE Sensors Journal 15.16, p. 18033-18046, doi: 10.1109/JSEN.2021.3084970.

Due to the fact that tsfresh basically provides time-series feature extraction for free, you can now concentrate on engineering new time-series, like e.g. differences of signals from synchronous measurements, which provide even better time-series features:

Kempa-Liehr, A.W., Oram, J., Wong, A., Finch, M., Besier, T. (2020). Feature engineering workflow for activity recognition from synchronized inertial measurement units. In: Pattern Recognition. ACPR 2019. Ed. by M. Cree et al. Vol. 1180. Communications in Computer and Information Science (CCIS). Singapore: Springer, p. 223–231. doi: 10.1007/978-981-15-3651-9_20.
Simmons, S., Jarvis, L., Dempsey, D., Kempa-Liehr, A.W. (2021). Data Mining on Extremely Long Time-Series. In: 2021 International Conference on Data Mining Workshops (ICDMW). Ed. by B. Xue et al. Los Alamitos: IEEE, p. 1057-1066. doi: 10.1109/ICDMW53433.2021.00137.

Systematic time-series features engineering allows to work with time-series samples of different lengths, because every time-series is projected into a well-defined feature space. This approach allows the design of robust machine learning algorithms in applications with missing data.

Kennedy, A., Gemma, N., Rattenbury, N., Kempa-Liehr, A.W. (2021). Modelling the projected separation of microlensing events using systematic time-series feature engineering. Astronomy and Computing 35.100460, p. 1–14, doi: 10.1016/j.ascom.2021.100460

Is your time-series classification problem imbalanced? There is a good chance that undersampling of time-series feature matrices might solve your problem:

Dempsey, D.E., Cronin, S.J., Mei, S., Kempa-Liehr, A.W. (2020). Automatic precursor recognition and real-time forecasting of sudden explosive volcanic eruptions at Whakaari, New Zealand. Nature Communications 11.3562, p. 1-8, doi: 10.1038/s41467-020-17375-2.

Natural language processing of written texts is an example of applying systematic time-series feature engineering to event sequences, which is described in the following open access paper:

Tang, Y., Blincoe, K., Kempa-Liehr, A.W. (2020). Enriching Feature Engineering for Short Text Samples by Language Time Series Analysis. EPJ Data Science 9.26, p. 1–59. doi: 10.1140/epjds/s13688-020-00244-9

Advantages of tsfresh

TSFRESH has several selling points, for example

it is field tested
it is unit tested
the filtering process is statistically/mathematically correct
it has a comprehensive documentation
it is compatible with sklearn, pandas and numpy
it allows anyone to easily add their favorite features
it both runs on your local machine or even on a cluster

Next steps

If you are interested in the technical workings, go to see our comprehensive Read-The-Docs documentation at http://tsfresh.readthedocs.io.

The algorithm, especially the filtering part are also described in the paper mentioned above.

We appreciate any contributions, if you are interested in helping us to make TSFRESH the biggest archive of feature extraction methods in python, just head over to our How-To-Contribute instructions.

If you want to try out tsfresh quickly or if you want to integrate it into your workflow, we also have a docker image available:

docker pull nbraun/tsfresh

Backwards compatibility

If you need to reproduce or update time-series features, which were computed with the matrixprofile feature calculators, you need to create a Python 3.8 environment:

conda create --name tsfresh__py_3.8 python=3.8
conda activate tsfresh__py_3.8
pip install tsfresh[matrixprofile]

Acknowledgements

The research and development of TSFRESH was funded in part by the German Federal Ministry of Education and Research under grant number 01IS14004 (project iPRODICT).

tsfresh's People

Contributors

Stargazers

Watchers

Forkers

pvl stephanesbizzera ahn19 akansal1 wanjinchang xcbat stevenlol ekoziol qgzang algorismes benjamesbabala sbairishal dfmooreqqq laisun shannonyu techscientist trading42 mpvyard awesome-archive jkleint nivertech kaiserdan jevans12 bmackattack jb-delafosse directorscut82 imrer startakovsky justinwhite fancyisbest armcknight chagge moandcompany manguluka dantodor curtiszimmerman dodermatt misc-git-forks liudvikas matthewwilfred emrul liuhengli rtvt123 olveirap linearregression earthgecko neuroradiology johnsontrey alex-duzhichao zhangboxun maniacs-ops xsongx bradparks n1ckelman xgdgsc jmrinaldi benlongo rlugojr 42machinelearning dalbrecht kormilitzin zhezhong ml-ai-nlp-ir empia mynameisvinn vyraun awesome-python rap9430 cyhsm g-cl mouse1231 zergey emilyfay anuragreddygv323 nickbuchny codeaudit jlmaurer nils-braun jaredchung jonntd bigr-lab rickyall radovankavicky jmakov 157995010 apollonius chenmoshushi salemameen imbilltucker ancoraimparo fsgp diegslva dragoncircle datnamer orchestor maxbenchrist f-a sudev gojira dmitryserg

tsfresh's Issues

select_features returns empty DataFrame

Hi,

I tried to run tsfresh on my sample data (2 time series). After calling extract_features I received following matrix:

1003 feature_1 feature_2 ...
1004 feature_2 feature_3 ...

Then I call select_features like this:

ys = pd.Series([1, 2], index = [1003, 1004], name = 'target')
select_features(features, ys)

But all I receive is an empty DataFrame. What I am doing wrong?

ImportError: No module named 'StringIO'

I cannot import, the following error appears:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-2-0fc1191c9b20> in <module>()
----> 1 from tsfresh import extract_relevant_features

/home/nobody/anaconda3/lib/python3.5/site-packages/tsfresh/__init__.py in <module>()
     19 
     20 
---> 21 from tsfresh.convenience.relevant_extraction import extract_relevant_features
     22 from tsfresh.feature_extraction import extract_features
     23 from tsfresh.feature_selection import select_features

/home/nobody/anaconda3/lib/python3.5/site-packages/tsfresh/convenience/relevant_extraction.py in <module>()
      5 from __future__ import absolute_import
      6 import pandas as pd
----> 7 from tsfresh.feature_extraction import extract_features
      8 from tsfresh.feature_selection import select_features
      9 from tsfresh.utilities.dataframe_functions import restrict_input_to_index

/home/nobody/anaconda3/lib/python3.5/site-packages/tsfresh/feature_extraction/__init__.py in <module>()
      3 """
      4 
----> 5 from tsfresh.feature_extraction.extraction import extract_features
      6 from tsfresh.feature_extraction.settings import FeatureExtractionSettings

/home/nobody/anaconda3/lib/python3.5/site-packages/tsfresh/feature_extraction/extraction.py in <module>()
      9 import pandas as pd
     10 import numpy as np
---> 11 from tsfresh.utilities import dataframe_functions, profiling
     12 from tsfresh.feature_extraction.settings import FeatureExtractionSettings
     13 

/home/nobody/anaconda3/lib/python3.5/site-packages/tsfresh/utilities/profiling.py in <module>()
      6 """
      7 
----> 8 import cProfile, pstats, StringIO
      9 import logging
     10 

ImportError: No module named 'StringIO'

A little searching shows that the StringIO module is gone in Python 3 and has been replaced with the io module, from whence io.StringIO should be imported.

Explain the FeatureExtractionSettings object in documentation

Especially how to control which features are calculated with setting.name_to_param dictionary.

Remove Warnings from the user experience

I understand that printing all the UserWarnings and RuntimeWarnings is useful while developing on the source code of tsfresh, but as a user, it makes me wonder whether I screwed something up, or if there are bugs in the calculations themselves.

I'm assuming it's all intentional, but that's an assumption I'd rather not have to make, and a not-ideal user experience.

If you're ok with blocking warnings by default, I'm happy to implement that (I do something similar for auto_ml, since I assume it's my responsibility as the project's author to handle these warnings, and that my end users should not have to be bothered by my design choices).

Add to FAQ: Comparing tsfresh to other feature extraction and selection methods

Hi,

I reviewed the documentation. There are 2 main things in tsfresh:

Feature extraction (I saw you have a rather long list of features you create)
Feature filtering

I have some general questions:

What is unique about the feature filtering ? for example, let's say I want to use the features for classification. Can't I just take all the features to an information.gain software (like FSelector in R), it will tell me how good/bad each feature is, based on information gain, and by this way to use the feature ?
You have a long list of features, from your experience, how much is left after feature filtering ? I know it is a very general question, and it is very time series specific, but I wonder, from your experience, the typical filtering.
What is the "smart thing" about the tsfresh solution ? Is it new features, new filtering method, combo, please epxlain ?
Any other insights you can provide on the algorithm Vs. other feature extraction and feature importance algorithms (RF,information.gain, …)

extract_features fails on a 0-indexed DataFrame

Can I just confirm that in terms of the DataFrame passed to extract_features, the
DataFrame columns must be named with a string or at least > 0 digit?

The document indeed demonstrates named columns in all the DataFrame examples,
however it does not specifically say that the DataFrame cannot be default
0-indexed and perhaps there is a good reason for that.

This is easily worked around by naming the columns as per all the DataFrame examples and resolves this e.g.:

df.columns = ['metric', 'timestamp', 'value']
df_features = extract_features(df, column_id='metric', column_sort='timestamp', column_kind=None, column_value=None)

But just for clarification, I am constructing a single timeseries Flat DataFrame with just the default
pandas df indexing

                                0           1         2
0     stats.statsd.bad_lines_seen  1478736060  0.000000
1     stats.statsd.bad_lines_seen  1478736120  0.000000
2     stats.statsd.bad_lines_seen  1478736180  0.000000

And then passing extract_features the following parameters:

df_features = extract_features(df, column_id=0, column_sort=1, column_kind=None, column_value=None)

This causes the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-21-65f3bdeb4c3b> in <module>()
      5 
      6 from tsfresh import extract_features, extract_relevant_features, select_features
----> 7 df_features = extract_features(df, column_id=0, column_sort=1, column_kind=None, column_value=None)

/opt/python_virtualenv/projects/tsfresh-py2712/lib/python2.7/site-packages/tsfresh/feature_extraction/extraction.pyc in extract_features(timeseries_container, feature_extraction_settings, column_id, column_sort, column_kind, column_value)
     67     kind_to_df_map, column_id, column_value = \
     68         dataframe_functions.normalize_input_to_internal_representation(timeseries_container, column_id, column_sort,
---> 69                                                                        column_kind, column_value)
     70 
     71     # Use the standard setting if the user did not supply ones himself.

/opt/python_virtualenv/projects/tsfresh-py2712/lib/python2.7/site-packages/tsfresh/utilities/dataframe_functions.pyc in normalize_input_to_internal_representation(df_or_dict, column_id, column_sort, column_kind, column_value)
    260                 raise ValueError("You have NaN values in your id column.")
    261     else:
--> 262         raise ValueError("You have to set the column_id which contains the ids of the different time series")
    263 
    264     # Either the column for the value must be given...

ValueError: You have to set the column_id which contains the ids of the different time series

And having poked around it is not that simple as utilities/dataframe_functions.py
will always raise ValueError if column_id = 0

>>> column_id = 0
>>> if column_id:
...     print(column_id)
... else:
...     print('fail')
...
fail
>>>

And even if that is changed in utilities/dataframe_functions.py to handle
passing 0 e.g.

#    if column_id:
    if column_id or column_id == 0:

That just raises

--> 259                 raise AttributeError("The given column for the id is not present in the data.")
    260             elif kind_to_df_map[kind][column_id].isnull().any():
    261                 raise ValueError("You have NaN values in your id column.")

AttributeError: The given column for the id is not present in the data.

So without having to reverse engineer all the kind_to_df_map stuff, it is easier
for the timebeing to just note in the documentation that the DataFrame columns
should not be 0-indexed?

Or maybe it is just an issue that has not been reported?

Python 3 support

Currently we only support Python 2. In future releases we want to support both Python 2 and 3. This howto outlines the main steps towards Python 3 support.

PCA after feature extraction

Hello.

Am I understanding it correctly, that it is recommended to perform PCA on the extracted features to achieve better accuracy in classification?

I saw it in the paper: https://arxiv.org/pdf/1610.07717v1.pdf

List of features

I don't see a list of features to be calculated somewhere, can we add that information into documentation?

Add testing information to how_to_contribute

It would be good to add a section in how_to_contribute on testing with examples and some information on how the project is tested locally.

Several feature calculators lack unit tests

The feature calculators augmented_dickey_fuller, fft_coefficient, number_cwt_peaks, spkt_welch_density, cwt_coefficients lack unit tests. See feature_calculators.py for their implementations. Stubs for the corresponding tests can be found in test_feature_calculations.py

Permission denied for robot_execution_failures data download

tsfresh.examples.robot_execution_failures.download_robot_execution_failures()

got error

"Permission denied: '/usr/local/lib/python2.7/site-packages/tsfresh/examples/data'"

i have to manually sudo mkdir folder and sudo wget http://archive.ics.uci.edu/ml/machine-learning-databases/robotfailure-mld/lp1.data to folder '/usr/local/lib/python2.7/site-packages/tsfresh/examples/data/robotfailure-mld/ " to makedf, y = load_robot_execution_failures() working .

link issue with repo description

The repo description "Automatic extraction of relevant features from time series: http://tsfresh.readthedocs.io. " , there should be a space before the last dot or the url won' t open correctly.

Add example/documentation how to use tsfresh for time series forecasting

Improve unit tests with data sets

Currently our test cases only use integer data sets. We should add tests using data sets containing floats, NaNs and +/-Infs as well as data sets of length 1.

Update the Quick Start

In order to follow along on the Quick Start page, in the Dive in section, I believe the method used to load the data should be updated, as the code would not run as is.

from tsfresh.examples.robot_execution_failures import download_robot_execution_failures, load_robot_execution_failures

download_robot_execution_failures()
timeseries, y = load_robot_execution_failures()

Classifying the feature calculators.

We can use our @set_property( decorator to give the feature calculators objects properties.

This can be for example used to classify fast calculators such as @set_property("minimal", True). We could use such properties to classify our feature calculators even further. For example we could denote calculators that are "time independent" or that are "medium costly" or highly "highly costly".

This issue should be a starting point for a discussion.

No Selected Features

Using a collection of 147 time series with 20 time stamps each, the method extract_features returns 211 features. Some of the features are constant across all 147 IDs but others are not constant.

When the method select_features is applied to the (147,211) DataFrame X, the outcome is and empty DF.

However, there are features whose STD is not zero, X.iloc[:, X.std().values!=0].shape, (147, 165)
This is an example of the correlation between these 165 non-constant features and target

Other attempts were made such as:

renaming the columns to match the demo notebooks. Resulted in different features but the same empty DF of selected features.
Use extract_relevant_features directly. Resulted in different features but the same empty DF of selected features.
Use other series type. Resulted in different features but the same empty DF of selected features.
Use more than one series type, i.e. DF of shape (147, 12), where 2 of the columns are id and time, the remaining columns are series types. Resulted in different features but the same empty DF of selected features.
Use longer time series, e.g. 200 time stamps. Resulted in different features but the same empty DF of selected features.
Use different values for extraction_settings.fdr_level = 0.1. Resulted in different features but the same empty DF of selected features. Extracted features were still 211.

Questions

Why only 211, even when extending the series to 200 tim stamps?
Why no selected features from any series type when the initial pool of features is 211?
are there any parameters missing to be set?
Is there any problem with the type of inputs?

Fix pypi package overview and documentation

Currently, travis is starting "setup.py upload" which uploads tsfresh to pypi. This setup py also uses the README.MD as the package description on pypi. However, the documentation looks bad because pypi does not support the github markdown highlighting We should have a separate readme for pypi.

Also, pypi or travis seem to uploaded not only to rtd but also to http://pythonhosted.org/tsfresh/. This seems like a waste of resources, we should only hoste it on rtd (I prefer rtd because you can directly edit the documentation in your browser)

Add static type annotations

We already include type annotations for most functions in their docstrings. mypy offers a comment syntax for annotating types in Python 2. This would help us ensure that these annotations are correct.

Provide a stand alone script operating on csv

We could provide a stand alone script that runs tsfresh directly on csv files and returns the results as another csv file.

Discuss FeatureExtractionSettingsObject

At the moment the FeatureExtractionSettingsObject feels clumsy and hard to understand.

Also the naming of some methods do not seem to fit their purpose (I am looking at you, set_default_parameters).

We should rethink our FeatureExtractionSettingsObject API and think about which functionality is still missing. (For example at the moment do_not_calculate is not able to iterate over all kind of time series)

Provide the user with some sense of progress while extract_features is calculating

Right now the only progress indicators are the Warning messages.

I know that doing the computation inside an async map call makes this trickier. But even just knowing how many different things it's calculating, and reporting back every once in a while on how many total have been completed, would be really useful. I'm not sure whether to read a quick xkcd comic while waiting, or go home and let this run overnight.

Provide function to produce list of features from FeatureExtractionSettings object

There should be an easy way to obtain a list of all features a FeatureExtractionSettings object is configured to calculate.

Improve documentation for saving and loading of feature extraction calculators

Hello Max and the other contributors,

Great work on this library, It works really well.

Is it possible to save and load extraction calculators that were used in a dataset? For instance, let's say that I have a dataset and I run extract_features and then select_features. As of result, some features are being removed and the standing features were calculated based on a specific calculator. Is there a way to get those calculators?

The reasoning of this option is that I need to run future dataset with the same calculators since my model is being trained on them.

Let me know if this is possible or if am I missing something?

Thank you.

Add MinimalFeatureExtractionSettings

Right now our FeatureExtractionSettings contains many features.

It would be nice to have a smaller FeatureExtractionSettings object that only calculates basic properties such as mean, median, min, max. This would allow to fastly tinker with data and then later calculate the big list of features

Add check if target y is pandas Series.

user @danjo89 reported an error
https://gist.github.com/danjo89/7db0fc3e145337969cec1f0e08a239fe

we do not have a check if the target y is a pandas Series. In the above case the target was a pandas DataFrame. We can easily add an check for that

ZeroDivisionError: float division by zero

I noticed quite a few divide by zero warnings, like below:
/home/preston/.local/lib/python2.7/site-packages/scipy/signal/_peak_finding.py:412: RuntimeWarning: divide by zero encountered in double_scalars

But then, at one point after running for a while, the entire extract_features() process errored out.

Traceback (most recent call last):
  File "df_dev_script.py", line 590, in <module>
    features = extract_features(combined_df, column_id='store_id', column_sort='created_at_in_local_time')
  File "/home/preston/.local/lib/python2.7/site-packages/tsfresh/feature_extraction/extraction.py", line 93, in extract_features
    extracted_features = pool.map(partial_extract_features_for_one_time_series, kind_to_df_map.items())
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
ZeroDivisionError: float division by zero

Should be pretty easy to do safe division that doesn't error.

Feature combinations for different time series combination

At the moment the calculated features are always based on one type of time series.

It would be nice to have features that are based on multiple time series.

We could use something in the line of http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html to get feature combinatins.

Add new feature: aproximate entropy

For a description of the feature see https://en.wikipedia.org/wiki/Approximate_entropy
A tutorial on how to add new features to tsfresh can be found in our documentation: http://tsfresh.readthedocs.io/en/latest/text/how_to_add_custom_feature.html

Provide example for a workflow in case of a multiclass target

When selecting features we assume the classification case only when the target is binary:

target_is_binary = len(set(y)) == 2

Otherwise we assume the regression case.

So for a multiclass target the user has to perform a binary one versus rest classification. We should provide an example how to efficiently do that using tsfresh, i.e.:

Calculate all features once
Select the features separately for each one versus rest case

Add new feature: p-value of the Augmented Dickey-Fuller test

Currently we calculate the Augmented Dickey-Fuller test statistic as feature. The p-value of the Augmented Dickey-Fuller test might also be a relevant feature.

As the statsmodels function adfuller returns the test statistic as well as the p-value, it would make sense to memoize or buffer the results and add the new feature as a separate function.

A tutorial on how to add new features to tsfresh can be found in our documentation: http://tsfresh.readthedocs.io/en/latest/text/how_to_add_custom_feature.html

Fix code on quick start documentation

Reported by user WickedWicky on r/machinelearning

Following this page http://tsfresh.readthedocs.io/en/latest/text/quick_start.html and I don't know if that is supposed to be up to date but the code blocks don't fit together.
"from tsfresh import select features" should be "from tsfresh import select_features"
and the last block you use 'df' as an argument where you defined it as 'timeseries' earlier.

Can tsfresh be used to solve this problem

Hi,

I have 400,000 time series, each of length 1,000.
Each time series has a class 0 or 1 (binary classification problem). The dataset is balanced between 0 and 1

I have created my own features (min,max,sd,var,linear regression) on different time length of the time series (50,100,200,300,400,500,600,700,800,900,1000), which cam out to be 900 features.

I have ran the 900 features in grid search with CART (decision tree).
The best accuracy I got was 0.56 or so.

Can tsfresh be used (and hopefully get a much better job). If so, how ?

This is a real life problem, lets see how tsfresh handles it (vs other methods)

ipython notebook error

in notebook , robot_failure_example.ipynb ,

X_filtered = extract_relevant_features(df, y, column_id='id', column_sort='time')

had error of

"ValueError: Column a__autocorrelation__lag_8 of dataframe must not contain NaN values "

Find the optimal versions for our requirements

For our requirements we have to figure out the minimum versions we actually need. Optimizing our requirements increases our compatibility with different package versions and helps us support more environments.

extraction and filtering not equal to filtered extraction

User @Huandao0812 reported in issue #22:

Hi Max, I tried to do the same by your example, but my X_new has different number of features than my X_old, my code is here https://github.com/Huandao0812/lstm_exp/blob/master/test_tsfresh.py#L46
can you have a quick look

update: I check the diff of 2 set of columns and this is the difference:
the X_new has 2 more columns than the X_old
diff columns = set(['feature__cwt_coefficients__widths_(2, 5, 10, 20)_coeff_13__w_20', 'feature__cwt_coefficients__widths(2, 5, 10, 20)__coeff_3__w_5'])

Parallelize the calculation of p-values

Currently we calculate the p-values per feature in a loop:

for feature in df_features['Feature']:
        if target_is_binary:
            # Decide if the current feature is binary or not
            if len(set(X[feature].values)) == 2:
                df_features.loc[df_features.Feature == feature, "type"] = "binary"
                p_value = target_binary_feature_binary_test(X[feature], y, settings)
            else:
                df_features.loc[df_features.Feature == feature, "type"] = "real"
                p_value = target_binary_feature_real_test(X[feature], y, settings)
        else:
            # Decide if the current feature is binary or not
            if len(set(X[feature].values)) == 2:
                df_features.loc[df_features.Feature == feature, "type"] = "binary"
                p_value = target_real_feature_binary_test(X[feature], y, settings)
            else:
                df_features.loc[df_features.Feature == feature, "type"] = "real"
                p_value = target_real_feature_real_test(X[feature], y, settings)

        # Add p_values to df_features
        df_features.loc[df_features['Feature'] == feature, "p_value"] = p_value

This loop could be parallelized with joblib or something similar.

Parallelize the calculation of features

Currently we calculate the features in a list comprehension:

extracted_features = [_extract_features_for_one_time_series(relevant_time_series, str(kind),
                                                                column_id, column_value, feature_extraction_settings)
                                  for kind, relevant_time_series in kind_to_df_map.iteritems()]

A first step would be to parallelize this list comprehension with joblib or something similar.

Further, as can be seen here the calculation of features is fragmented both horizontally and vertically. So we can also parallelize on a per sample basis.

Dask support

To efficiently handle huge data sets, we could make use of Dask

Efficient rolling window of time series (e.g. while using temporary results)

I want to extract features from a rolling window of a table with columns of several timeserieses and do some prediction based on the timeseries in that window.
Currently, as far as I understand the doc. I have to extract the timeseries and tile them like in the example, so there would be a lot of duplicate data because the rolling window and doesn' t seem memory efficient. Is there a rolling window API or better ways to do it?

Thanks!

Track successful applications of tsfresh

It would be nice to have an overview in the documentation for which fields (biology, manufactoring line optimization, predictive maintenance, financial data) the application of tsfresh proved to be helpful.

Implement new feature: sample entropy

@CYHSM mentioned this paper https://www.ncbi.nlm.nih.gov/pubmed/10843903?dopt=Abstract

We should implement the entropy features sample entropy (SampEn)

Allow NaN or None values to be passed in, and silently ignored

In my DataFrames, I oftentimes find it very reasonable to have quite a few NaN or None values.

In order to run tsfresh, I just set all those values to 0, which is... not ideal. Even imputing missing values would not work particularly well for several of my use cases.

Yet, it seems (from a super naive outsider's perspective), like this is filtering that tsfresh could do relatively easily itself.

When it grabs each time_series, it can simply remove or categorically ignore NaN/None, and compute features on the values that do exist. This makes my life easier when, say, one customer signs up a month after another customer, and thus has missing values for that month.

Again, super naive outsider perspective here, I know this might be impossible. But if it is possible, I'd love to add in that bit of filtering!

Add to FAQ: different length of time series possible

Does it support different length of time series?

I am interested to use it on audio tasks.

Thanks!

Confusion about extract_relevant_features

I have two time serises with different length and value, and they belong to two different labels, and i transform them into pandas.Dateframe like this

when i use the method extract_relevant_features, and it return an empty list like this

i dont understand the result, does it mean the two serise have no relevant features?

and here i post my code, pitch（picth2）is a list like

my code：

y = [0, 1]
y = np.array(y)
pitch = pitch_analyse.pitch_profile("new001.wav")
pitch2 = pitch_analyse.pitch_profile(r"D:\emotion_new\angry\angry_1_009.wav")
time1 = range(len(pitch))
time2 = range(len(pitch2))
time1.extend(time2)
id_dic = [0]*len(pitch)+[1]*len(pitch2)
pitch.extend(pitch2)
series = {'id': id_dic, 'time': time1,'value': pitch}
series = pd.DataFrame(series)
features_filtered_direct = extract_relevant_features(series, y, column_id='id', column_sort='time')
print features_filtered_direct

Dickey Fuller feature - return test statistic or boolean?

as it currently stands, the feature returned by augmented_dickey_fuller is the test statistic. should (1) we return the test statistic feature or (2) a boolean, regarding reject/fail-to-reject hypothesis test?

in conventional analysis, this value is compared with corresponding values at 1%, 5%, 10% significance to test for stationarity (eg if DA statistic > 5% value, then fail to reject hypothesis of non-stationarity).

Add feature class: Count/Percentage same value

I had an idea for a class of features.

percentage of data points that occur at least second time
sum of data points that occur at least second time
percentage of observed values that occur at lest second time
...

Installation problem on Windows 10 - Anaconda2

I have run this command for installation of tsfresh in windows 10:

"C:\Anaconda2\Scripts\pip.exe" install "C:\Anaconda2\extrapackages\tsfresh-master.zip"

I got the error as:

Processing c:\anaconda2\extrapackages\tsfresh-master.zip
Complete output from command python setup.py egg_info:

Installed c:\users\vinodv\appdata\local\temp\pip-jwysox-build\.eggs\pyscaffold-2.5.6-py2.7.egg
ERROR:root:Error parsing
Traceback (most recent call last):
  File "c:\users\vinodv\appdata\local\temp\pip-jwysox-build\.eggs\pyscaffold-2.5.6-py2.7.egg\pyscaffold\contrib\pbr\pbr\core.py", line 111, in pbr
    attrs = util.cfg_to_args(path, dist.script_args)
  File "c:\users\vinodv\appdata\local\temp\pip-jwysox-build\.eggs\pyscaffold-2.5.6-py2.7.egg\pyscaffold\contrib\pbr\pbr\util.py", line 246, in cfg_to_args
    pbr.hooks.setup_hook(config)
  File "c:\users\vinodv\appdata\local\temp\pip-jwysox-build\.eggs\pyscaffold-2.5.6-py2.7.egg\pyscaffold\contrib\pbr\pbr\hooks\__init__.py", line 25, in setup_hook
    metadata_config.run()
  File "c:\users\vinodv\appdata\local\temp\pip-jwysox-build\.eggs\pyscaffold-2.5.6-py2.7.egg\pyscaffold\contrib\pbr\pbr\hooks\base.py", line 27, in run
    self.hook()
  File "c:\users\vinodv\appdata\local\temp\pip-jwysox-build\.eggs\pyscaffold-2.5.6-py2.7.egg\pyscaffold\contrib\pbr\pbr\hooks\metadata.py", line 26, in hook
    self.config['name'], self.config.get('version', None))
  File "c:\users\vinodv\appdata\local\temp\pip-jwysox-build\.eggs\pyscaffold-2.5.6-py2.7.egg\pyscaffold\contrib\pbr\pbr\packaging.py", line 710, in get_version
    raise Exception("Versioning for this project requires either an sdist"
Exception: Versioning for this project requires either an sdist tarball, or access to an upstream git repository. Are you sure that git is installed?
error in setup command: Error parsing c:\users\vinodv\appdata\local\temp\pip-jwysox-build\setup.cfg: Exception: Versioning for this project requires either an sdist tarball, or access to an upstream git repository. Are you sure that git is installed?

----------------------------------------

I have installed git from git-scm.
In "Path" system variable the "C:\Program Files\Git\cmd" path is present.

Improve FAQ

Add the following points to FAQ:

PCA, why is it useful and which implementation to use
Extraction and filtering is equal to filtered extraction
How to save time by pickling calculated features
How can I apply tsfresh to huge time series datasets ?