Multivariate time series classification ,about johannfaouzi/pyts

Comments (22)

johannfaouzi commented on May 13, 2024 4

Hi,

Thanks for the link, I didn't know this paper. I gave it a quick look and it looks interesting. A PR would be very welcome if you have some time and interest in the future.

I have not been maintaining pyts a lot lately, but I have been working on implementations of several variations of DTW (Sako-Chiba band, Itakura parallelogram, Multiscale DTW and Fast DTW) with numba to make it much faster (my current implementation of DTW is pretty bad). All of this work is still offline and I hope that I can release a new version of pyts this fall / winter (I don't have as much time as I would like to spend on this package).

Regarding your current problem, I think that it could also be interesting to look at the accuracy with each feature (by feature I mean a single time series in a MTS, not a timestamp), which could give you some hints about the relevance of each feature. You could also fit a standard classifier (Logistic Regression, SVM, Random Forest, etc.) with the predictions / probabilities of each classifier for each feature (which could be more powerful than voting if you have a large dataset and/or a lot of classes).

from pyts.

johannfaouzi commented on May 13, 2024 1

Hi,

Currently, pyts does not have any implementation of an algorithm that specifically deals with multivariate time series.

One solution, maybe not always ideal, would be to consider the m time series of a multivariate time series to be independent. If you make such an assumption, you can split your dataset into m datasets, train a classifier independently for each dataset, which will give you a prediction (and a probability for each class if the classifier can output probabilities). Finally, to have a single prediction for your multivariate time series, you can use a voting classifier. Regarding the voting, with a classifier that doesn't output probabilities, you will have to use a hard voting, while with a classifier that can output probabilities, you can also use a soft voting.

Once again, I will emphasize that considering the m time series of a multivariate time series to be independent may be a strong assumption in theory, but in practice it could have a good performance.

What do you think about it?

from pyts.

johannfaouzi commented on May 13, 2024 1

@sknepal Here are some Python packages hosted on GitHub that might be interesting for you:

tsfresh: already mentioned,
seglearn: pretty similar to what tsfresh provides, the article says that it could be faster than tsfresh,
tslearn: the most similar package to pyts, it provides implementations for algorithms that are not available in pyts currently.

from pyts.

johannfaouzi commented on May 13, 2024 1

@sknepal Multivariate recurrence plots introduces the Joint Recurrence Plot that is mentioned on Wikipedia.

Avantages:

It's based on literature.
It's super easy to implement: it's the Hadamard product between the recurrence plots for each feature, which means that one can just use np.prod(recurrence_plots, axis=...) to transform the list of recurrence plots into one recurrence plot only (one needs to give the right value for axis).

Cons:

Summing up several images into one might make you lose a lot of information.
I can't guarantee you that it would improve your predictive performance.

from pyts.

johannfaouzi commented on May 13, 2024 1

I think that resampling (either under-sampling or over-sampling) would be the best. Usually image encoding algorithms expect the time gap between each timestamp to be the same.

under-sampling: You can use pyts.approximation.PiecewiseAggregateApproximation to perform under-sampling.
over-sampling: You can use pyts.preprocessing.InterpolationImputer to perform imputation (which requires you to add missing values at the timestamps that you want); there is no built-in over-sampling method right now, but adding missing values for some given timestamps and imputing these missing values is literally over-sampling.

from pyts.

johannfaouzi commented on May 13, 2024 1

Unfortunately I don't think that the algorithms implemented in this package will help you for your use case. The term time series classification usually refers (at least in the literature) to a single label being associated to a whole time series. For instance:

the blue time series, X-motion, could be class 0,
the orange time series, Qom-motion, could class 1, and
the green time series, Y-motion, could be class 0.

In this case, the objective of time series classification would be to predict a label for a new whole time series.

Your use case makes me think of regression models for binary time series. Unfortunately this package does not deal at all with this topic. Here a few references:

Hope this helps you a bit.

from pyts.

johannfaouzi commented on May 13, 2024 1

If you use a fixed window length and consider a single output (0 or 1), then you can transform your task into a time series classification task. For instance, with a window length of 100, you can transform your whole multivariate time series into several smaller multivariate time series of length 100 and a single label for each small time series (the future value of beat with a given time gap).

Concerning the multivariate aspect of your task and the interdependence between the different signals, the literature on multivariate time series classification is unfortunately much scarcer. Like integration of multimodal data, the interdependence is usually domain-dependent or task-dependent, and thus there is no approach that works for every task. Adding your knowledge about the interdependence may be the best solution.

from pyts.

senemaktas commented on May 13, 2024 1

Thank you for your suggestion. We try different techniques and methods as possible and try to get successful results.
It may be helpful to take a signal in the form of a window rather than as a whole. Hope this works fine. Thanks a lot for helping.

from pyts.

Andrea0911 commented on May 13, 2024

Hi, thanks for the reply, i think is possible to implement this.

I have implemented in python an Adaptive DTW (DTW-A) presented in this paper that has two DTW variations (maybe i could submit a PR with this algorithm one day):

The first one is consider the multivariate time series (MTS) as a set of independent time series (DTW-Independent), and the second one considers the MTS as a whole (DTW-Dependent), depending of the results of the distances with these 2 options the adaptive algorithm choose the better and compute the KNN with this final distance.

But, sadly, i was unable to achieve good results , i will try your approach.

Thanks for the suggestion.

from pyts.

Andrea0911 commented on May 13, 2024

Hi!

Thanks for your suggestion, I will try this approach and maybe come back here to ask another question hehe.

Currently I don't have so much time to improve the code and submit the PR, as soon as I have free time I will try to work on the code.

from pyts.

sknepal commented on May 13, 2024

@Andrea0911 Would you mind sharing what approach you used at the end?

@johannfaouzi The examples in the documentation says:

 # Parameters
n_samples, n_features = 100, 144

But, n_features here is value of a single feature on N time steps, correct? This is not the n_features as that of a multivariate time series (ie multiple features at a single timestep). That got me confused for a while. For example:

feature 1	feature 2	...	feature n	label
23.4	12.5	...	1.34	Class 1
27.3	14.7	...	2.46	Class 1
...	...	...	...	Class 2
26.5	11.2	...	2.32	Class 2

I have a dataset like this, where each row corresponds to one day's sample and there are X features. Can I use pyts implementation of RP, GADF, MTF etc. to encode each row to a single image?

Also: are you aware if recurrence plot supports multivariate ts? I have seen one example here that uses recurrence plot for plotting multivariate ts. Can you please look into it? Thank you!

from pyts.

johannfaouzi commented on May 13, 2024

I agree that the use of n_features is not ideal. When I wrote the code, I was not aware of multivariate time series. The next minor version release (0.8, you can have a look at the branch) will replace n_features with n_timestamps, to make a clear distinction between features and timestamps. Unfortunately this version will not contain any tool to deal with multivariate time series, but it's on my todo list.

I had a look at the example and if I understood correctly, they use a trajectory with a dimension equal to 1 (timestamp), then they compute the recurrence plot with shape (n_timestamps, n_timestamps) by aggregating along each feature. It can be done with your dataset by using RecurrencePlots(dimension=1). However, I find their definition non standard. If you have a look at the original paper describing recurrence plots, it is mentioned that the trajectory is computed along the timestamps, not the features (they only consider univariate time series in the article). Moreover, the article that is mentioned in the notebook deals with univariate time series only. To extend the definition for multivariate time series, there are several possibilities that come to my mind:

replacing the distance between two vectors with a distance between two matrices,
computing the distance for each feature independently and aggregate the images by taking the mean for each pixel,
the Wikipedia page mentions two extensions for multivariate time series: cross recurrence plots and joint recurrence plots. I never read the corresponding papers so I can't comment on that, but it would be interesting to add these tools to pyts.

There is one thing that I don't understand with your dataset: which axis represents time (the timestamps)? The second axis (the columns) represents the features, but the first axis represents the samples or the timestamps? Do you have a dataset like this one for each subject? When dealing with multivariate time series, I expect a dataset with three dimensions: samples, features and timestamps.

from pyts.

sknepal commented on May 13, 2024

Sorry, my bad -- there's a third dimension accounting for subjects (just as in the first post).

I get pretty different images with RecurrencePlots(dimension=1) (or even with any other encoding techniques), which is why I wanted to try that, but again, doing so on a multivariate data is not based on literature so I am not sure if that will make sense even if I get good results.

So far, I have tried treating all samples as independent data points and running classification algorithms on it. That was not good enough. LSTM wasn't good either. Then, I tried tsfresh to extract features from the signals and ran some xgboost, randomforest on it -- it somewhat improved the performance but its still not good enough. Is there any other approach you can suggest?

Thank you for your help!

from pyts.

Andrea0911 commented on May 13, 2024

@sknepal sorry for the late answer, the approach that achieved the best results was with tsfresh

from pyts.

sknepal commented on May 13, 2024

Just realized that with a CNN I could just create multiple RPs for each features and stack them together for prediction. I found a paper where they've stacked distance matrices instead of recurrence plots though (and obtained good results): Classification of Recurrence Plot's Distance Matrices with a CNN for Activity Recognition.

Here's their reasoning for it:

Determining the threshold ε of a RP is a difficult task and very subjective. A bad threshold will lead a RP to have most of its entries as ones or zeros. This will make it difficult to find discriminative patterns. To overcome this, instead of analyzing the RP, we will directly work with the distance matrix D.

from pyts.

johannfaouzi commented on May 13, 2024

This paper also skips the thresholding step.

from pyts.

sknepal commented on May 13, 2024

Is there a way we can skip thresholding in pyts too?

Edit: Nevermind. :) Just saw the documentation. Thresholding is disabled by default it seems.

from pyts.

sknepal commented on May 13, 2024

@johannfaouzi

What would you suggest as an image encoding technique when there's varying samples for each timestamp? For example if the data for one hour consists of say 50 samples whereas for the next hour consists of 30.

I am guessing I could either interpolate or reduce dimension to make it a fixed sample number and then do encoding. However, are there any approaches that are built into pyts that could be of help?

Thank you!

from pyts.

senemaktas commented on May 13, 2024

Hi,

Great work with time series, congrats!

It is a question instead an issue, i'm working in a multivariate time series classification problem, the dataset is in the form:
feature 1 feature 2 ... feature n label time_series_group
23.4 12.5 ... 1.34 Class 1 1
27.3 14.7 ... 2.46 Class 1 1
... ... ... ... Class 1 1
26.5 11.2 ... 2.32 Class 1 1
28.4 11.5 ... 1.54 Class 1 2
27.3 14.3 ... 2.09 Class 1 2
... ... ... ... Class 1 2
21.2 14.9 ... 3.34 Class 1 2
25.4 12.5 ... 1.34 Class 2 3
28.3 14.7 ... 2.46 Class 2 3
... ... ... ... Class 2 3
29.5 11.9 ... 1.35 Class 2 3
27.4 10.5 ... 1.54 Class 2 4
21.3 17.3 ... 2.09 Class 2 4
... ... ... ... Class 2 4
26.4 13.6 ... 2.47 Class 2 4

The dataset above isn´t real, it is only an example. The column time_series_group identifies the quantity of rows that represent the information belonging a multidimensional time series. For instance, we going to consider that each group of time series have the same number of rows m, so, the first m rows (identified by the time_series_group = 1) have the information of the n features that represent the Class 1. Each column of features, inside a group number 1, represent a time series of the feature behavior, therefore, the Class 1 is identified by the n (number of features) time series.

Is there a way to solve this kind of problem with pyts? If not, do you have any thoughts about how to approach it?

Hi @Andrea0911 . Did you solve that problem ? I am trying to do similar thing. Is there anything you can suggest? Thanks.

from pyts.

johannfaouzi commented on May 13, 2024

Hi @senemaktas,

Could you give me more information about your use case and your data? We provide some tools in the pyts.multivariate module.

Best,
Johann

from pyts.

senemaktas commented on May 13, 2024

Hi @johannfaouzi

I have made a dataset (3 inputs), which are 3 signals. The red marks in the graph are the beats (truth value: marked 0 or 1 to indicate beats and non-beats).The output contains too many zero so there is a data imbalance also. I am looking for a solution to estimate the values of 1 here. Thank you very much.

from pyts.

senemaktas commented on May 13, 2024

My goal is to provide an output estimation by examining the values of 3 signals in a specific time period. Because these signals are actually interdependent. I think this is a complex process. I will continue to work with regression instead of classification.

Thank you very much for your answer and suggestion. All the best.

from pyts.

Multivariate time series classification about pyts HOT 22 OPEN

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent