wenjiedu / saits Goto Github PK

The official PyTorch implementation of the paper "SAITS: Self-Attention-based Imputation for Time Series". A fast and state-of-the-art (SOTA) deep-learning neural network model for efficient time-series imputation (impute multivariate incomplete time series containing NaN missing data/values with machine learning). https://arxiv.org/abs/2202.08516

Home Page: https://doi.org/10.1016/j.eswa.2023.119619

License: MIT License

Python 96.96% Shell 3.04%

time-series imputation-model missing-values self-attention partially-observed-data partially-observed-time-series partially-observed interpolation time-series-imputation incomplete-data

saits's Introduction

🤙 Contact info:

👋 Hi, I'm Wenjie Du (杜文杰 in Chinese). My research majors in modeling time series with machine learning, especially partially-observed time series (POTS), namely, incomplete time series with missing values, A.K.A. irregularly-sampled time series. I strongly advocate open-source and reproducible research, and I always devote myself to building my work into valuable real-world applications. Unix philosophy "Do one thing and do it well" is also my life philosophy, and I always strive to walk my talk. My research goal is to model this non-trivial and kaleidoscopic world with machine learning to make it a better place for everyone. It's my honor if my work could help you in any way.

🤔 POTS is ubiquitous in the real world and is vital to AI landing in the industry. However, it still lacks attention from academia and is also in short of a dedicated toolkit even in a community as vast as Python. Therefore, to facilitate our researchers and engineers' work related to POTS, I'm leading PyPOTS Research Team (pypots.com) to build a comprehensive Python toolkit ecosystem for POTS modeling, including data preprocessing, neural net training, and benchmarking. Stars🌟 on our repos are also very welcome of course if you like what we're trying to achieve with PyPOTS.

💬 I'm open to questions related to my research and always try my best to help others. I love questioning myself and I never stop. If you have questions for discussion or have interests in collaboration, please feel free to drop me an email or ping me on LinkedIn/WeChat/Slack (contact info is at the top) 😃 You can follow me on Google Scholar and GitHub to get notified of our latest publications and open-source projects. Note that I'm very glad to help review papers related to my research, but ONLY for open-source ones with readable code.

❤️ If you enjoy what I do, you can fund me and become a sponsor. And I assure you that every penny from sponsorships will be used to support impactful open-science research.

😊 Thank you for reading my profile. Feel free to contact me if you'd like to trigger discussions.

🏠 Visits

saits's People

Contributors

Stargazers

Watchers

Forkers

lyapunovstability shism2 matrixleon18 yangxin666 wangjiali0310 chendwend niharikajo jyotirmayaijaradar mone27 the-black-coat diffdynamo steveliu91 alice202108 jackyin68 marlowe2046 samfallahian joshuawe destinysword theblackcoathunt mlw67 darshanchandak ziyit skpalu murphyoneill bankplus zjh152 seanigami ronghaogu lumisong zzaiyan noinget panzipanzi kietltdes raoyinkun nivedharengaraj03 gugababa zhaopw5 victoeywilly enene666 duobladex viktoriiakharchenko wangsiweitvt bhbraswell jinghonh saeedmurtaza tsukasa619a melonan mr-nobody-dey

saits's Issues

Some questions about multivariate time series Imputation.

Thank you for your work，I recently read your paper SAITS: Self-Attention-based Imputation for Time Series. I am also doing the work related to multivariate time series Imputation. I have some questions, and I hope to communicate with you.
1.I recently used your method to run the data set I used. My data processing approach is first divided into training set and test set, and then build time sequence, first use the train set to train, and then use the test set test (but I know data Imputation algorithm is unsupervised algorithm and did not use the true information of the missing data, there are some people who divided the test set training set, while there are also some people didn't,There are some differences in the results of your algorithm between these two data set partitioning methods,May I ask how do you view the partitioning of data sets?)
2. May I ask whether your algorithm will have overfitting, because the loss of back propagation is the MAE of unmissing items, not the MAE of the whole data set. I feel that with the increase of training times, it will gradually tend to be overfitting
3. Now the stopping condition of the algorithm is to reach the specified epoch. The epoch of different data sets need to be detected,If we divide the test set and the training set, can we quit the training by judging that the missing item data MAE of the training set reaches the minimum.
Thank you very much

训练步长可以动态调整吗？

你好，根据给出的example.py中的saits = SAITS(n_steps=48, n_features=37, n_layers=2, d_model=256, d_inner=128, n_heads=4, d_k=64, d_v=64, dropout=0.1, epochs=10)，可以看到n_steps被设置为48，因为example中给定的数据集中每个RecordID都有48个样本。
但我的数据集中每个RecordID对应的样本数是不固定的，比如1个，7个，甚至216个，这样的话我把n_steps参数设置为最大的RecordID对应数目，比如216，这会是可行的吗？或者有没有其它方案。十分感谢！

window truncate function

def window_truncate(feature_vectors, seq_len):
    """ Generate time series samples, truncating windows from time-series data with a given sequence length.
    Parameters
    ----------
    feature_vectors: time series data, len(shape)=2, [total_length, feature_num]
    seq_len: sequence length
    """
    start_indices = np.asarray(range(feature_vectors.shape[0] // seq_len)) * seq_len
    sample_collector = []
    for idx in start_indices:
        sample_collector.append(feature_vectors[idx: idx + seq_len])

    return np.asarray(sample_collector).astype('float32')

Wenjie,

I have some questions if you do not mind to clarify

In the implementation, is the training data generated by diving into the time series based on the sequence length?
What is the advantage of such training data configuration over using the sliding window approach, e.g., generates the training set with one-time step lag [t-n, t-n+1, ... t], [t-n+1, t-n+2, ... t+1], [t-n+2, t-n+3, ... t+2]. Is not the sliding window approach would generate more datasets for training?
I am not quite familiar with transformer architecture. In a typical RNN based imputation method, there are the concepts of sequence length (i.e., length of historical or future data for input) and prediction horizon (i.e., how far in the future or in the past the model try to impute). For the SAITS, what would be the equivalent concepts or does such a concept of the prediction horizon exist?
I understand from your paper that the sequence length is fixed between different models for comparison purposes. How does the sequence length affect the accuracy of the imputation? What would you recommend to determine the appropriate sequence length for the problem at hand?
An unrelated question, Is your PyPOTS currently working with the Air Quality dataset?

Thanks in advance,
Haochen

Question about loss

Thank you for your work and please understand that it is not a direct question about the code. Is there any reason the loss function does not include the classification error term? Some models that perform reconstruction and imputation are include classification error in the loss function. Have you ever trained models in this way? If so, please let me know what the results were like.

missing value problems

pd.concat

FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

Calculation of the loss function

Thank you for excellent work!

The imputation loss of MIT is not covered the complement feature vector in the code.

Secondly, the paper also talks about taking the raw data X without artificially-masking as input to the MIT formula, and I found in the corresponding code that you used the manual masked X^.

Is there something I don't understand. I look forward to your resolution of my doubts!

Training stage of an attention based model!

Greetings Wenjie,
I was very much impressed aby your work "SAITS". I am trying to create an attention-based model on my own as a part of my Bacholer's project and I have a few questions to ask:
I wanted to know how many trials did you run the SAITS in the training stage? I noticed you've already answered a similar issue to user to the user "Rajesh90123" where you said "just let the experiment run till I thought it was good to stop". Just for reference, could you please tell em how many trials did you run during the hyperparameter tuning till you realised it was good enough to stop? Was it in range of 100s, 1000s, 10,000s or more till you thought it was good enough? Please let me know at the earliest of your convenience. Thank you!

Using CSV files versus h5 data

Hello Wenjie,

Thank you for releasing the code, I had couple of questions. I am trying to run the code using Air Quality dataset in Google Colab. These are some of my doubts:

!CUDA_VISIBLE_DEVICES=2 python run_models.py --config_path configs/AirQuality_SAITS_best.ini
Running this gives me the following error message.
OSError: Unable to open file (unable to open file: name = 'dataset_generating_scripts/RawData/AirQuality/PRSA_Data_20130301-20170228/datasets.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

All the dataset is in .CSV format.
1a) Is there an option to use the default .csv data?
1b) How do I convert .csv to h5 format?

Where should we change the file path of the dataset for training purpose
As in file configs/AirQuality_SAITS_best.ini ?

Please do let me know, thanks.

Niharika

Can this be implemented to single feature time series dataset?

I'm trying to impute a single feature dataset. Can it be possible?

关于生成数据的一些问题

您好，在代码复现过程中遇到了一些问题，希望您能解答。
1.插补生成的数据与原始数据一样，在ETT测试集中，根据missing_mask可以得知(0,3)这个点应该是手动缺失，但最后生成的值和原始值一摸一样，excel是数据标准化后导出的文件，h5文件是生成的文件。

2.如果采用自己的数据集，训练过程中的config文件是要自己定义吗？其次config文件夹中的best是自己不断训练出来的吗？

3.最后插补生成的h5文件我想用来进行下游预测工作，可以转为csv导出吗？

یک سایت صادرات واردات ساده cd ~/backend git init git add . git commit -m "Initial commit" git remote add origin <YOUR_GITHUB_REPO_URL> git push -u origin master cd ~/frontend git init git add . git commit -m "Initial commit" git remote add origin <YOUR_GITHUB_REPO_URL> git push -u origin master

Question about temporal dependencies and feature correlations captured by DMSA

你好，关于文章当中的自注意力我有问题想请教您。维度为N×N的自注意力矩阵Q·Kt，表示的是长度为N的一种维度之间的注意力关系，而您文章中提到的“Such a mechanism makes DMSA able to capture the temporal dependencies and feature correlations between time steps in the high dimensional space with only one attention operation”，DMSA的一个注意力矩阵能一次性同时捕获到两种维度之间的注意力，想问一次注意力操作捕获到两种类型的注意力是怎么做到的。

Inquiry About Using SAITS Model with 'physionet_2012' Dataset

Hi everyone:

While using the 'physionet_2012' dataset, I found that the 'y' values cannot be output as expected. I noticed that the article's 'Downstream classification task' section addresses relevant content. If the 'In-hospital_death' values are not 0 and 1 but some other numbers, and the 'X' section content influences the mortality rate, can the SAITS model be used to predict this? If so, which function of SAITS should I use?

Thank you for your assistance.

Best regards,

Test data

Hi,

After certain modification and inclusion of code snippets I was able train, validate and get the mae for test data.
I want to obtain the de-normalized value after the imputation happens in test data, both predicted and actual. Can you help?

Question about output of the first DMSA

Hello, I want to ask you about the saits.py part of the modeling in your code, I only used the first DMSA module, I also entered the X and Miss Mask in your way, but after going through the encoder layer, data becomes all Nan, what is the reason for this situation.
Looking forward to your reply

Configs of ETTm1

Hello,
Could you share the configuration settings on the ETTm1 dataset?
Thanks!

Is this X_tilde_3 or X_c?

        imputation_MAE = masked_mae_cal(
            X_tilde_3, inputs["X_holdout"], inputs["indicating_mask"]
        )

Question about MAE

Hi, Wenjie

def masked_mae_cal(inputs, target, mask):
    """ calculate Mean Absolute Error"""
    return torch.sum(torch.abs(inputs - target) * mask) / (torch.sum(mask) + 1e-9)

I have a little doubt about the calculation of MAE.
I found you normalizes the dataset with standard scaling, it means the target and input are standard normalized. So why not calculate MAE after inverse the scaling to them?

Loss_MIT wrong?

I saw that the loss of MIT computaion in core.py was
'MIT_loss = self.customized_loss_func(
X_tilde_3, inputs["X_ori"], inputs["indicating_mask"]
)'
,which computed the MAE between M~3 and X_ori and differed to the paper.

Final error calculation

Hello Wenjie,

I have a doubt regarding the calculation of the final error metrics on the test data.

Suppose my sample data looks like this:

date          A       B
timestamp1    3       5
timestamp2    4       7
timestamp3    6       8
timestamp4    8       10

After introducing 50% missingness :

date          A       B
timestamp1    Nan     5
timestamp2    Nan     7
timestamp3    6       8
timestamp4    Nan    Nan

After imputation :

date          A       B
timestamp1    2       5
timestamp2    4       7
timestamp3    6       8
timestamp4    6       5

The MAE, RMSE, and MRE are calculated only on the imputed values or on the whole dataset?
Can you explain the MAE, RMSE, and MRE formulas/ equations used.

Thank you, Please let me know

Regards
Niharika Joshi

How to comprehend the NNI finetunning?

First, the file SAITS_basic_config.ini under NNI_tuning folder miss 2 args: "MIT" & "ORT", which influence the script python ../../run_models.py --config_path SAITS_basic_config.ini --param_searching_mode running. You may add this two args in the .ini file and also check for other .ini files if you have time.
Second, i am wondering how to check the help of nni for tuning the parameters. To be more specific, which parameters did nni change? When and how much did the parameters change? Only parameters listed in SAITS_searching_space.json file will be changed?

Thanks for your attention.

Question about hyperparameter optimization

Hello, I would like to ask you about the hyperparameter optimization for the model. In your file NNI_tuning/SAITS/SAITS_searching_config.yml, you described the settings for the hyperparameters and the training command, which also includes a JSON file. However, when I tried to run the command for hyperparameter optimization on SAITS, I encountered an error: "No option 'mit' in section: 'training'". I supplemented the missing parameters and ran it again, but I only obtained the parameters set in the SAITS_basic_config.ini file. Could you please advise me on how to iterate through the parameters in the JSON file to obtain the optimal parameters?

Custom dataset

Thank you for your wonderful work and I would like to know whether I can use this model or train a model from scratch to imputate my time series?

Few more questions on SAITS working method

I am really impressed with your work: "SAITS: SELF-ATTENTION-BASED IMPUTATION FOR TIME SERIES".
I was studying your code and paper and I have few questions:

Is there any particular reason for not using learning rate scheduler?
Also, when I tried replicating the code, I found out that my program was running indefinitely during random search for hyperparameter tuning especially with the use of loguniform in learning rate. I didn't find any lines regarding maximum number of trials for hyperparameter tuning in your code or in your paper "SAITS: SELF-ATTENTION-BASED IMPUTATION FOR TIME SERIES". Could you please provide information regarding this?
Do you change the value of artificial missing rate for MIT in training section based on missing rate at test? What I mean here is that, for example, if you have to test your model on test dataset with 80% missing rate, will the MIT missing rate be fixed to 20% as you have done in the code provided or do you train the entire model again by changing MIT missing rate manually to 80% in the training code?
Do you perform hyperparameter tuning on different missing rate in validation data or do you perform hyperparameter tuning on a particular missing rate at validation dataset and save it to use it for test dataset with any missing rate? What I mean here is, running hyperparameter tuning on a validation dataset of 20% missing rate and saving the best model, and using the same best model to impute missing data for test dataset with 80% missing rate?
Thank you, and if my questions are not clear, please comment, and I will try my best to describe my question. English is my second language, and I am not fluent on speaking English.

why do not have the code for ETTDataset

Question about training, validation, testing and missing rates?

Greetings, sir.
In your research paper, "SAITS: Self-Attention-based Imputation for Time Series", you have presented a table with different missing rates (20% to 90%). So, do you prepare best models from each val_dataset with different missing % rates so as to provide result for corresponding missing % of test or do you save model based on specific missing percentage in validation data (say 20% loss in val data) and use it to prepare result for test for other missing percentage(say 30% artificial missing percentage in test data)?