rakshitha123 / tsforecasting Goto Github PK

This repository contains the implementations related to the experiments of a set of publicly available datasets that are used in the time series forecasting research space.

Home Page: https://forecastingdata.org/

License: Other

R 55.58% Python 17.87% Jupyter Notebook 26.56%

benchmarks datasets forecasting global-models time-series

tsforecasting's People

Contributors

Stargazers

Watchers

Forkers

timoschowski chmathys statmixedml jingmouren pmontman valeman mudithnirmala mahdiabolghasemi mhdella tonbao30 yuanzhoulvpi2017 dangxuanhong prigrecov chanjeunlam allisterh kashif overfittingstudyroom carlosvneves fahad021 xueyinglong chetanmehra jnchinz weibl9 dhau alicia-ux nguyenhott rpatil524 piyumaha12 brianonchweri huangshanrui tobias272727 pitmonticone seanigami ravenrip mssongit sohagkumarsaha jerrydboonstra nfarisia mongibesbes hydrogeohc thierrymoudiki swpease

tsforecasting's Issues

Default trainer from Gluon package in the deep_learning_experiments script

Hello, thank you for making this benchmark available.
I noticed that the deep learning implementation relies on the Gluon package and in the train_model function, the trainer is always used with the defaulted parameters. I wonder if it is not better to adjust some params to the considered dataset (basically the num_batches_per_epoch parameter).

Data leakage for global univariate setting?

Hi, thanks for your amazing work! One issue I'm wondering about is, what constitutes a data leak in the global univariate setting?

Please correct me if I'm wrong, but I noticed that for the global univariate settings where the time series is unaligned (specifically, an unaligned end date), this leads to a shorter time series X to have test set in the same period as a longer time series Y's training set. (Examples would include Vehicle Trips and Tourism datasets).

There could possibly be some leakage of global information (e.g. in some financial time series there could be some systematic movement of all assets), leading to unfair advantage for global models.

sktime integration of `data_loader.convert_tsf_to_dataframe`

Really nice collection of forecasting benchmark datasets you have here!

We were wondering (at sktime) whether you would be open for us to integrate (a possibly modified) data_loader.convert_tsf_to_dataframe into the data_io module of sktime, using the Monash forecasting repository as a data endpoint. Of course with proper attribution and crediting of the source.

Since tsf is based on the ts format and the loaders are similar, it would fit nicely with a current refactoring effort in the space.

If you'd be up for a chat and/or collaboration, feel free to visit us on the sktime slack, forecasting or forecasting-global channel.
(go https://github.com/alan-turing-institute/sktime, README -> slack badge at the top)
Might also be nice to collaborate on "nice" benchmark functionality, which can be loaded as a package and which directly interfaces with existing base class templates (with no need to write extra glue code that's special to the benchmark)

Missing documentation in the notebook

In the forecastingdata_python notebook, it should be nice if you could add something like this:

After a fresh R installation, you need to install devtools by running "install.packages("devtools")" in R command line.

By the way, thank you for this great forecasting repository and the tools you created in the GitHub repo.

Why is arima giving an error when I use a subset of NN5 dataset without missing values?

Here is the procedure I followed:

Download nn5_daily_dataset_without_missing_values.tsf from Zenodo.
Remove all the time series except the first and second one from TSF file and save as nn5_small.tsf. In the resulting TSF file there are two time series T1 and T2.
Run "do_fixed_horizon_local_forecasting("nn5_small", "arima", "nn5_small.tsf","series_name", "start_timestamp")"

When I follow the above steps, I get the following error:

> do_fixed_horizon_local_forecasting("nn5_small", "arima", "nn5_small.tsf","series_name", "start_timestamp")
[1] "Started loading nn5_small"
[1] "started Forecasting"
[1] 1
[1] 2
[1] "Finished Forecasting"
Time difference of 1.184725 mins
The length of the provided data differs.
Length of holdout: 56
Length of forecast: 0
Error: Cannot proceed.
In addition: Warning messages:
1: The chosen seasonal unit root test encountered an error when testing for the first difference.
From stl(): NA/NaN/Inf in foreign function call (arg 1)
0 seasonal differences will be used. Consider using a different unit root test. 
2: The chosen seasonal unit root test encountered an error when testing for the first difference.
From stl(): NA/NaN/Inf in foreign function call (arg 1)
0 seasonal differences will be used. Consider using a different unit root test. 
3: In array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x),  : 'data' must be of a vector type, was 'NULL'

When I check the result file, I realized that there is no forecast for T1.

$ cat results/fixed_horizon_forecasts/nn5_small_arima.txt 
T1
T2,12.9393424036281,15.3486394557823,18.75,29.6768707482993,30.7256235827664,16.7375283446712,12.046485260771,12.9393424036281,15.3486394557823,18.75,29.6768707482993,30.7256235827664,16.7375283446712,12.046485260771,12.9393424036281,15.3486394557823,18.75,29.6768707482993,30.7256235827664,16.7375283446712,12.046485260771,12.9393424036281,15.3486394557823,18.75,29.6768707482993,30.7256235827664,16.7375283446712,12.046485260771,12.9393424036281,15.3486394557823,18.75,29.6768707482993,30.7256235827664,16.7375283446712,12.046485260771,12.9393424036281,15.3486394557823,18.75,29.6768707482993,30.7256235827664,16.7375283446712,12.046485260771,12.9393424036281,15.3486394557823,18.75,29.6768707482993,30.7256235827664,16.7375283446712,12.046485260771,12.9393424036281,15.3486394557823,18.75,29.6768707482993,30.7256235827664,16.7375283446712,12.046485260771

All other five methods (ets,ses,theta, tbats, and dhr_arima) work fine.

I wonder if this is an expected behaviour?

The cp1252 encoding fails on certain files

Hi,

I tried using the given python script to load some datasets. It fails on the Wikipedia web traffic dataset, due to non latin characters in the page titles. Should the encoding be utf-8for all files?

Thanks

Unable to reproduce the results from paper

Hi,

Monash Forecasting Repository and your work is greatly appreciated. Thanks a lot for making the work reproducible.

However, I tried experimenting with a few datasets and found that I was unable to reproduce the same results for the local models like ARIMA, ETS, SES, etc. Here are the results that I found for the COVID dataset. I use the same script and the same lag and horizon values. Could you please let me know if I am doing something wrong?. Should I change some parameters for these local models in order to attain the results reported in the paper?.

These are the results I got for the COVID dataset.

	ETS	TBATS	SES	ARIMA
MyExperiment	8.98	8.98	8.977	6.104
Published Results	5.33	5.72	7.776	6.117

Additionally, I also found that TBATS, ETS, and SES mostly always show the same error. Do you have an idea about why this could be true for the COVID dataset?

data_loader.R prohibitively slow

I tried using data_loader.R on temperature_rain_dataset_with_missing_values.tsf, and it was not working. I halted it after over an hour.

I modified the script and it now loads in a minute or so. The speedup all comes from pre-allocating lists to read the lines into. Do you want to incorporate these changes somehow -- either a pull request, modifying it on your own, etc.?

License

Hi,

Thanks for making this data repository available! Could you please add a license for the code (e.g., the data loader)?

From GitHub docs:

without a license, the default copyright laws apply, meaning that you retain all rights to your source code and no one may reproduce, distribute, or create derivative works from your work.

Thank you,
Alex

UnicodeDecodeError in data_loader.py when reading m4_monthly_dataset.tsf on Windows

I downloaded and extracted M4 monthly dataset and startet data_loader.py. I get following error message:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-3-78dfe61da181> in <module>
      1 filename = 'm4_monthly_dataset.tsf'
      2 loaded_data, frequency, forecast_horizon, contain_missing_values, contain_equal_length = \
----> 3     data_loader.convert_tsf_to_dataframe("tsf_data/"+filename)
      4 
      5 print('loaded_data',loaded_data)

~\Documents\PythonScripts\Timeseries2020\MonashTSForecastingArchiv\data_loader.py in convert_tsf_to_dataframe(full_file_path_and_name, replace_missing_vals_with, value_column_name)
     28 
     29     with open(full_file_path_and_name, 'r', encoding='utf-8') as file:
---> 30         for line in file:
     31             # Strip white space from start/end of line
     32             line = line.strip()

~\miniconda3\envs\sktime\lib\codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 437: invalid start byte

If I change encoding from utf-8 to ansi it works:

 #with open(full_file_path_and_name, 'r', encoding='utf-8') as file:
 with open(full_file_path_and_name, 'r', encoding='ansi') as file:

I work on Windows 10 with Python 3.9.4

covariate and forecasting target

Hello! I had a question regarding the forecasting target vs. additional covariates in the dataset. For example, in the temperature_rain dataset there are some fields that I assume are the forecast target from the names of the column while others are covariates... How does the tst format distinguish between those?

Question regarding MASE seasonality for M4 competition datasets

Hi,

The M4 competition defined the seasonality of the weekly, daily, and hourly datasets to be 1. However, it seems that your work uses ~52 for weekly, 7 for daily, and 24 for hourly. In other words, the mean MASE scores reported in your work are different to those of the competition for these frequencies, but not for the other frequencies?

Reproducing the results for vehicle_trips_dataset_without_missing_values

Hi, thank you for the hard work establishing this repository, collecting the datasets and releasing the code!

I would like to follow up on the discussion that we had in #11 (comment) about reproducing the results reported on https://forecastingdata.org/.

In short, I am unable to reproduce the MASE scores for vehicle_trips_dataset_without_missing_values reported in the table using the R code provided on GitHub.

	TBATS	SES	Theta	ETS	ARIMA	PR
My experiment	1.856	2.273	1.914	1.964	2.051	2.196
Published	1.860	1.224	1.244	1.305	1.282	1.212

For mean/median sMAPE/MAE (and all other non-seasonal metrics) the results for PR and SES are identical to the paper, but for ETS and Theta the numbers are quite different from the paper.

I did some more investigation with git blame on the file experiments/feature_functions.R, and it seems that the problem might be caused by some past changes that incorrectly set SEASONALITY_VALS used when computing the results for the vehicle_trips dataset.

If I change SEASONALITY_VALS[[7]] <- 7 to SEASONALITY_VALS[[7]] <- 24, then both MASE results for all models & sMAPE/MAE scores for Theta and ETS get very close to the ones published in the appendix.

Can you please tell me if there are mistakes in my reasoning and, if so, what I should change in my setup to obtain the correct results comparable to the official table?