antoinecarme / pyaf Goto Github PK

PyAF is an Open Source Python library for Automatic Time Series Forecasting built on top of popular pydata modules.

License: BSD 3-Clause "New" or "Revised" License

Python 91.64% R 0.01% Shell 0.01% Makefile 8.36% Procfile 0.01%

scikit-learn pandas jupyter forecasting exogenous benchmark seasonal time-series horizon autoregressive python trend cycle arx automatic-forecasting transformation hierarchical-forecasting automl signal-decomposition machine-learning

pyaf's Introduction

PyAF (Python Automatic Forecasting)

PyAF is an Open Source Python library for Automatic Forecasting built on top of popular data science python modules: NumPy, SciPy, Pandas and scikit-learn.

PyAF works as an automated process for predicting future values of a signal using a machine learning approach. It provides a set of features that is comparable to some popular commercial automatic forecasting products.

PyAF has been developed, tested and benchmarked using a python 3.x version.

PyAF is distributed under the 3-Clause BSD license.

Demo

import numpy as np, pandas as pd
import pyaf.ForecastEngine as autof

if __name__ == '__main__':
   # generate a daily signal covering one year 2016 in a pandas dataframe
   N = 360
   df_train = pd.DataFrame({"Date": pd.date_range(start="2016-01-25", periods=N, freq='D'),
   	                    "Signal": (np.arange(N)//40 + np.arange(N) % 21 + np.random.randn(N))})
				  
   # create a forecast engine, the main object handling all the operations
   lEngine = autof.cForecastEngine()

   # get the best time series model for predicting one week
   lEngine.train(iInputDS=df_train, iTime='Date', iSignal='Signal', iHorizon=7);
   lEngine.getModelInfo() # => relative error 7% (MAPE)

   # predict one week
   df_forecast = lEngine.forecast(iInputDS=df_train, iHorizon=7)
   # list the columns of the forecast dataset
   print(df_forecast.columns)

   # print the real forecasts
   # Future dates : ['2017-01-19T00:00:00.000000000' '2017-01-20T00:00:00.000000000' '2017-01-21T00:00:00.000000000' '2017-01-22T00:00:00.000000000' '2017-01-23T00:00:00.000000000' '2017-01-24T00:00:00.000000000' '2017-01-25T00:00:00.000000000']
   print(df_forecast['Date'].tail(7).values)

   # signal forecast : [ 9.74934646  10.04419761  12.15136455  12.20369717  14.09607727 15.68086323  16.22296559]
   print(df_forecast['Signal_Forecast'].tail(7).values)

also availabe as a jupyter notebook

Features

PyAF allows forecasting a time series (or a signal) for future values in a fully automated way. To build forecasts, PyAF allows using time information (by identifying long-term evolution and periodic patterns), analyzes the past of the signal, exploits exogenous data (user-provided time series that may be correlated with the signal) as well as the hierarchical structure of the signal (by aggregating spatial components forecasts, for example).

PyAF uses Pandas as a data access layer. It consumes data coming from a pandas data- frame (with time and signal columns), builds a time series model, and outputs the forecasts in a pandas data-frame. Pandas is an excellent data access layer, it allows reading/writing a huge set of file formats, accessing various data sources (databases) and has an extensive set of algorithms to handle data- frames (aggregation, statistics, linear algebra, plotting, etc.).

PyAF statistical time series models are built/estimated/trained using scikit-learn.

The following features are available :

Training a model to forecast a time series (given in a pandas data-frame with time and signal columns).
- PyAF uses a machine learning approach (the signal is cut into estimation and validation parts, respectively, 80% and 20% of the signal).
- A time-series cross-validation can also be used.
Forecasting a time series model on a given horizon (forecast result is also a pandas data-frame) and providing prediction/confidence intervals for the forecasts.
Generic training features
- Signal decomposition as the sum of a trend, periodic and AR components.
- PyAF works as a competition between a comprehensive set of possible signal transformations and linear decompositions. For each transformed signal, a set of possible trends, periodic components and AR models is generated and all the possible combinations are estimated. The best decomposition in terms of performance is kept to forecast the signal (the performance is computed on a part of the signal that was not used for the estimation).
- Signal transformation is supported before signal decompositions. Four transformations are supported by default. Other transformations are available (Box-Cox, etc.).
- All models are estimated using standard procedures and state-of-the-art time series modeling. For example, trend regressions and AR/ARX models are estimated using scikit-learn linear regression models.
- Standard performance measures are used (L1, RMSE, MAPE, MedAE, LnQ, etc.)
PyAF analyzes the time variable and infers the frequency from the data.
- Natural time frequencies are supported: Minute, Hour, Day, Week and Month.
- Strange frequencies like every 3.2 days or every 17 minutes are supported if data are recorded accordingly (every other Monday => two weeks frequency).
- The frequency is computed as the mean duration between consecutive observations by default (as a pandas DateOffset).
- The frequency is used to generate values for future dates automatically.
- PyAF does its best when dates are not regularly observed. Time frequency is approximate in this case.
- Real/Integer valued (fake) dates are also supported and handled in a similar way.
Exogenous Data Support
- Exogenous data can be provided to improve the forecasts. These are expected to be stored in an external data-frame (this data-frame will be merged with the training data-frame).
- Exogenous data are integrated into the modeling process through their past values (ARX models).
- Exogenous variables can be of any type (numeric, string, date or object).
- Exogenous variables are dummified for the non-numeric types, and standardized for the numeric types.
PyAF implements Hierarchical Forecasting. It follows the excellent approach used in Rob J Hyndman and George Athanasopoulos book. Thanks @robjhyndman
- Hierarchies and grouped time series are supported.
- Bottom-Up, Top-Down (using proportions), Middle-Out and Optimal Combinations are implemented.
The modeling process is customizable and has a huge set of options. The default values of these options should however be OK to produce a reasonable quality model in a limited amount of time (a few minutes).
- These options give access to a full set of signal transformations and AR-like models that are not enabled by default.
- Gives rise to Logit, Fisher transformations as well as XGBoost, Support Vector Regressions and Croston intermittent models, LGBM, among others.
- By default, PyAF uses a fast mode that activates many popular models. It is also possible to activate a slow mode, in which PyAF explores all possible models.
- Specific models and features can be customized.
A benchmarking process is in place (using M1, M2, M3 competitions, NN3, NN5 forecasting competitions).
- This process will be used to control the quality of modeling changes introduced in the future versions of PyAF. A related github issue is created.
- Benchmarks data/reports are saved in a separate github repository.
- Sample benchmark report with 1001 datasets from the M1 Forecasting Competition.
Basic plotting functions using matplotlib with standard time series and forecasts plots.
Software Quality Highlights
- An object-oriented approach is used for the system design. Separation of concerns is the key factor here.
- Fully written in Python with NumPy, SciPy, Pandas and scikit-learn objects. Tries to be column-based everywhere for performance reasons (respecting some modeling time and memory constraints).
- Internally using a fit/predict pattern, inspired by scikit-learn, to estimate/forecast the different signal components (trends, cycles and AR models).
- A test-driven approach (TDD) is used. Test scripts are available in the tests directory, one directory for each feature.
- TDD implies that even the most recent features have some sample scripts in this directory. Want to know how to use cross-validation with PyAF? Here are some scripts.
- Some jupyter notebooks are available for demo purposes with standard time series and forecasts plots.
- Very simple API for training and forecasting.
A basic RESTful Web Service (Flask) is available.
- This service allows building a time series model, forecasting future data and some standard plots by providing a minimal specification of the signal in the JSON request body (at least a link to a csv file containing the data).
- See this doc and the related github issue for more details.

PyAF is a work in progress. The set of features is evolving. Your feature requests, comments, help, hints are very welcome.

Installation

PyAF has been developed, tested and used on a python 3.x version.

It can be installed from PyPI for the latest official release:

pip install pyaf

The development version is also available by executing:

pip install scipy pandas scikit-learn matplotlib pydot xgboost statsmodels
pip install --upgrade git+git://github.com/antoinecarme/pyaf.git

Development

Code contributions are welcome. Bug reports, request for new features and documentation, tests are welcome. Please use the GitHub platform for these tasks.

You can check the latest sources of PyAF from GitHub with the command::

git clone http://github.com/antoinecarme/pyaf.git

Project history

This project was started in summer 2016 as a POC to check the feasibility of an automatic forecasting tool based only on Python available data science software (NumPy, SciPy, Pandas, scikit-learn, etc.).

See the AUTHORS.rst file for a complete list of contributors.

Help and Support

PyAF is currently maintained by the original developer. PyAF support will be provided when possible and even if you are not creating an issue, you are encouraged to follow these guidelines.

Bug reports, improvement requests, documentation, hints and test scripts are welcome. Please use the GitHub platform for these tasks.

Please don't ask too much about new features. PyAF is only about forecasting (the last F). To keep PyAF design simple and flexible, we avoid Feature Creep.

For your commercial forecasting projects, please consider using the services of a forecasting expert near you (be it an R or a Python expert).

Documentation

An introductory notebook to the time series forecasting with PyAF is available here. It contains some real-world examples and use cases.

A specific notebook describing the use of exogenous data is available here.

Notebooks describing an example of hierarchical forecasting models are available for Signal Hierarchies and for Grouped Signals.

The python code is not yet fully documented. This is a top priority (TODO).

Communication

Comments, appreciations, remarks, etc .... are welcome. Your feedback is welcome if you use this library in a project or a publication.

pyaf's People

Contributors

Stargazers

Watchers

Forkers

noirqs vivpra89 eli-s-goldberg flamingofugang mgiangreco tfrojd henri-lo freephys hammashamzah tony32769 vfx01j samithaj valeman jmabry knut0815 robspringles goikoetxea tonylv liumenglife julesjulesjules anton4i rkkwan kathygcy tspannhw asinga1982 nikita0108 sandy1811 ysravankumar ifv kevin-wisepipe neuralji vinodkumarcvk07 joofeloof vanamalivanam interestingforked bikashckarmokar aadland6 morpheus000 todokku anony-account veesamkrao nishantsbi kusumy sandy4321 kiko1004 olivier2311 nguyensu simonepasci sajjadhz matjanssen vishalbelsare rambam613 ivanletteri ysdede elischmitter donjon86 webclinic017 jeffamaxey kiomegm overfittingstudyroom pyquantsharp joaonfcastro ahmad-abdellatif dal3006 kostasgrevenitis rujora1 timcera seanigami adieprestone michael-auckland peterhamfelt harel-coffee svorwerk-flextg

pyaf's Issues

Add Variance stabilizing transformations

https://en.wikipedia.org/wiki/Variance-stabilizing_transformation

In applied statistics, a variance-stabilizing transformation is a data transformation that is specifically chosen either to simplify considerations in graphical exploratory data analysis or to allow the application of simple regression-based or analysis of variance techniques.

The aim behind the choice of a variance-stabilizing transformation is to find a simple function ƒ to apply to values x in a data set to create new values y = ƒ(x) such that the variability of the values y is not related to their mean value.

Reduce memory footprint

Typing columns is not optimized for the moment ....

Performance/Profiling review

Need to review the software aspects of performance.

Hierarchical Forecasting : Middle-out approach

This approach is still missing. Generate new columns with MO prefix based on base forecasts for all hierarchy levels.

pyAF_introduction stalling during training?

The introductory notebook hangs calling lEngine.train(ozone_dataframe, )

I am investigating...

ipykernel==4.5.2
ipython==5.3.0
ipywidgets==6.0.0
jupyter==1.0.0
jupyter-client==5.0.0
jupyter-console==5.1.0
jupyter-core==4.3.0
Keras==1.2.2
matplotlib==2.0.0

Improve Plots for Hierarchical Models

Plots that are generated for hierarchical models are too elementary.

Add more significant annotation for all the hierarchy nodes:

MAPE for the node model.
Top-Down average proportions for the edges.
Use a specific color for each level of the hierarchy.
Other annotations ?

Assign Colors in a uniform way in the various plots.

In the API-generated plots, the forecast line color is not always the same. The same applies for other lines colors.

Failure in an artificial test

After updating #6 , The following test fails :

python3 tests/artificial/transf_log/trend_constant/cycle_12/ar_12/test_artificial_1024_log_constant_12_12_100.py

Add LnQ Performance measure

According to :

https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error

A limitation to SMAPE is that if the actual value or forecast value is 0, the value of error will boom up to the upper-limit of error. (200% for the first formula and 100% for the second formula).

Provided the data are strictly positive, a better measure of relative accuracy can be obtained based on the log of the accuracy ratio: log(Ft / At) This measure is easier to analyse statistically, and has valuable symmetry and unbiasedness properties. When used in constructing forecasting models the resulting prediction corresponds to the geometric mean (Tofallis, 2015).

Use Travis-CI

According to https://github.com/ripienaar/free-for-dev

travis-ci.org — Free for public GitHub repositories

Goal :
Put in place some continuous integration process for pyaf.

https://travis-ci.org/antoinecarme/pyaf

Store version information in Model metadata

In order to analyze accurately the issues, it is necessary to have the exact version of each component used.

system (uname -a ??)
pyhton version
scikit-learn.
Pandas
NumPy,
SciPy.
matplotlib
SQLAlchemy

These version should be stored in the model (with model training date etc).

Exception Handling

To make error handling easier, PyAF api calls should all raise the same kind of exception (PyaF_Error) or an inherited form.

Allow Modeling with Exogenous Data through the REST API

new feature to be exposed.

User Guide

At least

API documentation in Python code
Some examples

Perform signal transformation in a uniform way

Some signal transformations need nothing, others need the signal to be positive (log = boxcox(0)).
Some perform some scale-translation invariance. needs to be applicavble to all transformations (new option).

Artificial dataset test failure

The following test fails :

tests/artificial/transf_/trend_poly/cycle_7/ar_/test_artificial_1024__poly_7__20.py

Fails with the exception :

ValueError: shapes (1012,1030) and (1000,) not aligned: 1030 (dim 1) != 1000 (dim 0)

Seems to be a scikit learn usage error.

Improve numerical stability

On Heruku platform, there is some very tiny difference on dates generated by pyaf. This may impact the extraction of exogenous data.

Add a document about plotting features of PyAF

PyAF has a API call lEngine.standardPlots(). It gives some classical plots (signal against forecast, residues, trends, cycles, AR)

All the plots are generated with matplotlib

Document the plots generated.

The REST service (issue #20 ) also gives the same plots in a png/base64 encoding, to be documented.

Benchmark : expsmooth R package Datasets

Need to test PyAF on all th e datasets given in this package

package :

https://cran.r-project.org/web/packages/expsmooth/index.html

datasets :

ausgdp dji gasprice partx utility bonds enplanements hospital ukcars vehicles canadagas fmsales jewelry unemp.cci visitors carparts freight mcopper usgdp xrates djiclose frexport msales usnetelec

Add a RAML doc for the RESTful API

https://en.wikipedia.org/wiki/RAML_%28software%29

RESTful API Modeling Language (RAML) is a YAML-based language for describing RESTful APIs.[2] It provides all the information necessary to describe RESTful or practically-RESTful APIs. Although designed with RESTful APIs in mind, RAML is capable of describing APIs that do not obey all constraints of REST (hence the description "practically-RESTful"). It encourages reuse, enables discovery and pattern-sharing, and aims for merit-based emergence of best practices.

Create a Heroku App

Need a RESTful API for PyAF and a Heroku demo.

https://pyaf.herokuapp.com/

Avoid mixing RMSE and MAPE as performance indicators for model selection

The code as it is today, uses RMSE sometimes and MAPE sometimes.

There is an option,

self.mModelSelection_Criterion = "L2";

That shoud be used for model selection everywhere. By the way it should be set to "MAPE" be default.

This is a very serious bug.

Add a low-memory mode for Heroku

Heroku free dyno "only" has 512MB and low cpu.

Investigate TensorFlow usage with PyAF

Following #12 , Only keras with theano backend was tested on a 24 cores machine (HP Z600).

Try to configure TensorFlow (with or wihout an NVidia gpu) on this machine and perform the same tests.

Investigate GPU usage with PyAF

Following #12 , Only keras with theano backend was tested on a 24 cores machine (HP Z600).

Try to configure an NVidia gpu on this machine and perform the same tests.

MAPE, sMAPE computations are slow

PyAF protects these indicators against zero values in the signal :

def protect_small_values(self, signal, estimator):
    eps = 1.0e-13;
    keepThis = (np.abs(signal) > eps);
    signal1 =  signal[keepThis];       
    estimator1 = estimator[keepThis];
    # self.dump_perf_data(signal , signal1);        
    return (signal1 , estimator1);

This is not necessary.

An approximation is better :

rel_error = abs(estimator - signal) / (abs(signal) + eps)
MAPE = mean(rel_error)

Add the possibility to disable or force a model

At least for debugging purposes.

Installation should be easier.

The README file says that it is necessary to clone the github repository. This is not necessary.

A simple pip install should be Ok, like :

pip install git+git://github.com/antoinecarme/pyaf.git

A lot of imports should be updated (the FQN of a class starts with pyaf. etc).

Support Date types

antoine@z600:~/dev/python/packages/pyaf$ ipython3 tests/bench/test_yahoo.py
ACQUIRED_YAHOO_LINKS 4818
YAHOO_DATA_LINK AAPL https://raw.githubusercontent.com/antoinecarme/TimeSeriesData/master/YahooFinance/nasdaq/yahoo_AAPL.csv
YAHOO_DATA_LINK GOOG https://raw.githubusercontent.com/antoinecarme/TimeSeriesData/master/YahooFinance/nasdaq/yahoo_GOOG.csv
load_yahoo_stock_prices my_test 2
BENCH_TYPE YAHOO_my_test OneDataFramePerSignal
BENCH_DATA YAHOO_my_test <pyaf.Bench.TS_datasets.cTimeSeriesDatasetSpec object at 0x7fdc9c5f7fd0>
TIME : Date N= 1246 H= 12 HEAD= ['2011-07-28T00:00:00.000000000' '2011-07-29T00:00:00.000000000'
'2011-08-01T00:00:00.000000000' '2011-08-02T00:00:00.000000000'
'2011-08-03T00:00:00.000000000'] TAIL= ['2016-07-05T00:00:00.000000000' '2016-07-06T00:00:00.000000000'
'2016-07-07T00:00:00.000000000' '2016-07-08T00:00:00.000000000'
'2016-07-11T00:00:00.000000000']
SIGNAL : GOOG N= 1246 H= 12 HEAD= [ 610.941019 603.691033 606.771021 592.40099 601.171059] TAIL= [ 694.950012 697.77002 695.359985 705.630005 715.090027]
GOOG Date
0 610.941019 2011-07-28
1 603.691033 2011-07-29
2 606.771021 2011-08-01
3 592.400990 2011-08-02
4 601.171059 2011-08-03

Tuning Keras Models

Following #12 , MLP and LSTM Keras with theano backend were tested on a 24 cores machine (HP Z600).

These m odels may need some tuning (RNN architecture improvement).

Need also some artificial dataset validation.

SignalDecomposition and TS_datasets missing?

download_all_stock_prices.py:

import SignalDecomposition as SigDec
import TS_datasets as tsds

tsds.download_yahoo_stock_prices();

Avoid unnecessary failures

Sometimes the signal is too small/easy to forecast. PyAF fails when the signal has only one row !!!

The goal here is to make pyaf as robust as possible against very small/bad datasets
PyAF should automatically produce reasonable/naive/trivial models in these cases.
It should not fail in any case (normal behavior expected, useful for M2M context)

Add a Jupyter Notebook to demonstrate the use of Hierarchical Forecasting

need a similar doc with hierarchical and grouped time series detailed examples.

Enforce type checking for time and signal columns

Time should be a time (np.dtype is time/date or numeric)
Signal should be numeric
issue error messages if time/signal column is not found in the training dataset and has correct type.

Artificial dataset test failure - Warning about singular matrix

A second group of failures. The following warning is raised :

logs/test_artificial_1024_cumsum_constant_5__20.log:UserWarning: Singular matrix in solving dual problem. Using least-squares solution instead.

sample script :

tests/artificial/transf_cumsum/trend_constant/cycle_5/ar_/test_artificial_1024_cumsum_constant_5__20.py

Benchmarking Process

Need to run a benchmarking process to review the current state of PyAF.

In as first time, we will see this as a sanity check (correct some bugs here and there ;).

In a second time, a report is generated with performance figures.

Add MASE Performance measure

According to

https://www.otexts.org/fpp/2/5

MAPE/SMAPE are not reliable performance measures for model selection.

The authors recommend MASE (mean absolute scaled error):

MASE=mean(|qj|).

where qj is the scaled error.

Add Support Vector Regression Models

Using scikit_learn SVR models (the same way we already use Ridge linear regression for AR/ARX models.).

Analyze numerical differences in the logs of test scripts

Avoid using diff and cat

Compare files bu parsing all the lines and allowing 1e-10 error in the numerical columns of pandas JSON output for example.

New Dataset - Canards Tour d'Argent

Need to find a dataset reporting the day of each duck (these are numbered) served at "La Tour d'Argent".

Experiment with Neural Networks

It may be funny to test some neural network models in the competing models.

Add a logistic transformation

A logistic transformation is more appropriate when the signal is a proportion.

Try to reduce memory usage

A lot of pandas dataframes are created internally at each step of the signal decomposition process.
Try to get rid of unnecessary dataframes.

Some memory profiling is also welcome.

Investigate IoT Time Series Applications

At least check the possibility of using pyaf in this context.
pyaf is not aware of the data source type (time series database or web service, etc) as long as the dataset is stored in a pandas dataframe.
Is there a link with hierarchical models ?
A jupyter notebook is welcome with a real example.

Document the Algorithmic Details of PyAF

Need a document to describe the algorithmic aspects of time series forecasting in PyAF.

The overall algorithm
The detail of the signal decomposition
The machine learning aspects
Advanced usage/control of the algorithms.
5.Hierarchical forecasting.

Certificate failure Python 3.6

This is not an issue with PyAF.

However, in case anyone else should encounter read_csv() failures on macosx when running examples, refer

https://bugs.python.org/issue28150

http://stackoverflow.com/questions/41691327/ssl-sslerror-ssl-certificate-verify-failed-certificate-verify-failed-ssl-c

Signal Transformation Computation is too slow

We copy the dataset and use a loop over the whole dataset which is too baaaaad :

'''

def specific_invert(self, df):
    df_orig = df.copy();
    df_orig.iloc[0] = df.iloc[0];
    for i in range(1,df.shape[0]):
        df_orig.iloc[i] = df.iloc[i] -  df.iloc[i - 1]
    return df_orig;

'''

There is room for improvement.

Add Timing reports for all operations

There is already a log for 'fit/train' with

INFO:pyaf.std:END_TRAINING_TIME_IN_SECONDS 'Ozone' 3.291870594024658

Add the same kind of logging for 'predict/forecast' and 'plot'.

Enhance artificial dataset tests.

Use Python Logging

instead of printing text messages on the standard output !!!