phil20686 / aika Goto Github PK

2.0 8.0 0.0 353 KB

License: Apache License 2.0

Python 91.90% Shell 0.37% Jupyter Notebook 7.72%

aika's Introduction

aika

Introduction

Aika is a project born out of the desire to make working with time series, and in particular doing time series research, as painless as possible. As a rule time series computations are always in some sense a DAG, however, when doing such computations "live" we generally have a complex graph that we want to run up to "now". To do so brings in many quite complex requirements:

Incremental running - we generally do not want to run the entire history again, so we want to identify how much to run.
Look back: When running incrementally , we may need data from a considerable lookback on the inputs to calculate a new entry right now, such as eg exponential moving average.
Completeness: Data may not be available for some inputs at the expected time. In that case, if we simply run all the nodes we may incorrectly evaluate new rows, when the desired behaviour should be to wait until that data is ready.

This requirements represent quite complex use cases, and are vital to the correct running of timeseries systems, however, they are also hard to get right. There are also further requirements that such a system might require: distributed computing is one example, a graph with thousands of nodes in somewhat parallel pipelines will benefit from parallelisation, but this requires understanding when parents are complete.

The goal of aika is as far as possible to abstract away these three concerns, and make it possible for researchers to think only of writing simple python functions, running them, and little by little building up a reliable graph that can be easily transferred into a production setting.

Notebooks

There are as part of the project some notebooks that will introduce you to the essential parts of aika. In particular how to create tasks, chain them together, run them and view their outputs. There is at this time no exhaustive user guide beyond these notebooks. To access these notebooks simply clone the repository and install the project-level requirements file. This should give a working distribution on most systems.

Developer Notes

Repo structure

All libraries should be defined in their own sub-folder under the libs/ directory

Dependencies for each package should be declared inline in the install_requires list of setup.py

Note that because we are creating different packages and distributions under the same namespace, there must be no __init__ under src/aika, but only under src/aika/package should have the first init, otherwise the modules will not be importable. See python documentation on namespace packages.

Compiling dependencies

The requirements.txt file at the top level contains all the dependencies for all the packages in the repo.

This file can be updated by running the compile_requirements.sh script. This should be done if any of the following apply:

You have created a new library under libs/
You have added, removed, or changed the version specified of any dependency on an existing library
You wish to update dependencies to pick up recent releases

Note that this requirements file is intended only for the purposes of eg: running the tests in a consistent way. This code is intended to be used as a collection of libraries and any actual restrictions on versions must be specified in the setup.py of the individual libraries in the usual ways. This is only a convenience for developers.

User notes

The datagraph currently supports parameters that are one of:

Python primitives like numbers, strings.
Tuples
frozen dicts

For user ease dicts are converted into frozen dicts and lists are converted into tuples upon dataset creation. That means that a dataset with a list parameter will be considered equal to one with a tuple parameter with the same values.

Note that this is a decision about the persistence layer, tasks can be parameterised with lists and tuples, and they are not converted, this is especially important when working with pandas which treats lists and tuples differently in indexing operations.

We will add further parameter types in the future, including: sets and arbitrary hashable python objects.

aika's People

Contributors

Stargazers

Watchers

aika's Issues

Add delete method to both backends

correct the inheritance behaviour for engine

Causal Dataset Generator

Following on from creating pipelines. we need to create dataset generators to "feed" the pipelines. These are designed for time series research and as such should generate causally correct dataset objects with aligned time series indexes suitable for supervised learning. They should have a variety of options for how to "step" through the data producing overlapping or non-overlapping data sets.

ML Pipelines

Bivariate pipelines are incredibly powerful. Most open source platforms only support univarte pipelines, i.e. sklearn, but we want pipelines that have both and x and y, or data and response datasets simultaneously. The critical parts of this are:

To construct an interface for a fit, transform type logic that both receives and returns a dataset object with both an X and a y.
To build a wrapper for at least XGBoost and Linear regression type models in sklearn.
To make sure that all fitted models, transformers, and pipelines are pickle-able, so that learned pipelines can be easily stored.

compete the test coverage

Joblib Runner

The intended usage pattern is that people define in code their graph, they then select a sub set of end points, and we call this list of tasks a "Graph". A runner should run the graph by doing, roughly, the following steps:

scan through the parents recursively and checking their completeness. Stopping where the task is complete. I.e constructing the minimum subset of tasks that must be computed.
run these tasks in order.
avoid the edge case where a previously complete task becomes incomplete while a task is running. For example, if you have a completeness checker with a 1 minute offset and the graph takes > one minute to run. There must be sensible behaviour in this case.

It would be nice to have a version of this where tasks are farmed out to multiple processes via joblib.

Add validation method

The intention is that tasks should validate their output data before writing it. So tasks need a validate method that is called prior to writing. Validation should be optional but should essentially be able to pass in any callable. Any exception from this callable should be caught and re raised with the comment that this is a validation error in task etc.

Create a version of causal dataset generator that uses time

To work smoothly with the completion checkers it is more natural to have a version of causal dataset that works of an anchor date and an interval go generate its periods.

Find better name than "dux" for the task library

Graph class does not deal correctly with disconnected nodes

Since we are using nx.DiGraph underneath, and this is initialised off edges, this means that nodes with absolutely no edges do not get run since they are effectively hidden from the graph. We can easily avoid this by having it check for nodes which do not appear in the edges.

Build Tools require pip-tools, which is not installed by default

So because of the set -e flag at the top of the script, errors are suppressed, so it took me a fair while to realise that the failure was due to not having installed pip-tools so that I had no pip-compile command available in the shell. It feels like maybe we should add to the compile script a line to pip-install the relevant dependencies of the script itself?

add causal code

Bug where scan doesnt find datasets

There is a bug where scan was not finding the datasets due to not normalising the parameters consistently compared to the file system backends. For nested parameters like dictionaries inside dictionaries the order is important in evaluating equal since it leads to different hashes in some cases.

add tests for putki

add scan method

We need to add a method to the engines to provide a list of the meta data objects for all stored items meeting a certain designation. The api is that this should be a function on IPersistenceEngine with the api:

def scan(name : str, params: {}) -> List[DatasetMetaDataStub]:

and where, eg, an entry in params of foo=3 means it would return only metadata stubs where parameter foo equaled three.

Improve calendar support for holidays

We want to improve calendar support for holidays for completeness checking. It should be easy to create a completeness checker from a time of day and a cday such that it understands holidays.

MongoBackEnd with FileStore

Currently the mongo backend has to use GridFs to store large dataframes, since bson has a hard 16mb size limit for a single document. GridFS is not transaction aware and its updates are not atomic. It would be good to have a persistance engine where the meta data lives in mongo but the actual data files are written to a file. Atomic writes provides a nice backend for writing. This allows virtually unlimited data storate.

Add test coverage to the build pipeline checks

Configure checks for hash-collision

Since the data is always written exclusively by the graph runner, we know that we can always detect a hash collision before it propagates to children. In essence all we have to do is that any modification of a dataset, such as eg, append, must search by hash and then check that all of the parameters including parental hashes match. This check guarantees that the hash based search has found the correct dataset. Note that in the event of a hash collision we must simply error out. The chance of this in intended use cases is order 10* -10 in intended use cases. We might look at md5 hashes for even more security. If a user ever came across one a one character change to the dataset name should fix it.

add "find" method

We need a method on the persistence engine that will find all the data "node names" matching a regex pattern.

MongoDb Timeseries Data

As of version 5 mongodb supports time series natively. https://www.mongodb.com/docs/v5.2/core/timeseries-collections/

We should explore whether this is suitable for storing what is effectively panel data.

Map-Reduce type behaviour

Block Bootstrapping is the use case that I have in mind here. Suppose that you want to fit the model on 100 block boot-strapped time series out of a single data. On each of these 100 datasets we then need to perform three tasks in sequence, and then we want tasks that "see" all 100 of the final outputs and can do some type of aggregation on them. Eg, suppose they are returns and we want the aggregation to return the distribution of sharpe ratios across the 100 datasets. We currently require that the python functions have named parameters so we would have to create a function with 100 explicitly named parameters and then we can create a normal task out of that function. This will work, but its obviously very painful. It would be better to be able to have one function that had a single argument that represented a list of input datasets/tasks, this seems like it should be possible in our setup.

Arctic Chunk Store

Another promising backend that we could potentially support is the Arctic Chunk Store: https://arctic.readthedocs.io/en/latest/chunkstore/

Names in time_series_task args is too common

There are number of special arguments in time_series_task that are too common, and if they conflict with the names of arguments in the fuctions it is wrapping then it is a problem. So, for example, no function with the argument "name" can currently be run, because that is one of the reserved arguments for the task name. We should make these less common by prepending aika_ or task_ etc as appropriate.

Luigi based runner

We should make a runner based on luigi