wlanslovenija / datastream Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 9.0 573 KB

Datastream API provides a powerful and unified Python API for time-series data.

Home Page: http://datastream.readthedocs.org/

License: Other

Python 100.00%

datastream's People

Contributors

Stargazers

Watchers

Forkers

mstajdohar kostko uservidya chrisnolan1992 noelhx kalebros

datastream's Issues

Add an API call to regenerate all generated streams

Add an API call to regenerate all generated streams. This will be useful if we will improve implementation of operators (for example, add overflow parameters) and we want to recompute things.

It could be easily implemented: we just delete all derived stream datapoints, mark it as pending, and then call backprocess.

It is true though that this means datapoint values for existing stream will be changed. This is destroying the API contract we have (that datapoints can only be appended). Maybe it is better to not support that and simply require user to delete derived stream and add new one, with same parameters. Stream's ID will change so it will be clear that it is another stream with another datapoints. And for end-user things will work the same because end-user will query streams by query tags.

Why we are storing tags as list?

Why we are storing tags as list? This is really hard to work with. Or at least we should document that and show use cases we envision. And proposed ways to work with them.

Downsampling into the future

Why is not possible to downsample into the future? Currently even if you give until parameter to downsample with a future time, it downsamples only until the current time. And you have to use _time_offset to really get it to be downsampled. I think we should allow user to "freeze" the datapoints also into the future. This might be useful when importing.

Support for null values

We should support passing in null values. We can then use them to signal that the value was missing but it should be. For example, when we read data from nodes and node is down. This is necessary so that we when to connect datapoints in the graph and when to not (because there are missing values in between).

When downsampling, missing values would be simply ignored. So count downsampling would not count them, average ignore them and so on. If some interval when downsampling has no values (or just missing values), the result of downsampling would be a null value as well.

The question is do we store every null value or only the first one? In theory it would be enough to store only the first one, but this would still be less real data available (we would know later on if there were some times where values were really missing or monitoring was done). And is it really easy to implement this storing only of the first one? We should then check if we already story? Or if we are saving null, we could set some mark on the descriptor object? Or something? Or we could simply store multiple null values and this would be it, simple implementation (only downsampling have to be smart) and all data available.

Auto-correlation operator

It might be interesting to have auto-correlation operator? Or some other way of finding recurring parts?

Allow appending custom metadata to each datapoint

Allow appending custom metadata to each datapoint, so that we can store some additional data which should not be really used, but should still be kept to make data more useful for later analysis.

An example is when you measure packet loss by pinging a host with multiple packets, to store how many packets were send to compute packet loss.

This should be stored only on the highest level granularity stream and would not be downsampled.

This would modify append to allow adding this metadata and get_data with a flag to retrieve this metadata (by default false).

Update README file

Explain what the package is, what is its purpose and goal, where to ask questions and so on.

Anomaly detection

https://www.npmjs.com/package/bell.js

Downsampling: random sample

I think we should introduce a new downsample function which can work on any data type: random sample. So for set of values, it chooses a random sample among them.

The only issue is that it has to be a deterministic process so that same samples are generated when downsampling is rerun. We should maybe seed random generator based on hash of stream's uuid?

Check and update documentation

Check and update documentation.

Use Monary

Maybe use Monary driver to read data from MongoDB and compute aggregations using numpy.

Remove callback of the API

We could remove callback from the official API. Users can always inherit the API and add some code which is called after the append by wrapping append. We might instead return what we are passing to the callback from the append.

Docs page does not exist

In the summary on pypi, it says documentation is available on http://datastream.readthedocs.org/ .

When I actually go there, I get a 404/error page

Add to tests expected count of number of queries to the database

Add to tests expected count of number of queries to the database and fail if there is a mismatch. In this way we can make sure that we do not by mistake introduce additional queries.

Try/except should be moved out of operators

Currently we use try/except in operators to catch TypeError exceptions. Operator implementations should not have to care about that. They should assume that they are getting deserialized value as input and compute whatever they want. If there is some issue deserializing, we should catch and warn that somewhere else.

Make sure all operations can be run concurrently multiple times

Make sure all operations can be run concurrently multiple times. There are two main issues.

Assuring that concurrent runs of downsampling do the expected thing (not overriding or duplicating work). Probably we could lock streams as they get started being downsampled and other runs skip them. We should make sure that they do not get locked indefinitely. Same for backprocessing of dependent streams.

Assuring that datapoints can be appended concurrently. Mostly this is already so and even for processing of dependent streams this is so. The only known issue is with derive operator which expects reset stream to be processed before data stream, so that it can know if reset happened or not. Maybe we should just document this and require user to assure that? Or should we make it work no matter the order? The issue with the latter path would be that it seems we would have to store not just datapoints when reset happened, but also when it did not.

Provide metadata about range of datapoints available in a stream

Provide metadata about range of datapoints available in a stream at a highest granularity. So information about earliest and latest datapoint should be provided as stream metadata.

Additionally, at each granularity level, latest granularity datapoint should be provided as well.

More levels of granularity

While developing Django interface to datastream Nejc observed that current granularity is too coarse for easy implementation of smooth zooming in and out of plots. For example, zooming in from hours to minutes means that there is 60x increase in number of datapoints which clients has to fetch to display more detailed plot.

Probably there should be an upper limit, like 10x, between levels of granularity. So I am proposing that we extend levels to:

seconds
10 seconds
1 minute
10 minutes
1 hour
6 hours
24 hours

Instead of current 1+3 levels we would have 1+6. This means more storage space needed, but easier client implementation.

Support for string/errors

I was thinking that all errors nodes had could be stored as datastream as well. So we could then display them as discrete events in visualizations. This would then make debugging easier. We could make fake stream in django-datastream which would read from the relational database, or we could simply store all errors into datastream to begin with.

So I imagine then support for data type which would be or list of strings or maybe list of simple JSON-able objects? (You can have multiple errors at a given moment.) I would go for JSON-able objects because you can then have more metadata about the error, not just string.

Downsampling would be simple as well, we would probably support only min, max and sum for this datatype (maybe I missed some). min is the first object (list of one object) in the concatenation of all objects, sum is a concatenation of all objects, max is the last object (list of one object). Ah, and we can support count. Do we count number of datapoints or number of all objects in all datapoints?

Few ideas for aggregate functions

From InfluxDB.

Tests sometimes fail

Tests sometimes file on exactly the same location with 4 datapoints instead of 3. Why?

Datapoints.getitem should not return a generator for a single item

It would seem that Datapoints.__getitem__ returns a generator:

    def __getitem__(self, key):
        if self.cursor is None:
            raise IndexError

        for datapoint in self.cursor.__getitem__(key):
            yield self.stream._format_datapoint(datapoint)

Doesn't this result in some unexpected behavior when you only want a single index like datapoints[0]? This would return a generator object for the single object and not the object itself! I can see how this can be useful for slices, but when accessing a scalar index, this behavior seems wrong.