wlanslovenija / datastream Goto Github PK
View Code? Open in Web Editor NEWDatastream API provides a powerful and unified Python API for time-series data.
Home Page: http://datastream.readthedocs.org/
License: Other
Datastream API provides a powerful and unified Python API for time-series data.
Home Page: http://datastream.readthedocs.org/
License: Other
Add an API call to regenerate all generated streams. This will be useful if we will improve implementation of operators (for example, add overflow parameters) and we want to recompute things.
It could be easily implemented: we just delete all derived stream datapoints, mark it as pending, and then call backprocess.
It is true though that this means datapoint values for existing stream will be changed. This is destroying the API contract we have (that datapoints can only be appended). Maybe it is better to not support that and simply require user to delete derived stream and add new one, with same parameters. Stream's ID will change so it will be clear that it is another stream with another datapoints. And for end-user things will work the same because end-user will query streams by query tags.
Why we are storing tags as list? This is really hard to work with. Or at least we should document that and show use cases we envision. And proposed ways to work with them.
Why is not possible to downsample into the future? Currently even if you give until
parameter to downsample
with a future time, it downsamples only until the current time. And you have to use _time_offset
to really get it to be downsampled. I think we should allow user to "freeze" the datapoints also into the future. This might be useful when importing.
We should support passing in null values. We can then use them to signal that the value was missing but it should be. For example, when we read data from nodes and node is down. This is necessary so that we when to connect datapoints in the graph and when to not (because there are missing values in between).
When downsampling, missing values would be simply ignored. So count downsampling would not count them, average ignore them and so on. If some interval when downsampling has no values (or just missing values), the result of downsampling would be a null value as well.
The question is do we store every null value or only the first one? In theory it would be enough to store only the first one, but this would still be less real data available (we would know later on if there were some times where values were really missing or monitoring was done). And is it really easy to implement this storing only of the first one? We should then check if we already story? Or if we are saving null, we could set some mark on the descriptor object? Or something? Or we could simply store multiple null values and this would be it, simple implementation (only downsampling have to be smart) and all data available.
It might be interesting to have auto-correlation operator? Or some other way of finding recurring parts?
Allow appending custom metadata to each datapoint, so that we can store some additional data which should not be really used, but should still be kept to make data more useful for later analysis.
An example is when you measure packet loss by pinging a host with multiple packets, to store how many packets were send to compute packet loss.
This should be stored only on the highest level granularity stream and would not be downsampled.
This would modify append
to allow adding this metadata and get_data
with a flag to retrieve this metadata (by default false
).
Explain what the package is, what is its purpose and goal, where to ask questions and so on.
I think we should introduce a new downsample function which can work on any data type: random sample. So for set of values, it chooses a random sample among them.
The only issue is that it has to be a deterministic process so that same samples are generated when downsampling is rerun. We should maybe seed random generator based on hash of stream's uuid?
Check and update documentation.
Maybe use Monary driver to read data from MongoDB and compute aggregations using numpy.
We could remove callback from the official API. Users can always inherit the API and add some code which is called after the append
by wrapping append. We might instead return what we are passing to the callback from the append
.
In the summary on pypi, it says documentation is available on http://datastream.readthedocs.org/ .
When I actually go there, I get a 404/error page
Add to tests expected count of number of queries to the database and fail if there is a mismatch. In this way we can make sure that we do not by mistake introduce additional queries.
Currently we use try/except in operators to catch TypeError
exceptions. Operator implementations should not have to care about that. They should assume that they are getting deserialized value as input and compute whatever they want. If there is some issue deserializing, we should catch and warn that somewhere else.
Make sure all operations can be run concurrently multiple times. There are two main issues.
Assuring that concurrent runs of downsampling do the expected thing (not overriding or duplicating work). Probably we could lock streams as they get started being downsampled and other runs skip them. We should make sure that they do not get locked indefinitely. Same for backprocessing of dependent streams.
Assuring that datapoints can be appended concurrently. Mostly this is already so and even for processing of dependent streams this is so. The only known issue is with derive operator which expects reset stream to be processed before data stream, so that it can know if reset happened or not. Maybe we should just document this and require user to assure that? Or should we make it work no matter the order? The issue with the latter path would be that it seems we would have to store not just datapoints when reset happened, but also when it did not.
Provide metadata about range of datapoints available in a stream at a highest granularity. So information about earliest and latest datapoint should be provided as stream metadata.
Additionally, at each granularity level, latest granularity datapoint should be provided as well.
While developing Django interface to datastream Nejc observed that current granularity is too coarse for easy implementation of smooth zooming in and out of plots. For example, zooming in from hours to minutes means that there is 60x increase in number of datapoints which clients has to fetch to display more detailed plot.
Probably there should be an upper limit, like 10x, between levels of granularity. So I am proposing that we extend levels to:
Instead of current 1+3 levels we would have 1+6. This means more storage space needed, but easier client implementation.
I was thinking that all errors nodes had could be stored as datastream as well. So we could then display them as discrete events in visualizations. This would then make debugging easier. We could make fake stream in django-datastream which would read from the relational database, or we could simply store all errors into datastream to begin with.
So I imagine then support for data type which would be or list of strings or maybe list of simple JSON-able objects? (You can have multiple errors at a given moment.) I would go for JSON-able objects because you can then have more metadata about the error, not just string.
Downsampling would be simple as well, we would probably support only min, max and sum for this datatype (maybe I missed some). min is the first object (list of one object) in the concatenation of all objects, sum is a concatenation of all objects, max is the last object (list of one object). Ah, and we can support count. Do we count number of datapoints or number of all objects in all datapoints?
From InfluxDB.
Tests sometimes file on exactly the same location with 4 datapoints instead of 3. Why?
It would seem that Datapoints.__getitem__
returns a generator:
def __getitem__(self, key):
if self.cursor is None:
raise IndexError
for datapoint in self.cursor.__getitem__(key):
yield self.stream._format_datapoint(datapoint)
Doesn't this result in some unexpected behavior when you only want a single index like datapoints[0]
? This would return a generator object for the single object and not the object itself! I can see how this can be useful for slices, but when accessing a scalar index, this behavior seems wrong.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.