I think there's a bug here: <div class="Box Box--condensed my-2"

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Possible fix (completely untested): <a class="issue-link js-issue-link" data-error-tex

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

[BUG] whisper-fetch.py --drop incorrect timestamps about whisper HOT 10 CLOSED

cdeil commented on May 27, 2024

[BUG] whisper-fetch.py --drop incorrect timestamps

from whisper.

Comments (10)

deniszh commented on May 27, 2024

Hi @cdeil ,

As far as I remember it was not initial intention of author in #57
Other users also noticed that - e.g. #250

from whisper.

deniszh commented on May 27, 2024

TBH I think proper implementation should not drop point but replace it with 0 or previous value instead. But then we should not call this function "drop", isn't it?

from whisper.

piotr1212 commented on May 27, 2024

I don't see where it says this is intended. Returning the wrong timestamp does not make any sense to me, can't imagine why someone would want this.

from whisper.

cdeil commented on May 27, 2024

This cost me a day at work, because I ran whisper-fetch.py --drop=nulls and got timestamps for 2010 - 2012 and requested another data extract from a legacy system and was debugging why the extract doesn't work properly there. But really the data was for 2018-2020 as it should be already in the Whisper file, I just created wrong timestamps due to this bug.

For now I'm changing my code to just call whisper.fetch() in my pipeline instead of whisper-fetch.py, put the timestamps and values into a pandas.Series and call dropna to drop irrelevant (t, val).

from whisper.

cdeil commented on May 27, 2024

Possible fix (completely untested): #306

from whisper.

cdeil commented on May 27, 2024

Another issue that I ran into is that here you're using local time of the machine that I'm running the data processing on, which gave incorrect results in my case:

whisper/bin/whisper-fetch.py

Lines 86 to 89 in 8d21c56

    
           if options.time_format: 
        
             timestr = time.strftime(options.time_format, time.localtime(t)) 
        
           else: 
        
             timestr = time.ctime(t)

I'm now using this, I think that should be correct:

def read_whisper(path):
    (fromTime, untilTime, step), val = whisper.fetch(path, fromTime=0, archiveToSelect=None)
    fromTimeStamp = pd.Timestamp(fromTime, unit="s", tz="Europe/Berlin")
    index = pd.date_range(
        start=fromTimeStamp,
        freq="H",
        periods=len(val),
    )
    data = {"val": val}
    return pd.DataFrame(data, index=index)

from whisper.

cdeil commented on May 27, 2024

I was getting incorrect data with whisper.fetch for archiveToSelect=60, tried varies fromTime values.

I see correct data with whisper-dump.py, but I don't want to write temp CSV files.

Wrote this, which seems to work fine:

def read_whisper_archive(path: str, archive_id: int) -> pd.DataFrame:
    """Whisper data read direct implementation with Numpy and Pandas"""
    infos = whisper.info(path)
    if archive_id < 0 or archive_id >= len(infos["archives"]):
        raise ValueError(f"Invalid archive_id = {archive_id}")

    dtype = np.dtype([
        ("time", ">u4"),
        ("val", ">f8")
    ])

    offset = infos["archives"][archive_id]["offset"]
    data = np.fromfile(path, dtype=dtype, offset=offset)
    data = data[np.nonzero(data["time"])]
    # The astype is needed to avoid this error later on
    # ValueError: Big-endian buffer not supported on little-endian compiler
    df = pd.DataFrame(
        data={"val": data["val"].astype(float)},
        index=pd.to_datetime(data["time"], unit="s")
    )
    df = df.sort_index()
    return df

This should be much faster and memory-efficient than the current whisper-dump.py, which used Python types and lists, and also more convenient for use cases where people want to do data analysis directly, or dump to binary formats like Parquet / HDF5 / ....

Is the processing correct, i.e. is it guaranteed that non-filled values have time=0? and is the sorting by time at the end needed, or is this already the case in the file? the description at https://graphite.readthedocs.io/en/stable/whisper.html unfortunately doesn't explain where values are filled within the archive.

Do you think it could make sense to add a function like this to this repo? The Numpy & Pandas import could be delayed to the function, i.e. it would be an optional dependency.
Alternatively I could just make a file whisper_pandas.py in my private Github account share some functions there.

from whisper.

deniszh commented on May 27, 2024

Hi @cdeil,
I agreed that Whisper has many use cases, but IMO using it for analytical purposes is not widely adopted.
That's why I do not think that including pandas or numpy as part of library is a good idea.
But you can add your scripts to contrib/ directory, which exist for exact that reason.

from whisper.

stale commented on May 27, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from whisper.

cdeil commented on May 27, 2024

Just in case someone finds this old thread and is looking for a Whisper file Pandas reader, check this out:
https://github.com/cdeil/whisper_pandas/blob/main/whisper_pandas.py

Of course, any feedback or contribution would be welcome.

Specifically I'm not sure if the data = data[data["time"] != 0] line and sort_index is valid.

It seems to work for my files, but the WhisperDB docs at https://graphite.readthedocs.io/en/latest/whisper.html unfortunately don't say how the file is initialised or where the points are inserted.

from whisper.

[BUG] whisper-fetch.py --drop incorrect timestamps about whisper HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	if options.time_format:
	timestr = time.strftime(options.time_format, time.localtime(t))
	else:
	timestr = time.ctime(t)