Giter Site home page Giter Site logo

Comments (18)

jdonges avatar jdonges commented on August 24, 2024 1

In my view, lines of length 0 are an ill-defined concept and cannot be counted (because a line of length is not a line). This is why I think an index shift may be a viable solution, when documenting it thoroughly. Lines of length N, while an edge case, can occur and should be accounted for.

@fkuehlein please make sure that statistics that rely on weighted line distributions such as Determinism are then still computed correctly.

On #188: here the non-quadtratic case for CrossRecurrencePlot can be accounted for by overriding the vertline_dist() and diagline_dist() methods.

from pyunicorn.

zugnachpankow avatar zugnachpankow commented on August 24, 2024

Hey @madarax64,

I recently joined the pyunicorn team and will try my best to help you.
I checked what changed in ._white_vertline_dist since 0.6.1 and the only thing i came across was the ctypedef that before directly happened in timeseries/_ext/numerics and is now imported from core/_ext/types.

  1. Could you maybe create a minimally complete example that reproduces the error and post it (tagging me)?
  2. What do you mean by particular datasets? Maybe it is helpful if you can provide some more info.

from pyunicorn.

madarax64 avatar madarax64 commented on August 24, 2024

Hello @MaxBechthold ,
Thanks for the response. I have attached such an example at the bottom of this post. You'll need to install pyts to replicate it, however. For reference, I install PyUnicorn directly from this Github repo (i.e. by pip3 install git+https://github.com/pik-copan/pyunicorn) and I am running in a Linux environment.

I am working on a classification problem using the RQA measures available from the PyUnicorn library as features to a machine learning classifier. I am using 128 publicly-available datasets (the UCR 2018 dataset archive), which I am sourcing from the PyTs library mentioned above. As you'll see from the example code, it is very easy to load the data for any of these datasets by name. I encounter this issue when I use the ACSF1 dataset from that repository, but not when I use others e.g the Ham dataset, which is what I meant by "particular datasets".

As mentioned previously, I have a PyUnicorn 0.6.1 installation on a Windows machine (using the wheel I linked to it in the first post above), and I do not have this issue in that environment, suggesting that it is a problem either with PyUnicorn 0.7 or the Linux environment.

Thanks

----BEGIN Example----

import numpy as np
from pyts.datasets import fetch_ucr_dataset
from pyunicorn.timeseries.recurrence_plot import RecurrencePlot

def loadData(dataset):
	dataDict = fetch_ucr_dataset(dataset)
	return dataDict['data_train'], dataDict['data_test']

def extract_feature(x):
	x_ = x[~np.isnan(x)]
	features = []
	rp = RecurrencePlot(x_, recurrence_rate=0.1, dim=1, tau=1, metric='supremum',silence_level=2)
	features.append(rp.average_diaglength())
	features.append(rp.average_vertlength())
	features.append(rp.average_white_vertlength()) #this is the offending feature
	features.append(rp.determinism())
	features.append(rp.diag_entropy())
	return features

if __name__ == "__main__":
	srcDatasetName = 'ACSF1' #ACSF1 fails, Ham works
	x_train, x_test  = loadData(srcDatasetName) #load the data from the 2018 UCR TS archive
	extracted_features = []
	for sample in x_train:
		extracted_features.append(extract_feature(sample))`

----End Example----

from pyunicorn.

zugnachpankow avatar zugnachpankow commented on August 24, 2024

Hey @madarax64, thanks for getting back to me. I will try to replicate the problem in the next days and let you know.

from pyunicorn.

madarax64 avatar madarax64 commented on August 24, 2024

Sure. Thanks alot!

from pyunicorn.

zugnachpankow avatar zugnachpankow commented on August 24, 2024

Hey @madarax64, so I was actually able to reproduce your error. Unfortunately I was not able to fix it or find a workaround... The only thing that changed for said method was the type conversion between C and python. Restoring the state from version 0.6.1. did not fix the problem though, which leaves me thinking that this might be some interaction problem with cython. Sadly I have not expertise in C and with cython yet to solve this problem.

If 0.6.1 works with those datasets for you I think the best idea is to actually keep using it for this particular sets. Would that work for you for now? I hope we can solve this with further commits in the future.

from pyunicorn.

madarax64 avatar madarax64 commented on August 24, 2024

Hi @zugnachpankow ,
Thanks for the speedy response! I'm not familiar with Cython myself, but I feel you're probably right.

I will try to workaround the issue as you've suggested. Hopefully a future commit will fix it. In the meantime, is it ok if I leave the issue open i.e. so it remains visible?
Thanks again!

from pyunicorn.

zugnachpankow avatar zugnachpankow commented on August 24, 2024

Agreed, let's leave this issue open. And thanks for pointing this problem out!

from pyunicorn.

madarax64 avatar madarax64 commented on August 24, 2024

Thanks @zugnachpankow !

from pyunicorn.

ntfrgl avatar ntfrgl commented on August 24, 2024

This appears to be a logical bug of the off-by-one type in the implementation of RecurrencePlot.white_vertline_dist(): The array white_vertline for the return value is initialised with length equal to the size of the recurrence matrix, which does not cover the edge case in timeseries._ext.numerics._white_vertline_distβ€Ž() that an entire column might be white, i.e., k = n_time. In other words, the allocated histogram does not cover the full event space.

This bug has gone unnoticed until the Cython compiler directive boundscheck was activated in 79ded90.

@fkuehlein, could you please write a test exposing this edge case and investigate whether downstream uses of this method, and related ones, would have any issues if the return value had a length increased by one?

from pyunicorn.

fkuehlein avatar fkuehlein commented on August 24, 2024

Thank you @ntfrgl for specifying this.

  1. You're right about the off-by-one type of the bug, although I found that an index shift would do the job while sparing us any downstream complications of changing the length of white_vertline:

    Certainly, the edge case of an all-white column in the RP should be accounted for. Instead, I see no sense in counting lines of length 0 and indeed there are none ever counted, which is why the first histogram entry is practically unused. Therefore, I'd propose to just shift the indexing from vertline[k] to vertline[k-1]. At my first sight, this should induce no downstream complications, and it makes more sense mathematically. Would you agree, or should we ask some RQA pro here first?

  2. I also found that the same problem occurs in other places, which would have to be changed accordingly in order to be consistent:

    The same applies to RecurrencePlot.vertline_dist() and its helper methods, producing the same error.
    The same also applies to RecurrencePlot.diagline_dist(), although no error occurs here as the index will only exceed bounds on the last loop (the main diagonal), where it is not used anymore, making that last loop pretty useless.

Note that the current behavior appears to be in line with the corresponding legacy Weave code (see 4644229, 4eb2478 and 5e54a36), so the problem was not in the Cython port here.

The edge cases can easily be triggered in a parametrized test including thresholds of 0 and a rather high value, producing an all-white or all-black RP respectively. Adding tests for all above methods is overdue anyway, RecurrencePlot's coverage is currently only at about 36%.

from pyunicorn.

fkuehlein avatar fkuehlein commented on August 24, 2024

... of course, the quicker fix would be to just ensure that the index k will not exceed the bounds of the indexed array.

Still, I think shifting the index as proposed above would be more sensible. But as this issue affects large parts of RecurrencePlot and the index shift would change the behavior of the respective methods a bit, I'll check back with Reik and Jonathan before deciding this.

In the meantime I implemented some tests against this issue, trying a parametrized test fixture this time. Although parametrized fixtures are cool for being reusable over multiple tests, I haven't found a proper way to vary multiple parameters as subsequent pytest.mark.parametrize()-statements would do. If you know one, let me know!

from pyunicorn.

ntfrgl avatar ntfrgl commented on August 24, 2024

Yes, changing the output semantics of an essential library method requires a discussion with stakeholders, and an update to the documentation where appropriate. Since the currently documented output is supposed to be a "frequency distribution of line lengths", I think that it would be prudent to either include the actual length-0 and length-N frequencies or to note that some of them are not computed. I would personally prefer the former, but it is up to the active users to make the call.

I have added a commit to the linked branch which suggests a possible factorisation of the fixture parametrisation.

from pyunicorn.

fkuehlein avatar fkuehlein commented on August 24, 2024

Awesome, parametrising fixtures that way is even neater than what I was hoping for. A good start for further improving test coverage, thanks.

Also agreed on the question of line-length frequencies, although I still don't quite understand how a length-0 line would be defined. In my eyes, starting to count at length 1 makes sense, right? Anyway, Reik has already said to look into it, so I'll see his response first.

from pyunicorn.

fkuehlein avatar fkuehlein commented on August 24, 2024

As noticed in #188, line distributions will also throw an IndexError with non-quadratic CrossRecurrencePlots. While basic RecurrencePlots are always quadratic, CrossRecurrencePlots don't need to be.

Hence, when fixing line distribution measures here, the non-quadratic case should be accounted for in the respective base class methods, or overriding methods must be implemented in the child class. Then, the testing of line distributions and use of test fixtures should be extended to CrossRecurrencePlot and all of test_timeseries.py.

from pyunicorn.

fkuehlein avatar fkuehlein commented on August 24, 2024

In the above commits 740f6ca and 00d8773, I implemented the agreed-upon index shift attempting not to corrupt or change the behavior of any downstream methods. I checked for correct results manually, but will try adding quantative assertions to the respective tests before opening a PR.

Among other things, in 14f822d I renamed the Cython line distribution helpers as I found current names to be misleading.
Please let me know if I missed something there, or in the other commits.

from pyunicorn.

ntfrgl avatar ntfrgl commented on August 24, 2024

I obviously agree with the above approach, but let me briefly clarify how I would make sense of length 0.

A black line within an underlying linear grid might be mathematically defined in two ways, which correspond to slightly different automata for recognising them:

  1. Filter out white cells, then form equivalence classes among the remaining black cells (reflexive and transitive closure of the relation neighbours), and finally count the sizes of the equivalence classes.
  2. Form equivalence classes among all cells (reflexive and transitive closure of the relation neighbours which are both black), and then count the number of black cells within each equivalence class.

In the approach (1), length 0 cannot exist, whereas in (2), each white cell is an equivalence class of its own.

from pyunicorn.

fkuehlein avatar fkuehlein commented on August 24, 2024

Ah ok, I think I get the idea now!

from pyunicorn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.