Comments (18)
In my view, lines of length 0 are an ill-defined concept and cannot be counted (because a line of length is not a line). This is why I think an index shift may be a viable solution, when documenting it thoroughly. Lines of length N, while an edge case, can occur and should be accounted for.
@fkuehlein please make sure that statistics that rely on weighted line distributions such as Determinism are then still computed correctly.
On #188: here the non-quadtratic case for CrossRecurrencePlot can be accounted for by overriding the vertline_dist() and diagline_dist() methods.
from pyunicorn.
Hey @madarax64,
I recently joined the pyunicorn team and will try my best to help you.
I checked what changed in ._white_vertline_dist since 0.6.1 and the only thing i came across was the ctypedef that before directly happened in timeseries/_ext/numerics and is now imported from core/_ext/types.
- Could you maybe create a minimally complete example that reproduces the error and post it (tagging me)?
- What do you mean by particular datasets? Maybe it is helpful if you can provide some more info.
from pyunicorn.
Hello @MaxBechthold ,
Thanks for the response. I have attached such an example at the bottom of this post. You'll need to install pyts to replicate it, however. For reference, I install PyUnicorn directly from this Github repo (i.e. by pip3 install git+https://github.com/pik-copan/pyunicorn
) and I am running in a Linux environment.
I am working on a classification problem using the RQA measures available from the PyUnicorn library as features to a machine learning classifier. I am using 128 publicly-available datasets (the UCR 2018 dataset archive), which I am sourcing from the PyTs library mentioned above. As you'll see from the example code, it is very easy to load the data for any of these datasets by name. I encounter this issue when I use the ACSF1 dataset from that repository, but not when I use others e.g the Ham dataset, which is what I meant by "particular datasets".
As mentioned previously, I have a PyUnicorn 0.6.1 installation on a Windows machine (using the wheel I linked to it in the first post above), and I do not have this issue in that environment, suggesting that it is a problem either with PyUnicorn 0.7 or the Linux environment.
Thanks
----BEGIN Example----
import numpy as np
from pyts.datasets import fetch_ucr_dataset
from pyunicorn.timeseries.recurrence_plot import RecurrencePlot
def loadData(dataset):
dataDict = fetch_ucr_dataset(dataset)
return dataDict['data_train'], dataDict['data_test']
def extract_feature(x):
x_ = x[~np.isnan(x)]
features = []
rp = RecurrencePlot(x_, recurrence_rate=0.1, dim=1, tau=1, metric='supremum',silence_level=2)
features.append(rp.average_diaglength())
features.append(rp.average_vertlength())
features.append(rp.average_white_vertlength()) #this is the offending feature
features.append(rp.determinism())
features.append(rp.diag_entropy())
return features
if __name__ == "__main__":
srcDatasetName = 'ACSF1' #ACSF1 fails, Ham works
x_train, x_test = loadData(srcDatasetName) #load the data from the 2018 UCR TS archive
extracted_features = []
for sample in x_train:
extracted_features.append(extract_feature(sample))`
----End Example----
from pyunicorn.
Hey @madarax64, thanks for getting back to me. I will try to replicate the problem in the next days and let you know.
from pyunicorn.
Sure. Thanks alot!
from pyunicorn.
Hey @madarax64, so I was actually able to reproduce your error. Unfortunately I was not able to fix it or find a workaround... The only thing that changed for said method was the type conversion between C and python. Restoring the state from version 0.6.1. did not fix the problem though, which leaves me thinking that this might be some interaction problem with cython. Sadly I have not expertise in C and with cython yet to solve this problem.
If 0.6.1 works with those datasets for you I think the best idea is to actually keep using it for this particular sets. Would that work for you for now? I hope we can solve this with further commits in the future.
from pyunicorn.
Hi @zugnachpankow ,
Thanks for the speedy response! I'm not familiar with Cython myself, but I feel you're probably right.
I will try to workaround the issue as you've suggested. Hopefully a future commit will fix it. In the meantime, is it ok if I leave the issue open i.e. so it remains visible?
Thanks again!
from pyunicorn.
Agreed, let's leave this issue open. And thanks for pointing this problem out!
from pyunicorn.
Thanks @zugnachpankow !
from pyunicorn.
This appears to be a logical bug of the off-by-one type in the implementation of RecurrencePlot.white_vertline_dist()
: The array white_vertline
for the return value is initialised with length equal to the size of the recurrence matrix, which does not cover the edge case in timeseries._ext.numerics._white_vertline_distβ()
that an entire column might be white, i.e., k = n_time
. In other words, the allocated histogram does not cover the full event space.
This bug has gone unnoticed until the Cython compiler directive boundscheck
was activated in 79ded90.
@fkuehlein, could you please write a test exposing this edge case and investigate whether downstream uses of this method, and related ones, would have any issues if the return value had a length increased by one?
from pyunicorn.
Thank you @ntfrgl for specifying this.
-
You're right about the off-by-one type of the bug, although I found that an index shift would do the job while sparing us any downstream complications of changing the length of
white_vertline
:Certainly, the edge case of an all-white column in the RP should be accounted for. Instead, I see no sense in counting lines of length 0 and indeed there are none ever counted, which is why the first histogram entry is practically unused. Therefore, I'd propose to just shift the indexing from
vertline[k]
tovertline[k-1]
. At my first sight, this should induce no downstream complications, and it makes more sense mathematically. Would you agree, or should we ask some RQA pro here first? -
I also found that the same problem occurs in other places, which would have to be changed accordingly in order to be consistent:
The same applies to
RecurrencePlot.vertline_dist()
and its helper methods, producing the same error.
The same also applies toRecurrencePlot.diagline_dist()
, although no error occurs here as the index will only exceed bounds on the last loop (the main diagonal), where it is not used anymore, making that last loop pretty useless.
Note that the current behavior appears to be in line with the corresponding legacy Weave code (see 4644229, 4eb2478 and 5e54a36), so the problem was not in the Cython port here.
The edge cases can easily be triggered in a parametrized test including threshold
s of 0 and a rather high value, producing an all-white or all-black RP respectively. Adding tests for all above methods is overdue anyway, RecurrencePlot
's coverage is currently only at about 36%.
from pyunicorn.
... of course, the quicker fix would be to just ensure that the index k
will not exceed the bounds of the indexed array.
Still, I think shifting the index as proposed above would be more sensible. But as this issue affects large parts of RecurrencePlot
and the index shift would change the behavior of the respective methods a bit, I'll check back with Reik and Jonathan before deciding this.
In the meantime I implemented some tests against this issue, trying a parametrized test fixture this time. Although parametrized fixtures are cool for being reusable over multiple tests, I haven't found a proper way to vary multiple parameters as subsequent pytest.mark.parametrize()
-statements would do. If you know one, let me know!
from pyunicorn.
Yes, changing the output semantics of an essential library method requires a discussion with stakeholders, and an update to the documentation where appropriate. Since the currently documented output is supposed to be a "frequency distribution of line lengths", I think that it would be prudent to either include the actual length-0 and length-N frequencies or to note that some of them are not computed. I would personally prefer the former, but it is up to the active users to make the call.
I have added a commit to the linked branch which suggests a possible factorisation of the fixture parametrisation.
from pyunicorn.
Awesome, parametrising fixtures that way is even neater than what I was hoping for. A good start for further improving test coverage, thanks.
Also agreed on the question of line-length frequencies, although I still don't quite understand how a length-0 line would be defined. In my eyes, starting to count at length 1 makes sense, right? Anyway, Reik has already said to look into it, so I'll see his response first.
from pyunicorn.
As noticed in #188, line distributions will also throw an IndexError
with non-quadratic CrossRecurrencePlot
s. While basic RecurrencePlot
s are always quadratic, CrossRecurrencePlot
s don't need to be.
Hence, when fixing line distribution measures here, the non-quadratic case should be accounted for in the respective base class methods, or overriding methods must be implemented in the child class. Then, the testing of line distributions and use of test fixtures should be extended to CrossRecurrencePlot
and all of test_timeseries.py
.
from pyunicorn.
In the above commits 740f6ca and 00d8773, I implemented the agreed-upon index shift attempting not to corrupt or change the behavior of any downstream methods. I checked for correct results manually, but will try adding quantative assertions to the respective tests before opening a PR.
Among other things, in 14f822d I renamed the Cython line distribution helpers as I found current names to be misleading.
Please let me know if I missed something there, or in the other commits.
from pyunicorn.
I obviously agree with the above approach, but let me briefly clarify how I would make sense of length 0.
A black line within an underlying linear grid might be mathematically defined in two ways, which correspond to slightly different automata for recognising them:
- Filter out white cells, then form equivalence classes among the remaining black cells (reflexive and transitive closure of the relation neighbours), and finally count the sizes of the equivalence classes.
- Form equivalence classes among all cells (reflexive and transitive closure of the relation neighbours which are both black), and then count the number of black cells within each equivalence class.
In the approach (1), length 0 cannot exist, whereas in (2), each white cell is an equivalence class of its own.
from pyunicorn.
Ah ok, I think I get the idea now!
from pyunicorn.
Related Issues (20)
- BUG: segfault when calling `CouplingAnalysis.mutual_information()` and `CouplingAnalysis.information_transfer()` HOT 5
- BUG: `CrossRecurrencePlot()` throws TypeError with fixed recurrence rate HOT 5
- Tests fail with NumPy 1.24.4 HOT 3
- LINT: few remaining refactor and warning flags to be resolved HOT 2
- Tests failing with `Unable to lock file`
- pip install pyunicorn error:error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community\\VC\\Tools\\MSVC\\14.29.30133\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2 HOT 1
- LINT: disabled `pylint` flags should be re-enabled where possible, or disabled locally HOT 5
- Replace: `progressbar` -> `tqdm`
- MAINT: add `cython-lint` to test suite
- MAINT: review funcionality and use of `CouplingAnalysis.get_nearest_neighbors()` and add a specific test
- ENH: Theiler window option for diagonal RQA measures in `RecurrencePlot` HOT 2
- CI: set up Travis CI for macOS and Windows builds
- MAINT: drop `climate.MapPlots` in favour of `climate.CartopyPlots` HOT 2
- ENH: Remove old examples folder
- `Network.nsi_local_clustering()` might encounter divide by zero when `typical_weight` is given HOT 1
- module 'numpy' has no attribute 'int' HOT 1
- Failed building wheel for pyunicorn while installing HOT 4
- Printing `ClimateData` object throws `KeyError` from `h5netcdf` HOT 2
- Add `typical_weight` argument for all n.s.i. measures to enable their "corrected" versions
- numpy module not found during installation (Win10) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyunicorn.