aeon-toolkit / aeon Goto Github PK

A toolkit for machine learning from time series

License: BSD 3-Clause "New" or "Revised" License

Dockerfile 0.01% Python 99.96% Shell 0.02%

data-mining data-science forecasting machine-learning scikit-learn time-series time-series-analysis time-series-classification time-series-clustering time-series-regression

aeon's People

Contributors

Stargazers

Watchers

Forkers

haskarb jingmouren scorcism ltsaprounis time-series-machine-learning lmmentel hadifawaz1999 clarenceke rafaaygar datavaluepeople allanzhang721 katiebuc thearchiver qubiit guzalbulatova sandy4321 shiningstarcode vinicius-ianni afnindar-f josegron arno-dutra jerronl aiwalter sylvaincom lxrxr harel-coffee vaseekaran-v hedeershowk steenrotsman kgmuzungu kevinlu1248 zy18811 wwzeng1 chrisholder mloning akshatvishu paulrabich brunohdmacedo xiaopu222 halzorg frankl1 buke2016 ashish090798 shanthshivam andregdmitri jose-gilberto pcjdhhhh atharva131201 cks7 itsdivya1309 aquibkhann jackpickert youssefragab99 ulysses1212 vedant2100 anonymouscodes911 nimanzik alicesn tvilight4 aadya940 codelionx maxwell1503 vntzyy ryantigi254 lfun1 nirojasva krstopro alexbanwell1 alexgmatt1 harshithasudhakar quatuordecimber wayneadams abhash297 amitkumargope xuhong1999 chiyu-chiu jasonmokk futuer-szd amishkakru patricejada mettebeekman aries99c nileenagp jevdende adm-unl moonzyyy acquayefrank vishalbelsare jasonlines addi-p epsilon-ent-sol aryanpola hfawaz grandson-of-phyxis matthewmiddlehurst irknyazev cyril-meyer manuelmusngi datadote phershbe

aeon's Issues

[FORK] Github repo branch protection rules

The main branch should be protected to avoid accidental pushes and changes should be introduced through PR. The rule should include admin/core devs as means of clearly communicating changes.

[FORK] Decide on the primary communication channel/medium for the community

Some options are;

use existing slack workspace without any changes
use existing slack workspace and rename it
create a new slack workspace

Do we also do the same for the twitter and linkedin?

[GOV] Changes to the code of conduct

Recent changes have already been reverted (#18 #17), but we should discuss if we want to make any further changes to the CoC.

[FORK] Get scikit-time Pypi

We are currently in communication with the current owners on obtaining access to the scikit-time PyPi.

[GOV] New governance for aeon

hi, what do you think we should do with sktime governance? Personally I would like to scrap it all and start again, but htat might be too disruptive?

The current governance roles are:

Core developer
Community council (CC)
Code of conduct committee (CoCC)
Algorithm maintainer (in codeowners file)

[ENH] Deprecate the alignment module

Is your feature request related to a problem? Please describe.

The alignment module was introduced with no clear demand. It introduced a new soft dependency, dtw-python, rather than use our own implementations. There is a whole infrastructure of base class etc, but there are no examples I can see. The functionality
"computes the full alignment between X[0] and X[1]" is available in the distances module as follows.

from sktime.distances import dtw_alignment_path
x = np.random.rand(10)
y = np.random.rand(10)
path=dtw_alignment_path(x, y)
print(path)

from sktime.distances import distance_alignment_path
path2=distance_alignment_path(x, y, metric="dtw")
print(path)

I dont think there is anyone here who wants to maintain this.

The only place it is used is the dists_kernels module in the class DistFromAligner, which itself is not used anywhere. In my view, distances or alignments between time series should be found by calling distance functions which can return alignments if required. If you want a distance matrix/kernel for a set of time series, call the pairwise_distances function. That is how it is done in distances. This is how other packages provide distances. In Aligner and dist_kerns there is a confusing set of things such as "Composer that creates distance from aligner."

Describe the solution you'd like
I want to delete the director alignment and the class DistFromAligner

Describe alternatives you've considered
An alternative would be to attempt to merge it with dist_and_kerns and distances. I do not want to do this

Additional context

these are the only usages of this module. There are no docs related to it I can see. It should be simple to get rid of it.

[FORK] Decide on digital assets

To ensure operations and releases we would require:

github organization and repo
documentation account - readthedocs.org
pypi account with registered package name

Optional

domain name to host documentation under
email account
anaconda/conda-forge account to release conda packages #33
linkedin account
twitter account

[FORK] Retaining documentation of previous contributors and core developers

Do we keep the contributors? I'm not sure of the protocol

[FORK] get project conda set up/access

PyPi now has a release to install from, but aeon does not have a conda install option.

[FORK] Update the landing page

I think we should remove all of this
https://github.com/scikit-time/scikit-time/blob/main/README.md
for now, and just have a simple message "This is a fork of sktime made by previous core developers, and is a work in progress" or something. Any thoughts on that?

[FORK] keep the same licence?

just checking, but I assume we keep the same licence?

https://github.com/scikit-time/scikit-time/blob/main/LICENSE

[ENH] Deprecate the alignment module

The alignment module wraps a package called dtw-python which contains an implementation of dtw that is slower than our ours. The functionality of alignment is available in distances using our own distance functions. The only place it is used is in KNN, which I would like to remove
#47
#39
and dists_kernels, which I would also like to deprecate. @chrisholder

[DOC] Notebooks structure and content in examples directory

I would like to take the opportunity of the reboot to restructure and partially rewrite the notebooks in the directory examples. There is an issue of whether we need separate notebooks and webpages, lets discuss, I dont feel strongly, but whichever, the structure needs looking at (scikit learn has an examples and docs dir)

Currently they look like this, just a dumping ground with inconsistent naming. such as AA_datatypes_and_datasetds

I think at the root dir we should have some basic info and then one notebook per module, e.g.

getting_started.ipynb
data_input.ipynb
annotation.ipynb
classification.ipynb
clustering.ipynb
distances.ipynb
forecasting.ipynb
networks.ipynb
regression.ipynb
transformations.ipynb

then subfolder more detailed ones by module.

with links to subdirectories with more detail.

[MNT] Python 3.7 EOL

Python 3.7 reaches its end-of-life for security support in 4 months (https://endoflife.date/python). Maybe now is a good time to drop it if it is going to be a big deal?

[ENH] base classifier input checks

Is your feature request related to a problem? Please describe.

There are two checks I think should happen in the classifier base class

We have a clear distinction between classifiers and regressors. I think we should check that the input is not continuous in fit (and predict), since everything will behave strangely if a classifier is built on a regression problem. However, I'm not sure how best to do this, since input can be a Series or a numpy array of floats. How many unique values make it a regression problem?
We should check in predict that the input X is the same dimensions as the X passed in fit. This should only be allowed with early classification

[FORK] Package name: `sktime` or `scikit_time`

Do we keep sktime as our package name or change it to avoid conflicts with the previous version?

Currently, it would be like scikit-learn, which calls their main package sklearn.

[FORK] Minimal CI pipeline

As we are private currently we cant run GitHub actions. This should be fixed asap.

PR to remove workflows: #15
PR to revert removal: #16

[BUG] plot_series has side-effect on index name

Describe the bug
Function plot_series has side-effects as y forgets index name after calling plot_series. Probably it can be fixed by simply adding a y = y.copy() to the function.

To Reproduce

from sktime.datasets import load_airline
from sktime.utils.plotting import plot_series

y = load_airline()
y = y.to_frame()
y_name1 = y.index.name
plot_series(y)
y_name2 = y.index.name

assert y_name1 == y_name2

Expected behavior

Additional context

Versions
'0.15.0'

[ENH] Parallel Backend

Should we change the parallel backend to threading in the classifiers?

Parallel(n_jobs=n_jobs, backend="threading")(

I have noticed issues when calling nested Parallel-constructs.

@MatthewMiddlehurst @TonyBagnall I recall, you noticed similarly?

[ENH] Improved error message for MultIndex DataFrames in case level(-1) is not containing a time index

Is your feature request related to a problem? Please describe.
for panel forecasting we use MultiIndex and the inner index level has to be the time index. This is an established convention within the framework and should therefore raise a more informative error message and checks on that. Users might not be aware of this convention.

[BUG] Contractable BOSS deprecation

contractableBOSS has a deprecation warning

I'm not sure if has happened or if we still want it to happen, requires resolving one way or another

[DOC] remove references to mtype/machine type scitype/stype

the terms mtype/machine type and scitype/stype were introduced with the datatype module. These are non standard terms that mean nothing to most people and are sure to cause confusion.

For example, the header for the _convert module is this
"""Machine type converters for scitypes."""
and comments like this
"obj : object to convert - any type, should comply with mtype spec for as_scitype

there is no reference here to what these mean nor a link to an explanation. As far as I can tell it is a reinvention of the concept of abstract data types with some weird inclusion of objects. So a scitype would be an abstract data type such as a list, an mtype would be an implementation of a list such as a linked list or an array. I see absolutely no need for this distinction and think all is does is confuse things

it basically gives a form out automation to convert, but I find it convoluted and confusing.

Exports
-------
convert_to(obj, to_type: str, as_scitype: str, store=None)
    converts object "obj" to type "to_type", considerd as "as_scitype"

convert(obj, from_type: str, to_type: str, as_scitype: str, store=None)
    same as convert_to, without automatic identification of "from_type"

mtype(obj, as_scitype: str)
    returns "from_type" of obj, considered as "as_scitype"
---

I would like to remove these terms completely, and replace them with, for example
input type/internal type/output type
then have simple converters that convert from one to another and remove the scitype
just looking at usages of this, to_type is a required parameter, so all this does

def convert_to(
    obj,
    to_type: str,
    as_scitype: str = None,
    store=None,
    store_behaviour: str = None,
):

is some convoluted checking, then calls convert! This adds no value I can see. In first instance I would just change the terminology, but really I doubt the below adds any value at all.

   # input checks on to_type, as_scitype; coerce to_type, as_scitype to lists
    to_type = _check_str_or_list_of_str(to_type, obj_name="to_type")

    # sub-set a preliminary set of as_scitype from to_type, as_scitype
    if as_scitype is not None:
        # if not None, subset to types compatible between to_type and as_scitype
        as_scitype = _check_str_or_list_of_str(as_scitype, obj_name="as_scitype")
        potential_scitypes = mtype_to_scitype(to_type)
        as_scitype = list(set(potential_scitypes).intersection(as_scitype))
    else:
        # if None, infer from to_type
        as_scitype = mtype_to_scitype(to_type)

    # now further narrow down as_scitype by inference from the obj
    from_type = infer_mtype(obj=obj, as_scitype=as_scitype)
    as_scitype = mtype_to_scitype(from_type)

    # if to_type is a list, we do the following:
    # if on the list, then don't do a conversion (convert to from_type)
    # if not on the list, we find and convert to first mtype that has same scitype
    if isinstance(to_type, list):
        # no conversion of from_type is in the list
        if from_type in to_type:
            to_type = from_type
        # otherwise convert to first element of same scitype
        else:
            same_scitype_mtypes = [
                mtype for mtype in to_type if mtype_to_scitype(mtype) == as_scitype
            ]
            if len(same_scitype_mtypes) == 0:
                raise TypeError(
                    "to_type contains no mtype compatible with the scitype of obj,"
                    f"which is {as_scitype}"
                )
            to_type = same_scitype_mtypes[0]

[FORK] Rename the project mentions from `sktime` to new package name

This includes:

documentation
README #43
configuration files: pyproject.toml etc.

[FORK] Changing codeowners

I suggest we just delete the current version
https://github.com/scikit-time/scikit-time/blob/main/CODEOWNERS
and give admins ownership of everything. Any thoughts?

[ENH] Deprecate Python 3.7 and bump scikit-learn version

Some estimators dont support Python 3.7 anymore, we should think about deprecating it. Also newer scikit-learn>=1.1.0 has no Python 3.7 anymore.

Probably we can bump then scikit-learn>=1.1.0 and possibly also other dependencies to require a higher version.

[ENH] Investigate/improve performance of data checks and pd.MultiIndex operations/iterations

Is your feature request related to a problem? Please describe.
There seems to be quite a problem related to performance of some data checks and pd.MultiIndex operations.

Describe the solution you'd like
Related: sktime/sktime#4139

[ENH] Redesign KNN classifier and regressor

The KNN Regressor and Classifier were redesigned to replace the inheritance model with containment. The algorithms now always work out the pairwise distance for the train data, which is then never used, since the algorithms that can exploit the triangle inequality to find neighbours are only usable with hard coded scikit-learn distances, not our elastic distances. Any option other than algorithm="brute" breaks the current version. This was all done against my opinion, but I lost the energy to fight it.

They also tightly couple to distances with a string list, rather than use the distance factory as before. This is bad practice, since the addition of new distances will also require a change to the classifier/regressor. Better to use that.

I will dig a little deeper into the scikit algorithm, but my preference would be to abandon the scikit version all together and just implement knn.

[BUG] Early Classifier changes

generally need to look at EarlyClassification. There is a deprecation warning

not sure if we still want to do this, and also TEASER is failing some classification tests, due we think to my refactoring of _class_dictionary in the Classifier base class.

see this PR for more info

#52
excluded TEASER because of global test failures
#68

@patrickzib

[BUG] forecaster.cutoff should be Period and Timestamp not PeriodIndex and DatetimeIndex

Describe the bug
There was a new bug introduced with sktime==0.16.0, the cutoff attribute of a forecaster is now PeriodIndex but in past it was Period, same for Timestamp and DatetimeIndex

To Reproduce

from sktime.datasets import load_airline
from sktime.forecasting.naive import NaiveForecaster
y = load_airline()
forecaster = NaiveForecaster(strategy="drift")
forecaster.fit(y)

y_pred = forecaster.predict(fh=[1,2,3])
forecaster.cutoff

With sktime==0.15.1 this returns Period('1960-12', 'M') and Timestamp('1960-12-31 00:00:00', freq='M')

With sktime==0.16.0 this returns PeriodIndex(['1960-12'], dtype='period[M]', name='Period') and DatetimeIndex(['1960-12-31'], dtype='datetime64[ns]', name='Period', freq='M')
Expected behavior

Additional context

Versions

[FORK] Choose a project name!

scikit-time is just a temporary name, and we need to select a new one for the project. Some suggestions:

scikit-series
skseries
Py-ts
py-time
ml-time

[GOV] Set up scikit-time slack

scikit-time is live under https://join.slack.com/t/scikit-timeworkspace/shared_invite/zt-1ph9lewat-x1ZgqoPIydbEzuswe4fJmQ

[BUG] Deprecate ProximityForest

Describe the bug
ProximityForest is broken and at this point it is easiest to deprecate it and maybe look for someone to reimplement it. See here for a discussion

sktime/sktime#3648

To Reproduce

from sktime.classification.distance_based import ProximityForest
    from sktime.datasets import load_unit_test
    pf = ProximityForest()
    trainX, trainy = load_unit_test()
    pf.fit(trainX, trainy)```

this results in an infinite recursion in _fit leading to stack overflow.
RecursionError: maximum recursion depth exceeded while calling a Python object
this happens on all problems I tried (unit_test, arrow_head, GunPoint, ItalyPowerDemand)

Expected behaviour

It should fit and predict with default parameters on data in an allowed format.

Additional context
Removal previously blocked. The original author of PF is unavailable,

[ENH] Delete _contrib folder

contrib was a folder we included at the beginning of sktime as a working area for code not suitable or ready for sktime proper. We made it private a year or so ago, but I think there is no need for it any more. My group certainly doesn't need it. The simplest solution, and the one I am proposing, is just deleting the directory.

[ENH] Redesign of time series forest classifier/regressor

We currently have duplicate versions of the same algorithm TSF, and for some reason I have never understood it plays a very prominent role in examples etc. I think we should simplify to a single version and not use it so much, it is a very obscure algorithm not used at all in practice.

[ENH] Remove dists_kernels

we currently have a module distances, which contains elastic distance functions, and one called dists_kernels that wraps everything in objects. I would like to remove dists_kernels. It is confusing having both, and I see no benefit from wrapping distances and introducing things like PwTrafoPanelPipeline, BasePairwiseTransformerPanel etc

I think this description would make little sense to most people.
https://www.sktime.org/en/stable/api_reference/dists_kernels.html

conceptually, insisting that functions be wrapped in objects (functor pattern) is much more Java like than python. Even java recognises functors are a clumsy pattern and introduced lambdas. To go the other way with python just seems perverse. @chrisholder can comment, but I believe all other relevant packages treat distances as functions.

#39
#46
https://arxiv.org/abs/2205.15181

[ENH] Write a function to perform the logic of DerivativeSlopeTransformer using numpy arrays instead of pd.DataFrame

quick review reveals a few possible improvements to EE, which was one of the earliest classifiers we introduced

1When calculating derivatives, EE uses a DerivativeSlopeTransformer which has internal type nested_univ. This means this calculation is shoving a numpy back into a panda, doing inefficient calculations, then pulling ti back into numpy.

           # X is a 3D numpy that then gets internally converted to pandas
            der_X = DerivativeSlopeTransformer().fit_transform(X)
            # it convert back to numpy: remove this and just create differences
            if isinstance(der_X, pd.DataFrame):
                der_X = from_nested_to_3d_numpy(der_X)

Rather than changing DerivativeSlopeTransformer, which is embedded in a morass of code, I think this really simply transform should just be done in a static method in ElasticEnsemble using numpy only. Basically convert this

            def get_der(x):
            der = []
            for i in range(1, len(x) - 1):
                der.append(((x[i] - x[i - 1]) + ((x[i + 1] - x[i - 1]) / 2)) / 2)
            return pd.Series([der[0]] + der + [der[-1]])

        return [get_der(x) for x in X]

to use numpy. There are other possible optimisations in EE, there is no testing for correctness for EE and it does not include all of the original

[ENH] Remove TimeSeriesSVC

Is your feature request related to a problem? Please describe.

TimeSeriesSVC was introduced on 25th De 2022 without review to sktime as a wrapper for scikit support vector machine. I see no issue associated with this classifier . Its purpose seems to be to find a use case for dists_kerns module, but its problematic and I think it should be removed for the following reasons.

It is not properly integrated to distances module, it does everything through AggrDist, which is a BasePairwiseTransformerPanel, which takes a BasePairwiseTransformer. This is a tremendous amount of infrastructure to support creating a distance matrix, and very confusing.
Elastic distances do not generate semi-positive definite kernels. For people who use SVM would be a deal breaker. The classifier has the comment "typically, SVC literature assumes kernels to be positive semi-definite. However, any pairwise transformer can be passed as kernel, including distances. This will still produce classification results, which may or may not be performant" I dont think we should introduce classifiers on this basis. If there were a paper and related results, then fair enough. However, svm are not used in the TSC literature and svm have historically performed poorly at time series classification.
Its just a wrapper. Anyone could just use scikit learn svc if they understood it without having to get their head around the dist_and_kerns module. The core implementation is just this

  def _fit(self, X, y):
        self._X = X
        kernel_mat = self._kernel(X)
        self.svc_estimator_.fit(kernel_mat, y)
        return self

    def _predict(self, X):
        kernel_mat = self._kernel(X, self._X)
        y_pred = self.svc_estimator_.predict(kernel_mat)
        return y_pred

Describe the solution you'd like
Remove this classifier

[ENH] Clean up mlflow dependencies

deprecate pip install sktime[mlflow,mlflow_tests]. Move mlflow to all_extras and mlflow_tests into dev. I think @ltsaprounis you were also supportive for this?

[FORK] Replace sponsor info

remove the sponsor this project info

[ENH] CI is slow

CI is slow on some platforms with up to 132 minutes.

Among the slowest tests is sktime/forecasting/tests/test_all_forecasters.py::TestAllForecasters

Screenshot taken from: [test-unix (3.9, macOS-11)]

To start with: can we do something about these forecasting tests? They take >1000 seconds.

[GOV] Set up google analytics and integrate with readthedocs

[BUG] dependency on packaging module not in the install

there seems to be a new dependency which is not installed and is not a soft dependency

 -*- coding: utf-8 -*-
"""Utility to check soft dependency imports, and raise warnings or errors."""

__author__ = ["fkiraly", "mloning"]

import io
import sys
import warnings
from importlib import import_module
from inspect import isclass

from packaging.requirements import InvalidRequirement, Requirement
from packaging.specifiers import InvalidSpecifier, SpecifierSet

from packaging.requirements import InvalidRequirement, Requirement
from packaging.specifiers import InvalidSpecifier, SpecifierSet

[BUG] B028 No explicit stacklevel keyword argument found.

Describe the bug
warn is throwing a new pre-commit problem about stack levels

To Reproduce

Expected behavior

adding stacklevel argument fixes, will fix in #52, putting this issue up for reference

[GOV] Choose and set up password manager for the project

[FORK] Write a "mission statement" on why we forked and what is different

Should be a medium term goal, does not necessarily have to be as soon as we make the repository public?.

We actually have to decide what is different first!

[BUG] failed test test_softdeps.py::test_est_fit_without_modulenotfound[TEASER]

Describe the bug
changing some docs has opened a can of worms it seems.

sktime/tests/test_softdeps.py::test_est_fit_without_modulenotfound[TEASER] - KeyError: 0
====== 1 failed, 1204 passed, 1 skipped, 30 warnings in 93.89s (0:01:33) =======
make: *** [Makefile:40: test_softdeps] Error 1

so it fails test_est_fit_without_modulenotfound for some reason. I'm guessing, but this is testing whether Teaser throws module not found when built without the correct soft dependency? Seems slow for the dictionary classifiers,

another example of the completely OTT testing see #59, its like the test suite is meant to be an oracle that can spot all possible bugs. This is just unrealistic and impractical imo

To Reproduce

](https://github.com/scikit-time/scikit-time/actions/runs/4184490173/jobs/7250212056)
Expected behavior

This is how it fails

it is imo a problem with the test not the classifier, I will look closer.
Versions

[FORK] GitHub Billing

The GitHub organisation is currently on GitHub Free, which has a number of limitations (including limited actions minutes). https://github.com/organizations/scikit-time/settings/billing

How did sktime deal with this issue? I assume we can leverage our academic links i.e. https://docs.github.com/en/billing/managing-billing-for-your-github-account/discounted-subscriptions-for-github-accounts

[BUG] Documentation build breaks under Python 3.10

Describe the bug

Documentation build break with python 3.10 because of this sphinx-doc/sphinx#9512

To Reproduce

pip install .[dev,docs]
cd docs
make html

Expected behavior

Additional context

Spinx version is currently pinned at 4.1.1, where the latest is 6.1.3. Anyone remembers why it is pinned?

Versions

[ENH] Operator overloading in BaseClasses

Is your feature request related to a problem? Please describe.

Operator overloading (apparently called magic or dunder methods by python people) was introduced across the toolkit last year.

sktime/sktime#2164

I simply does see the need, for classification, clustering and regression at least. It seems like a contrivance for doing it just because you can. I can see no issues asking for it, no genuine use case and to me at least the high level logic of equating pipeline with multiplication is not at all intuitive. It is not, for example, transitive. It may be that other people like it, in which case no problem.

Describe the solution you'd like

In the push for a simpler toolkit, I would like to strip it out and focus instead on using sklearn pipelines, which are more familiar to users.

Describe alternatives you've considered

if these are popular we can leave them as is, I'm just looking to simplify are reduce the bloatiness of the base classes.

Additional context
I would like to do the same for clustering and regression

[ENH] Support of ONNX format for model persistance and deployment

ONNX is a stadardised ML persistance format for deployment e.g. on edge devices.
We would have to write a custom ONNX "wrapper" , here is how that works for scikit-learn: https://onnx.ai/sklearn-onnx/

Useful links:
https://github.com/onnx/onnxmltools
https://github.com/onnx/onnx

Related:
sktime/sktime#1240

aeon-toolkit / aeon Goto Github PK

aeon's People

Contributors

Stargazers

Watchers

Forkers

aeon's Issues

Optional

Recommend Projects

Recommend Topics

Recommend Org