Giter Site home page Giter Site logo

time-series-machine-learning / tsml-java Goto Github PK

View Code? Open in Web Editor NEW
156.0 156.0 119.0 281.33 MB

Java time series machine learning tools in a Weka compatible toolkit

License: GNU General Public License v3.0

Java 91.76% MATLAB 0.13% Python 0.11% TypeScript 8.00%

tsml-java's People

Contributors

a-pasos-ruiz avatar abostrom avatar c-eg avatar changweitan avatar craftycodie avatar divyodatasci avatar goastler avatar herrmannm avatar james-large avatar jasonlines avatar lukewalker5498 avatar matthewmiddlehurst avatar mjflynn avatar oliver-boys avatar tonybagnall avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tsml-java's Issues

SaveParameterInfo must go!

So looking at my ancient interfaces to tidy/absorb or remove, we have
SaveParameterInfo: just a single abstract method that is surely better somewhere else, or just abandon for weka getOptions.
Is there an easy way in netbeans to find all classes that implement an interface?

Data cleaning in addition to loading

Following talks on Tuesday and today, I reckon a decent addition for both public AND research usability would be to have default, bog-standard, non-controversial and unintelligent weka Filters or new 'Transformers' or whatever for data cleaning (of timeseries data in particular).

These can then optionally applied to data when loading it in via whatever means (or always 'applied', and for stuff like standardised UCR data nothing happens). Main ideas from my POV for these would be:

  • Remove identical instances

  • Remove instances with missing classes

  • Perform basic missing value imputation

  • Perform any basic padding or truncating to get to equal length

  • Maybe more?

  • Provide a summary of changes made

Such that from a public usability point of view, people with their fresh imperfect dataset can easily just get numbers from our algs that aren't engineered to handle difficult factors at the moment. And from a research point of view, when adding datasets to UCR TSC or UEA MTSC, we can potentially leave datasets in an imperfect form similar to the option to leave datasets unnormalised. But, we also provide the bog standard difficult to argue with way of getting it into a complete standard form for easy comparison between classifiers. For particular applications, people more involved/interested can define settings for/write their own data cleaning processes etc.

This can also be applied to sktime, and as an addition to all this, we should probably add support for other file types, namely .ts files but also simple csv files with not meta/attribute information. Maybe this should all be done after the/a dedicated timeseries instance type is drafted up. On a side note, this could probably make for a decent third year project too

Non Time Series classes

currently we have vector_classifiers, vector_clusterers etc.

I propose refactor to have a single directory
weka_uea
with subdirs classifiers, clusters etc

any objections?

Uniform Contracting

Of the classifiers that currently have contracting implemented each have done it slightly differently. It would be a good idea to discuss a uniform way to implement contracting into a classifier, or if necessary define different methods of contracting and make it clearer which one each classifier uses (i.e. separate interfaces).

Tuning classifiers with TunedClassifier

I'm doing a study of TSF and RISE in both sktime and, err, time-mine. I'm going to use James's TunedClassifier, which on first look seems excellent, documented, clear example etc, I may even use it correctly. My issue here is, should we have a new interface
interface Tuneable{
ParameterSpace getDefaultParameterSpace()

}
and maybe more? Feels a little hacky manually setting it up in ClassifierLists

Experiments cmdline parameters

Idea for dealing with the current ClassifierLists issue of handling parameters. Defining a name with the parameter, e.g. "RotF<100/200/500>" becomes clunky with a large number of parameter options.

Currently there's 2 use cases:

  1. I have some predetermined hyper-parameters that I want to set, e.g. numTrees, and I want to pass this via cmdline.
  2. I have some predetermined dependent parameters that I want to set via cmdline, say build time limit, e.g. 1m, 2m, 5m, 10m, 30m, 1h. Each of these is dependent on the previous and it is redundant to have 6 different cluster jobs, one for each time limit, as work would be repeated. It would be better to run one job which contracts the classifier for 1m, records results, 2m, records results, etc...

Current thoughts on how this would work on the cmdline is:

  1. pass the parameter key/values as a string separated by spaces, e.g. "-p param1 val1 param2 val2".
  2. pass incremental parameter sets in the same way, but each increment is a separate cmdline option, e.g. "-ip trainContract 60 -ip trainContract 120 -ip trainContract 300".

Benefits:

  • easy parameter passing for experiments, everything can be passed from the bash script which makes it easier to loop over parameters
  • no need to define a new classifier id (e.g. RotF200) per parameter you're examining
  • future proofing, every parameter for a classifier can be passed through this interface so there's no need to add more and more jcommander options as parameters grow

I have a hacky version on the ee branch currently: https://github.com/uea-machine-learning/tsml/blob/a0278c0f11ae2fa5e8812bbef229debd69cdfdac/src/main/java/experiments/Experiments.java#L102

timeseriesweka.classifiers.distance_based

this package needs tidying up and refactoring. In particular:

  1. does proximity_forest need to be in its own package/directory? (James)
  2. can FastWWS be better assimilated? (Tony)
  3. All things elastic_ensemble (Jason and George)

Issue in ImpRandomSearch

I found what could be a mistake in the code related to the random search for shapelet selection.

In this line in ImpRandomSearch.java:
Shapelet shape = checkCandidate.process(getTimeSeries(timeSeries,shapelet.var3), shapelet.var1, shapelet.var2, shapelet.var3);
var1 is the length and var2 is the position as stated by a comment preceding it, however the order is actually different in the checkCandidate function which is as follows:
protected Shapelet checkCandidate(Instance series, int start, int length, int dimension)
This causes some of the selected shapelets to have the length 0 i.e. no content at all.
I found this issue while trying to write the generated shapelets to a file and found some lines to be empty.

set seed in AbstractClassifierWithTrainingInfo

seems sensible, just need to double check it is already not in abstract classifier or any of our existing interfaces. IIRC weka has Randomizable, but I think its cleaner to just use our own

timeseriesweka.filters.shapelet_transforms

yikes! I will have a little tidy up (GraceShapeletTransform) but I think this needs a bottom up reconstruction. Perhaps do the python version first. I think James has a start on the slimmed down version?
image

Time Series Extraction

Utilities.extractTimeSeries includes the class value in the extraction of the time series. This should be dropped.

FlatCote.main() results in java.lang.IllegalArgumentException

Execution of timeseriesweka.classifiers.hybrids.FlatCote.main() in master (ef5fd10) produces an exception with the following details:

Exception in thread "main" java.lang.IllegalArgumentException: Src and Dest differ in # of attributes: 68 != 1
at weka.core.RelationalLocator.copyRelationalValues(RelationalLocator.java:87)
at weka.filters.Filter.copyValues(Filter.java:371)
at weka.filters.Filter.push(Filter.java:288)
at weka.filters.SimpleBatchFilter.batchFinished(SimpleBatchFilter.java:257)
at weka.filters.Filter.useFilter(Filter.java:682)
at weka_extras.classifiers.ensembles.AbstractEnsemble.buildClassifier(AbstractEnsemble.java:890)
at timeseriesweka.classifiers.hybrids.FlatCote.buildClassifier(FlatCote.java:104)
at timeseriesweka.classifiers.hybrids.FlatCote.main(FlatCote.java:185)

Problem is caused by applying filter ShapeletTransform in weka_extras.classifiers.ensembles.AbstractEnsemble.buildClassifier(AbstractEnsemble.java:890) --> It seems that the usage of Filter causes the problem for ShapeletTransform.

If I call the ShapeletTransform.process(data) directly the exception is not thrown, tested via change of AbstractEnsemble:
Line 888 else{
transform.setInputFormat(data);
this.trainInsts = Filter.useFilter(data,transform);
}
into:
Line 888 else{//changed by davcem: There is some filter issue when using
if(this.transform instanceof ShapeletTransform){
this.trainInsts = ((ShapeletTransform)transform).process(data);
}else{
transform.setInputFormat(data);
this.trainInsts = Filter.useFilter(data,transform);
}
}

This is just a quick hack, but I think that the "new way" of useFilter cases issues with the ShapeletTransform Filter.

Java version/netbeans

A discussion point/issue arising from Travis. Should we switch to IntelliJ and update from Java 8? Open question we have gone over, but I'll leave it here

Filters

the long term question is, do we want to continue using the Filter mechanism in weka for transformers.
Pros for Filter:

  1. Fits into weka way
  2. Its how we have done them all
    Cons
  3. Stupid BatchFilter usage pattern
  4. I prefer the name Transformer to Filter and sktime uses Transformer

any other thoughts? Whatever we do, they all need capabilities checking

TechnicalInformation

Another thing we should do from a documentation/public usability povover time is to add to all our own classifiers (where appropriate) the technical information type stuff.

The interface TechnicalInformationHandler defines getTechnicalInformation() which returns a TechnicalInformation object, essentially defines the bibtex for a paper reference for this classifier, e.g j48:

/**
   * Returns an instance of a TechnicalInformation object, containing 
   * detailed information about the technical background of this class,
   * e.g., paper reference or book this class is based on.
   * 
   * @return the technical information about this class
   */
public TechnicalInformation getTechnicalInformation() {
    TechnicalInformation 	result;

    result = new TechnicalInformation(Type.BOOK);
    result.setValue(Field.AUTHOR, "Ross Quinlan");
    result.setValue(Field.YEAR, "1993");
    result.setValue(Field.TITLE, "C4.5: Programs for Machine Learning");
    result.setValue(Field.PUBLISHER, "Morgan Kaufmann Publishers");
    result.setValue(Field.ADDRESS, "San Mateo, CA");

    return result;
}

redesign of ShapeletTransformClassifier

Desired characteristics

  1. Use ContractRotF not CAWPE
  2. Make Contractable and manage split of time between transform and classifier
  3. Facility to save transform and shapelets

these will benefit from the shapelet transform enhancements on other Issues

TrainAccuracyEstimator interface

so almost certainly originating with me, some classifiers implement this interface but all they do is a CV. This is high maintenance when the Experiments could do it anyway. So I think unless it has the capacity to do something other than CV, it should not implement this interface.

For those that do, I think we should split the time to estimate error and the final build time in ClassifierResults, and store them both.

thoughts?

ShapeletTransform setting k

this is a perennial problem we need to address. How do we set k? I know from recent experiments that it matters. Using best 100 is sig worse than using the number generated (which seemed variable!). In ShapeletTransformClassifier I manually set it in createrTransformData. This is brittle and unsatisfactory. I will try it with different default values. A priori, I think setting it to 500 should be sufficient.

image

ShapeletTransform finding shapelets length 2

when using TransformExperiments, on Yoga the top shapelets are all of length 2., when generating with this code

image

image

as an aside, it might be worth not allowing identical shapelets, even from different series?

redesign of HiveCote classifier

this needs updating. Desired characteristics:

  1. Use ClassifierResults not bespoke file writing
  2. Mechanism for checking if results are present, then build from file
  3. Ability to thread the components
  4. Contractable and Checkpointable, just by using Contractable and Checkpointable components.

volunteers?

refactor out HiveCoteModule interface

this interface has three methods
public interface HiveCoteModule{

public double getEnsembleCvAcc();
public double[] getEnsembleCvPreds();
public String getParameters();

the first two are better in TrainAccuracyEstimate (although would change the name to getTrainAcc and getTrainPreds). The second already exists in SaveParameterInfo.

any thoughts?

Branch protection

We need some form of branch protection for the dev branch. @MatthewMiddlehurst accidentally deleted the dev branch earlier this week and managed to recover it (phew!). To prevent this from happening again, we need some kind of protection over the important branches. Master is already protected.

On github, to protect a branch you have to make a branch protection rule. I made a rule for master which is why you can't push directly to it. We have two options for dev:

  1. make an empty rule which stops accidental deletion / renaming / etc
  2. protect dev in the same way as master, preventing pushing unless via pull request. This is a popular option to maintain a clean dev version, but comes at the cost / pain of having to do pull requests every time (although I think it's a good idea)

We at the very least need option 1, would like some feedback on option 2 :)

timeseriesweka.utilities

is all this necessary? James and Aaron are I think responsible for the bulk of this code. What uses it? Apart from the obvious, is most of it not needed or only used by shapelets?
image

timeseriesweka.classifiers

  1. FastWWS: This is for Petijean DTW. It should be refactored to somewhere more sensible
  2. boss: these are variants for the IDA paper. I think they can go
  3. cote: Do we use these?
  4. ensembles. This seems a bit ad hoc
  5. randomboss. this should go in Matthews development branch (i.e. local). I think we have maybe two BOSS on the Repo: The original (BOSS) and our best effort so far (RandomBOSS)
  6. Classifiers: I suggest we oganise these in packagages dictionary, shapelet, frequencydomain, interval and timedomain

Experiments/ClassifierList

I need an adaptation to Experiments, ClassifierLists. When contracted, I want the file structure to reflect that. So If I build ShapeletTransformClassifier for 100 hours, results should be in ShapeletTransformClassifier100.

ShapeletTransform contract change

hi, in addition to the precalculated timing way we use, I would like a wrapper that measures actual time, so that if the calculation is off, the transform stops looking when contract is exceeded, or continues looking if time is left.

distance measures

we currently have two packages for distance functions:
timeseriesweka.elastic_distance_measures and
timeseriesweka.filters.shapelet_transforms.distance_functions
it would be nice to tidy this up somehow. Any thoughts? We also have the wrapped FastWWS with its own package structure.

distances are not classifier only, so I think a single package is good. Weka puts them in core, but I dont like that. I think just a package distances? Maybe subpackage elastic_distance and window_distance? And then, do we continue to implement weka DistanceFunction?

BOSS train files

hi, BOSS currently internally creates the train files, but the format is not consistent with the others. Currently looks like

Beef,BOSSEnsemble,train
FullyNestedEstimates,true,numSeries,1,numclassifiers0,12,windowSize,10,wordLength,10,alphabetSize,4,norm,true,windowSize,22,wordLength,8,alphabetSize,4,norm,true,windowSize,25,wordLength,12,alphabetSize,4,norm,true,windowSize,400,wordLength,8,alphabetSize,4,norm,true,windowSize,445,wordLength,14,alphabetSize,4,norm,true,windowSize,16,wordLength,10,alphabetSize,4,norm,false,windowSize,22,wordLength,16,alphabetSize,4,norm,false,windowSize,25,wordLength,14,alphabetSize,4,norm,false,windowSize,31,wordLength,10,alphabetSize,4,norm,false,windowSize,55,wordLength,16,alphabetSize,4,norm,false,windowSize,379,wordLength,16,alphabetSize,4,norm,false,windowSize,445,wordLength,16,alphabetSize,4,norm,false
0.6
0.0,0.0

but should look like
Beef,RISE,train,0,NANOSECONDS,PREDICTIONS, Generated by Experiments.java
FullyNestedEstimates,true,numClassifiers,500,MinInterval,16,Filter0,PowerSpectrum,Filter1,ACF
0.7333333333333333,116582716091,125647425168,-1,-1,5
0,0,,0.498,0.1,0.15,0.118,0.134,,3726164889,,
0,0,,0.706,0.08,0.052,0.084,0.078,,4325897589,,
0,0,,0.696,0.106,0.044,0.078,0.076,,4017730289,,

I have not looked at RBOSS though

Standard tie-breaking practice

Based on a small discussion we had in the office, do we have a standard practice for tie-breaking? Do all classifiers use the same method and what method should be used? i.e. first item, random selection, weighted random selection using class distribution.
Mostly in the context of class selection for classification but could be extended to other situations with ties.
Mainly for discussion, if this is a non-issue feel free to close.

DTW variants

bit more housekeeping of my ancient code. DTW, important benchmark etc. I've moved the variants into classifiers.distance_based. I know at lot of it is old, because it has eclipse indentation from 10 years ago. We have
SlowDTW_1NN: Standard naive version with full CV, seems fully integrated, but needs updating for new experiments etc

image

FastDTW_1NN: my own (unpublished) DTW speed ups. Also integrated as slow.

FastDTW: Wrapper for the SDM fast version, not integrated, and may have a memory bug.

DTW_kNN: I have no idea, think I can remove this. I think its just a more configurable SlowDTW with added early abandon.

So what to do?

  1. Smash it all into one configurable NN_DTW (or whatever we call it)
  2. keep FastDTW separate and TechnicalInfo'ed, but merge the others, deleting DTW_kNN

any thoughts? Other classifiers here are called

DD_DTW
DTD_C
ElasticEnsemble
NN_CID
ProximityForest

Numerosity Reduction in BOP

Numerosity reduction in BOP class BagOfPatternsFilter, method buildBag possibly not functioning correctly.
prevPattern initialised to an array of -1 values and is used to compare to the current pattern and remove if identical.
It appears that the values for this array are never updated, posting this issue to look into this after the sktime sprint.

setClassifier and horribleGlobalPath

hi, james has this to be backward compatible with my hack for reading CAWPE from file. I propose we remove classic, my global stings and just have the switch inside setClassifier(ExperimentalArguments a)

any reason not to do this anyone knows of? The only downside is that you need to assume the results files for CAWPEFROMFILE will have to all be in the same directory, and that will have to be where the CAWPEFROMFILE writes to (if you use setClassifier to run this). I have no problem with that.

image

Master-Compile errors package statistics.transformations

I checked out the master (ef5fd10) yesterday and noticed compile problems since the package specification for all files in statistics.transformations is: package transformations instead of package statistics.transformations;
src/main/java/statistics/tests/ResidualTests.java also imported transformation package

Test

test for slack

ShapeletTransform duplicates

I'd like a check to not allow exactly the same shapelets. Generally wont be an issue. This arose because flat lines are discriminatory in Yoga. So shapelet 0,0,0 appears multiple times, thus wasting shapelet allocation

vector_classifiers

this directory needs consolidation and tidying up. In particular, we have
ensembles:
stackers
weightedvoters
then loads of other one offs. We should maybe model weka file structure

Clear out and tidy up

Changes to core repo

  1. Remove development package and most in it, replace with experiments package.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.