time-series-machine-learning / tsml-java Goto Github PK

View Code? Open in Web Editor NEW

156.0 156.0 119.0 281.33 MB

Java time series machine learning tools in a Weka compatible toolkit

License: GNU General Public License v3.0

Java 91.76% MATLAB 0.13% Python 0.11% TypeScript 8.00%

tsml-java's People

Contributors

Stargazers

Watchers

Forkers

james-large frankl1 martinogonzales wangyu1024 patrickzib sandy4321 changweitan vishalbelsare ostandage deepoli jamiegreasley zwbjtu123 senlihaha coderlv666 amiralitaheri wzpy thanhbok26b monashts daiyicn craftycodie fxb20170613 aung2phyowai vnicholson1 longshirong zhaohhit sumesh1 kiminh yrwjby petritror dotnet54 randomboolean goastler dondealban ridwannaibi cxz whackteachers aidenrushbrooke zhenwenhe isabella618033 terranomadic alecsmith96 zmf18gwu aidanstrong jamesbarnes48 malabone tom-1506 tomiwaoke jasonlines chrisholder standun-github rncwind jameschriston aferguson98 wrightjo mwatt0204 lgzcode pdobbek t3kken project-25iot chris-sutcliffe dangee1705 jacksterd1 emiliathewolf svarya ranhui98 abu-tareq-rony shrimpguard divyodatasci pdkyll lukewalker5498 wbchief mtrlee omarbahri oliver-boys zhangxy8244 harishj011 psb18 mgiles717 chriswranek hm20-ops vaileigh tiger3132 darceygardiner99 moffatdev ercengsha 12olivermiddleton robert-stevenson-1 joshryan123 497hwq blaupuppiez jwansek grimbyy nkkrnkl matthewmcelwain1 mrmoopers harrywhitelam canderson1501 dylanfoss jamesb9 sogeking233

tsml-java's Issues

should we have an interfaces package?

raised after matthew's comments. Should we put all our contract, train estimate etc etc interfaces in one place? If so, where?

SaveParameterInfo must go!

So looking at my ancient interfaces to tidy/absorb or remove, we have
SaveParameterInfo: just a single abstract method that is surely better somewhere else, or just abandon for weka getOptions.
Is there an easy way in netbeans to find all classes that implement an interface?

Data cleaning in addition to loading

Following talks on Tuesday and today, I reckon a decent addition for both public AND research usability would be to have default, bog-standard, non-controversial and unintelligent weka Filters or new 'Transformers' or whatever for data cleaning (of timeseries data in particular).

These can then optionally applied to data when loading it in via whatever means (or always 'applied', and for stuff like standardised UCR data nothing happens). Main ideas from my POV for these would be:

Remove identical instances
Remove instances with missing classes
Perform basic missing value imputation
Perform any basic padding or truncating to get to equal length
Maybe more?
Provide a summary of changes made

Such that from a public usability point of view, people with their fresh imperfect dataset can easily just get numbers from our algs that aren't engineered to handle difficult factors at the moment. And from a research point of view, when adding datasets to UCR TSC or UEA MTSC, we can potentially leave datasets in an imperfect form similar to the option to leave datasets unnormalised. But, we also provide the bog standard difficult to argue with way of getting it into a complete standard form for easy comparison between classifiers. For particular applications, people more involved/interested can define settings for/write their own data cleaning processes etc.

This can also be applied to sktime, and as an addition to all this, we should probably add support for other file types, namely .ts files but also simple csv files with not meta/attribute information. Maybe this should all be done after the/a dedicated timeseries instance type is drafted up. On a side note, this could probably make for a decent third year project too

Gradle configuration still set to old name

Gradle config is still set to old name "uea-tsc" so is very confusing when opened in an IDE supporting gradle!

Non Time Series classes

currently we have vector_classifiers, vector_clusterers etc.

I propose refactor to have a single directory
weka_uea
with subdirs classifiers, clusters etc

any objections?

Uniform Contracting

Of the classifiers that currently have contracting implemented each have done it slightly differently. It would be a good idea to discuss a uniform way to implement contracting into a classifier, or if necessary define different methods of contracting and make it clearer which one each classifier uses (i.e. separate interfaces).

Replace getParameters with weka's getOptions

Weka's getOptions is probably more intuitive / plays better with weka, so we should use that instead of getParameters. Need to change this when adjusting the interfaces

Tuning classifiers with TunedClassifier

I'm doing a study of TSF and RISE in both sktime and, err, time-mine. I'm going to use James's TunedClassifier, which on first look seems excellent, documented, clear example etc, I may even use it correctly. My issue here is, should we have a new interface
interface Tuneable{
ParameterSpace getDefaultParameterSpace()

}
and maybe more? Feels a little hacky manually setting it up in ClassifierLists

Experiments cmdline parameters

Idea for dealing with the current ClassifierLists issue of handling parameters. Defining a name with the parameter, e.g. "RotF<100/200/500>" becomes clunky with a large number of parameter options.

Currently there's 2 use cases:

I have some predetermined hyper-parameters that I want to set, e.g. numTrees, and I want to pass this via cmdline.
I have some predetermined dependent parameters that I want to set via cmdline, say build time limit, e.g. 1m, 2m, 5m, 10m, 30m, 1h. Each of these is dependent on the previous and it is redundant to have 6 different cluster jobs, one for each time limit, as work would be repeated. It would be better to run one job which contracts the classifier for 1m, records results, 2m, records results, etc...

Current thoughts on how this would work on the cmdline is:

pass the parameter key/values as a string separated by spaces, e.g. "-p param1 val1 param2 val2".
pass incremental parameter sets in the same way, but each increment is a separate cmdline option, e.g. "-ip trainContract 60 -ip trainContract 120 -ip trainContract 300".

Benefits:

easy parameter passing for experiments, everything can be passed from the bash script which makes it easier to loop over parameters
no need to define a new classifier id (e.g. RotF200) per parameter you're examining
future proofing, every parameter for a classifier can be passed through this interface so there's no need to add more and more jcommander options as parameters grow

I have a hacky version on the ee branch currently: https://github.com/uea-machine-learning/tsml/blob/a0278c0f11ae2fa5e8812bbef229debd69cdfdac/src/main/java/experiments/Experiments.java#L102

timeseriesweka.classifiers.distance_based

this package needs tidying up and refactoring. In particular:

does proximity_forest need to be in its own package/directory? (James)
can FastWWS be better assimilated? (Tony)
All things elastic_ensemble (Jason and George)

Issue in ImpRandomSearch

I found what could be a mistake in the code related to the random search for shapelet selection.

In this line in ImpRandomSearch.java:
Shapelet shape = checkCandidate.process(getTimeSeries(timeSeries,shapelet.var3), shapelet.var1, shapelet.var2, shapelet.var3);
var1 is the length and var2 is the position as stated by a comment preceding it, however the order is actually different in the checkCandidate function which is as follows:
protected Shapelet checkCandidate(Instance series, int start, int length, int dimension)
This causes some of the selected shapelets to have the length 0 i.e. no content at all.
I found this issue while trying to write the generated shapelets to a file and found some lines to be empty.

set seed in AbstractClassifierWithTrainingInfo

seems sensible, just need to double check it is already not in abstract classifier or any of our existing interfaces. IIRC weka has Randomizable, but I think its cleaner to just use our own

timeseriesweka.filters.shapelet_transforms

yikes! I will have a little tidy up (GraceShapeletTransform) but I think this needs a bottom up reconstruction. Perhaps do the python version first. I think James has a start on the slimmed down version?

Time Series Extraction

Utilities.extractTimeSeries includes the class value in the extraction of the time series. This should be dropped.

vector_classifiers

any thoughts on what to do with this (if anything)

FlatCote.main() results in java.lang.IllegalArgumentException

Execution of timeseriesweka.classifiers.hybrids.FlatCote.main() in master (ef5fd10) produces an exception with the following details:

Exception in thread "main" java.lang.IllegalArgumentException: Src and Dest differ in # of attributes: 68 != 1
at weka.core.RelationalLocator.copyRelationalValues(RelationalLocator.java:87)
at weka.filters.Filter.copyValues(Filter.java:371)
at weka.filters.Filter.push(Filter.java:288)
at weka.filters.SimpleBatchFilter.batchFinished(SimpleBatchFilter.java:257)
at weka.filters.Filter.useFilter(Filter.java:682)
at weka_extras.classifiers.ensembles.AbstractEnsemble.buildClassifier(AbstractEnsemble.java:890)
at timeseriesweka.classifiers.hybrids.FlatCote.buildClassifier(FlatCote.java:104)
at timeseriesweka.classifiers.hybrids.FlatCote.main(FlatCote.java:185)

Problem is caused by applying filter ShapeletTransform in weka_extras.classifiers.ensembles.AbstractEnsemble.buildClassifier(AbstractEnsemble.java:890) --> It seems that the usage of Filter causes the problem for ShapeletTransform.

If I call the ShapeletTransform.process(data) directly the exception is not thrown, tested via change of AbstractEnsemble:
Line 888 else{
transform.setInputFormat(data);
this.trainInsts = Filter.useFilter(data,transform);
}
into:
Line 888 else{//changed by davcem: There is some filter issue when using
if(this.transform instanceof ShapeletTransform){
this.trainInsts = ((ShapeletTransform)transform).process(data);
}else{
transform.setInputFormat(data);
this.trainInsts = Filter.useFilter(data,transform);
}
}

This is just a quick hack, but I think that the "new way" of useFilter cases issues with the ShapeletTransform Filter.

Elastic Ensemble code

what exactly do we need for EE? Can we get rid of my legacy DTW code?

timeseriesweka.filters.measures

is this used/necessary?

multivariate_timeseriesweka

I propose we keep this off the repo for now, until it is better documented and structured. Thoughts?

Java version/netbeans

A discussion point/issue arising from Travis. Should we switch to IntelliJ and update from Java 8? Open question we have gone over, but I'll leave it here

Filters

the long term question is, do we want to continue using the Filter mechanism in weka for transformers.
Pros for Filter:

Fits into weka way
Its how we have done them all
Cons
Stupid BatchFilter usage pattern
I prefer the name Transformer to Filter and sktime uses Transformer

any other thoughts? Whatever we do, they all need capabilities checking

Implementation of TunedClassifier

one issue is how to deal with parameters that are dependent on the data. So for example number of features in TSF is sqrt(m).

timeseriesweka.elastic_distance_measures

to be included in George's tidy up

timeseriesweka.examples

to be moved to experiments

TechnicalInformation

Another thing we should do from a documentation/public usability povover time is to add to all our own classifiers (where appropriate) the technical information type stuff.

The interface TechnicalInformationHandler defines getTechnicalInformation() which returns a TechnicalInformation object, essentially defines the bibtex for a paper reference for this classifier, e.g j48:

/**
   * Returns an instance of a TechnicalInformation object, containing 
   * detailed information about the technical background of this class,
   * e.g., paper reference or book this class is based on.
   * 
   * @return the technical information about this class
   */
public TechnicalInformation getTechnicalInformation() {
    TechnicalInformation 	result;

    result = new TechnicalInformation(Type.BOOK);
    result.setValue(Field.AUTHOR, "Ross Quinlan");
    result.setValue(Field.YEAR, "1993");
    result.setValue(Field.TITLE, "C4.5: Programs for Machine Learning");
    result.setValue(Field.PUBLISHER, "Morgan Kaufmann Publishers");
    result.setValue(Field.ADDRESS, "San Mateo, CA");

    return result;
}

redesign of ShapeletTransformClassifier

Desired characteristics

Use ContractRotF not CAWPE
Make Contractable and manage split of time between transform and classifier
Facility to save transform and shapelets

these will benefit from the shapelet transform enhancements on other Issues

TrainAccuracyEstimator interface

so almost certainly originating with me, some classifiers implement this interface but all they do is a CV. This is high maintenance when the Experiments could do it anyway. So I think unless it has the capacity to do something other than CV, it should not implement this interface.

For those that do, I think we should split the time to estimate error and the final build time in ClassifierResults, and store them both.

thoughts?

ShapeletTransform setting k

this is a perennial problem we need to address. How do we set k? I know from recent experiments that it matters. Using best 100 is sig worse than using the number generated (which seemed variable!). In ShapeletTransformClassifier I manually set it in createrTransformData. This is brittle and unsatisfactory. I will try it with different default values. A priori, I think setting it to 500 should be sufficient.

ShapeletTransform finding shapelets length 2

when using TransformExperiments, on Yoga the top shapelets are all of length 2., when generating with this code

as an aside, it might be worth not allowing identical shapelets, even from different series?

redesign of HiveCote classifier

this needs updating. Desired characteristics:

Use ClassifierResults not bespoke file writing
Mechanism for checking if results are present, then build from file
Ability to thread the components
Contractable and Checkpointable, just by using Contractable and Checkpointable components.

volunteers?

Global paths in DatasetLists

these were the globals that James was thinking of

I'll look at purging them in due course

refactor out HiveCoteModule interface

this interface has three methods
public interface HiveCoteModule{

public double getEnsembleCvAcc();
public double[] getEnsembleCvPreds();
public String getParameters();

the first two are better in TrainAccuracyEstimate (although would change the name to getTrainAcc and getTrainPreds). The second already exists in SaveParameterInfo.

any thoughts?

Branch protection

We need some form of branch protection for the dev branch. @MatthewMiddlehurst accidentally deleted the dev branch earlier this week and managed to recover it (phew!). To prevent this from happening again, we need some kind of protection over the important branches. Master is already protected.

On github, to protect a branch you have to make a branch protection rule. I made a rule for master which is why you can't push directly to it. We have two options for dev:

make an empty rule which stops accidental deletion / renaming / etc
protect dev in the same way as master, preventing pushing unless via pull request. This is a popular option to maintain a clean dev version, but comes at the cost / pain of having to do pull requests every time (although I think it's a good idea)

We at the very least need option 1, would like some feedback on option 2 :)

timeseriesweka.utilities

is all this necessary? James and Aaron are I think responsible for the bulk of this code. What uses it? Apart from the obvious, is most of it not needed or only used by shapelets?

timeseriesweka.classifiers

FastWWS: This is for Petijean DTW. It should be refactored to somewhere more sensible
boss: these are variants for the IDA paper. I think they can go
cote: Do we use these?
ensembles. This seems a bit ad hoc
randomboss. this should go in Matthews development branch (i.e. local). I think we have maybe two BOSS on the Repo: The original (BOSS) and our best effort so far (RandomBOSS)
Classifiers: I suggest we oganise these in packagages dictionary, shapelet, frequencydomain, interval and timedomain

Experiments/ClassifierList

I need an adaptation to Experiments, ClassifierLists. When contracted, I want the file structure to reflect that. So If I build ShapeletTransformClassifier for 100 hours, results should be in ShapeletTransformClassifier100.

ShapeletTransform contract change

hi, in addition to the precalculated timing way we use, I would like a wrapper that measures actual time, so that if the calculation is off, the transform stops looking when contract is exceeded, or continues looking if time is left.

distance measures

we currently have two packages for distance functions:
timeseriesweka.elastic_distance_measures and
timeseriesweka.filters.shapelet_transforms.distance_functions
it would be nice to tidy this up somehow. Any thoughts? We also have the wrapped FastWWS with its own package structure.

distances are not classifier only, so I think a single package is good. Weka puts them in core, but I dont like that. I think just a package distances? Maybe subpackage elastic_distance and window_distance? And then, do we continue to implement weka DistanceFunction?

BOSS train files

hi, BOSS currently internally creates the train files, but the format is not consistent with the others. Currently looks like

Beef,BOSSEnsemble,train
FullyNestedEstimates,true,numSeries,1,numclassifiers0,12,windowSize,10,wordLength,10,alphabetSize,4,norm,true,windowSize,22,wordLength,8,alphabetSize,4,norm,true,windowSize,25,wordLength,12,alphabetSize,4,norm,true,windowSize,400,wordLength,8,alphabetSize,4,norm,true,windowSize,445,wordLength,14,alphabetSize,4,norm,true,windowSize,16,wordLength,10,alphabetSize,4,norm,false,windowSize,22,wordLength,16,alphabetSize,4,norm,false,windowSize,25,wordLength,14,alphabetSize,4,norm,false,windowSize,31,wordLength,10,alphabetSize,4,norm,false,windowSize,55,wordLength,16,alphabetSize,4,norm,false,windowSize,379,wordLength,16,alphabetSize,4,norm,false,windowSize,445,wordLength,16,alphabetSize,4,norm,false
0.6
0.0,0.0

but should look like
Beef,RISE,train,0,NANOSECONDS,PREDICTIONS, Generated by Experiments.java
FullyNestedEstimates,true,numClassifiers,500,MinInterval,16,Filter0,PowerSpectrum,Filter1,ACF
0.7333333333333333,116582716091,125647425168,-1,-1,5
0,0,,0.498,0.1,0.15,0.118,0.134,,3726164889,,
0,0,,0.706,0.08,0.052,0.084,0.078,,4325897589,,
0,0,,0.696,0.106,0.044,0.078,0.076,,4017730289,,

I have not looked at RBOSS though

Standard tie-breaking practice

Based on a small discussion we had in the office, do we have a standard practice for tie-breaking? Do all classifiers use the same method and what method should be used? i.e. first item, random selection, weighted random selection using class distribution.
Mostly in the context of class selection for classification but could be extended to other situations with ties.
Mainly for discussion, if this is a non-issue feel free to close.

DTW variants

bit more housekeeping of my ancient code. DTW, important benchmark etc. I've moved the variants into classifiers.distance_based. I know at lot of it is old, because it has eclipse indentation from 10 years ago. We have
SlowDTW_1NN: Standard naive version with full CV, seems fully integrated, but needs updating for new experiments etc

FastDTW_1NN: my own (unpublished) DTW speed ups. Also integrated as slow.

FastDTW: Wrapper for the SDM fast version, not integrated, and may have a memory bug.

DTW_kNN: I have no idea, think I can remove this. I think its just a more configurable SlowDTW with added early abandon.

So what to do?

Smash it all into one configurable NN_DTW (or whatever we call it)
keep FastDTW separate and TechnicalInfo'ed, but merge the others, deleting DTW_kNN

any thoughts? Other classifiers here are called

DD_DTW
DTD_C
ElasticEnsemble
NN_CID
ProximityForest

Numerosity Reduction in BOP

Numerosity reduction in BOP class BagOfPatternsFilter, method buildBag possibly not functioning correctly.
prevPattern initialised to an array of -1 values and is used to compare to the current pattern and remove if identical.
It appears that the values for this array are never updated, posting this issue to look into this after the sktime sprint.

[BUG] Last value in PLAID dataset all seem to be zero

double check this and if true, remove the redundant feature

Repo / travis ownership

@TonyBagnall and I need to switch the repo ownership to him, along with the travis integration

setClassifier and horribleGlobalPath

hi, james has this to be backward compatible with my hack for reading CAWPE from file. I propose we remove classic, my global stings and just have the switch inside setClassifier(ExperimentalArguments a)

any reason not to do this anyone knows of? The only downside is that you need to assume the results files for CAWPEFROMFILE will have to all be in the same directory, and that will have to be where the CAWPEFROMFILE writes to (if you use setClassifier to run this). I have no problem with that.

Master-Compile errors package statistics.transformations

I checked out the master (ef5fd10) yesterday and noticed compile problems since the package specification for all files in statistics.transformations is: package transformations instead of package statistics.transformations;
src/main/java/statistics/tests/ResidualTests.java also imported transformation package

Remove development package and most in it, replace with experiments package.