time-series-machine-learning / tsml-java Goto Github PK
View Code? Open in Web Editor NEWJava time series machine learning tools in a Weka compatible toolkit
License: GNU General Public License v3.0
Java time series machine learning tools in a Weka compatible toolkit
License: GNU General Public License v3.0
raised after matthew's comments. Should we put all our contract, train estimate etc etc interfaces in one place? If so, where?
So looking at my ancient interfaces to tidy/absorb or remove, we have
SaveParameterInfo: just a single abstract method that is surely better somewhere else, or just abandon for weka getOptions.
Is there an easy way in netbeans to find all classes that implement an interface?
Following talks on Tuesday and today, I reckon a decent addition for both public AND research usability would be to have default, bog-standard, non-controversial and unintelligent weka Filters or new 'Transformers' or whatever for data cleaning (of timeseries data in particular).
These can then optionally applied to data when loading it in via whatever means (or always 'applied', and for stuff like standardised UCR data nothing happens). Main ideas from my POV for these would be:
Remove identical instances
Remove instances with missing classes
Perform basic missing value imputation
Perform any basic padding or truncating to get to equal length
Maybe more?
Provide a summary of changes made
Such that from a public usability point of view, people with their fresh imperfect dataset can easily just get numbers from our algs that aren't engineered to handle difficult factors at the moment. And from a research point of view, when adding datasets to UCR TSC or UEA MTSC, we can potentially leave datasets in an imperfect form similar to the option to leave datasets unnormalised. But, we also provide the bog standard difficult to argue with way of getting it into a complete standard form for easy comparison between classifiers. For particular applications, people more involved/interested can define settings for/write their own data cleaning processes etc.
This can also be applied to sktime, and as an addition to all this, we should probably add support for other file types, namely .ts files but also simple csv files with not meta/attribute information. Maybe this should all be done after the/a dedicated timeseries instance type is drafted up. On a side note, this could probably make for a decent third year project too
Gradle config is still set to old name "uea-tsc" so is very confusing when opened in an IDE supporting gradle!
currently we have vector_classifiers, vector_clusterers etc.
I propose refactor to have a single directory
weka_uea
with subdirs classifiers, clusters etc
any objections?
Of the classifiers that currently have contracting implemented each have done it slightly differently. It would be a good idea to discuss a uniform way to implement contracting into a classifier, or if necessary define different methods of contracting and make it clearer which one each classifier uses (i.e. separate interfaces).
Weka's getOptions is probably more intuitive / plays better with weka, so we should use that instead of getParameters. Need to change this when adjusting the interfaces
I'm doing a study of TSF and RISE in both sktime and, err, time-mine. I'm going to use James's TunedClassifier, which on first look seems excellent, documented, clear example etc, I may even use it correctly. My issue here is, should we have a new interface
interface Tuneable{
ParameterSpace getDefaultParameterSpace()
}
and maybe more? Feels a little hacky manually setting it up in ClassifierLists
Idea for dealing with the current ClassifierLists issue of handling parameters. Defining a name with the parameter, e.g. "RotF<100/200/500>" becomes clunky with a large number of parameter options.
Currently there's 2 use cases:
Current thoughts on how this would work on the cmdline is:
Benefits:
I have a hacky version on the ee branch currently: https://github.com/uea-machine-learning/tsml/blob/a0278c0f11ae2fa5e8812bbef229debd69cdfdac/src/main/java/experiments/Experiments.java#L102
this package needs tidying up and refactoring. In particular:
I found what could be a mistake in the code related to the random search for shapelet selection.
In this line in ImpRandomSearch.java:
Shapelet shape = checkCandidate.process(getTimeSeries(timeSeries,shapelet.var3), shapelet.var1, shapelet.var2, shapelet.var3);
var1 is the length and var2 is the position as stated by a comment preceding it, however the order is actually different in the checkCandidate function which is as follows:
protected Shapelet checkCandidate(Instance series, int start, int length, int dimension)
This causes some of the selected shapelets to have the length 0 i.e. no content at all.
I found this issue while trying to write the generated shapelets to a file and found some lines to be empty.
seems sensible, just need to double check it is already not in abstract classifier or any of our existing interfaces. IIRC weka has Randomizable, but I think its cleaner to just use our own
Utilities.extractTimeSeries includes the class value in the extraction of the time series. This should be dropped.
Execution of timeseriesweka.classifiers.hybrids.FlatCote.main() in master (ef5fd10) produces an exception with the following details:
Exception in thread "main" java.lang.IllegalArgumentException: Src and Dest differ in # of attributes: 68 != 1
at weka.core.RelationalLocator.copyRelationalValues(RelationalLocator.java:87)
at weka.filters.Filter.copyValues(Filter.java:371)
at weka.filters.Filter.push(Filter.java:288)
at weka.filters.SimpleBatchFilter.batchFinished(SimpleBatchFilter.java:257)
at weka.filters.Filter.useFilter(Filter.java:682)
at weka_extras.classifiers.ensembles.AbstractEnsemble.buildClassifier(AbstractEnsemble.java:890)
at timeseriesweka.classifiers.hybrids.FlatCote.buildClassifier(FlatCote.java:104)
at timeseriesweka.classifiers.hybrids.FlatCote.main(FlatCote.java:185)
Problem is caused by applying filter ShapeletTransform in weka_extras.classifiers.ensembles.AbstractEnsemble.buildClassifier(AbstractEnsemble.java:890) --> It seems that the usage of Filter causes the problem for ShapeletTransform.
If I call the ShapeletTransform.process(data) directly the exception is not thrown, tested via change of AbstractEnsemble:
Line 888 else{
transform.setInputFormat(data);
this.trainInsts = Filter.useFilter(data,transform);
}
into:
Line 888 else{//changed by davcem: There is some filter issue when using
if(this.transform instanceof ShapeletTransform){
this.trainInsts = ((ShapeletTransform)transform).process(data);
}else{
transform.setInputFormat(data);
this.trainInsts = Filter.useFilter(data,transform);
}
}
This is just a quick hack, but I think that the "new way" of useFilter cases issues with the ShapeletTransform Filter.
I propose we keep this off the repo for now, until it is better documented and structured. Thoughts?
A discussion point/issue arising from Travis. Should we switch to IntelliJ and update from Java 8? Open question we have gone over, but I'll leave it here
the long term question is, do we want to continue using the Filter mechanism in weka for transformers.
Pros for Filter:
any other thoughts? Whatever we do, they all need capabilities checking
one issue is how to deal with parameters that are dependent on the data. So for example number of features in TSF is sqrt(m).
to be included in George's tidy up
to be moved to experiments
Another thing we should do from a documentation/public usability povover time is to add to all our own classifiers (where appropriate) the technical information type stuff.
The interface TechnicalInformationHandler defines getTechnicalInformation() which returns a TechnicalInformation object, essentially defines the bibtex for a paper reference for this classifier, e.g j48:
/**
* Returns an instance of a TechnicalInformation object, containing
* detailed information about the technical background of this class,
* e.g., paper reference or book this class is based on.
*
* @return the technical information about this class
*/
public TechnicalInformation getTechnicalInformation() {
TechnicalInformation result;
result = new TechnicalInformation(Type.BOOK);
result.setValue(Field.AUTHOR, "Ross Quinlan");
result.setValue(Field.YEAR, "1993");
result.setValue(Field.TITLE, "C4.5: Programs for Machine Learning");
result.setValue(Field.PUBLISHER, "Morgan Kaufmann Publishers");
result.setValue(Field.ADDRESS, "San Mateo, CA");
return result;
}
Desired characteristics
these will benefit from the shapelet transform enhancements on other Issues
so almost certainly originating with me, some classifiers implement this interface but all they do is a CV. This is high maintenance when the Experiments could do it anyway. So I think unless it has the capacity to do something other than CV, it should not implement this interface.
For those that do, I think we should split the time to estimate error and the final build time in ClassifierResults, and store them both.
thoughts?
this is a perennial problem we need to address. How do we set k? I know from recent experiments that it matters. Using best 100 is sig worse than using the number generated (which seemed variable!). In ShapeletTransformClassifier I manually set it in createrTransformData. This is brittle and unsatisfactory. I will try it with different default values. A priori, I think setting it to 500 should be sufficient.
this needs updating. Desired characteristics:
volunteers?
this interface has three methods
public interface HiveCoteModule{
public double getEnsembleCvAcc();
public double[] getEnsembleCvPreds();
public String getParameters();
the first two are better in TrainAccuracyEstimate (although would change the name to getTrainAcc and getTrainPreds). The second already exists in SaveParameterInfo.
any thoughts?
We need some form of branch protection for the dev branch. @MatthewMiddlehurst accidentally deleted the dev branch earlier this week and managed to recover it (phew!). To prevent this from happening again, we need some kind of protection over the important branches. Master is already protected.
On github, to protect a branch you have to make a branch protection rule. I made a rule for master which is why you can't push directly to it. We have two options for dev:
We at the very least need option 1, would like some feedback on option 2 :)
I need an adaptation to Experiments, ClassifierLists. When contracted, I want the file structure to reflect that. So If I build ShapeletTransformClassifier for 100 hours, results should be in ShapeletTransformClassifier100.
hi, in addition to the precalculated timing way we use, I would like a wrapper that measures actual time, so that if the calculation is off, the transform stops looking when contract is exceeded, or continues looking if time is left.
we currently have two packages for distance functions:
timeseriesweka.elastic_distance_measures and
timeseriesweka.filters.shapelet_transforms.distance_functions
it would be nice to tidy this up somehow. Any thoughts? We also have the wrapped FastWWS with its own package structure.
distances are not classifier only, so I think a single package is good. Weka puts them in core, but I dont like that. I think just a package distances? Maybe subpackage elastic_distance and window_distance? And then, do we continue to implement weka DistanceFunction?
hi, BOSS currently internally creates the train files, but the format is not consistent with the others. Currently looks like
Beef,BOSSEnsemble,train
FullyNestedEstimates,true,numSeries,1,numclassifiers0,12,windowSize,10,wordLength,10,alphabetSize,4,norm,true,windowSize,22,wordLength,8,alphabetSize,4,norm,true,windowSize,25,wordLength,12,alphabetSize,4,norm,true,windowSize,400,wordLength,8,alphabetSize,4,norm,true,windowSize,445,wordLength,14,alphabetSize,4,norm,true,windowSize,16,wordLength,10,alphabetSize,4,norm,false,windowSize,22,wordLength,16,alphabetSize,4,norm,false,windowSize,25,wordLength,14,alphabetSize,4,norm,false,windowSize,31,wordLength,10,alphabetSize,4,norm,false,windowSize,55,wordLength,16,alphabetSize,4,norm,false,windowSize,379,wordLength,16,alphabetSize,4,norm,false,windowSize,445,wordLength,16,alphabetSize,4,norm,false
0.6
0.0,0.0
but should look like
Beef,RISE,train,0,NANOSECONDS,PREDICTIONS, Generated by Experiments.java
FullyNestedEstimates,true,numClassifiers,500,MinInterval,16,Filter0,PowerSpectrum,Filter1,ACF
0.7333333333333333,116582716091,125647425168,-1,-1,5
0,0,,0.498,0.1,0.15,0.118,0.134,,3726164889,,
0,0,,0.706,0.08,0.052,0.084,0.078,,4325897589,,
0,0,,0.696,0.106,0.044,0.078,0.076,,4017730289,,
I have not looked at RBOSS though
Based on a small discussion we had in the office, do we have a standard practice for tie-breaking? Do all classifiers use the same method and what method should be used? i.e. first item, random selection, weighted random selection using class distribution.
Mostly in the context of class selection for classification but could be extended to other situations with ties.
Mainly for discussion, if this is a non-issue feel free to close.
bit more housekeeping of my ancient code. DTW, important benchmark etc. I've moved the variants into classifiers.distance_based. I know at lot of it is old, because it has eclipse indentation from 10 years ago. We have
SlowDTW_1NN: Standard naive version with full CV, seems fully integrated, but needs updating for new experiments etc
FastDTW_1NN: my own (unpublished) DTW speed ups. Also integrated as slow.
FastDTW: Wrapper for the SDM fast version, not integrated, and may have a memory bug.
DTW_kNN: I have no idea, think I can remove this. I think its just a more configurable SlowDTW with added early abandon.
So what to do?
any thoughts? Other classifiers here are called
DD_DTW
DTD_C
ElasticEnsemble
NN_CID
ProximityForest
Numerosity reduction in BOP class BagOfPatternsFilter, method buildBag possibly not functioning correctly.
prevPattern initialised to an array of -1 values and is used to compare to the current pattern and remove if identical.
It appears that the values for this array are never updated, posting this issue to look into this after the sktime sprint.
double check this and if true, remove the redundant feature
@TonyBagnall and I need to switch the repo ownership to him, along with the travis integration
hi, james has this to be backward compatible with my hack for reading CAWPE from file. I propose we remove classic, my global stings and just have the switch inside setClassifier(ExperimentalArguments a)
any reason not to do this anyone knows of? The only downside is that you need to assume the results files for CAWPEFROMFILE will have to all be in the same directory, and that will have to be where the CAWPEFROMFILE writes to (if you use setClassifier to run this). I have no problem with that.
I checked out the master (ef5fd10) yesterday and noticed compile problems since the package specification for all files in statistics.transformations is: package transformations instead of package statistics.transformations;
src/main/java/statistics/tests/ResidualTests.java also imported transformation package
test for slack
I'd like a check to not allow exactly the same shapelets. Generally wont be an issue. This arose because flat lines are discriminatory in Yoga. So shapelet 0,0,0 appears multiple times, thus wasting shapelet allocation
this directory needs consolidation and tidying up. In particular, we have
ensembles:
stackers
weightedvoters
then loads of other one offs. We should maybe model weka file structure
Changes to core repo
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.