yinlou / mltk Goto Github PK

View Code? Open in Web Editor NEW

136.0 136.0 74.0 659 KB

Machine Learning Tool Kit

License: BSD 3-Clause "New" or "Revised" License

Java 100.00%

java machine-learning

mltk's People

Contributors

Stargazers

Watchers

Forkers

jungwon lucentcosmos georgismilyanov bin2000 techtinkerer42 dariasor codeaudit lenovor 466152112 zhoujialinmumu mind90code laisun cvml congyangmin kuyun-zhangyang littleyuyu pengvan orangelpai delong-yang fangzheng354 t0903 sdd031215 lamalu111232 thomastong5441 ambier nagyistoce mindis ardydedase ldeng76 pfjob09 tiagd sds-dubois ebx lichenchi phuongtg zhyongwei jimsow jonenash yalechang xuyiyu rquintino sbelak jz3707 msaffarm pjpan clustersdata jiaweisong magielbruntink micseb onexuan farkaslee roryzhengzhang drroad hangjun stefanhgm dfrankow wabc1994 gottaboy vaiju1981 yachaoshao thfadssp kiminh xingzhis zhuster valeman zeta1999 aurora955 yuchenzhao-zju okzds shainaraza

mltk's Issues

Regression trees cause GC churn

The regression tree methods in MLTK allocate and drop a huge number of objects, which causes GC churn, and hugely strains the VM. A huge amount of time is spent in garbage collection, and the impact is even worse if you are trying to run several regressions in parallel, since the JVM doesn't do a good job of concurrent garbage collection. (The scalability issue alone will probably mean I can't use MLTK for my task.)

An object instance recycling scheme would help immensely with this problem.

is normalization step needed in feature preparation?

If the original features are at different ranges other than [-1, 1], do we need to normalize/calibrate their values before running mltk.predictor.gam.GAMLearner? Or, GAMLearner will take care of it?

Another question is: for mltk.predictor.evaluation.Evaluator, does can we find its metric output? I don't see the metric numbers displayed in output display or saved in any output file?

"mvn clean package" doesn't work

I'm trying to build on OS X 10.14.4.

Downloaded Oracle's java 12 with brew install cask java
Downloaded maven with brew install maven
Ran mvn clean package as directed here.

I get this error:

1 error
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  9.082 s
[INFO] Finished at: 2019-04-23T21:20:38-05:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:2.10.3:jar (attach-javadocs) on project mltk: MavenReportException: Error while generating Javadoc: 
[ERROR] Exit code: 1 - javadoc: error - The code being documented uses modules but the packages defined in http://docs.oracle.com/javase/8/docs/api/ are in the unnamed module.
[ERROR] 
[ERROR] Command line was: /Library/Java/JavaVirtualMachines/openjdk-12.0.1.jdk/Contents/Home/bin/javadoc @options @packages
[ERROR] 
[ERROR] Refer to the generated Javadoc files in '/Users/dan/work/mltk/target/apidocs' dir.
[ERROR] 
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

Might just be docs?

could you provide some working data set?

This is a very interesting package for ml in java.
I think there must be some working data set during the development.
could you provide some of them as demos to help users startup.

Support for data already in program

Does mltk have support for data structures which are already present in the runtime of the program? I'm loading and manipulating data after fetching it from a remote database, and would love to pump in my matrices directly into mltk without writing to disk.

Need for programmatic setup of datasets

Is there any way to build datasets in memory, using the API, or do I need to write out a file to disk, and read it back in?

I tried creating a dataset using the API, but the methods and constructors of Attribute are not visible, so I can't create a List, so I can't create an Instances object, so I can't create cross-validation folds.

GAM plots with nominal interaction terms

It seems to me that there is a bug in Visualizer when one of the interaction terms is not of type BinnedAttribute, because at line 251 there is a cast Bins bins1 = ((BinnedAttribute) f1).getBins(); without a previous type test. This is weird because some tests are made before to handle NominalAttributes, but the end of the code seems to run only on BinnedAttributes (in particular it need boundaries which are only defined for bins..).

Add MLTK to Maven Central

I'm using MLTK from Clojure and deployment/installation of code dependent on it would be much nicer if it were in Maven Central.

ElasticNet results inconsistent

If I run an elastic net learner a dozen times in a row, I get nine different results. I've confirmed that my input data is identical on each round (the hashcodes on my data strings and attribute strings are identical each round). Is there anything variable or non-deterministic about the elastic net process? Also, all of my work is done using BigDecimal, so any floating point issues would be within the regression. Is that what's happening here?

Documentation is missing many important details

I have been trying to use MLTK for a regression problem, and I found there are many important details missing from the documentation:

There is no info on how to set up datasets in memory programmatically, rather than reading from file (see #14).
On the page https://github.com/yinlou/mltk/wiki/Basics , there is no explanation of "(class)". I eventually figured out that it's how you set the target variable. That doesn't make sense for regression though, the target variable is not a class. It's also made more confusing because the section "Example attribute file" has the "(class)" attribute last, but under "Sparse input format", you list the target first, implying that the position is significant (which it is not).
There is no explanation that if the number of attributes doesn't match the number of columns, that lines are simply ignored. This is more confusing because it's not clear comparing sections "Example attribute file", "Example data file" and "Sparse Input Format" whether the target attribute even needs to be set in the attribute file or not, or whether its discrete/continuous nature is inferred based on whether you're doing classification or regression (and whether it is inferred that the first column is always the target). If you have the wrong number of attributes, no instances are read.
If you don't specify the target attribute, it is set to NaN, and the regression will fail unless you go through all the Instances and set a target value on each one. I couldn't figure out why I was getting NaN until I read through the code.
You do say that the attribute file parameter to InstancesReader.read is "optional", but even the JavaDoc doesn't explain that you need to set this to null for this purpose. The way that attributes are automatically inferred for sparse and dense cases if you do set this to null is not explained. (For example, the target is always unset if you don't specify an attributes file, so you'll always get NaN for the target values unless you manually set the target, as described above.)
There is no example explaining the usage with a dense datafile.

That's as far as I have gotten so far... I finally got my first regression results, but it took me several hours to figure out how to use this properly... hopefully this feedback helps!

Typo in DoublePairComparator

My guess is that it does not create a bug, but there is a typo in the function compare in DoublePairComparator in mltk.predictor.evaluation.AUC.
You wrote:

int cmp = Double.compare(o1.v1, o2.v1);
if (cmp == 0) {
    cmp = Double.compare(o2.v2, o2.v2);
}

but the second comparison should be Double.compare(o1.v2, o2.v2).

How to run GA2M with FAST?

I've been reading the docs trying to create an example of classification with GA2MLearner using command-line tools.

I checked in https://github.com/dfrankow/mltk/tree/master/examples with train_ga2m.sh. You should be able to check it out and run it. If I can get it to work, I'm happy to pass it back as an example, as requested in #17.

Several questions:

How do we generate a sensible pairwise terms file to pass to GA2MLearner instead of including all? I think that would possibly include the FAST algorithm, but I don't know how to use it.
Why does Evaluator not have any output?
Can I use the command line to run predictions (in this case classification output) on the test set?

JAR file

I am new to java. Could you make a .jar file please?

Residuals not saved when building GAM

On the Intelligible Models webpage you say that we should pass the option -R cal_housing.residual when building the GAM model (step 3), so that it can be used to detect the interactions in step4.

However this version of the code does not handle such option and does not save the residuals. Can you confirm that the residuals that you mention are those stored in rTrain in your code ?
I've fixed it and can submit a pull request if you want.

By the way, the option -T is not handled neither, and so the score on the test set is not computed.

Thanks a lot for sharing your work!