mdabros / sharplearning Goto Github PK

Machine learning for C# .Net

License: MIT License

C# 100.00%

machine-learning csharp decision-trees adaboost gradient-boosting-machine random-forest neural-nets deep-learning ensemble-learning cross-validation

sharplearning's Introduction

SharpLearning

SharpLearning is an opensource machine learning library for C# .Net. The goal of SharpLearning is to provide .Net developers with easy access to machine learning algorithms and models.

Currently the main focus is supervised learning for classification and regression, while also providing the necesarry tools for optimizing and validating the trained models.

SharpLearning provides a simple high-level interface for machine learning algorithms.
In SharpLearning a machine learning algorithm is refered to as a Learner, and a machine learning model is refered to as a PredictorModel. An example of usage can be seen below:

// Create a random forest learner for classification with 100 trees
var learner = new ClassificationRandomForestLearner(trees: 100);

// learn the model
var model = learner.Learn(observations, targets);

// use the model for predicting new observations
var predictions = model.Predict(testObservations);

// save the model for use with another application
model.Save(() => new StreamWriter("randomforest.xml"));

All machine learning algorithms and models implement the same interface for easy replacement.

Currently SharpLearning supports the following machine learning algorithms and models:

DecisionTrees
Adaboost (trees)
GradientBoost (trees)
RandomForest
ExtraTrees
NeuralNets (layers for fully connected and convolutional nets)
Ensemble Learning

All the machine learning algorithms have sensible default hyperparameters for easy usage. However, several optimization methods are available for hyperparameter tuning:

GridSearch
RandomSearch
ParticleSwarm
GlobalizedBoundedNelderMead
Hyperband
BayesianOptimization

License

SharpLearning is covered under the terms of the MIT license. You may therefore link to it and use it in both opensource and proprietary software projects.

Documentation

SharpLearning contains xml documentation to help guide the user while using the library.

Code examples and more information about how to use SharpLearning can be found in SharpLearning.Examples

The wiki also contains a set of guides on how to get started:

Installation

The recommended way to get SharpLearning is to use NuGet. The packages are provided and maintained in the public NuGet Gallery. More information can be found in the getting started guide on the wiki

Learner and model packages:

SharpLearning.DecisionTrees - Provides learning algorithms and models for DecisionTree regression and classification.
SharpLearning.AdaBoost - Provides learning algorithms and models for AdaBoost regression and classification.
SharpLearning.RandomForest - Provides learning algorithms and models for RandomForest and ExtraTrees regression and classification.
SharpLearning.GradientBoost - Provides learning algorithms and models for GradientBoost regression and classification.
SharpLearning.Neural - Provides learning algorithms and models for neural net regression and classification. Layers available for fully connected and covolutional nets.
SharpLearning.XGBoost - Provides learning algorithms and models for regression and classification using the XGBoost library. CPU and GPU learning supported. This pakcage is x64 only.
SharpLearning.Ensemble - Provides ensemble learning for regression and classification. Makes it possible to combine the other learners/models from SharpLearning.
SharpLearning.Common.Interfaces - Provides common interfaces for SharpLearning.

Validation and model selection packages:

SharpLearning.CrossValidation - Provides cross-validation, training/test set samplers and learning curves for SharpLearning.
SharpLearning.Metrics - Provides classification, regression, impurity and ranking metrics..
SharpLearning.Optimization - Provides optimization algorithms for hyperparameter tuning.

Container/IO packages:

SharpLearning.Containers - Provides containers and base extension methods for SharpLearning.
SharpLearning.InputOutput - Provides csv parsing and serialization for SharpLearning.
SharpLearning.FeatureTransformations - Provides CsvRow transforms like missing value replacement and matrix transforms like MinMaxNormalization.

Contributing

Contributions are welcome in the following areas:

Add new issues with bug descriptions or feature suggestions.
Add more examples to SharpLearning.Examples.
Solve existing issues by forking SharpLearning and creating a pull request.

When contributing, please follow the contribution guide.

sharplearning's People

Contributors

Stargazers

Watchers

Forkers

nietras jannicklange rpmsportsanalytics oguduvictor metafight carmamir mikaboz clustersdata hal2001 jesusnoelvalmo lulzzz automationconsultant sjoerdteunisse ajdinmasic-code nikita94 kingefosa jaenudin86 rcplay andreidegtiarev ercindedeoglu ercin-dedeoglu larspellarin erikcheatham zenboer awesomedotnetcore nholmgaard kyapp69 blacktx42 dagangwood163 xiangyan99 arnavdas88 petro-ai buildcomplete thecentury superowner jameschch gaga2015 zaharponimash orrollo 591094733 gtsianakas sanjay-bhat github9800 stjordanis pxji xiaoxiongnpu alfrjw jason6583 artemiusgreat sharov-am mkaeucnv pmreis aiframework jtone123 newyusiyu1987 guzuomuse amszwolf kmuffato bigbirdone rabinn appalachiainteractive xnameoz aurelius-ai ulasguner wormhole2019 ru-sh bubdm tdiethe wesenu liermer mokosan avroke ridgew vietld-itec franklindane ko9ma7 watermelonich xuan2261 liweilian

sharplearning's Issues

NeuralNet: Add support for sample weights

Add support for sample weights to the NeuralNet learners. This will make it possible to handle imbalanced datasets directly in the learners, instead of under/oversampling the dataset inorder to balance it.

This is currently on-hold until #9 has been decided.

Optimization: Add option for how to sample hyper parameters

Currently, all optimizers in SharpLearning.Optimization use random uniform sampling for sampling hyper parameters from the provided min/max boundaries. This is not always optimal, for instance when dealing with a hyper parameters like learning rate that can span a large range of values, like 0.0001 to 1.0. Using random uniform sampling in this case might result in only sampling values in a small part of the range. In this case it would be much better to sample at uniform in the log space. Hence, it should be possible to select which space to sample from for each hyper parameter when setting og an optimizer.

This should include at least random uniform form:

Linear (current method)
Logarithmic
Exponential

At the same time, setup of the hyper-parameter ranges could be changed from setting op an array of arrays to using a type to guide the user better:

Current method

var parameters = new double[][]
{
   new double[] { 80, 300 }, // iterations (min: 80, max: 300)
   new double[] { 0.02, 0.2 }, // learning rate (min: 0.02, max: 0.2)
};

Proposed method

var parameters = new OptimizerParameter[]
{
   new OptimizerParameter(min: 80, max: 300,  SamplingMethod.Linear), // iterations (min: 80, max: 300)
   new OptimizerParameter(min: 0.02, max: 0.2, SamplingMethod.Logarithmic),, // learning rate (min: 0.02, max: 0.2)
};

AdaBoostLearners: Add subsample ratio pr. tree as hyper parameter

RandomForest and GradientBoost learners have a hyper parameter, subSampleRatio, which controls how many training samples are forwarded to each tree in the ensemble. When subsampling is active, samples from the training data will be drawn with replacement, leading to more variation among the trees in the ensemble. This parameter should also be introduced in the AdaBoostLearners (ClassificationAdaBoostLearner and RegressionAdaBoostLearner), to have more possibilities for reguralizing this type of model .

In the RandomForest implementation of this feature, there is sampling with replacement, even if subSamplingRatio=1.0, this is part of the algorithms design. However, for the AdaBoost implementation of this feature, if subsampling is off (subSampleRatio=1.0), no sampling with replacement should be introduced, and the whole training set should be considered in each tree of the ensemble. This will result in the 'classic', AdaBoost algorithm, if the subSamplingRatio is set to 1.0.

Besides the difference when subSampleRatio=1.0, the AdaBoost implementation should be very similar to the RandomForest implementation, which can be found here RandomForest.

[Question] How would you use the model to make predictions on new data?

I have read all the examples and gone through the source code, but haven't been able to answer the question.

I have setup a data set, trained and tested the model, but now I would like to use the model to make predictions on new data. How would I achieve this?

Example:
Target value has 3 classifications: good, bad, average

New data comes in -> use trained/tested model to make a prediction on the target value. Also, would it be possible to get a probability/confidence of the prediction of the target value i.e. 25% good, 50% average, 25% bad.

.Net Core and .Net Standard support

Add .NET Core and .Net Standard support to make SharpLearning available on more platforms. Porting to .Net Standard involves the following tasks:

Retargeting the projects .NET Framework version to .NET Framework 4.6.2.
Determining the portability of the code using API Portability Analyzer. This has been done, and only the GenericXmlDataContractSerializer from SharpLearning.InputOutput uses unsupported API calls.
Change the implementation of GenericXmlDataContractSerializer to conform with .net standard 2.0. This is possible with the available API calls, however there are issues with serializing some of the Math.net containers used in the NeuralNet models. This might be solved together with #9, since CNTK will most likely replace math.net in the SharpLearning.Neural project.
Change project format to .net core.

After the porting process the continuous integration on appveyor must be updated.

[Question] 2d - 3d output in neural networks

Is it possible ?
And if it's possible, how to train the network ?

AdaBoost: Add support for sample weights

Add support for sample weights to the AdaBoost learners. This will make it possible to handle imbalanced datasets directly in the learners, instead of under/oversampling the dataset inorder to balance it.

The DecisionTreeLearners, used by AdaBoost, already support sample weights, so implementing it only involves setting up the sample weights and forwarding the weights to the DecisionTreeLearner in each boosting iteration. The learners can be found here in the AdaBoost project: AdaBoost

The work is currently in progress in the branch adaboost-sample-weight-support

How to vectorize text?

Hi, thanks for the great library!

My CSV has text in some of the columns. Some of them are categorical (e.g. month of the year) and some have free text (e.g., book title). Looks like SharpLearning.InputOutput.Csv.CsvRowExtensions.ToF64Matrix is trying to parse stringified numbers. What if my CSV consists of non-number values? Is there a recommended way or should I wire another lib to do TF-IDF/word2vec/char embedding/etc?

LearningCurves: Add support for weighted learners

Extend the ILearningCurvesCalculator interface to support the IWeightedIndexedLearner interface. This depends on #14 being completed.
Implement support for sample weights in LearningCurvesCalculator.

Most implementations of IOptimizer don't properly pass on the random seed to all internally used algorithms

Hi mdabros,

I love the Optimizer classes of SharpDevelop, I use them heavily for hyperparameter tuning.
But I would like to use differend random seeds and parts of the Optimizer classes don't allow using them like that.
Example:
If you look at the constructor of your BayesianOptimizer you can see that it doesn't pass on the "seed" parameter to all other classes that BayesianOptimizer creates instances of, sometimes it will just forward a hardcoded 42 instead.
I know, 42 is the answer, but I would prefer "seed" in this case... ;)
The other IOptimizer implementations have similar issues.
Would be nice if you could modify that some time...

Thank you!

Best regards
Florian

How to load data from SQL server table

All given samples contain CSV method only.

            #region Read data

            // Use StreamReader(filepath) when running from filesystem
            var parser = new CsvParser(() => new StringReader(Resources.winequality_white));
            var targetName = "quality";

            // read feature matrix (all columns different from the targetName)
            F64Matrix observations = parser.EnumerateRows(c => c != targetName).ToF64Matrix();

            // read targets
            var targets = parser.EnumerateRows(targetName).ToF64Vector();

I would like to load data from a List object; how it can be possible?
Thanks!

edit:

    /// <summary>
    /// Parses the CsvRows to a double array. Only CsvRows with a single column can be used
    /// </summary>
    /// <param name="dataRows"></param>
    /// <returns></returns>
    public static double[] ToF64Vector(this IEnumerable<CsvRow> dataRows)
    {
      if (dataRows.First<CsvRow>().ColumnNameToIndex.Count != 1)
        throw new ArgumentException("Vector can only be genereded from a single column");
      return dataRows.SelectMany<CsvRow, double>((Func<CsvRow, IEnumerable<double>>) (values => (IEnumerable<double>) values.Values.AsF64())).ToArray<double>();
    }

This code only works with CsvRow list.

Strongly-named assemblies

First, let me say thanks for the great package: it works much better in our app than our previous (non-learning) solution.

We've got one issue to report: in our next release, all the various subsystems need to be in signed assemblies, which means we can't at the moment use SharpLearning since it would need to be called by a signed assembly and is itself unsigned. Any chance we could see signed versions of the SharpLearning Nuget packages?

Thanks,

Alistair

Failing unit test ClassificationGradientBoostLearner_LearnWithEarlyStopping

This always fails on when I run all.ps1
cc: @mdabros

Failed   ClassificationGradientBoostLearner_LearnWithEarlyStopping
Error Message:
   Assert.AreEqual failed. Expected a difference no greater than <1E-06> between expected value <0.162790697674419> and actual value <0.13953488372093>.
Stack Trace:
   at SharpLearning.GradientBoost.Test.Learners.ClassificationGradientBoostLearnerTest.ClassificationGradientBoostLearner_LearnWithEarlyStopping() in E:\oss\SharpLearning\src\SharpLearning.GradientBoost.Test\Learners\ClassificationGradientBoostLearnerTest.cs:line 120
Standard Output Messages:


Debug Trace:
Iteration 1 Validation Error: 0.674418604651163
   Iteration 11 Validation Error: 0.290697674418605
   Iteration 21 Validation Error: 0.244186046511628
   Iteration 31 Validation Error: 0.22093023255814
   Iteration 41 Validation Error: 0.186046511627907
   Iteration 51 Validation Error: 0.186046511627907
   Iteration 61 Validation Error: 0.197674418604651
   Iteration 71 Validation Error: 0.174418604651163
   Iteration 81 Validation Error: 0.13953488372093
   Iteration 91 Validation Error: 0.162790697674419

Both for Debug and Release.

Add support for enabling/disabling messages from learners during training.

Currently, some of the learners in SharpLearning will output information during the training period. This includes the early stopping with gradient boost and neural networks. It should be possible to enable and disable these messages, and preferably, also possible to choose where the messages should be outputted. For instance to Console, Trace, or alternatively a log file.

This could be made by adding an Action to receive the message. This should probably be part of a separate interface, for learners supporting this. Something like:

public interface ILogger
{
   public Action<string> Log { get; set }
}

The learners would then add the message to the log. Likewise, other algorithms like the optimizers could also implement this interface.

Better default parameters for DecisionTreeLearners

Currently, the DecisionTreeLearners (ClassificationDecisionTreeLearner and RegressionDecisionTreeLearner) does not have very good default parameters. With a maximumTreeDepth=2000, using the default paramters will, in most cases, result in a model that overfits the problem. Hence, a better set of default paramters should be found, that, in more cases results in a better regularized model.

Thanks for developing and sharing this brilliant machine learning package in C#

Unnecessary System Files Generated

For our project MetaMorpheus, after adding Sharplearning NuGet Package to our EngineLayer and TaskLayer, in the GUI WPF project, there is an excessive amount of unnecessary system dll files generated in the output folder after building (No matter release or debug). Here is a list of these files. We really couldn't determine where is the problem since there is no trace in the .csproj files and references of GUI nor related projects. So please help us if you have any idea! Thanks a lot.

Text Classification

I am pretty new to Machine Learning and all of these things
I am want to ask a question about text classification:
Can I use this library for text classification and how?

P.S I need to split one string and classify substrings by Categories\Groups\etc, for example "Ray's Potato Chips with Ketchup taste 80g" I need to split into
Category "Potato Chips"
Groups:"Ketchup"
Brand(or smthng):"Ray's"

Which model type should I use for financial price prediction?

First of all thank you for the great library!
My question is simple: I want to predict next period price with pre-computed history values.
I have over 30 rows data for each price.
Price and datas are decimal.

For example history:
Indicator1 - Indicator 2 - Indicator 3 - Price - Trend
10,01121 - 23,56540 - 12.00001 - 12,23321 - UP
9,00001 - 3,00040 - 2.00001 - 1,23300 - DOWN
...
...
And data to predict coming like
8,11211 - 1,00020 - 0.00021 - 3,5555 - ?
I want to get TREND field.

Which model should I use? Any example will be perfect?
Regards!

Nuget package for SharpLearning.XGBoost can't install

Currently there is an issue when installing the SharpLearning.XGBoost package, Nuget will try to add a reference for the native xgboost dll:

The reason for this is how the nuget package for PicNet.XGBoost.Net has been created. I have openened an issue to get this solved: PicNet/XGBoost.Net#24

Random Forest: m_random and parallel RNG

Hello, I would first like to thank you for making such an excellent and accessible repo. We're getting quite a bit of use out of it in multiple projects.

I've noticed that despite setting the seed in the Random Forest learner, I get slightly varying results from run-to-run given identical inputs. I suspect the problem is in the following code block:

Parallel.ForEach(rangePartitioner, (work, loopState) =>
{
   results.Add(CreateTree(observations, targets, indices, new Random(m_random.Next())));
});

The parallel random number generation is a problem; the same random numbers are generated because the seed is set in the constructor, but there is a "race" between the threads to grab the next random number. Locking m_random would not help, I think. This should be fixed by generating a random number for each tree prior to entering the Parallel.ForEach loop, like so:

int[] randomNumbers = new int[m_trees];
for(int i = 0; i < randomNumbers.Length; i++)
{
   randomNumbers[i] = m_random.Next();
}
Parallel.ForEach(rangePartitioner, (work, loopState) =>
{
   results.Add(CreateTree(observations, targets, indices, randomNumbers[work]));
});

Let me know if I'm overlooking something. I'd be happy to fork and make a pull request.

Rob

AdaBoostLearners: Add features pr. split as regularization hyper parameter

RandomForest and GradientBoost learners have a hyper parameter, featuresPrSplit, which controls how many randomly selected features are considered during the decision trees search for a new split. This parameter should also be introduced in the AdaBoostLearners (ClassificationAdaBoostLearner and RegressionAdaBoostLearner), to have more possibilities for reguralizing this type of model .

Sine the DecisionTreeLearner used in adaboost already supports 'featuresPrSplit', the implementation should simply add the hyper parameter to the adaboost learner contructors and forward the parameter to the DecisionTreeLearner.

Math.NET Matrices

This is a really great library. Was there a specific reason why you chose to roll your own Matrix class, rather than leveraging Math.NET?

Ideally I'd like to marry the two (not only for consistency with modules I've already written, but even for smaller things like using Matrix rather than Matrix). Before I jump in and start changing anything, though, I thought I'd check with the author to see if there was a specific reason behind it.

If I do proceed with integrating the two, more than happy to submit back a PR too, just let me know.

change unsafe code to safe code

SharpLearning.Neural full .net core2.0/standard 2.0 support

SharpLearning.Neural depends on math.net, which does not currently support .net core 2.0/.net standard 2.0. Hence, SharpLearning.Neural will only work on .NET Desktop/Windows.

Support for .net core 2.0/.net standard 2.0 is planned for math.net, so full support for SharpLearning.Neural will also be possible once this is implemented. Eventually, #9 might also solve this.

Random Forest Regression generates constant values for prediction

Hi,

I am using random Forest for learning. The output I get results into the initial 40 or so values varying (Float values) but after that it's just constant.

I was wondering if you have seen this behavior. The data has 24 features and 500 observations. I get the 42 first prediction varying but after that it's just constant.

I can provide the data and the code if you would like that.

regards,
Avi

CS0012 The type 'Object' is defined in an assembly that is not referenced. You must add a reference to assembly 'netstandard, Version=2.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'.

Fresh download, after restoring packages via NuGet.

How to solve it????

An item with the same key has already been added

Hello,

Might not be an actual issue, but more of a question on how to handle my error. I want to feed data into the parser that is in the SharpLearning.InputOutput.Csv package. Below is my code:

`string rawTarget = Transformations.ReturnColumnAsCSVString(Data, OutputColumn);

System.Windows.Forms.Clipboard.SetText(rawTarget);

var targetparser = new CsvParser(() => new StringReader(rawTarget));

var targets2 = targetparser.EnumerateRows(OutputColumn).ToF64Vector();`

So, first I pull my data into a string, this results in a string looking like this:
"Vwap";7049.4;6983.3;6981.8;6871.0;6846.7;6811.0
(obviously there is a lot more)

Then I use the Stringreader to read my string and parse it to an F64 vector. The error I get is:
An item with the same key has already been added.

I have also tried to convert the string to a stream, then use the Streamreader but this results in the exact same error. I am at a loss at how to solve this.

Hope anyone can provide a solution! Thanks in advance!

Exception: Source array was not long enough. Check srcIndex and length, and the array's lower bounds

The following code throws a System.IndexOutOfRangeException on line 328 in GBMDecisionTreeLearner.cs

            var sut = new RegressionSquareLossGradientBoostLearner();

            Random rnd = new Random(42);
            var rows = 10000;
            var columns = 1;
            double[] values = new double[rows * columns];
            for (int i = 0; i < rows * columns; i++)
                values[i] = rnd.NextDouble();
            Containers.Matrices.F64Matrix observations = new Containers.Matrices.F64Matrix(values, 1, 10000);
            double[] targets = new double[rows];
            for (int i = 0; i < rows; i++)
                targets[i] = rnd.NextDouble();

            var model = sut.Learn(observations, targets);

Add TPrediction[] Predict(F64Matrix observations) to public interface IPredictor<TPrediction>

I think it would be a good idea to add the F64Matrix overload to the IPredictor interface as it would make it easier to use the IPredictorModel interface in your code. The models seems to implement it already.

It would add a dependency for SharpLearning.Containers.Matrices in SharpLearning.Common.Interfaces, but I think it is unlikely that you would use the SharpLearning library without referencing SharpLearning.Containers.Matrices anyway.

Better error messages from learners in case of dimensionality mismatch

Currently, there are no checks to verify that the dimensions of the observation matrix and the target array matches before learning is started. This results in error messages from somewhere in the learner implementation, providing poor error messages and feedback to the user.

Checks should be added to all learners to ensure that the provided arguments and data is valid, before starting the learning process.

Metrics: Consider adding support for sample weighted metrics

When dealing with imbalanced data sets, it can be beneficial to use sample weighted metrics. This task should be split into several tasks, one for each metric, if sample weights are to be supported in the metrics project.

Duplicate efforts

Hi @mdabros!

I've just found your library a couple days ago and couldn't help but notice the similarity between both of our projects, SharpLearning and Accord.NET. Since we both share the same goal (bring serious machine learning to .NET), and instead of duplicating our efforts, wouldn't you be willing to join the Accord.NET project as well?

Seeing your extremely well-organized repository and coding skills, you would be more than welcome in joining Accord.NET as one of its authors.

Regards,
Cesar

Ensemble learners: Add support for sample weights

Add support for sample weights to the Ensemble learners. This will make it possible to handle imbalanced datasets directly in the learners, instead of under/oversampling the dataset inorder to balance it.

This task requires #14 to be done first, since the ensemble learners needs to be extended to also support weighted learners in the constructor.

Following the learners must implement weighted learner interfaces and should simply forward the sample weights the learners in the ensemble. The ensemble learners can be found here in the ensemble project: Ensemble learners

Replace SharpLearning.Containers.Matrices.F64Matrix with multidimensional array

In SharpLearning, the F64Matrix class, which is part of the Learner interfaces, is mostly used as a container for holding the features for a learning problem. While SharpLearning does contain some arithmetic extensions for the F64Matrix, the arithmetic is not used by any of the learners. Also, more efficient implementations can be found in Math.net.

Therefore it might indicate, that the primary container for features in SharpLearning should rather be a standard .net type like multidimensional array (double[,]) or jagged array (double[][]), with some extension methods to add the current functionality of the F64Matrix.

An alternative, also suggested in #6, would be to replace the F64Matrix directly by using Math.net as the matrix provider. However, since only the SharpLearning.Neural project is using matrix arithmetic and with the plan of using CNTK as backend, math.net is a large dependency to take, if only using the matrix class as a feature container. So currently, I am leaning more towards replacing F64Matrix with a standard .net type. However, to better handle integration between Math.Net and SharpLearning, maybe a separate project, SharpLearning.MathNet, could be added with efficient conversions between Math.net and SharpLearning containers (both copy and shared memory). This of course depends on what data structure ends up replacing F64Matrix, if any.

These are my current thoughts, and people are very welcome to discuss and pitch in with suggestions.

OutOfMemory Exception in F64Matrix constructor - maximum array bounds exceeded

Hi there!

First: Thanks for the great work, excellent design you have there!

I am experiencing an OutOfMemory Exception in the constructor of F64Matrix that does not really come from memory shortage but from the fact that F64Matrix internally uses a single one dimensional double array that can quite easily exceed .NETs internal boundaries of maximum array dimensions.

In my case I tried to create a F64Matrix with 10 mio rows and 55 columns.

My preferred suggestion would be to either abstract the matrix to an IF64Matrix interface that probably only consists of the At() method overloads. This would enable users to provide a custom implementation that is capable of handling larger amounts of data, if needed even by swapping data from and to disk.
Another solution could be to change the internal implementation of F64Matrix to use an array of double arrays, which I believe could also help.

Thanks for your help and keep up the excellent work!

Best regards

Florian

GradientBoost: Add support for sample weights

Add support for sample weights to the GradientBoost learners. This will make it possible to handle imbalanced datasets directly in the learners, instead of under/oversampling the dataset inorder to balance it.

The GBMDecisionTreeLearner, used by GradientBoost does not support sample weights, so adding sample weight support to the GradientBoost learners requires first adding it to the GBMDecisionTreeLearner. Adding sample weight support to the GBMDecisionTreeLearner, primarely requires using the wieghts in the loss functions: GradientBoost Loss

Following, sample wieght support can be added to the GradientBoost learners. The learners can be found here in the GradientBoost project: GradientBoost

RandomForest: Add support for sample weights

Add support for sample weights to the RandomForest learners. This will make it possible to handle imbalanced datasets directly in the learners, instead of under/oversampling the dataset inorder to balance it.

The DecisionTreeLearners, used by RandomForest, already support sample weights, so implementing it only involves setting up the sample weights and forwarding the weights to the DecisionTreeLearner for each tree. The learners can be found here in the RandomForest project: RandomForest

Please share your vision of .NET deep Learning

@mdabros Pls apologize if I hijack your excellent work here.

Daniel from MSFT is gathering a broad vision for .NET Deep Learning here. I think you may have unique view on this.

Possible evolution: trained neuralnet predict C output

Hello,
I created a NeuralNet and trained it on my computer.
I wish to port it on an embedded target now and i asked myself if a C code output of predict for a trained network could be possible. Is that the case ? If yes could you tell me some advises where to look at ?

Thank you

Sample weight for XGBoost

In Python XGBoost one can provide weights for each row of the data, see http://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier.fit. I tried to look for a way to specify such weights in SharpLearning, but could not find it. Is this possible?

Feature Suggestion

First of all, I want to congratulate you for this project. I have a suggestion, couldn't figure where to write it other than issues on GitHub.

My suggestion is, number of observations (or better their indexes, one can count them) that fall to the left and right child of a node.

Consider switching to XorShift for RNG

XorShift is a simple alternative to the built-in Random class in C#, which provide better performance and randomness. I benchmarked the RNG implementations of Math.net as seen in the picture. The XorHome in the list is my own quick port of xoroshiro128+ from here: http://xoroshiro.di.unimi.it/xoroshiro128plus.c

As seen in the picture, the Math.Net XorShift is much faster than the built-in random.

CrossValidation CrossValidate ProbabilityPredictions error means?

Hi,
when I run "CrossValidation_CrossValidate_ProbabilityPredictions" example with my data, it gives these values:
Cross-validation error: 6.88921441267999
Training error: 6.86281040890985
I searched a lot but couldn't find any documantation abput it? Could you explain what's this values mean, and for best prediction what they should be?
Thanks.
Regards!

System.AccessViolationException when retrieving data from a large dataset

In F64MatrixColumnView.RowPtr, integer overflow can occur when row * m_strideInBytes is larger than int.MaxValue, resulting in an invalid offset being applied to m_dataPtr:

double* RowPtr(int row)
{
    return (double*)((byte*)m_dataPtr + row * m_strideInBytes);
}

I am happy to fix this (it just requires a cast to long), but I'm not sure how to contribute - do I branch from master, then push and create a pull request from the GitHub website? In the future, if I find a simple bug like this, should I raise an issue, or can I just push with details and let you decide whether it's a good fix?

I haven't contributed to an open source project before!

CNTK as backend for SharpLearning.Neural

The Microsoft team working on CNTK has recently released the initial version of the C#/.Net API with support for both evaluation and training of neural networks. A more feature complete version, with support for layers and other helpful features, should arrive before the end of the year. Currently, there seems to be a few performance related issues (microsoft/CNTK#2374 and microsoft/CNTK#2386) but hopefully these will be also be solved in the next release.

Using CNTK as backend for SharpLearning.Neural will add operators for more layer and network types, while also enabling GPU training and evaluation. Using a well supported deep learning toolkit as backend will also help to ensure that future operator, layer and network types will be available faster.

This task will require a large rewrite of SharpLearning.Neural, most likely only keeping the top level interface. However, since all the core operations are availible from CNTK, most of the hard work is already completed.

This task should be split into multiple others when a design of how CNTK should be integrated has been completed. A few considerations:

Should the integrations be "simple", i.e. only have a NeuralNetLearner and NeuralNetModel in SharpLearning and use the layer construction and related functionality from CNTK directly?
Should the integration hide CNTK behind an adapter to make it easier to support other deep learning toolkits like TensorFlow(Sharp)?

CrossValidation: Add support for weighted learners

Extend the ICrossValidation interface to support the IWeightedIndexedLearner interface. This depends on #14 being completed.
Implement support for sample weights in CrossValidation.

ExtremelyRandomizedTrees: Add support for sample weights

Add support for sample weights to the ExtremelyRandomizedTrees learners. This will make it possible to handle imbalanced datasets directly in the learners, instead of under/oversampling the dataset inorder to balance it.

The DecisionTreeLearners, used by ExtremelyRandomizedTrees, already support sample weights, so implementing it only involves setting up the sample weights and forwarding the weights to the DecisionTreeLearner for each tree. The learners can be found here in the RandomForest project:
ExtremelyRandomizedTrees

Add IWeigtedLearner and IWeigtedIndexedLearner interfaces to Common.Interfaces

Add interface for learners supporting sample weights:

IPredictorModel<TPrediction> Learn(F64Matrix observations, double[] targets, 
double[] sampleWeights);

Add interface for learners supporting sample indices and sample weights:

IPredictorModel<TPrediction> Learn(F64Matrix observations, double[] targets, 
double[] sampleWeights, int[] indices);

Unittests fail because of localization settings

Some unittests compare against hardcoded strings written in the test method. These fail on systems that use ,(comma) as decimal separator instead of .(dot). These strings should probably be loaded from a resource or the entire library should work with invariant culture unless otherwise specified.