mdabros / sharplearning Goto Github PK
View Code? Open in Web Editor NEWMachine learning for C# .Net
License: MIT License
Machine learning for C# .Net
License: MIT License
In Python XGBoost one can provide weights for each row of the data, see http://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier.fit. I tried to look for a way to specify such weights in SharpLearning, but could not find it. Is this possible?
@mdabros Pls apologize if I hijack your excellent work here.
Daniel from MSFT is gathering a broad vision for .NET Deep Learning here. I think you may have unique view on this.
First, let me say thanks for the great package: it works much better in our app than our previous (non-learning) solution.
We've got one issue to report: in our next release, all the various subsystems need to be in signed assemblies, which means we can't at the moment use SharpLearning since it would need to be called by a signed assembly and is itself unsigned. Any chance we could see signed versions of the SharpLearning Nuget packages?
Thanks,
Alistair
Currently, the DecisionTreeLearners (ClassificationDecisionTreeLearner and RegressionDecisionTreeLearner) does not have very good default parameters. With a maximumTreeDepth=2000, using the default paramters will, in most cases, result in a model that overfits the problem. Hence, a better set of default paramters should be found, that, in more cases results in a better regularized model.
Hi, thanks for the great library!
My CSV has text in some of the columns. Some of them are categorical (e.g. month of the year) and some have free text (e.g., book title). Looks like SharpLearning.InputOutput.Csv.CsvRowExtensions.ToF64Matrix
is trying to parse stringified numbers. What if my CSV consists of non-number values? Is there a recommended way or should I wire another lib to do TF-IDF/word2vec/char embedding/etc?
Extend the ILearningCurvesCalculator interface to support the IWeightedIndexedLearner interface. This depends on #14 being completed.
Implement support for sample weights in LearningCurvesCalculator.
I am pretty new to Machine Learning and all of these things
I am want to ask a question about text classification:
Can I use this library for text classification and how?
P.S I need to split one string and classify substrings by Categories\Groups\etc, for example "Ray's Potato Chips with Ketchup taste 80g" I need to split into
Category "Potato Chips"
Groups:"Ketchup"
Brand(or smthng):"Ray's"
Add support for sample weights to the ExtremelyRandomizedTrees learners. This will make it possible to handle imbalanced datasets directly in the learners, instead of under/oversampling the dataset inorder to balance it.
The DecisionTreeLearners, used by ExtremelyRandomizedTrees, already support sample weights, so implementing it only involves setting up the sample weights and forwarding the weights to the DecisionTreeLearner for each tree. The learners can be found here in the RandomForest project:
ExtremelyRandomizedTrees
XorShift is a simple alternative to the built-in Random class in C#, which provide better performance and randomness. I benchmarked the RNG implementations of Math.net as seen in the picture. The XorHome in the list is my own quick port of xoroshiro128+ from here: http://xoroshiro.di.unimi.it/xoroshiro128plus.c
As seen in the picture, the Math.Net XorShift is much faster than the built-in random.
Is it possible ?
And if it's possible, how to train the network ?
Add support for sample weights to the NeuralNet learners. This will make it possible to handle imbalanced datasets directly in the learners, instead of under/oversampling the dataset inorder to balance it.
This is currently on-hold until #9 has been decided.
Hi,
when I run "CrossValidation_CrossValidate_ProbabilityPredictions" example with my data, it gives these values:
Cross-validation error: 6.88921441267999
Training error: 6.86281040890985
I searched a lot but couldn't find any documantation abput it? Could you explain what's this values mean, and for best prediction what they should be?
Thanks.
Regards!
Add support for sample weights to the Ensemble learners. This will make it possible to handle imbalanced datasets directly in the learners, instead of under/oversampling the dataset inorder to balance it.
This task requires #14 to be done first, since the ensemble learners needs to be extended to also support weighted learners in the constructor.
Following the learners must implement weighted learner interfaces and should simply forward the sample weights the learners in the ensemble. The ensemble learners can be found here in the ensemble project: Ensemble learners
RandomForest and GradientBoost learners have a hyper parameter, featuresPrSplit, which controls how many randomly selected features are considered during the decision trees search for a new split. This parameter should also be introduced in the AdaBoostLearners (ClassificationAdaBoostLearner and RegressionAdaBoostLearner), to have more possibilities for reguralizing this type of model .
Sine the DecisionTreeLearner used in adaboost already supports 'featuresPrSplit', the implementation should simply add the hyper parameter to the adaboost learner contructors and forward the parameter to the DecisionTreeLearner.
When dealing with imbalanced data sets, it can be beneficial to use sample weighted metrics. This task should be split into several tasks, one for each metric, if sample weights are to be supported in the metrics project.
First of all, I want to congratulate you for this project. I have a suggestion, couldn't figure where to write it other than issues on GitHub.
My suggestion is, number of observations (or better their indexes, one can count them) that fall to the left and right child of a node.
Currently, all optimizers in SharpLearning.Optimization use random uniform sampling for sampling hyper parameters from the provided min/max boundaries. This is not always optimal, for instance when dealing with a hyper parameters like learning rate that can span a large range of values, like 0.0001 to 1.0. Using random uniform sampling in this case might result in only sampling values in a small part of the range. In this case it would be much better to sample at uniform in the log space. Hence, it should be possible to select which space to sample from for each hyper parameter when setting og an optimizer.
This should include at least random uniform form:
At the same time, setup of the hyper-parameter ranges could be changed from setting op an array of arrays to using a type to guide the user better:
Current method
var parameters = new double[][]
{
new double[] { 80, 300 }, // iterations (min: 80, max: 300)
new double[] { 0.02, 0.2 }, // learning rate (min: 0.02, max: 0.2)
};
Proposed method
var parameters = new OptimizerParameter[]
{
new OptimizerParameter(min: 80, max: 300, SamplingMethod.Linear), // iterations (min: 80, max: 300)
new OptimizerParameter(min: 0.02, max: 0.2, SamplingMethod.Logarithmic),, // learning rate (min: 0.02, max: 0.2)
};
Add support for sample weights to the GradientBoost learners. This will make it possible to handle imbalanced datasets directly in the learners, instead of under/oversampling the dataset inorder to balance it.
The GBMDecisionTreeLearner, used by GradientBoost does not support sample weights, so adding sample weight support to the GradientBoost learners requires first adding it to the GBMDecisionTreeLearner. Adding sample weight support to the GBMDecisionTreeLearner, primarely requires using the wieghts in the loss functions: GradientBoost Loss
Following, sample wieght support can be added to the GradientBoost learners. The learners can be found here in the GradientBoost project: GradientBoost
For our project MetaMorpheus, after adding Sharplearning NuGet Package to our EngineLayer and TaskLayer, in the GUI WPF project, there is an excessive amount of unnecessary system dll files generated in the output folder after building (No matter release or debug). Here is a list of these files. We really couldn't determine where is the problem since there is no trace in the .csproj files and references of GUI nor related projects. So please help us if you have any idea! Thanks a lot.
In SharpLearning, the F64Matrix class, which is part of the Learner interfaces, is mostly used as a container for holding the features for a learning problem. While SharpLearning does contain some arithmetic extensions for the F64Matrix, the arithmetic is not used by any of the learners. Also, more efficient implementations can be found in Math.net.
Therefore it might indicate, that the primary container for features in SharpLearning should rather be a standard .net type like multidimensional array (double[,]) or jagged array (double[][]), with some extension methods to add the current functionality of the F64Matrix.
An alternative, also suggested in #6, would be to replace the F64Matrix directly by using Math.net as the matrix provider. However, since only the SharpLearning.Neural project is using matrix arithmetic and with the plan of using CNTK as backend, math.net is a large dependency to take, if only using the matrix class as a feature container. So currently, I am leaning more towards replacing F64Matrix with a standard .net type. However, to better handle integration between Math.Net and SharpLearning, maybe a separate project, SharpLearning.MathNet, could be added with efficient conversions between Math.net and SharpLearning containers (both copy and shared memory). This of course depends on what data structure ends up replacing F64Matrix, if any.
These are my current thoughts, and people are very welcome to discuss and pitch in with suggestions.
I have read all the examples and gone through the source code, but haven't been able to answer the question.
I have setup a data set, trained and tested the model, but now I would like to use the model to make predictions on new data. How would I achieve this?
Example:
Target value has 3 classifications: good, bad, average
New data comes in -> use trained/tested model to make a prediction on the target value. Also, would it be possible to get a probability/confidence of the prediction of the target value i.e. 25% good, 50% average, 25% bad.
Currently, there are no checks to verify that the dimensions of the observation matrix and the target array matches before learning is started. This results in error messages from somewhere in the learner implementation, providing poor error messages and feedback to the user.
Checks should be added to all learners to ensure that the provided arguments and data is valid, before starting the learning process.
Add support for sample weights to the AdaBoost learners. This will make it possible to handle imbalanced datasets directly in the learners, instead of under/oversampling the dataset inorder to balance it.
The DecisionTreeLearners, used by AdaBoost, already support sample weights, so implementing it only involves setting up the sample weights and forwarding the weights to the DecisionTreeLearner in each boosting iteration. The learners can be found here in the AdaBoost project: AdaBoost
The work is currently in progress in the branch adaboost-sample-weight-support
The following code throws a System.IndexOutOfRangeException on line 328 in GBMDecisionTreeLearner.cs
var sut = new RegressionSquareLossGradientBoostLearner();
Random rnd = new Random(42);
var rows = 10000;
var columns = 1;
double[] values = new double[rows * columns];
for (int i = 0; i < rows * columns; i++)
values[i] = rnd.NextDouble();
Containers.Matrices.F64Matrix observations = new Containers.Matrices.F64Matrix(values, 1, 10000);
double[] targets = new double[rows];
for (int i = 0; i < rows; i++)
targets[i] = rnd.NextDouble();
var model = sut.Learn(observations, targets);
Currently there is an issue when installing the SharpLearning.XGBoost package, Nuget will try to add a reference for the native xgboost dll:
The reason for this is how the nuget package for PicNet.XGBoost.Net has been created. I have openened an issue to get this solved: PicNet/XGBoost.Net#24
Fresh download, after restoring packages via NuGet.
How to solve it????
This always fails on when I run all.ps1
cc: @mdabros
Failed ClassificationGradientBoostLearner_LearnWithEarlyStopping
Error Message:
Assert.AreEqual failed. Expected a difference no greater than <1E-06> between expected value <0.162790697674419> and actual value <0.13953488372093>.
Stack Trace:
at SharpLearning.GradientBoost.Test.Learners.ClassificationGradientBoostLearnerTest.ClassificationGradientBoostLearner_LearnWithEarlyStopping() in E:\oss\SharpLearning\src\SharpLearning.GradientBoost.Test\Learners\ClassificationGradientBoostLearnerTest.cs:line 120
Standard Output Messages:
Debug Trace:
Iteration 1 Validation Error: 0.674418604651163
Iteration 11 Validation Error: 0.290697674418605
Iteration 21 Validation Error: 0.244186046511628
Iteration 31 Validation Error: 0.22093023255814
Iteration 41 Validation Error: 0.186046511627907
Iteration 51 Validation Error: 0.186046511627907
Iteration 61 Validation Error: 0.197674418604651
Iteration 71 Validation Error: 0.174418604651163
Iteration 81 Validation Error: 0.13953488372093
Iteration 91 Validation Error: 0.162790697674419
Both for Debug
and Release
.
First of all thank you for the great library!
My question is simple: I want to predict next period price with pre-computed history values.
I have over 30 rows data for each price.
Price and datas are decimal.
For example history:
Indicator1 - Indicator 2 - Indicator 3 - Price - Trend
10,01121 - 23,56540 - 12.00001 - 12,23321 - UP
9,00001 - 3,00040 - 2.00001 - 1,23300 - DOWN
...
...
And data to predict coming like
8,11211 - 1,00020 - 0.00021 - 3,5555 - ?
I want to get TREND field.
Which model should I use? Any example will be perfect?
Regards!
Hello, I would first like to thank you for making such an excellent and accessible repo. We're getting quite a bit of use out of it in multiple projects.
I've noticed that despite setting the seed in the Random Forest learner, I get slightly varying results from run-to-run given identical inputs. I suspect the problem is in the following code block:
Parallel.ForEach(rangePartitioner, (work, loopState) =>
{
results.Add(CreateTree(observations, targets, indices, new Random(m_random.Next())));
});
The parallel random number generation is a problem; the same random numbers are generated because the seed is set in the constructor, but there is a "race" between the threads to grab the next random number. Locking m_random would not help, I think. This should be fixed by generating a random number for each tree prior to entering the Parallel.ForEach loop, like so:
int[] randomNumbers = new int[m_trees];
for(int i = 0; i < randomNumbers.Length; i++)
{
randomNumbers[i] = m_random.Next();
}
Parallel.ForEach(rangePartitioner, (work, loopState) =>
{
results.Add(CreateTree(observations, targets, indices, randomNumbers[work]));
});
Let me know if I'm overlooking something. I'd be happy to fork and make a pull request.
Rob
Add .NET Core and .Net Standard support to make SharpLearning available on more platforms. Porting to .Net Standard involves the following tasks:
After the porting process the continuous integration on appveyor must be updated.
I think it would be a good idea to add the F64Matrix
overload to the IPredictor
interface as it would make it easier to use the IPredictorModel
interface in your code. The models seems to implement it already.
It would add a dependency for SharpLearning.Containers.Matrices
in SharpLearning.Common.Interfaces
, but I think it is unlikely that you would use the SharpLearning library without referencing SharpLearning.Containers.Matrices
anyway.
SharpLearning.Neural depends on math.net, which does not currently support .net core 2.0/.net standard 2.0. Hence, SharpLearning.Neural will only work on .NET Desktop/Windows.
Support for .net core 2.0/.net standard 2.0 is planned for math.net, so full support for SharpLearning.Neural will also be possible once this is implemented. Eventually, #9 might also solve this.
All given samples contain CSV method only.
#region Read data
// Use StreamReader(filepath) when running from filesystem
var parser = new CsvParser(() => new StringReader(Resources.winequality_white));
var targetName = "quality";
// read feature matrix (all columns different from the targetName)
F64Matrix observations = parser.EnumerateRows(c => c != targetName).ToF64Matrix();
// read targets
var targets = parser.EnumerateRows(targetName).ToF64Vector();
I would like to load data from a List object; how it can be possible?
Thanks!
edit:
/// <summary>
/// Parses the CsvRows to a double array. Only CsvRows with a single column can be used
/// </summary>
/// <param name="dataRows"></param>
/// <returns></returns>
public static double[] ToF64Vector(this IEnumerable<CsvRow> dataRows)
{
if (dataRows.First<CsvRow>().ColumnNameToIndex.Count != 1)
throw new ArgumentException("Vector can only be genereded from a single column");
return dataRows.SelectMany<CsvRow, double>((Func<CsvRow, IEnumerable<double>>) (values => (IEnumerable<double>) values.Values.AsF64())).ToArray<double>();
}
This code only works with CsvRow list.
Hello,
Might not be an actual issue, but more of a question on how to handle my error. I want to feed data into the parser that is in the SharpLearning.InputOutput.Csv package. Below is my code:
`string rawTarget = Transformations.ReturnColumnAsCSVString(Data, OutputColumn);
System.Windows.Forms.Clipboard.SetText(rawTarget);
var targetparser = new CsvParser(() => new StringReader(rawTarget));
var targets2 = targetparser.EnumerateRows(OutputColumn).ToF64Vector();`
So, first I pull my data into a string, this results in a string looking like this:
"Vwap";7049.4;6983.3;6981.8;6871.0;6846.7;6811.0
(obviously there is a lot more)
Then I use the Stringreader to read my string and parse it to an F64 vector. The error I get is:
An item with the same key has already been added.
I have also tried to convert the string to a stream, then use the Streamreader but this results in the exact same error. I am at a loss at how to solve this.
Hope anyone can provide a solution! Thanks in advance!
Hello,
I created a NeuralNet and trained it on my computer.
I wish to port it on an embedded target now and i asked myself if a C code output of predict for a trained network could be possible. Is that the case ? If yes could you tell me some advises where to look at ?
Thank you
Hi @mdabros!
I've just found your library a couple days ago and couldn't help but notice the similarity between both of our projects, SharpLearning and Accord.NET. Since we both share the same goal (bring serious machine learning to .NET), and instead of duplicating our efforts, wouldn't you be willing to join the Accord.NET project as well?
Seeing your extremely well-organized repository and coding skills, you would be more than welcome in joining Accord.NET as one of its authors.
Regards,
Cesar
Add interface for learners supporting sample weights:
IPredictorModel<TPrediction> Learn(F64Matrix observations, double[] targets,
double[] sampleWeights);
Add interface for learners supporting sample indices and sample weights:
IPredictorModel<TPrediction> Learn(F64Matrix observations, double[] targets,
double[] sampleWeights, int[] indices);
The Microsoft team working on CNTK has recently released the initial version of the C#/.Net API with support for both evaluation and training of neural networks. A more feature complete version, with support for layers and other helpful features, should arrive before the end of the year. Currently, there seems to be a few performance related issues (microsoft/CNTK#2374 and microsoft/CNTK#2386) but hopefully these will be also be solved in the next release.
Using CNTK as backend for SharpLearning.Neural will add operators for more layer and network types, while also enabling GPU training and evaluation. Using a well supported deep learning toolkit as backend will also help to ensure that future operator, layer and network types will be available faster.
This task will require a large rewrite of SharpLearning.Neural, most likely only keeping the top level interface. However, since all the core operations are availible from CNTK, most of the hard work is already completed.
This task should be split into multiple others when a design of how CNTK should be integrated has been completed. A few considerations:
Hi,
I am using random Forest for learning. The output I get results into the initial 40 or so values varying (Float values) but after that it's just constant.
I was wondering if you have seen this behavior. The data has 24 features and 500 observations. I get the 42 first prediction varying but after that it's just constant.
I can provide the data and the code if you would like that.
regards,
Avi
Hi there!
First: Thanks for the great work, excellent design you have there!
I am experiencing an OutOfMemory Exception in the constructor of F64Matrix that does not really come from memory shortage but from the fact that F64Matrix internally uses a single one dimensional double array that can quite easily exceed .NETs internal boundaries of maximum array dimensions.
In my case I tried to create a F64Matrix with 10 mio rows and 55 columns.
My preferred suggestion would be to either abstract the matrix to an IF64Matrix interface that probably only consists of the At() method overloads. This would enable users to provide a custom implementation that is capable of handling larger amounts of data, if needed even by swapping data from and to disk.
Another solution could be to change the internal implementation of F64Matrix to use an array of double arrays, which I believe could also help.
Thanks for your help and keep up the excellent work!
Best regards
Florian
Hi mdabros,
I love the Optimizer classes of SharpDevelop, I use them heavily for hyperparameter tuning.
But I would like to use differend random seeds and parts of the Optimizer classes don't allow using them like that.
Example:
If you look at the constructor of your BayesianOptimizer you can see that it doesn't pass on the "seed" parameter to all other classes that BayesianOptimizer creates instances of, sometimes it will just forward a hardcoded 42 instead.
I know, 42 is the answer, but I would prefer "seed" in this case... ;)
The other IOptimizer implementations have similar issues.
Would be nice if you could modify that some time...
Thank you!
Best regards
Florian
RandomForest and GradientBoost learners have a hyper parameter, subSampleRatio, which controls how many training samples are forwarded to each tree in the ensemble. When subsampling is active, samples from the training data will be drawn with replacement, leading to more variation among the trees in the ensemble. This parameter should also be introduced in the AdaBoostLearners (ClassificationAdaBoostLearner and RegressionAdaBoostLearner), to have more possibilities for reguralizing this type of model .
In the RandomForest implementation of this feature, there is sampling with replacement, even if subSamplingRatio=1.0, this is part of the algorithms design. However, for the AdaBoost implementation of this feature, if subsampling is off (subSampleRatio=1.0), no sampling with replacement should be introduced, and the whole training set should be considered in each tree of the ensemble. This will result in the 'classic', AdaBoost algorithm, if the subSamplingRatio is set to 1.0.
Besides the difference when subSampleRatio=1.0, the AdaBoost implementation should be very similar to the RandomForest implementation, which can be found here RandomForest.
In F64MatrixColumnView.RowPtr
, integer overflow can occur when row * m_strideInBytes
is larger than int.MaxValue
, resulting in an invalid offset being applied to m_dataPtr
:
double* RowPtr(int row)
{
return (double*)((byte*)m_dataPtr + row * m_strideInBytes);
}
I am happy to fix this (it just requires a cast to long
), but I'm not sure how to contribute - do I branch from master, then push and create a pull request from the GitHub website? In the future, if I find a simple bug like this, should I raise an issue, or can I just push with details and let you decide whether it's a good fix?
I haven't contributed to an open source project before!
Add support for sample weights to the RandomForest learners. This will make it possible to handle imbalanced datasets directly in the learners, instead of under/oversampling the dataset inorder to balance it.
The DecisionTreeLearners, used by RandomForest, already support sample weights, so implementing it only involves setting up the sample weights and forwarding the weights to the DecisionTreeLearner for each tree. The learners can be found here in the RandomForest project: RandomForest
This is a really great library. Was there a specific reason why you chose to roll your own Matrix class, rather than leveraging Math.NET?
Ideally I'd like to marry the two (not only for consistency with modules I've already written, but even for smaller things like using Matrix rather than Matrix). Before I jump in and start changing anything, though, I thought I'd check with the author to see if there was a specific reason behind it.
If I do proceed with integrating the two, more than happy to submit back a PR too, just let me know.
Currently, some of the learners in SharpLearning will output information during the training period. This includes the early stopping with gradient boost and neural networks. It should be possible to enable and disable these messages, and preferably, also possible to choose where the messages should be outputted. For instance to Console
, Trace
, or alternatively a log file.
This could be made by adding an Action
to receive the message. This should probably be part of a separate interface, for learners supporting this. Something like:
public interface ILogger
{
public Action<string> Log { get; set }
}
The learners would then add the message to the log. Likewise, other algorithms like the optimizers could also implement this interface.
Some unittests compare against hardcoded strings written in the test method. These fail on systems that use ,(comma) as decimal separator instead of .(dot). These strings should probably be loaded from a resource or the entire library should work with invariant culture unless otherwise specified.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.