sethjuarez / numl Goto Github PK

View Code? Open in Web Editor NEW

430.0 77.0 104.0 36.94 MB

Machine Learning for .NET

Home Page: http://numl.net

License: MIT License

C# 99.30% PowerShell 0.47% Batchfile 0.01% Shell 0.22%

numl's People

Contributors

Stargazers

Watchers

Forkers

budbjames hal2001 nagarwl1 adroach1 vladtepes1473 emmekappa kevinhoward carwilki vrmanx sciumo kimsk pinkli alarial mheydt archimy bobuva commander dcgemperline cicorias wsamotshozo bdschrisk dashny cevans3098 mlf26 mysl applied-duality czcz1024 nagyistoce intranetfactory nickmorlando divyang4481 yuan39 themicrosoftguy merlinbrasil gitter-badger jamessdixon sureshsagi telavian zbxzc35 activephoenix openyang pengbian joserenteria pec92 cmudadu uranium62 shuangmei ryanhowe smartpcr cvertex stevestrong devlead atwoodtm davidjeet lin2000y bokdong2 mreng outsorcerer vimaire dgeller-ouhsc symphony2014 robin-karlsson liusanchuan dacho68 bheerschop jackwangcumt mserquet tchekjunior notesjor pheonix25 hermaneldering patrickjiang314 patrickjiang matmex vishalishere xtellurian whatevergeek alastairs jangoai jangoraspberrypi hhy5277 blake2002 vnvizitiu guojianbin vmmlog clustersdata kouweizhong hello-web mercalli lulzzz armgong sjoerdteunisse leocosta awesomedotnetcore superowner guojinunique flavio58it rajshakerp jason6583 stjordanis

numl's Issues

value return -1

the col value returns a lot of times -1

// uh oh, need to return something?
// a weird node of some sort...
// but just in case...
if (col == -1)
    BuildLeafNode(y.Mode());

data feed function :

public static IEnumerable<Value> GetData()
{
    Random r = new Random (500) ; 
    string rs  = ""; 

    for (int i =0; i < 2000; i++)
    {
        var a = r.Next(1, 500);
        var sum =   i;
        if (sum <= 100)
            rs = "s";
        else if (sum > 100 && sum <= 250)
            rs = "m";
        else
            rs = "l"; 

        yield return new Value { V1 = 2, V2 = i , R = rs }; 
    }
}

Datetime as label to learn - error "Dimensions do not match"

Hi Seth,

I am trying to set one integer as a feature & one datetime as the label for LinearRegression generator
I get error as "Dimensions do not match" for Learner.Learn method.

The training data passed is like
{"Timestamp": "2016-06-06T15:49:46.6420000Z", "num1":32}

Below is the stacktrace -

at numl.Math.LinearAlgebra.Vector.Dot(Vector one, Vector two)
at numl.Supervised.Regression.LinearRegressionModel.Predict(Vector x)
at numl.Learner.GenerateModel(IGenerator generator, Matrix x, Vector y, IEnumerable1 examples, Double trainingPct, Int32 total) at numl.Learner.Learn(IEnumerable1 examples, Double trainingPercentage, Int32 repeat, IGenerator generator)
at MyLib.Repo.GenerateMyModel(List`1 trainingData)

Can you please help?

Deserialization error when using Load(STREAM)

Hi Seth,

I'm getting an exception when trying to deserialize from a stream. Please check.
The error and the code are attached below

ERROR:
Unhandled Exception: System.InvalidOperationException: There is an error in XML
document (0, 0). ---> System.Xml.XmlException: Root element is missing.
at System.Xml.XmlTextReaderImpl.Throw(Exception e)
at System.Xml.XmlTextReaderImpl.ParseDocumentContent()
at System.Xml.XmlTextReaderImpl.Read()
at System.Xml.XmlTextReader.Read()
at System.Xml.XmlReader.MoveToContent()
at Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializationReaderDecisi
onTreeModel.Read1_DecisionTreeModel()
--- End of inner exception stack trace ---
at System.Xml.Serialization.XmlSerializer.Deserialize(XmlReader xmlReader, St
ring encodingStyle, XmlDeserializationEvents events)
at System.Xml.Serialization.XmlSerializer.Deserialize(Stream stream)
at numl.Supervised.Model.Load(Stream stream) in c:\projects\numl\numl\Supervi
sed\Model.cs:line 82
at numl.Supervised.DecisionTree.DecisionTreeModel.Load(Stream stream) in c:\p
rojects\numl\numl\Supervised\DecisionTree\DecisionTreeModel.cs:line 72

class Program
{
    static void Main(string[] args)
    {
        var data = new List<MyData>();
        for (var i = 0; i < 1000; i++)
        {
            data.Add(new MyData { Prop1 = i, Prop2 = i + 1, Prop3 = i + 2, Result = i % 2 == 0 });
        }
        var d = Descriptor.Create<MyData>();
        var g = new DecisionTreeGenerator(d);
        g.SetHint(false);
        var learningModel = Learner.Learn(data, 0.80, 1000, g);
        Console.WriteLine(learningModel);

        string modelXml;
        using (var ms = new MemoryStream())
        {
            learningModel.Model.Save(ms);
            ms.Position = 0;
            using (var reader = new StreamReader(ms, Encoding.Unicode))
            {
                modelXml = reader.ReadToEnd();
            }
        }

        // now read
        var m = new DecisionTreeModel();
        using (var stream = new MemoryStream())
        {
            using (var sr = new StreamWriter(stream, Encoding.Unicode))
            {
                sr.Write(modelXml);
                stream.Position = 0;
                // THIS GIVES AN ERROR
                m = (DecisionTreeModel)m.Load(stream);
            }
        }

    }
}


public class MyData
{
    [Feature]
    public double Prop1 { get; set; }
    [Feature]
    public double Prop2 { get; set; }
    [Feature]
    public double Prop3 { get; set; }
    [Label]
    public bool Result { get; set; }
}

LinearRegression- unable to load saved model

I've found what appears to be a bug in LinearRegressionModel.Load/Save methods.
I get this error when loading a previously saved model.
Attached model and code to reproduce it - just run the code twice to repro the problem.

Unhandled Exception: System.InvalidOperationException: There is an error in XML
document (18, 6). ---> System.InvalidOperationException: There is an error in XM
L document (18, 6). ---> System.InvalidOperationException:  was not
expected.
   at Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializationReaderVector
.Read1_v()
   --- End of inner exception stack trace ---
   at System.Xml.Serialization.XmlSerializer.Deserialize(XmlReader xmlReader, St
ring encodingStyle, XmlDeserializationEvents events)
   at numl.Utils.Xml.Read[T](XmlReader reader) in c:\projects\numl\numl\Utils\Xm
l.cs:line 152
   at numl.Supervised.Regression.LinearRegressionModel.ReadXml(XmlReader reader)
 in c:\projects\numl\numl\Supervised\Regression\LinearRegressionModel.cs:line 76

   at System.Xml.Serialization.XmlSerializationReader.ReadSerializable(IXmlSeria
lizable serializable, Boolean wrappedAny)
   at Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializationReaderLinear
RegressionModel.Read1_LinearRegressionModel()

---CODE-----

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using numl;
using numl.Model;
using numl.Supervised;
using numl.Supervised.Regression;

namespace MachineLearningByFormulaDemo
{
    class Program
    {
        static void Main(string[] args)
        {
            // generate data to train our neural network
            var rnd = new Random();
            Func<double, double, double> func = (l, r) => l + 2 * r;


            IModel model;
            var data = new List<ModelItem>();
            const string modelFileName = "test.mdl";
            if (File.Exists(modelFileName))
            {
                model = new LinearRegressionModel().Load(modelFileName);
            }
            else
            {
                for (var i = 0; i < 100; i++)
                {
                    var left = rnd.NextDouble(0, 50000);
                    var right = rnd.NextDouble(0, 50000);
                    var result = func(left, right);
                    data.Add(new ModelItem { LeftOperand = left, RightOperand = right, Result = result });
                }
                var d = Descriptor.Create<ModelItem>();
                var g = new LinearRegressionGenerator { Descriptor = d };
                var learningModel = Learner.Learn(data, .80, 1000, g);
                model = learningModel.Model;
                model.Save(modelFileName);
            }

            // test our trained network with some sample data
            for (var i = 0; i < 10; i++)
            {
                var left = rnd.NextDouble(0, 50000);
                var right = rnd.NextDouble(0, 50000);
                var item = new ModelItem { LeftOperand = left, RightOperand = right /* Result will be predicted */};
                model.Predict(item);

                var predictedResult = item.Result;
                var expectedResult = func(left, right);
                var diff = Math.Abs(expectedResult - predictedResult) / expectedResult;
                Console.WriteLine("Expected: {0:N2}, Predicted: {1:N2}, Difference: {2:P}", expectedResult, item.Result, diff);
            }
        }
    }

    public class ModelItem
    {
        [Feature]
        public double LeftOperand { get; set; }
        [Feature]
        public double RightOperand { get; set; }
        [Label]
        public double Result { get; set; }
    }

    public static class RandomExtensions
    {
        public static double NextDouble(
            this Random random,
            double minValue,
            double maxValue)
        {
            return random.NextDouble() * (maxValue - minValue) + minValue;
        }
    }
}

Difference with Label and StringLabel Attributes

What is the difference with [Label] and [StringLabel] attributes?

Error "Sequence contains no elements..."

Hi Seth,

I get this error at random times with the same data. It comes either at the learning step in below code -
generator = new DecisionTreeGenerator(5) { Descriptor = descriptor, Hint = 0 };
learningModel = Learner.Learn(data, 0.8, 5, generator);

Or
While deserilizing the model json like

JsonReader.ReadJson(jsonData);

This error really comes randomly & goes away with the same data after a few retries.

Can you please help spotting it?

How to load saved models?

After using

void Save(string file);

to save the trained model, how to load the model again?

Model saving and loading

Add ability to save and load models

KMeans Cluster Member Proximity to Center

I've been playing with K-Means the last few days and found some clusters with some data I'm interested in, which is great! I'd love to be able to rank the members of the clusters I am interested in by distance from the cluster center.

It doesn't look like the data available in the Cluster type offers the ability to do that. Is there something I'm missing? I see that there is a vector for the center of the cluster. I think the piece that I am missing are vectors for each member. I think if I had that, I could then do something like a Euclidian distance calculation to determine the ranking that I want. Does this sound like the right path?

Any thoughts are certainly appreciated!

GRU example

First, hats off for this great library, its so much better designed than Encog or Accord.Net ... and DotnetCore support jay !!

On to my question:

You recently added GRU support to the library, and i was wondering how to use it to do time series prediction ( I now the theory behind it etc, i am just wondering about how to use descriptors and generally the library to achieve this. )

Lets assume we have these classes ( i am using a simple, artifical example here, i know that this is not a good model, but it shows the principle )

class SensorReading
{
DateTime TimeStamp;
double Temperature
}

Let assume we have those readings in 1 minute intervals.
So we get lets say a List[SensorReading] with 1.000.000 entries.

Now usually i would apply a sliding window to this input so i get the following timeseries entities out of it: (

class SensorTimeSeries
{
IList[SensorReading] Input;
SensorReading Output;
}

So i get a List[SensorTimeSeries] out of the List[SensorReading] with lets say 5 Input data points and the corresponding output ( the 6th reading ) to predict and a total of 1.000.000 - 6 entries.

My question is now, how do i create a GRU model for this ? How to train and evaluate this lateron ?

To reiterate my training data would be List[SensorTimeSeries] with 999.994 entries each containing 5 readings as input and one as the to predicted label...

Feature value greater than range found in all trained instances throws an exception on prediction using DecisionTree

When you have a list of training data and a particular feature property's range of values is 0-80, then you attempt to predict on an instance where that property value is set to 100, the DecisionTreeGenerator model prediction throws an exception. I'm not sure if this is by design but I thought that just because the training data hasn't seen the value used in the prediction, it would at least know that it is greater than all values that it has been trained by.

You can find the offending test in my pull request. The Test that illustrates the issue is called:
ArbitraryPrediction_Test_With_Feature_Value_Greater_Than_Trained_Instances

Model.Load(string) fails for KernelPerceptronGenerator

Seth,

Could you please fix the Load functionality for KernelPerceptron? Thanks.
This code fails:

            var data = new List<MyData>();
            for (var i = 0; i < 100; i++)
            {
                data.Add(new MyData { Prop1 = i, Prop2 = i + 1, Prop3 = i + 2, Result = i % 2 == 0 });
            }
            var descriptor = Descriptor.Create<MyData>();

            var kernel = new RBFKernel(3);
            var generator = new KernelPerceptronGenerator(kernel) { Descriptor = descriptor };

            var learningModel = Learner.Learn(data, 0.80, 1000, generator);
            Console.WriteLine(learningModel);

            learningModel.Model.Save("model.mdl");
            // THIS FAILS ----
            learningModel.Model.Load("model.mdl");

Error:


There is an error in XML document (2, 2). ---> System.MissingMethodException: No parameterless constructor defined for this object.
   at System.RuntimeTypeHandle.CreateInstance(RuntimeType type, Boolean publicOnly, Boolean noCheck, Boolean& canBeCached, RuntimeMethodHandleInternal& ctor, Boolean& bNeedSecurityCheck)
   at System.RuntimeType.CreateInstanceSlow(Boolean publicOnly, Boolean skipCheckThis, Boolean fillCache, StackCrawlMark& stackMark)
   at System.Activator.CreateInstance(Type type, Boolean nonPublic)
   at System.Activator.CreateInstance(Type type)
   at numl.Supervised.Perceptron.KernelPerceptronModel.ReadXml(XmlReader reader) in c:\projects\numl\numl\Supervised\Perceptron\KernelPerceptronModel.cs:line 63
   at System.Xml.Serialization.XmlSerializationReader.ReadSerializable(IXmlSerializable serializable, Boolean wrappedAny)
   at Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializationReaderKernelPerceptronModel.Read1_KernelPerceptronModel()
   --- End of inner exception stack trace ---
   at System.Xml.Serialization.XmlSerializer.Deserialize(XmlReader xmlReader, String encodingStyle, XmlDeserializationEvents events)
   at System.Xml.Serialization.XmlSerializer.Deserialize(Stream stream)
   at numl.Utils.Xml.Load(String file, Type t) in c:\projects\numl\numl\Utils\Xml.cs:line 94
   at numl.Supervised.Model.Load(String file) in c:\projects\numl\numl\Supervised\Model.cs:line 75

Please port to .NET Standard

I'd love to use this library as part of a .NET Standard project, as well as .NET Core.
Thank you.

Linear Regression with NuML

I'm trying to do a really basic linear regression (Z = 2 * X + 1) prediction using NuML. Given the data is so linear I can't understand why the predicted value is so far off unless I am doing something wrong. I have the target class

public class Sample
{
public float V { get; set; }
public float X { get; set; }
public float Y { get; set; }
public float Z { get; set; }

    public Func<float, float, float, float> OutputStrategy { get; set; }
    public Sample(Func<float, float, float, float> outputStrategy)
    {
        OutputStrategy = outputStrategy;
    }
    public void Seed(int i)
    {
        V = (float) i;
        X = (float) 2 * i;
        Y = (float) 3 * i;
        Z = OutputStrategy(V, X, Y);
    }
}

and I have the NuML code to set up the source values and predict an answer for an arbitrary new data point:

NB: The output strategy is a simple 2 * A + 1. I've tried it with multivariate analysis and the prediction is further away

public static void Main(string[] args)
{
// Generate sample data
int sampleSize = 1000;
Sample[] samples = new Sample[sampleSize];
Func<float, float, float, float> outputStrategy = (A, B, C) => 2 * A + 1;
for (int i = 0; i < sampleSize; i++)
{
samples[i] = new Sample(outputStrategy);
samples[i].Seed(i);
}

    // calculate model
    var generator = new LinearRegressionGenerator();
    var descriptor = Descriptor.New("Samples")
        .With("V").As(typeof(float))
        .With("X").As(typeof(float))
        .With("Y").As(typeof(float))
        .Learn("Z").As(typeof(float));
    generator.Descriptor = descriptor;
    var model = Learner.Learn(samples, 0.6, 50, generator);

    // Use prediction
    var targetSample = new Sample(outputStrategy);
    targetSample.Seed(sampleSize + 1);
    var predictedSample = model.Model.Predict(targetSample);
    var predictedValue = predictedSample.Z;
    var actualValue = outputStrategy(targetSample.V, targetSample.X, targetSample.Y);
    Console.Write("Predicted Value = {0}, Actual Value = {1}, Difference = {2} {3:0.00}%", predictedValue, actualValue, actualValue - predictedValue, (decimal) (actualValue - predictedValue) / (decimal) predictedValue * 100M);
    Console.ReadKey();
}

This gives a difference of about 0.5% which considering the line is completely straight was surprising. I have tried using different % of the dataset for training and number of iterations of the model but it makes no difference to the output.

If I use even a more slightly more complicated model I get much worse predictive capabilities. If I use logistic regression, the predicted output of Z is always 1?!

Documentation

Do you have any documentation, other than the API References?

I like you framework but, I must admit, it is hard to use it without any documentation.

Thanks in advance!

LoadJson not implemented

I am using version 0.9.14-beta of nuget pkg for aspnetcore, and trying to load saved json of model. For DecisionTreeModel, the LoadJson method is not implemented.
Any help?

Unused local variable

Not a big deal, but wondering if this is indicative of some feature that was not implemented

Learner.cs line 92 var total is not used.

Unsupervised model specification

Create standardization around unsupervised models.

Performance tweak in DecisionTreeGenerator.GetBestSplit

Hi Seth,

I've run numl in the profiler that comes with Visual Studio and found this line of code to cause a lot of CPU utilization:
DecisionTreeGenerator.cs, line 212, inside GetBestSplit()
Activator.CreateInstance(ImpurityType);

It's essentially a call to a costly Activator.CreateInstance from within a "for" loop. This could potentially be optimized by taking it out of the loop, or even calling it once and caching, if ImpurityType property is not meant to be assigned.

I'm not seeing any calls to the setter outside the DecisionTreeGenerator class, so it could potentially be made private, e.g.

public Type ImpurityType { get; private set; }

Thanks.

Automatic model/parameter selection based on descriptor labels

Use label detection to select appropriate models

Guid property

I needed a guid property, and I implemented it for my solution. But I was thinking that maybe someone else might be interested in that. Should I prepare a PR for that? Or do you want to keep the core stuff as slim as possible?

KNNModel serialization

when I serialize a KNNModel I get an exception on K.ToString("r").
If I change to K.ToString("d") it seems to work fine.

Could you help me, please?

'public override void WriteXml(XmlWriter writer)
{
writer.WriteAttributeString("K", K.ToString("d"));
Xml.Write(writer, Descriptor);
Xml.Write(writer, X);
Xml.Write(writer, Y);
}`

Move Serialization to Json

I am thinking of moving all serialization to json. Any thoughts?

Model load issue

I tried to load a saved model but it gave System.TypeLoadException-Cannot find type numl.Tests.Data.Iris on JsonReader.ReadJson

var data = Iris.Load();
var description = Descriptor.Create();
var generator = new DecisionTreeGenerator(50);
var model = generator.Generate(description, data) as DecisionTreeModel;
model.Save(AppDomain.CurrentDomain.BaseDirectory + "/model_cache/test.json");
var learntmodel = JsonReader.ReadJson(System.IO.File.ReadAllText(AppDomain.CurrentDomain.BaseDirectory + "/model_cache/test.json"));

Add Genetic Algorithms

Add the ability to solve any problem using genetic algorithms, should be extensible to allow defining of custom problems and solving heuristics. Properties would include population growth rate, cross over rate, elitists and other genetic metrics.

Use case; An ARIMA type regression algorithm using moving genetic solvers, allowing time series predictors to modulate over time.

bug , always select the highers value

[V2, 0.0016]
 |- 0 ≤ x < 49.5
 |  [V1, 0.0000]
 |   |- 1 ≤ x < 1.01
 |   |   +(L, 1)-----------------------------<<<< this supposed to be "s" 
 |- 49.5 ≤ x < 99.01
 |   +(L, 1)



    [Test]
    public void ValueObject_Test_With_Yield_Enumerator()
    {
        var data = ValueObject.GetData();
        var generator = new DecisionTreeGenerator()
        {
            Descriptor = Descriptor.Create<ValueObject>()
        };

        var decisionTree = new DecisionTreeGenerator();
        var model = generator.Generate(data);

        var o = new ValueObject() { V1 = 1, V2 =10 };
        var os = model.Predict<ValueObject>(o).R;
        Assert.AreEqual("l".Sanitize(), os);
    }

Using StringFeature Kills Application

Doing anything with the library that uses StringFeature causes the application to just sit idly recursively allocating memory until it has consumed all system memory (if you allow it to), regardless of the data set length, string size, trainer parameters, etc.

Reccomendatin Problem

I've install numl from NuGet and there is no Reccomendation class in it.

Complex Feature type with Property Path support

When using the description method in numl the Descriptor object throws an exception when a Feature or Label attribute is defined on a complex type property. Properties can be complex types from external libraries thus preventing numl attribute usage.

For example given the below type:
public class Foo
{
[Feature]
public Bar One { get; set; }
[Label]
public bool IsOK { get; set; }
}

public class Bar
{
public int A { get; set; }
public int B { get; set; }
public int C { get; set; }
}

The descriptor would throw an error on converting the type Bar to a double. This is the same for nullable type properties also.

A suggested implementation is to use property path(s) in the Feature attribute. This would allow the Descriptor to extract one or more sub properties from the complex type.

Suggested Implementation:
public class Foo
{
[FeatureSelector("Bar.A", Bar.B", "Bar.C")]
public Bar One { get; set; }
[Label]
public bool IsOK { get; set; }
}

Docs link is broken.

The link to the documentation in the readme is broken and the main site only has a title on the documentation page.

I would love to read the documentation if there is some.

NullReferenceException using Learner

Hi!
I'm using the example provided in the "Getting Started" section, and I'm constantly getting NullReferenceException when using the Learner.
The exception occurs on Ject.GetCtor. If I substitute the Parallel.For on the Learner for a normal For, the exceptions disappear.

System.NullReferenceException was unhandled by user code
  HResult=-2147467261
  Message=Object reference not set to an instance of an object.
  Source=mscorlib
  StackTrace:
       at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
       at System.Collections.Generic.Dictionary`2.set_Item(TKey key, TValue value)
       at numl.Utils.Ject.GetCtor(Type type) in c:\projects\numl\numl\Utils\Ject.cs:line 182
       at numl.Utils.Ject.Create(Type type) in c:\projects\numl\numl\Utils\Ject.cs:line 195
       at numl.Supervised.DecisionTree.DecisionTreeGenerator.GetBestSplit(Matrix x, Vector y, List`1 used) in c:\projects\numl\numl\Supervised\DecisionTree\DecisionTreeGenerator.cs:line 215
       at numl.Supervised.DecisionTree.DecisionTreeGenerator.BuildTree(Matrix x, Vector y, Int32 depth, List`1 used) in c:\projects\numl\numl\Supervised\DecisionTree\DecisionTreeGenerator.cs:line 112
       at numl.Supervised.DecisionTree.DecisionTreeGenerator.Generate(Matrix x, Vector y) in c:\projects\numl\numl\Supervised\DecisionTree\DecisionTreeGenerator.cs:line 92
       at numl.Learner.GenerateModel(IGenerator generator, Matrix x, Vector y, IEnumerable`1 examples, Double trainingPct, Int32 total) in c:\projects\numl\numl\Learner.cs:line 145
       at numl.Learner.<>c__DisplayClasse.<Learn>b__d(Int32 i) in c:\projects\numl\numl\Learner.cs:line 111
       at System.Threading.Tasks.Parallel.<>c__DisplayClassf`1.<ForWorker>b__c()
  InnerException:

Unable to match split value 10 for feature [V2, 1, 1][2]

Hi, there is another issue and not sure if it is issue or I am doing some thing wrong :

// value should be "very weak" , but I am getting "very good" 
namespace Console
{
    public class Value
    {
        [Feature]
        public double V1 { get; set; }
        [Feature]
        public double V2 { get; set; }
        [Feature]
        public double V3 { get; set; }
        [Label]
        public string R { get; set; }

        public static IEnumerable<Value> GetData()
        {
            Random r = new Random(); 
            for (int i = 0; i <  8000; i++)
            {
                string label = "s";
                double v1 = r.Next(0, 100); 
                double q = r.Next ( 0,100) ; 
                double math = r.Next ( 0,100) ; 
                double avrge = ( q+i+math ) /3 ;
                if ( avrge >80 ) 
                    label ="very good" ; 
                else if ( avrge > 60 ) 
                    label ="good" ; 
                else if  ( avrge > 50 ) 
                    label= "middle" ; 
                else if  ( avrge>30 ) 
                    label="weak" ; 
                else
                    label="very weak" ;

                yield return new Value { V1 = v1, V2 = q, V3 = math, R = label };

            }
        }
    }
    class Program
    {
        static void Main(string[] args)
        {
            var data = Value.GetData().ToList() ;
            var description = Descriptor.Create<Value>();
            var generator = new DecisionTreeGenerator(512,50, description, null);
            var model = generator.Generate(description, data);


            Value currentValue = new Value { V1 = 1, V2 = 20, V3=30 };
            var pay = model.Predict<Value>(currentValue).R;

        }
    }
}

Levenshtein Edit Distance Algorithm

It seems I keep running across the Levenshtein Edit Distance algorithm. In reviewing the code for this library I found other distance algorithms implemented, and wondered what might have caused this one to be left out. (See Issue #16)

Given the importance of the algorithm and the relevance to information processing and loosely to machine learning, there are a few challenges to be considered.

Given a class:

class SimilarWordEntry
{
    [Feature] public string WordA { get; set; }
    [Feature] public string WordB { get; set; }

    // Where LevenshteinDistance is computed following the guidelines at: http://en.wikipedia.org/wiki/Levenshtein_distance
    [Feature, LevenshteinDistance("WordA","WordB")] 
    public int EditDistance { get; set; }

    [Feature] public int WordALength { get { return WordA.Length; } }
    [Feature] public int WordBLength { get { return WordB.Length; } }

    // Where Soundex is computed following the guidelines at: http://en.wikipedia.org/wiki/Soundex
    [Feature, Soundex("WordA")] public string WordASoundex { get; set; }
    [Feature, Soundex("WordB")] public string WordBSoundex { get; set; }  

    [Label] public double SimilarityScore { get; set; }
}

With this syntax we can convert two input features into a third feature extracted by use of the algorithm. The syntax is simply convenience over populating a Feature property with the result of the edit distance algorithm.

Thoughts?

2 issues with naive bayes - prediction is same every time; and memory exception if more than couple hundred records in training data

I must be doing something horribly wrong. I'm trying to use naive bayes to categorize some data based on input via a spreadsheet. The code runs, but it seems to give me the same category regardless of my input. I've appended some sample code below (gloss over the helper call that reads Excel; that just takes the sheet & converts to a .net dataset). Also, I'm limiting the training data to the first 100 rows. Any more than that and the program gets an out of memory exception!

So I have two questions:

how to get a meaningful prediction/classification?
how to load much larger training dataset without memory overflow?

    [Serializable]
    public class PeerCategory
    {
        [Feature]
        public string PeerCategoryDesc { get; set; }
        [Label]
        public string CAPeerCategory { get; set; }
    }

    [TestFixture]
    //[Ignore]
    public class BayesTest
    {

        [Test]
        public void TestExcel()
        {
            string pathToFile = @"C:\data\Training Data.xlsx";
            using (var reader = new FileReaderExcel(pathToFile))
            {
                int index = reader.GetSheetIndexByName("V6");
                var dt = reader.GetDataTableFromExcelSheet(index, true);
                var peerCatList = (
                    from DataRow row in dt.Rows
                    select new PeerCategory()
                    {
                        PeerCategoryDesc = row.Field<string>(0), 
                        CAPeerCategory = row.Field<string>(1)
                    }).Take(100).ToList();


                var width = peerCatList.Select(t => t.CAPeerCategory).Distinct().Count();

                IGenerator generator = new NaiveBayesGenerator(width);
                generator.Descriptor = Descriptor.Create<PeerCategory>();
                LearningModel learned = Learner.Learn(peerCatList, 0.8, 1000, generator);

                IModel model = learned.Model;
                double accuracy = learned.Accuracy;


                var value1 = model.Predict(GetItem("paint"));
                var value2 = model.Predict(GetItem("flower"));
                var value3 = model.Predict(GetItem("sofa"));
                var value4 = model.Predict(GetItem("desk"));
                var value5 = model.Predict(GetItem("bones"));
                                // value1 thru 5 will be the same category over and over again!!

            }

        }

        public PeerCategory GetItem(string desc)
        {
            var item = new PeerCategory()
            {
                PeerCategoryDesc = desc,
                CAPeerCategory = string.Empty
            };
            return item;
        }

Unexplained IndexOutOfRangeException

Hi Seth,

Had an idea to do a REALLY simple attempt to learn a function that I would have ordinarily implemented as a switch statement, just for the mind-bending. :-)

The code was written to be run in LinqPad.

void Main()
{
    Assembly.GetAssembly(typeof(Learner)).Dump();

    var gen = new numl.Supervised.NeuralNetwork.NeuralNetworkGenerator();
    gen.Descriptor = Descriptor.Create<WindDirection>();

    var learned = Learner.Learn(WindDirection.TrainingData(), 16/20, 1, gen);

    var model = learned.Model;
    var accuracy = learned.Accuracy.Dump();

    var windDir = new WindDirection(350, null);
    model.Predict(windDir); //Uncomment this if you are running this in LinqPad .Dump("Prediction");
}

// Define other methods and classes here
public class WindDirection {
    [Feature]
    public double Degrees { get; set; }

    [StringLabel()]
    public String Direction { get; set; }

    public WindDirection(double degrees, string direction)
    {
        this.Degrees = degrees;
        this.Direction = direction;
    }

    public static WindDirection[] TrainingData()
    {
        return new[] {
            // Training Values
            new WindDirection(0,     "N"  ),
            new WindDirection(22.5,  "NNE"),
            new WindDirection(45,    "NE" ),
            new WindDirection(67.5,  "ENE"),
            new WindDirection(90,    "E"  ),
            new WindDirection(112.5, "ESE"),
            new WindDirection(135,   "SE" ),
            new WindDirection(157.5, "SSE"),
            new WindDirection(180,   "S"  ),
            new WindDirection(202.5, "SSW"),
            new WindDirection(225,   "SW" ),
            new WindDirection(247.5, "WSW"),
            new WindDirection(270,   "W"  ),
            new WindDirection(292.5, "WNW"),
            new WindDirection(315,   "NW" ),
            new WindDirection(337.5, "NNW"),

            // Testing Values
            new WindDirection(22.5, "NNE"),
            new WindDirection(112.5, "ESE"),
            new WindDirection(11.25, "N"),
            new WindDirection(359-11.25, "N")
        };
    }
}

However, running the above Main function results in the following IndexOutOfRangeException.

   at numl.Model.StringProperty.Convert(Double val) in c:\projects\numl\numl\Model\StringProperty.cs:line 109
   at numl.Learner.GenerateModel(IGenerator generator, Matrix x, Vector y, IEnumerable`1 examples, Double trainingPct) in c:\projects\numl\numl\Learner.cs:line 169
   at numl.Learner.<>c__DisplayClasse.<Learn>b__d(Int32 i) in c:\projects\numl\numl\Learner.cs:line 110
   at System.Threading.Tasks.Parallel.<>c__DisplayClassf`1.<ForWorker>b__c()
   at System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
   at System.Threading.Tasks.Task.<>c__DisplayClass11.<ExecuteSelfReplicating>b__10(Object param0)

I have looked at the relevant files here and I can't find the cause of the exception at the relevant lines.

I am using the NuGet 0.8.17.0 build when I am getting this exception.

Have I failed to follow the documentation correctly?

Thoughts?

Image Processing?

I am totally new to machine learning. I am trying to figure out where to dive in.

My job is to be able to categorize images. Specifically patent labels. I will need to categorize common indicators on the label. (Though not my scenario, a decent example may be patent race: African-American, Caucasian, etc.)

But the image will also have barcodes and other numbers on them that are not the same from image to image (and should be ignored by the system).

To add one more level of complexity, there are many different kinds of patient labels. All of them will have the "race" info on them, but in different fonts and in different places. (And maybe even abbreviated differently.)

Is NuML able to do this kind of thing? If so I will dig in and learn it.

LinearRegression Accuracy Always Zero

While taking a more thorough look into the Linear Regression implementation, I'm seeing that Accuracy tends to report as 0%. Here is the code that is currently being used in (dev branch) Learner.cs:

            // testing            
            object[] test = GetTestExamples(testingSlice, examples);
            double accuracy = 0;

            for (int j = 0; j < test.Length; j++)
            {
                // items under test
                object o = test[j];

                // get truth
                var truth = Ject.Get(o, descriptor.Label.Name);

                // if truth is a string, sanitize
                if (descriptor.Label.Type == typeof(string))
                    truth = StringHelpers.Sanitize(truth.ToString());

                // make prediction
                var features = descriptor.Convert(o, false).ToVector();

                var p = model.Predict(features);
                var pred = descriptor.Label.Convert(p);

                // assess accuracy
                if (truth.Equals(pred))
                    accuracy += 1;
            }

            // get percentage correct
            accuracy /= test.Length;

Then this is consumed later in Learner.Best:

            var q = from m in models
                    where m.Accuracy == (models.Select(s => s.Accuracy).Max())
                    select m;

            return q.FirstOrDefault();

So basically, it iterates through the training slice, makes the prediction, and then assesses the success of the prediction against the truth. But currently, it only has one implementation of assessment: truth.Equals(pred). This then is consumed in the Learner.Best() being getting the one with the highest (max) value of Accuracy.

This approach means that unless two doubles are exactly equal (not likely except for possibly trivial data) that LinearRegression will always produce 0% Accuracy.

I wanted to abstract this out, but I wanted to get thoughts on how to approach this, as there are a lot of possible routes forward.

We could...

Pass in more parameters.
- With or without creating overloads for convenience.
Create a simple enum and pass this in as a single parameter.
- Avoids getting ridiculous with just shoving parameters.
Create some kind of TestOption object/hierarchy and pass this in.
- Current implementation would be a descendant like TruthEqualsPredictionTestOption.
- This would also be the default to avoid breaking changes.
Change Learner singleton implementation from static class to a singleton instance, in which case we could subclass Learner with overrides for different methods.

I personally waver between the TestOption approach and the Learner changes. Each has its pros and cons.

With the TestOption approach, we can easily keep from having breaking changes. But we would then have to change the Learner.Best() method depending on what the options instance is, and we end up with a switch statement, or worse, an if-then-else chain.

With the Learner singleton changes, we could more cleanly address the various capabilities of the Learner class. But this would probably entail breaking changes. I could actually write an ILearnerThing interface that has a default implementation that uses the current static class as-is, and this would avoid breaking changes. However, going forward, we would have a fragmented approach to using the library. Also, this would possibly (probably?) incorporate using DI of some sort which brings along with it more design decisions, i.e. complexity.

So, those are my thoughts. The goal is simply to get some accuracy with LinearRegression and do it in such a way that if we get a good statistician personage (or maybe one of you already is), it gives them easy access to a more robust assessment of accuracy without getting too YAGNI.

[Question] Training on big data

There is any strategy or way to generate a model out of a huge amount of data or updating with new data a prebuild model ?

Ps: thank you so much, i love this project !

How To Make A Prediction

I can create model successfully according to sample code. Now I should make a prediction with my unit data. How can I do this?

xproj can't load in my VS2015

I try to download your project and open it in VS2015, I always get error, I can't open this project, do you have any guidance that can help me to open your projects? thanks.

Please add support for decimal properties

Hi there,

My model consists of decimal properties labeled as [Feature] and [Label], and I'm getting the following exception in Jest.cs. It appears DoubleConverter.CanConvertTo(typeof(decimal)) returns false causing this exception.

As a workaround I've switched all properties to Double, but I'm wondering if anything can be done about it.

System.InvalidCastException was unhandled by user code
HResult=-2147467262
Message=Cannot convert 20 to Decimal
Source=numl
StackTrace:
at numl.Utils.Ject.Convert(Double val, Type t) in z:\Builds\work\6fc28cb662d1e0f0\numl\Utils\Ject.cs:line 287
at numl.Model.Property.Convert(Double val) in z:\Builds\work\6fc28cb662d1e0f0\numl\Model\Property.cs:line 79
at numl.Supervised.DecisionTree.DecisionTreeGenerator.BuildLeafNode(Double val) in z:\Builds\work\6fc28cb662d1e0f0\numl\Supervised\DecisionTree\DecisionTreeGenerator.cs:line 243
at numl.Supervised.DecisionTree.DecisionTreeGenerator.BuildTree(Matrix x, Vector y, Int32 depth, List1 used) in z:\Builds\work\6fc28cb662d1e0f0\numl\Supervised\DecisionTree\DecisionTreeGenerator.cs:line 172 at numl.Supervised.DecisionTree.DecisionTreeGenerator.Generate(Matrix x, Vector y) in z:\Builds\work\6fc28cb662d1e0f0\numl\Supervised\DecisionTree\DecisionTreeGenerator.cs:line 91 at numl.Learner.GenerateModel(IGenerator generator, Matrix x, Vector y, IEnumerable1 examples, Double trainingPct) in z:\Builds\work\6fc28cb662d1e0f0\numl\Learner.cs:line 143
at numl.Learner.<>c__DisplayClasse.b__d(Int32 i) in z:\Builds\work\6fc28cb662d1e0f0\numl\Learner.cs:line 110
at System.Threading.Tasks.Parallel.<>c__DisplayClassf`1.b__c()
InnerException:

Predict method allows bonehead null-reference bug

Hi folks,

Great work here!

I've recently run into a null-reference exception on this line due to not providing a proper Descriptor. I realize that is clear when reading the code, it just isn't all that clear from creating a model and then loading it from a serialized state and oh, you also need a descriptor to go with it.

Anyway, a pre-condition check on the Descriptor would, I think help clarify things for new folks.

Support coreclr

Started working on this (with a ton of help from @bdschrisk)

Large Integer/Double Error during Prediction

Hi,

I created 2 models ( Decision Tree and Neural Network models) on the classification dataset that I have. The models were created fine but I face an error when doing the prediction with either of these models (using code Model.Predict<Class>Object). The error was due to the conversion of double to int32 for features that have large numbers (10000 or more). I created a class for the data record objects and had specified double for all features. However, the numl code (Ject class) automatically converts double to Int32 for reasons that I could not understand. When looking at the Ject, Property & model classes, I could not comprehend where the type T is specified and why it is specified as Int32. Here are the error message and stack trace. Appreciate your help.

Value was either too large or too small for an Int32.

at System.Convert.ToInt32(Double value)
at System.Double.System.IConvertible.ToInt32(IFormatProvider provider)
at System.Convert.ChangeType(Object value, Type conversionType, IFormatProvider provider)
at System.Convert.ChangeType(Object value, Type conversionType)
at numl.Utils.Ject.Convert(Double val, Type t) in c:\projects\numl\numl\Utils\Ject.cs:line 313
at numl.Model.Property.Convert(Double val) in c:\projects\numl\numl\Model\Property.cs:line 79
at numl.Supervised.Model.Predict(Object o) in c:\projects\numl\numl\Supervised\Model.cs:line 38
at numl.Supervised.Model.Predict[T](T o) in c:\projects\numl\numl\Supervised\Model.cs:line 48
at WebSecurityRatingApp.Default.btnTwitter_Click(Object sender, EventArgs e)

Async support with cancellation

Hi Seth,

Please add awaitable Learner.LearnAsync method that accepts CancellationToken as one of the parameters and support graceful cancellation. It doesn't need to be instantaneous.

Thank you.

Text classification with numl

I am doing a software to classify texts about health.
So, I have many texts classified in "positive" and "negative" and these are the texts for my training.
The user will put the new texts and my software will evaluate and say with this text is "positive" or "negative".
Today, I use Weka and Naïve Bayes to this, but I would like of a framework specific to .NET.

So, I have found numl but I have not found a sample about this.

Is it possible?

Thanks

Unpredictable Results Using NaiveBayesGenerator

This examples uses the Tennis.cs sample data from the numl.net website:

Tennis[] data = Tennis.GetData();
IGenerator generator = new NaiveBayesGenerator(2);
generator.Descriptor = Descriptor.Create<Tennis>();
LearningModel learned = Learner.Learn(data, 0.80, 1000, generator);
IModel model = learned.Model;
double accuracy = learned.Accuracy;
Tennis t = new Tennis
{
    Outlook = Outlook.Sunny,
    Temperature = Temperature.High,
    Windy = false
};
Tennis predictedVal = model.Predict(t);
DisplayMessage(OutputTextbox, String.Format("Result: {0} (accuracy {1}%)", predictedVal.Play, accuracy * 100));

The result of this is that 3 predictions were true and 5 were false -- the correct response should be false. Here's the output:

7/29/2014 9:21:11 AM: Result: False (accuracy 100%)
7/29/2014 9:21:12 AM: Result: False (accuracy 100%)
7/29/2014 9:21:13 AM: Result: False (accuracy 100%)
7/29/2014 9:21:14 AM: Result: True (accuracy 100%)
7/29/2014 9:21:15 AM: Result: False (accuracy 100%)
7/29/2014 9:21:16 AM: Result: True (accuracy 100%)
7/29/2014 9:21:17 AM: Result: True (accuracy 100%)
7/29/2014 9:21:18 AM: Result: False (accuracy 100%)

All the other supervised generators produce the expected result. I've only been using this library for about an hour though, so hopefully I'm doing something wrong. Thanks!

Edit: correct response should be false, I accidentally wrote true.

Enumerable Label not supported.

Any label that implements the IEnumerable will throw exceptions saying that it needs to have the EnumerableFeature attribute, when I need a Label, not a Feature.

sethjuarez / numl Goto Github PK

numl's People

Contributors

Stargazers

Watchers

Forkers

numl's Issues

Recommend Projects

Recommend Topics

Recommend Org