The classifier from harthur

Classifier Failing with medium dataset

Hi,

First of all, thank you for a really helpful Naive Bayes Classifier library. I tried to implement in Node.js but also having the same problem with your library.

I have a dataset of 1212 sets and 294419 wordsets generated by your library. I used socket.io to handle websocket connections from clients and it always returning result at the first classifying attempt, however when attempting to classify after the first try, it always hangs.

If using a lower quantity of data like the example you posted, it was successful.

To make it clear, here are some codes I used:

io.sockets.on('connection', function(socket) {
    // Classify
    socket.on('classify', function(data) {
        if(typeof data.namespace == 'string' && typeof data.keywords == 'string') {
            console.log('Classifying: ' + data.keywords);
            var start = new Date().getTime();

            bayes.classify(data.keywords, function(cat) {
                console.log('Classified: ' + cat);
                var elapsed = new Date().getTime() - start;

                if(typeof cat == 'string' && cat !== '' && cat != 'unclassified') {
                    var result = {
                        classifyStatus: {
                            code: 200,
                            msg: 'Success',
                            timing: elapsed
                        },
                        text: data.keywords,
                        category: cat
                    };
                    console.log(result);
                    socket.emit('classifyCategory', result);
                }
                else {
                    console.log('Unclassified');
                    socket.emit('classifyCategory', {
                        classifyStatus: {
                            code: 200,
                            msg: 'Cannot classify into any categories',
                            timing: elapsed
                        },
                        text: data.keywords,
                        category: 'unclassified'
                    });
                }
            });
        }
        else {
            console.log('Unclassified - no data from client');
            socket.emit('classifyCategory', {
                trainStatus: {
                    code: 500,
                    msg: 'Failed',
                    timing: 0
                },
                text: data.keywords,
                category: ''
            });
        }
    });
});

UTF-8 support

Sadly does not support UTF-8. The problem lies here:

getWords : function(doc) {
    if (_(doc).isArray()) {
      return doc;
    }
    var words = doc.split(/\W+/);
    return _(words).uniq();
  }

doc.split(/\W+/);

does not seem to work for UTF-8

Here is an example with Cyrilic language (like Russian):

"Надежда за обич еп.36 Тест".split(/\W+/);

This returns:

[ "", "36", "" ]

Instead should return something like this:

[ "Надежда", "за", "обич", "еп", "36", "Тест"]

I was looking for fix, but ended up here:
http://stackoverflow.com/questions/280712/javascript-unicode-regexes

harthur / classifier Goto Github PK

classifier's People

Contributors

Stargazers

Watchers

Forkers

classifier's Issues

don't support chinese

Classifier Failing with medium dataset

UTF-8 support

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent