Giter Site home page Giter Site logo

andreekeberg / ml-classify-text-js Goto Github PK

View Code? Open in Web Editor NEW
124.0 7.0 12.0 90 KB

Machine learning based text classification in JavaScript using n-grams and cosine similarity

Home Page: https://www.npmjs.com/package/ml-classify-text

License: MIT License

JavaScript 100.00%
text-classification text-classifier machine-learning cosine-similarity artificial-intelligence similarity n-grams n-gram classification classifier training natural-language-processing library sentiment-analysis labels predictions

ml-classify-text-js's Introduction

๐Ÿ“„ ClassifyText (JS)

Version Total Downloads License

Use machine learning to classify text using n-grams and cosine similarity.

Minimal library that can be used both in the browser and in Node.js, that allows you to train a model with a large amount of text samples (and corresponding labels), and then use this model to quickly predict one or more appropriate labels for new text samples.

Installation

Using npm

npm install ml-classify-text

Using yarn

yarn add ml-classify-text

Getting started

Import as an ES6 module

import Classifier from 'ml-classify-text'

Import as a CommonJS module

const { Classifier } = require('ml-classify-text')

Basic usage

Setting up a new Classifier instance

const classifier = new Classifier()

Training a model

const positive = [
	'This is great, so cool!',
	'Wow, I love it!',
	'It really is amazing'
]

const negative = [
	'This is really bad',
	'I hate it with a passion',
	'Just terrible!'
]

classifier.train(positive, 'positive')
classifier.train(negative, 'negative')

Getting a prediction

const predictions = classifier.predict('It sure is pretty great!')

if (predictions.length) {
	predictions.forEach((prediction) => {
		console.log(`${prediction.label} (${prediction.confidence})`)
	})
} else {
	console.log('No predictions returned')
}

Returning:

positive (0.5423261445466404)

Advanced usage

Configuration

The following configuration options can be passed both directly to a new Model, or indirectly by passing it to the Classifier constructor.

Options

Property Type Default Description
nGramMin int 1 Minimum n-gram size
nGramMax int 1 Maximum n-gram size
vocabulary Array | Set | false [] Terms mapped to indexes in the model data, set to false to store terms directly in the data entries
data Object {} Key-value store of labels and training data vectors

Using n-grams

The default behavior is to split up texts by single words (known as a bag of words, or unigrams).

This has a few limitations, since by ignoring the order of words, it's impossible to correctly match phrases and expressions.

In comes n-grams, which, when set to use more than one word per term, act like a sliding window that moves across the text โ€” a continuous sequence of words of the specified amount, which can greatly improve the accuracy of predictions.

Example of using n-grams with a size of 2 (bigrams)

const classifier = new Classifier({
	nGramMin: 2,
	nGramMax: 2
})

const tokens = classifier.tokenize('I really dont like it')

console.log(tokens)

Returning:

{
    'i really': 1,
    'really dont': 1,
    'dont like': 1,
    'like it': 1
}

Serializing a model

After training a model with large sets of data, you'll want to store all this data, to allow you to simply set up a new model using this training data at another time, and quickly make predictions.

To do this, simply use the serialize method on your Model, and either save the data structure to a file, send it to a server, or store it in any other way you want.

const model = classifier.model

console.log(model.serialize())

Returning:

{
    nGramMin: 1,
    nGramMax: 1,
    vocabulary: [
    	'this',    'is',      'great',
    	'so',      'cool',    'wow',
    	'i',       'love',    'it',
    	'really',  'amazing', 'bad',
    	'hate',    'with',    'a',
    	'passion', 'just',    'terrible'
    ],
    data: {
        positive: {
            '0': 1, '1': 2, '2': 1,
            '3': 1, '4': 1, '5': 1,
            '6': 1, '7': 1, '8': 2,
            '9': 1, '10': 1
        },
        negative: {
            '0': 1, '1': 1, '6': 1,
            '8': 1, '9': 1, '11': 1,
            '12': 1, '13': 1, '14': 1,
            '15': 1, '16': 1, '17': 1
        }
    }
}

Documentation

Contributing

Read the contribution guidelines.

Changelog

Refer to the changelog for a full history of the project.

License

ClassifyText is licensed under the MIT license.

ml-classify-text-js's People

Contributors

andreekeberg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

ml-classify-text-js's Issues

Installation error

Hi,

I was trying to install module on linux server, was able to install the same on mac.

`npm WARN EBADENGINE }
npm ERR! code 127
npm ERR! path /home/www/test/node_modules/core-js-pure
npm ERR! command failed
npm ERR! command sh -c node -e "try{require('./postinstall')}catch(e){}"
npm ERR! sh: node: command not found

npm ERR! A complete log of this run can be found in:
npm ERR! /root/.npm/_logs/2021-12-04T13_18_28_279Z-debug.log
`

Support for tokenization of languages without spaces

Need to implement a smarter method of tokenization which takes into account languages that traditionally does not use spaces between words (currently resulting in full-sentence tokens not suitable for the current method of cosine similarity comparisons).

Some of these languages include:

  • Chinese
  • Japanese
  • Thai
  • Khmer
  • Lao
  • Burmese

Add some more usage examples

Create some practical examples and add these to the documentation, like sentiment analysis, profanity detection, etc.

Cannot add vocabulary

image
This is a screenshot from Webstorm. As you can see, parsed.vocabulary is set and I tried to insert it into classified both in the constructor and in model itself.

However, the vocabulary was not set in model.

Is that fixable?

Is it possible to save training data?

Not really an issue.
For a better performance, is there an API to save a training data as model and re-use it back?

For instance,

classifier.saveAsJson();

classifier.loadModel();

Add ability to define custom getter/setter methods for model data

Add support for for passing custom methods to Classifier and Model for defining custom getter/setter methods that allows us to bypass the data object literal that is currently stored directly as a property in Model instances.

For example:

const { createClient } = require('redis')
const client = createClient()

const classifier = new Classifier({
    getData: async (label) => {
        return await client.get(label || '*')
    },
    setData: async (label, data) => {
        return await client.set(label, data)
    }
})

There will be default getData and setData methods which work like in the current version (but still with some performance improvements like being asynchronous by default and where we optimize the way the data object is stored in memory).

From there we can update the train and predict methods in Classifier to call those, instead of the current behaviour of directly accessing this._model.data.

When using the library in this way, the data that still remains in Model (and is accessible via the serialize method) simply serves as general model meta data, and one would need to use the same custom data store every time the model is used later on.

It should also be noted this needs to wait until the upcoming version where all methods have been rewritten to be asynchronous by default is has been released.

Add support for stop words

Add a simple StopWords class and a related stopWords getter and setter in the Model class, which will store a list of words in a Set, and remove these both when calling Classifier.train (after splitWords and before tokenize), and in the same way when calling Classifier.predict.

Modernize the project (ESM)

Hi, this is a very nice library, I was just wondering if it's still active, and if you would like it to be upgraded to fully use ESM through esbuild and testing with vitest, instead of all the babel, webpack and other outdated tools for building it? I can make a PR for that.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.