Giter Site home page Giter Site logo

Comments (5)

haifengl avatar haifengl commented on April 28, 2024

Multiple data formats are available, depending on problems and algorithms. What is your problem and what algorithm do you try to use?

from smile.

lfomendes avatar lfomendes commented on April 28, 2024

Hello haifengl,
I want to try some algorithms (naive, logistic, random forests) to analyze twitter posts, so my features will be "contains word 'love'".
There is the Bag option here -> https://github.com/haifengl/smile/blob/master/Smile/src/main/java/smile/feature/Bag.java But i'm afraid that the model and each tweet will contain a lot of features as 0. The double[][] will be too large to process.

I have found the SparseDataset.java and BinarySparseDataset.java but I don't understand hot to use them with the classifiers.

Thanks

from smile.

haifengl avatar haifengl commented on April 28, 2024

For document classification, I suggest you to use Maximum Entropy classifier (MaxEnt class, http://haifengl.github.io/smile/doc/index.html). Mathematically, it is equivalent to logistic regression. And our implementation supports sparse data. It takes an integer array for features, of which each element is the index of non zero features. Checkout the unit test case for examples.

from smile.

lfomendes avatar lfomendes commented on April 28, 2024

Humm.. I will definitely try that.
I will compare the MaxEntropy with an implementation using feature vectorization using the hashing trick

Thank you very much

from smile.

haifengl avatar haifengl commented on April 28, 2024

A common mistake in NLP is that use all words in the documents for the features. It is better to do a feature selection first and use this (much) smaller set of words as features (with the Bag class as the helper). BTW, tree based method (e.g. Random Forest) will be very slow if the number of features is too large.

from smile.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.