Comments (5)
Multiple data formats are available, depending on problems and algorithms. What is your problem and what algorithm do you try to use?
from smile.
Hello haifengl,
I want to try some algorithms (naive, logistic, random forests) to analyze twitter posts, so my features will be "contains word 'love'".
There is the Bag option here -> https://github.com/haifengl/smile/blob/master/Smile/src/main/java/smile/feature/Bag.java But i'm afraid that the model and each tweet will contain a lot of features as 0. The double[][] will be too large to process.
I have found the SparseDataset.java and BinarySparseDataset.java but I don't understand hot to use them with the classifiers.
Thanks
from smile.
For document classification, I suggest you to use Maximum Entropy classifier (MaxEnt class, http://haifengl.github.io/smile/doc/index.html). Mathematically, it is equivalent to logistic regression. And our implementation supports sparse data. It takes an integer array for features, of which each element is the index of non zero features. Checkout the unit test case for examples.
from smile.
Humm.. I will definitely try that.
I will compare the MaxEntropy with an implementation using feature vectorization using the hashing trick
Thank you very much
from smile.
A common mistake in NLP is that use all words in the documents for the features. It is better to do a feature selection first and use this (much) smaller set of words as features (with the Bag class as the helper). BTW, tree based method (e.g. Random Forest) will be very slow if the number of features is too large.
from smile.
Related Issues (20)
- Can't merge 2 KMeans clustering trained models calculated in different locations in codes HOT 1
- On version 3.0.0 Scala, how to serialise a model to a file or over a network ? HOT 2
- Need ReadMe guide on model training, model merging, model validation and model serialisation HOT 4
- 404 error when accessing the kotlin api documentation in the project website HOT 1
- Add XMeans with float array type HOT 5
- Ability to stop TSNE, possibly other time-heavy computations HOT 1
- rbf kernel? HOT 3
- [FEATURE PROPOSAL] ARPACK wrapper functions HOT 2
- Exception in thread "main" java.lang.UnsatisfiedLinkError: no jniopenblas_nolapack in java.library.path: /usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib HOT 13
- Tree Representation for Regression Models in Google Earth Engine HOT 1
- Takes more memory for LSH model in NearestNeighborSearch HOT 6
- Gamma random number generator support only integer shape parameter. HOT 1
- java.lang.StackOverflowError HOT 1
- svm HOT 1
- Jitpack builds are failing since 3.x HOT 2
- FR: Compact "how to load dirty data" example HOT 1
- Arff.java writeField can fail when type isn't in the list of handled types HOT 1
- BarPlot.getUpperBound() computes wrong bound. HOT 1
- FR: Warn before trying to train where the label column has any nulls HOT 1
- Dot product Question HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from smile.