Hello, First of all thank you for this awesome contribution to the scientific worl

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Classification for small datasets about fasttext HOT 8 OPEN

facebookresearch commented on June 14, 2024 1

Classification for small datasets

from fasttext.

Comments (8)

xiamx commented on June 14, 2024 4

I think the latest source changed -verbose to a more related name -lrUpdateRate

from fasttext.

gojomo commented on June 14, 2024 1

Oddly enough it seems the 'verbose' parameter may affect how often the learning-rate is updated; see:

fastText/src/fasttext.cc

Line 236 in cd5726e

if (tokenCount > args.verbose) {

So perhaps try a small value there.

from fasttext.

havardl commented on June 14, 2024 1

I have quite a small binary data set, with around 400 (<900 in total) texts for each of the two classes. I was able to increase the precision and recall from around 0.53 to 0.64 by playing around with the different parameters. The one that had the most effect was -lrUpdateRate, with a setting of 150000 - 200000. Bucket needed to be above 100000, but beyond that had little effect.

Any ideas as to why fastText is performing so low on this sample? Running a plain NB on the same sample gives between 0.86 and 0.89 in accuracy with different normalization methods.

from fasttext.

alexbeletsky commented on June 14, 2024

I have the same question. My dataset is about 5x bigger, but I still have quite poor results P@1 = 0.37. It could be also related to a quality of my dataset, though it would be interesting to know the answer.

@gojomo very interesting about verbose.. is this an issue?

from fasttext.

lukewendling commented on June 14, 2024

Same problem, I'll add a use case to move the discussion along:

I want to use FT to classify questions from users in a chatbot app. Input is like "I want to sign up", "How do I get a login", "How do I get started?". The chatbot will eventually be able to classify many types of user input, but until I've collected actual questions from users, I want to seed the "signup" class of questions with a small number (<100) of inputs that I make up, so that my app knows "this is a signup request".

Problem:
With default settings for FT trainer on very few (but closely related - all with word 'signup', for example) observations, the predictions are not helpful... if I have 3 classes on 100 examples, I get probabilities of ~ 33% no matter what the input is, including gibberish input ("abc123").

Perhaps FT is inherently a bad choice for tiny datasets and early stage deployment like this. It's such a great tool for larger datasets that I was hoping to get it integrated into the app early on.

from fasttext.

cpuhrsch commented on June 14, 2024

Hello @bratao ,

Thank you for your post. You might find more support for this kind of issue within one of our community boards. In particular, the Facebook group has a lot of members many of whom are ML experts and are keen on discussing applications of this library.

Specifically, there is a

Facebook group
Stack Overflow tag
Google group

If you do decide to move this to one of our other community boards, please consider closing this issue.

Thanks,
Christian

from fasttext.

matanox commented on June 14, 2024

Have you used a pre-trained embedding in training your classifier? you should typically get good results for this size of supervised training data if you did.

from fasttext.

CharlesCCC commented on June 14, 2024

Do we have any update for this issue ?
As I'm also experience this problem with small-dataset, the precision don't even get close to 60%.

I saw there are people suggest to tweak the parameters, I did most of the suggested, but didn't help too much, the value is fluctuated between 50% ~ 53%.

“
To improve the performance of fastText on small datasets, the learning rate should be increased (for example, use -lr 0.5) as well as the number of epochs (for example, use -epoch 20). You can also decrease the number of buckets (for example, use -bucket 100000), to reduce the model size. A good starting point is something like:
./fasttext supervised -input TRAIN.txt -output MODEL -dim 10 -lr 0.5 -epoch 20 -bucket 100000
“

from fasttext.

Classification for small datasets about fasttext HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent