Giter Site home page Giter Site logo

amaiya / ktrain Goto Github PK

View Code? Open in Web Editor NEW
1.2K 34.0 268.0 110.47 MB

ktrain is a Python library that makes deep learning and AI more accessible and easier to apply

License: Apache License 2.0

Python 18.58% Jupyter Notebook 81.42%
deep-learning machine-learning tensorflow keras python nlp computer-vision graph-neural-networks tabular-data

ktrain's People

Contributors

00001h avatar amaiya avatar chicodelarosa avatar hunaidkhan2000 avatar ilos-vigil avatar jgraham909 avatar lambdaofgod avatar logeshb avatar lutich avatar niekvdplas avatar nrhodes avatar reluxingzeng avatar sanidhya-singh avatar textprobe avatar timomulder avatar vochicong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ktrain's Issues

Binary Classification with BERT

I was able to successfully train a binary classification using BERT, however the model files turned out be around 3.5 GB ,is this expected.
I ran the inference using these models however i see when it runs i see epoch 1/1 as a message (very briefly) , Am i missing any code that should be used before running the model on new data.

Getting an error while executing this code

(x_train, y_train), (x_test, y_test), preproc = text.texts_from_folder('C:\Users\anand.nataraj\Downloads\aclImdb',
maxlen=500,
preprocess_mode='bert',
train_test_names=['train',
'test'],
classes=['pos', 'neg'])

ERROR:

File "", line 1
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_folder('C:\Users\anand.nataraj\Downloads\aclImdb',
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

Below is the dataset that I'm loading:

datset fil structure: http://prntscr.com/qng8k1

download link: https://ai.stanford.edu/~amaas/data/sentiment/


please let me know the dataset and the file structure that i'm using is correct.

text.texts_from_df not read test data

Issue: train data read as test.

    if val_df is not None:
        test = val_df
        x_test = train[text_column].fillna('fillna').values
        y_test = train[label_columns].values
        x_train = x
        y_train = y

should be

    if val_df is not None:
        test = val_df
        x_test = test[text_column].fillna('fillna').values
        y_test = test[label_columns].values
        x_train = x
        y_train = y

Random State while splitting

Hey there,

I think the user should be able to set a random state, when for example, loading texts from folder for pre-processing. By default, if no test folder is given, 10% of the training set is taken as validation set. But this is done differently every time, i.e. the data are taken randomly.

This is a problem when you want to compare different models because the data are not splitted the same way. I see you use load_files and train_test_split both from sklearn. Both accept a random_state argument according to their documentation.

It would be great if we could pass it too. I can make a PR if you want.

predictor.explain() only in Jupyter-Notebooks

Hello and thanks for the amazing work.

I found that the predictor.explain() method only works for me in Jupyter-Notebooks and not in standard python scripts with console or an IDE as output (which should be the same). Is this by design or am i missing something? If not, please make it work or give me a hint.

Regards,
caesar

AttributeError on 'BERTPreprocessor'

Hey @amaiya

There is no issue for loading and preprocess the data using texts_from_csv and texts_from_df, but after this step, I wanted to load the pretrained BERT model and pass the preproc to it then I hit this issue which is

AttributeError: 'BERTPreprocessor' object has no attribute 'max_features'

I did set up the max_features in the load data methods and I have tested the return preproc object and there is no 'max_features' attribute.

Allow loading of downloaded BERT

Small enhancement recommendation, allow loading of BERT embedding via an already downloaded zip file so the package doesn't always download its own.

load predictor issue

Hi Amaiya,

I saved the predictor with predictor.save in colab, then I ran ktrain.load_predictor in my computer to load the predictor and I get the following error:

Call to keras.models.load_model failed. Try using the learner.model.save_weights and learner.model.load_weights instead.
Error was:
Traceback (most recent call last):
File "Bert_Classifier_WS.py", line 16, in
predictor = ktrain.load_predictor(root_path + 'pred_bert_26.h5')
File "C:\Users\a2c-admin\AppData\Roaming\Python\Python37\site-packages\ktrain\core.py", line 1199, in load_predictor
raise ValueError('model must be of instance Model')
ValueError: model must be of instance Model

I pasted in my computer the 2 files that I obtained by predictor.save in colab.
Any ideas about why I get this error?

Thank you in advance.
Emanuele

get_predictor().predict() returns class label name only

On a trained BERT model, when I run code below:

Input:
predictor = ktrain.get_predictor(learner.model, preproc)
predictor.predict(data_df["para"][0])

Output:
'target'

This does not match up to documentation and examples shown, assuming this is an issue

DistilBert / XLM

Hi,

Thanks for this great library. Coming from the FastAI course, it looks so familiar and easy to use! Great job. I was wondering if you could add some new models, especially the DistilBert one from HuggingFace that seems lighter and 60% faster to train than BERT while keeping 97% of BERT's power according to their paper.

XLM From Facebook seems also a to be a great cross-lingual model : https://github.com/facebookresearch/XLM#ii-cross-lingual-language-model-pretraining-xlm

Thank you!

newbie question: slow training on colab possible due to giant training set

First of all: thank you this lib makes getting your hands dirty much faster.

second:

I managed to get everything running in colab with a dataset i've created for doing sentiment analysis. I have quite a lot of data so i thought "why not throw a lot at BERT" then:

begin training using triangular learning rate policy with max lr of 2e-05...
Train on 195000 samples, validate on 76263 samples
Epoch 1/3
   416/195000 [..............................] - ETA: 51:10:26 - loss: 1.6178 - acc: 0.2404

As you can see i am training on 195'000 items and validating on 76263 items and the grand question now is Thats too much data right? cause i dont like the ETA up there: 51 hours!

And that is 51 hours on google colab with GPU support etc.

Expected different Shape in TextClassification-Model

Hey,
I tried your library with the comments example for training BERT-TextClassification and it worked great. Now I wanted to check it on my own dataset. I also provided the same folder structure with train/test classes and can generate the training and test_data, using the multimodel which includes german language which the corpus is in.

learner = ktrain.get_learner(text.text_classifier('bert', (x_train, y_train), preproc=preproc), 
                             train_data=(x_train, y_train), 
                             val_data=(x_test, y_test), 
                             batch_size=5)
-> Is Multi-Label? False
-> maxlen is 500
-> done.
learner.fit_onecycle(2e-5, 1)
-> ValueError: Error when checking target: expected dense_5 to have shape (1998,) but got array with shape (1639,)

When I tried executing the same again, it complains about a different layer every time having a shape missmatch. Can you guide me what I possibly made wrong?

Unable to use fit_onecycle on a model I compiled due to optimizer

I'm trying experiment with the onecycle policy of the (AMAZING) ktrain library.
I took the old cats vs dogs via augmentation from keras blog
https://gist.github.com/fchollet/0830affa1f7f19fd47b06d4cf89ed44d, and I want to see how the training policy changes the results.

However, the learner.fit_onecycle return the error 'RMSprop' object has no attribute 'beta_1'.
How should I compile my model, to allow onecycle to do its magic?
Is there a way to allow it to override the preexisting optimizer compiled into the model?

It is not possible to compile without an optimizer or to use the ktrain fit_onecycle without first compiling the model.

Many thanks :)

keras_bert's tokenizer breaks some unicode chars while lowercasing

Hello! I found out that lowercasing procedure within Tokenizer._tokenize in keras_bert (keras_bert/tokenizer.py line 103: text = unicodedata.normalize('NFD', text)) breaks some national symbols existed in bert vocab. I.e. russian char й got translated to и. This leads to breaking some russian words existed in vocabulary to wrong syllables.
This can be turned off by passing cased=True parameter to Tokenizer class. So I suggest adding to BERTPreprocessor
__init__ function parameter "cased" and pass it to BERT_Tokenizer:

--- preprocessor.py 2020-01-17 20:55:01.005621433 +0300
+++ preprocessor.py.new 2020-01-17 20:57:09.279613894 +0300
@@ -586,3 +586,3 @@
def init(self, maxlen, max_features, classes=[],
- lang='en', ngram_range=1, multilabel=False):
+ lang='en', ngram_range=1, multilabel=False, cased=False):

@@ -597,3 +597,3 @@
token_dict[token] = len(token_dict)
- tokenizer = BERT_Tokenizer(token_dict)
+ tokenizer = BERT_Tokenizer(token_dict, cased=cased)
self.tok = tokenizer

and to text_from_array in data.py to pass it further to BERTPreprocessor:

--- data.py 2020-01-17 21:15:23.258549536 +0300
+++ data.py.new 2020-01-17 21:14:42.000000000 +0300
@@ -262,3 +262,3 @@
random_state=None,
- verbose=1):
+ verbose=1, cased=False):
"""
@@ -310,3 +310,3 @@
classes = class_names,
- lang=lang, ngram_range=ngram_range)
+ lang=lang, ngram_range=ngram_range, cased=cased)
trn = preproc.preprocess_train(x_train, y_train, verbose=verbose)

Cannot get learner from iterator zip object

get_learner fails when the training data is a zip of iterators such as when it is used for image segmentation tasks (while augmenting images and masks together).

EDIT:

It works by hacking together a custom Iterator class, but it's not a particularly elegant hack...

image_gen and mask_gen below are keras.preprocessing.image.ImageDataGenerator.flow_from_directory() objects.

class Iterator():
    
    def __init__(self, image_gen, mask_gen):
        self.image_gen = image_gen
        self.mask_gen = mask_gen
        self.batch_size = image_gen.batch_size
        self.target_size = image_gen.target_size
        self.color_mode = image_gen.color_mode
        self.class_mode = image_gen.class_mode
        self.n = image_gen.n
        self.seed = image_gen.seed
        self.total_batches_seen = image_gen.total_batches_seen
    
    def __iter__(self):
        return self
    
    def __next__(self):
        return next(self.image_gen), next(self.mask_gen)
    
    def __getitem__(self, key):
        return self.image_gen[key], self.mask_gen[key]

Any ideas how we could make this more elegant?

Selection of a specific BERT pretrained model

Hi Amaiya,
it's a great project! Congrats!
However, I'd like to specify which specific BERT pretrained model to use.
For example, I'd like to try the last Multilingual Cased, but I don't see in the documentation how to do that.
I guess It might be an option in this call:
txt.text_classifier('bert', (x_train, y_train))

Thank you in advance.
Cheers,
Emanuele

Regarding Deployment on Flask

Hi, i have an issue regarding deployment i am not able to deploy ktrain multi text classification model. I tried to load model and .preproc file but it does not work.

OOM when allocating error

Hello. I'm trying to run the bert text classification model.

(X_train, y_train), (X_test, y_test), preproc = text.texts_from_array(x_train=X_train_orig, y_train=y_train_orig, x_test=X_test_orig, y_test=y_test_orig, class_names=target_names, preprocess_mode='bert', lang='he',ngram_range=1, maxlen=350, max_features=35000)

model = text.text_classifier('bert', train_data=(X_train, y_train), preproc=preproc)

learner = ktrain.get_learner(model, train_data=(X_train, y_train), batch_size=6)

learner.autofit(2e-5, 5)

learner.validate(val_data=(X_test, y_test))

I'm receiving the following error in loops (it doesn't stop):
2020-01-07 20:35:10.067077: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at random_op.cc:76 : Resource exhausted: OOM when allocating tensor with shape[768,768] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

I'm using Tesla K80 GPU at Azure, 11GB, (fast-ai's suitable machine)

That's the code that appears when the run starts:


using Keras version: 2.2.4-tf
2020-01-07 20:41:07.554584: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-01-07 20:41:07.562014: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-01-07 20:41:07.652799: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55e265486bb0 executing computations on platform CUDA. Devices:
2020-01-07 20:41:07.652839: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-01-07 20:41:07.655415: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2596990000 Hz
2020-01-07 20:41:07.656208: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55e2654f8820 executing computations on platform Host. Devices:
2020-01-07 20:41:07.656234: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2020-01-07 20:41:07.656711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: b38a:00:00.0
2020-01-07 20:41:07.656949: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-01-07 20:41:07.658177: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-01-07 20:41:07.659292: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2020-01-07 20:41:07.659651: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2020-01-07 20:41:07.661072: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2020-01-07 20:41:07.662144: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2020-01-07 20:41:07.665315: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-01-07 20:41:07.666074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2020-01-07 20:41:07.666341: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-01-07 20:41:07.669069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-01-07 20:41:07.669100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2020-01-07 20:41:07.669115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2020-01-07 20:41:07.670062: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10802 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: b38a:00:00.0, compute capability: 3.7)

How can I solve it?

Thanks

language preprocessing huggings face transformers

Your last release seems to fix my previous issue #51

However now that i just upgraded to the latest version, it seems i am getting this error:

train_data = t.preprocess_train(x_train, y_train)

preprocessor.py in detect_lang(texts, sample_size)
    335             continue
    336     if len(lst) == 0:
--> 337         raise Exception('could not detect language in random sample of %s docs.'  % (sample_size))
    338     return max(set(lst), key=lst.count)
    339 

Exception: could not detect language in random sample of 32 docs.

SyntaxError when importing ktrain

Python version: 3.5
OS: Linux 16.04
Instalation by: pip3

When I put:
import ktrain

raise the error:

Traceback (most recent call last):
  File "bert.py", line 9, in <module>
    import ktrain
  File "/usr/local/lib/python3.5/dist-packages/ktrain/__init__.py", line 2, in <module>
    from .core import *
  File "/usr/local/lib/python3.5/dist-packages/ktrain/core.py", line 1, in <module>
    from .imports import *
  File "/usr/local/lib/python3.5/dist-packages/ktrain/imports.py", line 30, in <module>
    from fastprogress import master_bar, progress_bar 
  File "/usr/local/lib/python3.5/dist-packages/fastprogress/__init__.py", line 1, in <module>
    from .fastprogress import master_bar, progress_bar, force_console_behavior
  File "/usr/local/lib/python3.5/dist-packages/fastprogress/fastprogress.py", line 42
    if h!= 0: return f'{h}:{m:02d}:{s:02d}'
                                          ^
SyntaxError: invalid syntax

download "uncased_L-12_H-768_A-12.zip" behind a firewall

hi here, I would like to try your package, however, I have to work behind a firewall, so I cannot run function "texts_from_folder" due to this reason. I have tried to download package, and then unzipped and put into the working folder, but still have this issue.

Could you please let me know where I should the fold or specific file?

ktrain import fails due to failure to import master_bar from fastprogress

I installed ktrain on my GPU server (running ubuntu 18.04.2)
Trying to import ktrain failed:

>>> import ktrain
using Keras version: 2.2.4-tf
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/ktrain/__init__.py", line 2, in <module>
    from .imports import *
  File "/usr/local/lib/python3.6/dist-packages/ktrain/imports.py", line 173, in <module>
    from fastprogress import master_bar, progress_bar
ImportError: cannot import name 'master_bar'

Reinstalling fastprogress did not solve the issue; indeed, import fastprogress does not fail

RE: tok_dct method on StandardTextPreprocessor(TextPreprocessor) class

Arun, firstly I want to say Happy New Year! Secondly, you have done an excellent job with ktrain and provided tensorflow users some very useful and easy to use new tools!

In relation to the preproc.tok_dct method of the StandardTextPreprocessor class, I always get an empty dictionary even though I have "fitted" the data. How can I get the vocabulary created? Have I misinterpreted the method? preproc.max_features and preproc.tok works fine. Here is what I did:

(x_train, y_train), (x_test, y_test), preproc = text.texts_from_array(x_train=texts,
y_train=test,
val_pct=0.3,
class_names=names,
preprocess_mode='standard',
maxlen=200,
max_features=2000,
random_state=1)

preproc.max_features --> this returns 2000
preproc.tok_dct --> this returns {}
preproc.tok --> this returns <keras_preprocessing.text.Tokenizer at 0x7f8a77f58cf8>

BERT / ELMO embeddings for NER

When are the pretrained embeddings for BERT and ELMO with regards to NER planned for ?
Could you help me with the development process, I could try to contribute these features.

Integration with tensorboard

Is the integration with tensorboard supported? I tried the callbacks in autofit(), ran into function object has no attribute fetch_callbacks error

Serving a trained model

I have a ktrain text classifier based on BERT. What would be the right way to go about saving the model for serving?

slow training

why the training is slow, even if I increase the batch size, it took me more than 4 hours and it is still running and stopped without completing for all the epochs

Update BERT IMDB example in README to work with latest changes

I recommend changing the following line in "Example: Text Classification of IMDb Movie Reviews Using BERT":

model = txt.text_classifier('bert', (x_train, y_train))

->

model = txt.text_classifier('bert', (x_train, y_train), preproc)

Thanks for a great repo! Best

Multilingual model from HuggingFace

Congratulations for this awesome easy to use the library.

I am using the Greek language to test the multilingual bert model from hugging face

In the preprocessing method for train and test test data.

train_data = t.preprocess_train(x_train, y_train)

I am getting an output:

preprocessing train... language: en

this is a bit misleading that the language identified is language 'English'.

I think that bert tokenizer handles the preprocessing and does not need to know the language input. The language detection is for the automatic model selection (english / multilingual).
Is that correct?
So if I am correct the language identified does not need to be printed out when using hugging's Face multilingual model?

Can't get 'bert' model to run

clean_sample (1).zip

Trying to use BERT in the ktrain.text.text_classifier results in the following error:

ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 2 array(s), but instead got the following list of 1 arrays: [array([[ 0, 0, 0, ..., 73, 64, 5783], [ 0, 0, 0, ..., 14, 183, 1045], [ 0, 0, 0, ..., 1504, 196, 235], ..., [ 0, 0, 0, ..., 12, ...

My code works fine with both fasttext and nbsvm.

The last line results in the error, attached is the csv I am reading from

NUM_WORDS = 50000
MAXLEN = 800
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_csv(DATA_PATH,
                      'unfil', label_columns=list_,
                      val_filepath=None, # if None, 10% of data will be used for validation
                      max_features=NUM_WORDS, maxlen=MAXLEN,
                      ngram_range=1)`

model = text.text_classifier('bert', (np.array(x_train), np.array(y_train)), preproc=preproc)
learner = ktrain.get_learner(model, train_data=(x_train, y_train), val_data=(x_test, y_test))

learner.lr_find()

Pandas 1.0

Is there a reason you are checking for Pandas < 1.0 in setup.py?
With Pandas 1.0 it throws an error:
ERROR: ktrain 0.9.0 has requirement pandas<1.0, but you'll have pandas 1.0.0 which is incompatible.
Thanks

Thankyou

Not an issue (apologies) but think this is FANTASTIC! particularly as it abstracts away some relatively new methods in such a simple API. Often there's pressure to test state of the art methods for business use cases, but having to write or even implement relatively low level code takes a lot of time (and is hard!), this is brilliant as a tool for prototyping and testing a hypothesis. I particularly like the idea of formatting input data to a convention as often reformatting input data is the hardest and most tedious part of creating models.

Thankyou so much for this, would be keen to contribute in any way that might help if you accept donations etc. It's a shame that the Keras library itself seems to be so far behind cutting edge and this is a brilliant tool for augmenting it.

Is there anywhere I could find details of the API itself? Perhaps i could help by documenting?

Multilingual Bert

That's a great project!
However, It is not clear which BERT pre-trained model version is loaded.
Is there a way to specify the multilingual version? or even between Base/Large and Cased/Uncased?

For chinese, here just "bert" is specified in both preprocessor and in the text_classifier:
https://github.com/amaiya/ktrain/blob/master/examples/text/ChineseHotelReviews-BERT.ipynb

I'm interested to use the "BERT-Base Multilingual Cased" from here: https://github.com/google-research/bert
Thank you.
Regards,
Emanuele

Predictor values are always the name of the categories column

Sorry if this is a too basic question, I'm trying to learn about text classification and I've been dealing with this issue for a few days now.
I'm following the example you provided:
https://github.com/amaiya/ktrain/blob/master/tutorial-04-text-classification.ipynb

The only difference is that i have my data in CSV format and I use pandas library to load it:

df = pd.read_csv('Historical_EdD_EN-ES.csv', low_memory=False, encoding='utf-8')
df['labels'] = pd.cut(df['RelEditDistance_NOFT'], [0, 0.15, 1], labels=[1, 0], include_lowest=True)
df = df[['Source_NOFT', 'labels']]

So I create the categories (1 and 0) based on the 'RelEditDistance_NOFT' column values. Anything below 0.15 will be category 1 and the rest 0. And 'Source_NOFT' column contains the text sentences.

df.head()

Source_NOFT | labels
View the help system for the Adaptive Keys | 0
Zoom in or zoom out | 0
If the slots are faulty, seek technical assist... | 1
Installing a storage drive into drive bay 3 | 0
Lenovo ThinkServer RD650 server (data node) | 1

Then I follow your code to create the model:

(x_train, y_train), (x_test, y_test), preproc = text.texts_from_df(
train_df=df, text_column='Source_NOFT', label_columns=['labels'], max_features=MAX_FEATURES,
maxlen=MAX_LENGTH, ngram_range=3, preprocess_mode='standard')

model = text.text_classifier('nbsvm', (x_train, y_train), preproc=preproc)

learner:

learner = ktrain.get_learner(model, train_data=(x_train, y_train), val_data=(x_test, y_test))
learner.fit(0.001, 3, cycle_len=1, cycle_mult=2)

and predictor:

predictor = ktrain.get_predictor(learner.model, preproc)

Here's when the problem arises:

data = [ 'This movie was horrible! The plot was boring. Acting was okay, though.',
'The film really sucked. I want my money back.',
'What a beautiful romantic comedy. 10/10 would see again!']
predictor.predict(data)

result:
['labels', 'labels', 'labels']

I was expecting to get the actual category values, 1 or 0.

Could you please give me some hint on what could be happening?
Is there something I'm doing wrong?

Thanks a lot in advance,
Fran.

Parameter "activation" is ignored in _build_bert function.

I suggest, it should be like this:

*** models.py.org 2019-11-14 13:11:35.732456424 +0000
--- models.py 2019-11-14 13:11:31.452468006 +0000
*************** def _build_bert(x_train, y_train, num_cl
*** 216,222 ****
seq_len=maxlen)
inputs = model.inputs[:2]
dense = model.get_layer('NSP-Dense').output
! outputs = Dense(units=num_classes, activation='softmax')(dense)
model = Model(inputs, outputs)
model.compile(loss=loss_func,
optimizer=U.DEFAULT_OPT,
--- 216,222 ----
seq_len=maxlen)
inputs = model.inputs[:2]
dense = model.get_layer('NSP-Dense').output
! outputs = Dense(units=num_classes, activation=activation)(dense)
model = Model(inputs, outputs)
model.compile(loss=loss_func,
optimizer=U.DEFAULT_OPT,

Need TF upgraded to 2.0

AttributeError: module 'tensorflow' has no attribute 'placeholder'

I am assuming this error because of being tied to TF < 2.0?

BI-GRU Embeddings Training

While trying to build a binary text classification with a very small dataset (less than 2k files), I was wondering what model will fit best in terms of accuracy of course but also in training time and inference time.

I basically tested everything here, except BERT because I don't have a GPU right now (I can't use cloud services due to confidential datas). I did try distilBERT with another library and get good results.

I will soon I have to make a crucial decision about what model I pick and I struggle for now to decide. Especially because my data can come in many languages. Thats important.

  • The LogReg model looks nice but it seems we only have a trainable word embedding with our own vocabulary, which, for my case, with a small dataset, might not been nice (I still have really good results with it). What if suddenly new data in inference time are completely different?

  • The NBSVM seems interesting as well, very quick and gives nice results. I just have a False Positive a little high in my opinion (around 14%) and this is the more important metric for me.

  • FastText is quick but as I said earlier, I can have text in different languages. There are talks here and there about merging fastText word vectors into a single space vector but I can't use it with this library right now. By default, ktrain seems to detect the language of the dataset. In my case, it can switch from "en" to "fr" depending of the randomly sampled data taken by the function which is problematic too (not sure it downloads the specific French FastText vectors though).

  • Finally, the Bidirectional GRU seems to give me good results and low FP (6%). This is similar to what I have with DistilBERT with much longer training and inference time of course.

If I go with BIGRU, I would like to know how it works really inside. I read the paper you linked, I also printed the model layers with print_layers():

0 (trainable=False) : <keras.engine.input_layer.InputLayer object at 0x7fe82d337a58>
1 (trainable=True) : <keras.layers.embeddings.Embedding object at 0x7fe82d10b898>
2 (trainable=True) : <keras.layers.core.SpatialDropout1D object at 0x7fe82d0a9828>
3 (trainable=True) : <keras.layers.wrappers.Bidirectional object at 0x7fe82cfa9a58>
4 (trainable=True) : <keras.layers.pooling.GlobalAveragePooling1D object at 0x7fe815ab5ba8>
5 (trainable=True) : <keras.layers.pooling.GlobalMaxPooling1D object at 0x7fe80806e358>
6 (trainable=True) : <keras.layers.merge.Concatenate object at 0x7fe815ab5be0>
7 (trainable=True, wd=None) : <keras.layers.core.Dense object at 0x7fe815ab5cc0>

My question is about the embedding layer. It appears that trainable is set to True. Could you confirm me that it means we start with a pre-trained vector from FastText + our own vocabulary taken from the training data (if not already present in FastText), and then during the training time the weights are updated?

This is very important to me because, if I understand correctly, we start with a solid base with a pre-trained vector embedding AND our own vocabulary.

Sorry for the long post.

Thanks in advance for your help!

explain currently only supports English

Hey there,
at first thanks for providing such an awesome framework, it's nice to work with. I'll try to get some insights about what my model has learned, but I do work on german data. So I wanted to know, how it is possible, to fit the current text-explainer to also handle german or other languages, because I do not see why that only currently works for english.
Best,
Richard

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.