jfilter / text-classification-keras Goto Github PK

📚 Text classification library with Keras

Home Page: https://jfilter.github.io/text-classification-keras/

License: MIT License

Python 100.00%

text-classification sentiment-analysis deep-learning nlp keras keras-tensorflow nlp-machine-learning framework library cnn-text-classification

text-classification-keras's Introduction

Text Classification Keras

A high-level text classification library implementing various well-established models. With a clean and extendable interface to implement custom architectures.

Quick start

Install

pip install text-classification-keras[full]

The [full] will additionally install TensorFlow, Spacy, and Deep Plots. Choose this if you want to get started right away.

Usage

from texcla import experiment, data
from texcla.models import TokenModelFactory, YoonKimCNN
from texcla.preprocessing import FastTextWikiTokenizer

# input text
X = ['some random text', 'another random text lala', 'peter', ...]

# input labels
y = ['a', 'b', 'a', ...]

# use the special tokenizer used for constructing the embeddings
tokenizer = FastTextWikiTokenizer()

# preprocess data (once)
experiment.setup_data(X, y, tokenizer, 'data.bin', max_len=100)

# load data
ds = data.Dataset.load('data.bin')

# construct base
factory = TokenModelFactory(
    ds.num_classes, ds.tokenizer.token_index, max_tokens=100,
    embedding_type='fasttext.wiki.simple', embedding_dims=300)

# choose a model
word_encoder_model = YoonKimCNN()

# build a model
model = factory.build_model(
    token_encoder_model=word_encoder_model, trainable_embeddings=False)

# use experiment.train as wrapper for Keras.fit()
experiment.train(x=ds.X, y=ds.y, validation_split=0.1, model=model,
    word_encoder_model=word_encoder_model)

Check out more examples.

API Documenation

https://jfilter.github.io/text-classification-keras/

Advanced

Embeddings

Choose a pre-trained word embedding by setting the embedding_type and the corresponding embedding dimensions. Set embedding_type=None to initialize the word embeddings randomly (but make sure to set trainable_embeddings=True so you actually train the embeddings).

factory = TokenModelFactory(embedding_type='fasttext.wiki.simple', embedding_dims=300)

FastText

Several pre-trained FastText embeddings are included. For now, we only have the word embeddings and not the n-gram features. All embedding have 300 dimensions.

English Vectors: e.g. fasttext.wn.1M.300d, check out all avaiable embeddings
Multilang Vectors: in the format fasttext.cc.LANG_CODE e.g. fasttext.cc.en
Wikipedia Vectors: in the format fasttext.wiki.LANG_CODE e.g. fasttext.wiki.en

GloVe

The GloVe embeddings are some kind of predecessor to FastText. In general choose FastText embeddings over GloVe. The dimension for the pre-trained embeddings varies.

: e.g. glove.6B.50d, check out all avaiable embeddings

Tokenzation

To work on token (or word) level, use a TokenTokenizer such e.g. TwokenizeTokenizer or SpacyTokenizer.
To work on token and sentence level, use SpacySentenceTokenizer.
To create an custom Tokenizer, extend Tokenizer and implement the token_generator method.

Spacy

You may use spaCy for the tokenization. See instructions on how to download model for your target language. E.g. for English:

python -m spacy download en

Models

Token-based Models

When working on token level, use TokenModelFactory.

from texcla.models import TokenModelFactory, YoonKimCNN

factory = TokenModelFactory(tokenizer.num_classes, tokenizer.token_index,
    max_tokens=100, embedding_type='glove.6B.100d')
word_encoder_model = YoonKimCNN()
model = factory.build_model(token_encoder_model=word_encoder_model)

Currently supported models include:

TokenModelFactory.build_model uses the provided word encoder which is then classified via a Dense layer.

Sentence-based Models

When working on sentence level, use SentenceModelFactory.

# Pad max sentences per doc to 500 and max words per sentence to 200.
# Can also use `max_sents=None` to allow variable sized max_sents per mini-batch.

factory = SentenceModelFactory(10, tokenizer.token_index, max_sents=500,
    max_tokens=200, embedding_type='glove.6B.100d')
word_encoder_model = AttentionRNN()
sentence_encoder_model = AttentionRNN()

# Allows you to compose arbitrary word encoders followed by sentence encoder.
model = factory.build_model(word_encoder_model, sentence_encoder_model)

Hierarchical attention networks (HANs) can be build by composing two attention based RNN models. This is useful when a document is very large.
For smaller document a reasonable way to encode sentences is to average words within it. This can be done by using token_encoder_model=AveragingEncoder()
Mix and match encoders as you see fit for your problem.

SentenceModelFactory.build_model created a tiered model where words within a sentence is first encoded using word_encoder_model. All such encodings per sentence is then encoded using sentence_encoder_model.

Contributing

If you have a question, found a bug or want to propose a new feature, have a look at the issues page.

Pull requests are especially welcomed when they fix bugs or improve the code quality.

Acknowledgements

Built upon the work by Raghavendra Kotikalapudi: keras-text.

Citation

If you find Text Classification Keras useful for an academic publication, then please use the following BibTeX to cite it:

@misc{raghakotfiltertexclakeras
    title={Text Classification Keras},
    author={Raghavendra Kotikalapudi, and Johannes Filter, and contributors},
    year={2018},
    publisher={GitHub},
    howpublished={\url{https://github.com/jfilter/text-classification-keras}},
}

License

MIT.

text-classification-keras's People

Contributors

Stargazers

Watchers

Forkers

deesatzed gitter-badger dgreyling haoyijiang praveenjoshi01 ronggui leiqi chaooma irdanish11 zmandyhe ahmad-abdellatif

text-classification-keras's Issues

Bug in Multi-Class Classification?

Hi,

we tried text-classification-keras, I works really nicely with single classes.

Today I tried it on a dataset with multiple classes per document, and ran into an error which I couldn't fix.

python eswc_eurlex.py train
Using TensorFlow backend.
INFO:texcla.embeddings:Building embeddings index...
INFO:texcla.embeddings:Done
number of classes: 1
19314 76406184
INFO:texcla.embeddings:Loading embeddings for all words in the corpus
2018-12-05 19:07:36.743116: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-12-05 19:07:36.755305: W tensorflow/core/framework/allocator.cc:113] Allocation of 37423200 exceeds 10% of system memory.
2018-12-05 19:07:36.829602: W tensorflow/core/framework/allocator.cc:113] Allocation of 37423200 exceeds 10% of system memory.
2018-12-05 19:07:36.845069: W tensorflow/core/framework/allocator.cc:113] Allocation of 37423200 exceeds 10% of system memory.
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 200)          0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 200, 300)     9355800     input_1[0][0]                    
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 198, 64)      57664       embedding_1[0][0]                
__________________________________________________________________________________________________
conv1d_2 (Conv1D)               (None, 197, 64)      76864       embedding_1[0][0]                
__________________________________________________________________________________________________
conv1d_3 (Conv1D)               (None, 196, 64)      96064       embedding_1[0][0]                
__________________________________________________________________________________________________
global_max_pooling1d_1 (GlobalM (None, 64)           0           conv1d_1[0][0]                   
__________________________________________________________________________________________________
global_max_pooling1d_2 (GlobalM (None, 64)           0           conv1d_2[0][0]                   
__________________________________________________________________________________________________
global_max_pooling1d_3 (GlobalM (None, 64)           0           conv1d_3[0][0]                   
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 192)          0           global_max_pooling1d_1[0][0]     
                                                                 global_max_pooling1d_2[0][0]     
                                                                 global_max_pooling1d_3[0][0]     
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 192)          0           concatenate_1[0][0]              
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 1)            193         dropout_1[0][0]                  
==================================================================================================
Total params: 9,586,585
Trainable params: 230,785
Non-trainable params: 9,355,800
__________________________________________________________________________________________________
Traceback (most recent call last):
  File "eswc_eurlex.py", line 110, in <module>
    train()
  File "eswc_eurlex.py", line 102, in train
    experiment.train(x=ds.X, y=ds.y, validation_split=0.1, model=model, word_encoder_model=word_encoder_model, shuffle=True) # wohlg added shuffle
  File "/home/wohlg/anaconda3/lib/python3.6/site-packages/texcla/experiment.py", line 76, in train
    batch_size=batch_size, callbacks=create_callbacks(exp_path, patience), **fit_args)
  File "/home/wohlg/anaconda3/lib/python3.6/site-packages/keras/engine/training.py", line 950, in fit
    batch_size=batch_size)
  File "/home/wohlg/anaconda3/lib/python3.6/site-packages/keras/engine/training.py", line 802, in _standardize_user_data
    check_array_length_consistency(x, y, sample_weights)
  File "/home/wohlg/anaconda3/lib/python3.6/site-packages/keras/engine/training_utils.py", line 236, in check_array_length_consistency
    'and ' + str(list(set_y)[0]) + ' target samples.')
ValueError: Input arrays should have the same number of samples as target arrays. Found 19314 input samples and 76406184 target samples.

I tracked the issue down into class Dataset.
We have 19314 documents as input, which have 3956 distinct labels. Per document it's about 5 labels on average. In line 34/35 of https://github.com/jfilter/text-classification-keras/blob/master/texcla/data.py the MultiLabelBinarizer seems to create a matrix of size 19314 * 3956 = 76406184 for y, this also correspond with the message in the exception.

What is needed to make the Keras understand that len(y)=num_docs * num_classes in the multilabel case, and not just num_docs?

I'd be very grateful for help,
Best, Gerhard

https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.simple.vec: 403 -- Forbidden

Traceback (most recent call last):
File "imdb.py", line 55, in
train()
File "imdb.py", line 34, in train
ds_train.num_classes, ds_train.tokenizer.token_index, max_tokens=MAX_LEN, embedding_type='fasttext.wiki.simple', embedding_dims=300)
File "/home/chenqing/anaconda3/lib/python3.7/site-packages/texcla/models/token_model.py", line 32, in init
embedding_type, embedding_dims, embedding_path)
File "/home/chenqing/anaconda3/lib/python3.7/site-packages/texcla/embeddings.py", line 236, in get_embeddings_index
embedding_type_obj['file'], origin=embedding_type_obj['url'], extract=extract, cache_subdir='embeddings', file_hash=embedding_type_obj.get('file_hash',))
File "/home/chenqing/anaconda3/lib/python3.7/site-packages/keras/utils/data_utils.py", line 224, in get_file
raise Exception(error_msg.format(origin, e.code, e.msg))
Exception: URL fetch failure on https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.simple.vec: 403 -- Forbidden

What's up with the Mask in the Embedding Layer

Should padding be 0 and unknown 1?

https://keras.io/layers/embeddings/

subprocess.CalledProcessError in tokenize.py

Great project Sir! Thanks a lot for maintaining this, i think it is super useful.

I ran into the following error when running the example code from the readme in my newly created environment on a Win10 x64 machine:

(kerastext) E:>python texcla_test.py
deep_plots: No display found. Using non-interactive Agg backend.
> Using TensorFlow backend.
> sed: -e expression #1, char 8: unknown option to `s'
> Traceback (most recent call last):
>   File "texcla_test.py", line 15, in <module>
>     experiment.setup_data(X, y, tokenizer, 'data.bin', max_len=100)
>   File "C:\Users\Dlenz\Anaconda3\envs\kerastext\lib\site-packages\texcla\experiment.py", line 128, in setup_data
>     tokenizer.build_vocab(X)
>   File "C:\Users\Dlenz\Anaconda3\envs\kerastext\lib\site-packages\texcla\preprocessing\tokenizer.py", line 194, in build_vocab
>     for token_data in self.token_generator(texts, **kwargs):
>   File "C:\Users\Dlenz\Anaconda3\envs\kerastext\lib\site-packages\texcla\preprocessing\word_tokenizer.py", line 142, in token_generator
>     tokens = fastTextWikiTokenizer.tokenize(text)
>   File "C:\Users\Dlenz\Anaconda3\envs\kerastext\lib\site-packages\texcla\libs\fastTextWikiTokenizer\tokenize.py", line 63, in tokenize
>     return(preproc(s).split())
>   File "C:\Users\Dlenz\Anaconda3\envs\kerastext\lib\site-packages\texcla\libs\fastTextWikiTokenizer\tokenize.py", line 59, in preproc
>     return __digits(__spaces(__normalize_text(s.lower())))
>   File "C:\Users\Dlenz\Anaconda3\envs\kerastext\lib\site-packages\texcla\libs\fastTextWikiTokenizer\tokenize.py", line 38, in __normalize_text
>     ['sed', *commands], input=s.encode()).decode("utf-8")
>   File "C:\Users\Dlenz\Anaconda3\envs\kerastext\lib\subprocess.py", line 336, in check_output
>     **kwargs).stdout
>   File "C:\Users\Dlenz\Anaconda3\envs\kerastext\lib\subprocess.py", line 418, in run
>     output=stdout, stderr=stderr)
> subprocess.CalledProcessError: Command '['sed', '-e', "s/’/'/g", '-e', "s/′/'/g", '-e', "s/''/ /g", '-e', "s/'/ ' /g", '-e', 's/“/"/g', '-e', 's/”/"/g', '-e', 's/"/ /g', '-e', 's/\\./ \\. /g', '-e', 's/<br \\/>/ /g', '-e', 's/, / , /g', '-e', 's/(/ ( /g', '-e', 's/)/ ) /g', '-e', 's/\\!/ \\! /g', '-e', 's/\\?/ \\? /g', '-e', 's/\\;/ /g', '-e', 's/\\:/ /g', '-e', 's/-/ - /g', '-e', 's/=/ /g', '-e', 's/=/ /g', '-e', 's/*/ /g', '-e', 's/|/ /g', '-e', 's/«/ /g']' returned non-zero exit status 1.

It seems to me like there is an easy fix; i don't see it though.

Env setup:

conda create -n kerastext python==3.6
activate kerastext
conda install pandas scipy scikit-learn tensorflow-gpu
python -m pip install --upgrade pip
pip install text-classification-keras[full]==0.1.1

keras_test.py :

from texcla import experiment, data
from texcla.models import TokenModelFactory, YoonKimCNN
from texcla.preprocessing import FastTextWikiTokenizer

# input text
X = ['some random text', 'another random text lala', 'peter', ...]

# input labels
y = ['a', 'b', 'a', ...]

# use the special tokenizer used for constructing the embeddings
tokenizer = FastTextWikiTokenizer()

# preprocess data (once)
experiment.setup_data(X, y, tokenizer, 'data.bin', max_len=100)

# load data
ds = data.Dataset.load('data.bin')

# construct base
factory = TokenModelFactory(
    ds.num_classes, ds.tokenizer.token_index, max_tokens=100,
    embedding_type='fasttext.wiki.simple', embedding_dims=300)

# choose a model
word_encoder_model = YoonKimCNN()

# build a model
model = factory.build_model(
    token_encoder_model=word_encoder_model, trainable_embeddings=False)

# use experiment.train as wrapper for Keras.fit()
experiment.train(x=ds.X, y=ds.y, validation_split=0.1, model=model,
    word_encoder_model=word_encoder_model)

Any help is greatly appreciated!