Giter Site home page Giter Site logo

guoyaohua / bert-classifier Goto Github PK

View Code? Open in Web Editor NEW
18.0 3.0 4.0 1.28 MB

A general text classifier based on BERT. Multi-process data processing, multi-gpu parallel training, rich monitoring indicators.

Home Page: https://www.guoyaohua.com

License: Apache License 2.0

Python 100.00%
bert classification-model nlp deep-learning sentence-classification natural-language-processing natural-language-understanding tensorflow deep-neural-networks ai machine-learning

bert-classifier's Introduction

BERT-Classifier

A general text classifier based on BERT. Multi-process data processing, multi-gpu parallel training, rich monitoring indicators.

GitHub stars GitHub issues GitHub license

IntroductionFeatureGetting StartedTensorboardMulti-GPUFastAverage CKPTBenchmarkMy Blog

Create by Yaohua Guo • https://www.guoyaohua.com

Introduction

BERT is a pre-trained language model proposed by Google AI in 2018. It has achieved excellent results in many tasks in the NLP field, and it is also a turning point in the NLP field., academic paper which describes BERT in detail and provides full results on a number of tasks can be found here:https://arxiv.org/abs/1810.04805.

Although after bert, a number of excellent models that have swept the NLP field, such as RoBERTa, XLNet, etc., have also been improved on the basis of BERT.

BERT-Classifier is a general text classifier that is simple and easy to use. It has been improved on the basis of BERT and supports three paragraphs of sentences as input for prediction. The low-level API was used to reconstruct the overall pipline, which effectively solved the problem of weak flexibility of the tensorflow estimator. Optimize the training process, effectively reduce the model initialization time, solve the problem of repeatedly reloading the calculation graph during the estimator training process, and add a wealth of monitoring indicators during training, including (precision, recall, AUC, ROC curve, Confusion Matrix, F1 score, learning rate, loss, etc.), which can effectively monitor model changes during training.

BERT-Classifier takes full advantage of the Python multi-process mechanism, multi-core speeds up the data preprocessing process, and the data preprocessing speed is more than 10 times faster than the original bert run_classifier (the specific speed increase is related to the number of CPU cores, frequency, and memory size).

Optimized the model checkpoint saving mechanism, which can save TOP N checkpoints according to different indicators, and adds the checkpoint average function, which can fuse model parameters generated in multiple different stages to further enhance model robustness.

It also supports packaging the trained models into services for use by downstream tasks.

Feature

  • 💪: State-of-the-art: based on pretrained 12/24-layer BERT models released by Google AI, which is considered as a milestone in the NLP community.
  • 🐣 Easy-to-use: require only two lines of code to fine-tune model or do inference.
  • Fast: The data preprocessing speed is more than 10 times faster than the original bert run_classifier (the specific speed improvement is related to the number of CPU cores, frequency, and memory size).
  • 🐙 Multi-GPU : Support for parallel training using multiple gpus, optimized graph for data parallel training.
  • 💎 Reliable: Tested on a variety of data sets; days of running without a break or OOM or any nasty exceptions.
  • 🎁 Rich-monitoring: Enriched tensorboard monitoring indicators, adding precision, recall, AUC, ROC curve, Confusion Matrix, F1 score, learning rate, loss and other metrics.
  • 🔔 Checkpoints: Optimized the checkpoint save mechanism. Top N checkpoints can be saved according to specified indicators.
  • 💾 Model-average: Supports averaging model parameters from multiple different stages to further enhance model robustness.
  • 🌈 Service: Support for packaging the trained model as a web service for downstream use.

Note that the Bert-Classifier MUST be running on Python >= 3.5 with Tensorflow == 1.14.0. the Bert-Classifier does not support Tensorflow 2.0!

Getting Started

Download a Pre-trained BERT Model

Download a model listed below, then uncompress the zip file into some folder, like ./pre_train_model/uncased_L-12_H-768_A-12/

List of released pretrained BERT models: (You can also download it here)

Model Description
BERT-Large, Uncased (Whole Word Masking) 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Large, Cased (Whole Word Masking) 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Uncased 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Uncased 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Cased 12-layer, 768-hidden, 12-heads , 110M parameters
BERT-Large, Cased 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Multilingual Cased (New, recommended) 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Multilingual Uncased (Orig, not recommended) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Chinese Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

Fine-turn the model on your own dataset

Create data processor

You first need to define a processor that fits your data set. All data processors should be based on the DataProcessor base class and defined in the processor.py file.

class DataProcessor(object):
    """Base class for data converters for sequence classification data sets."""

    def get_train_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the train set."""
        raise NotImplementedError()

    def get_dev_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the dev set."""
        raise NotImplementedError()

    def get_test_examples(self, data_dir):
        """Gets a collection of `InputExample`s for prediction."""
        raise NotImplementedError()

    def get_labels(self):
        """Gets the list of labels for this data set."""
        raise NotImplementedError()

The above four methods need to be implemented, where the three functions get_train_examples, get_dev_examples, get_test_examples should return a list containing InputExample instances, and the get_labels function returns a list containing all category names (strings).

InputExample is defined as follows:

class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text_a, text_b=None, text_c=None, label=None, weight=1):
        """Constructs a InputExample.

        Args:
          guid: Unique id for the example.
          text_a: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
          text_b: (Optional) string. The untokenized text of the second sequence.
            Only must be specified for sequence pair tasks.
          text_c: (Optional) string. The untokenized text of the third sequence.
            Only must be specified for sequence pair tasks.
          label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
          weight: (Optional) float. The weight of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.text_c = text_c
        self.label = label
        self.weight = weight

Here is a specific example that can be modified accordingly.

class MyProcessor(DataProcessor):
    """Custom data processor"""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(os.path.join(data_dir, "train.tsv"), "train")

    def get_eval_examples(self, data_dir):
        """See base class."""
        return self._create_examples(os.path.join(data_dir, "eval.tsv"), "eval")

    def get_test_examples(self, file_path):
        """See base class."""
        return self._create_examples(file_path, "test")

    def get_labels(self):
        """See base class."""
        return ["Label1", "Label2", "Label3"]

    def _create_examples(self, path, set_type):
        """Creates examples for the training and dev sets."""
        # ['text1','text2','text3','weight','label']
        examples = []
        i = 0
        with open(path, 'r', encoding='utf-8') as f:
            for line in tqdm(f, desc="Loading data:"):
                i += 1
                guid = "%s-%s" % (set_type, i)
                data = line[:-1].split('\t')
                if set_type == "test":
                    text_a = tokenization.convert_to_unicode(data[0])
                    text_b = tokenization.convert_to_unicode(data[1])
                    text_c = tokenization.convert_to_unicode(data[2])
                    # sample weight always 1 when doing inference.
                    weight = 1
                    # Not used during inference, so it can be any label.
                    label = "Label1"
                else:
                    text_a = tokenization.convert_to_unicode(data[0])
                    text_b = tokenization.convert_to_unicode(data[1])
                    text_c = tokenization.convert_to_unicode(data[2])
                    weight = float(data[3])
                    label = data[4]
                examples.append(
                    InputExample(guid=guid, text_a=text_a, text_b=text_b, text_c=text_c, label=label, weight=weight))
        return examples

After building a custom processor, you need to load and use this processor in run_fine_turn.py .

from processor import MyProcessor
# line:164
data_processor = MyProcessor()
# line:165
model = BertClassifier(data_processor, 
                       num_labels, 
                       bert_config_file,
                       max_seq_length, 
                       vocab_file, 
                       tensorboard_dir, 
                       init_checkpoint, 
                       keep_checkpoint_max, 
                       use_GPU, 
                       label_smoothing, 
                       cycle)

Model fine-tune

After defining the processor of your data, you can train the model. First, briefly introduce the meaning of the parameters in run_fine_tune.py.

Argument Type Default Description
data_dir str "./data/" The input data dir. Should contain the .tsv files (or other data files) for the task.
output_dir str "./output/" The output directory where the model checkpoints will be written.
tensorboard_dir str "./tensorboard/" The tensorboard output dir.
bert_config_file str Required The config json file corresponding to the pre-trained BERT model. This specifies the model architecture.
vocab_file str Required The vocabulary file that the BERT model was trained on.
init_checkpoint str Required Initial checkpoint (usually from a pre-trained BERT model).
do_lower_case bool True Whether to lower case the input text. Should be True for uncased models and False for cased models.
max_seq_length int 224 The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded.
do_train bool True Whether to run training.
do_predict bool False Whether to run the model in inference mode on the test set.
train_batch_size int 16 Total batch size for training.
eval_batch_size int 128 Total batch size for eval.
predict_batch_size int 128 Total batch size for predict.
learning_rate float 5e-5 The initial learning rate for Adam.
num_train_epochs int 1 Total number of training epochs to perform.
warmup_proportion float 0.1 Proportion of training to perform linear learning rate warmup for. E.g., 0.1 = 10% of training.
save_checkpoints_steps int 1000 How often to save the model checkpoint.
cycle int 1 polynomial decay learning rate cycle.
keep_checkpoint_max int 20 How many checkpoints to keep for more.
predict_file str (Optional) The predict input file, only for inference mode.
label_smoothing float 0.1 Model Regularization via Label Smoothing
use_GPU bool True Whether use GPU to speed up training.

You can fine-tune the model with the following command:

$python run_fine_tune.py  \
	--bert_config_file=./pre_train_model/uncased_L-12_H-768_A-12/bert_config.json \
	--vocab_file=./pre_train_model/uncased_L-12_H-768_A-12/vocab.txt \
	--init_checkpoint=./pre_train_model/uncased_L-12_H-768_A-12/bert_model.ckpt

In training mode, the model uses TFRecords files as input, which can make better use of memory, so after running run_fine_tune.py, the program first preprocesses the original input data and writes it to the TFRecords file. These files will be stored in the data_dir directory. This operation will only be performed for the first time. If the original data changes, you need to delete the TFRecords file in the data_dir directory, so that the program will generate TFRcords files based on the new data again.

Use model for inference

You only need three lines of code to use the model for inference tasks, as shown below:

from BertClassifier import BertClassifier
model = BertClassifier(data_processor, 
                       num_labels, 
                       bert_config_file,
                       max_seq_length, 
                       vocab_file, 
                       tensorboard_dir, 
                       init_checkpoint, 
                       keep_checkpoint_max, 
                       use_GPU, 
                       label_smoothing, 
                       cycle)
model.predict(file_path='./data/test.tsv', predict_batch_size=128, output_dir='./predict')
# Or single sample inference
# prob = model.predict(input_example=input_example)

In inference mode, the model allows two types as inputs:

When predicting a single sample, for example, in some streaming task scenarios, you may need to predict only a single sample. In this case, you need to construct an InputExample instance of the input features and pass it to the model.predict function. The function returns the probability distribution of the predicted result.

In batch sample prediction, you can directly pass in the file path. The model will first parse the file according to the MyProcessor defined above and perform batch inference. The result will be saved in the directory specified by output_dir.

Inference service

Flask encapsulates a simple model inference microservice, You may call the service via HTTP requests. You can easily start this service using the following code:

$python start_service.py \
	./pre_train_model/uncased_L-12_H-768_A-12/vocab.txt \
	./pre_train_model/uncased_L-12_H-768_A-12/bert_config.json \
	./pre_train_model/uncased_L-12_H-768_A-12/bert_model.ckpt \
	5666

inference service

Tensorboard

Bert-Classifier adds a wealth of monitoring indicators to more intuitively show the changes in model performance during training. You can run tensorboard with the following command.

$tensorboard --logdir ./tensorboard_dir

tensorboard_scalar

tensorboard_image

tensorboard_hist1

tensorboard_hist2

Multi-GPU support

Bert-Classifier uses data parallelism to implement multi-GPU parallel training tasks. You can choose to use CPU, single GPU, multi-GPU for training and inference tasks according to different needs.

# Use CPU to train
$python run_fine_tune.py  \
	--bert_config_file=./pre_train_model/uncased_L-12_H-768_A-12/bert_config.json \
	--vocab_file=./pre_train_model/uncased_L-12_H-768_A-12/vocab.txt \
	--init_checkpoint=./pre_train_model/uncased_L-12_H-768_A-12/bert_model.ckpt \
	--use_gpu=False
	
# Use single GPU to train
$CUDA_VISIBLE_DEVICES=0 python run_fine_tune.py  \
	--bert_config_file=./pre_train_model/uncased_L-12_H-768_A-12/bert_config.json \
	--vocab_file=./pre_train_model/uncased_L-12_H-768_A-12/vocab.txt \
	--init_checkpoint=./pre_train_model/uncased_L-12_H-768_A-12/bert_model.ckpt \
	--use_gpu=True
	
# Use multi-GPU to train
$python run_fine_tune.py  \
	--bert_config_file=./pre_train_model/uncased_L-12_H-768_A-12/bert_config.json \
	--vocab_file=./pre_train_model/uncased_L-12_H-768_A-12/vocab.txt \
	--init_checkpoint=./pre_train_model/uncased_L-12_H-768_A-12/bert_model.ckpt \
	--use_gpu=True

In the multi-GPU training mode, Bert-Classifier will keep a copy of the model with shared parameters on each GPU, and will automatically distribute the input batch evenly to all GPUs for forward propagation and gradient calculation. The gradient values obtained by each GPU calculation will be reassembled, and the parameters will be optimized after averaging.

Multi-GPU

Fast data preprocess

Bert-Classifier uses multiple processes to accelerate data preprocessing, which is more than 10 times faster than bert preprocessing (specifically related to the number of CPU cores, clock speed, and memory size). The program will adaptively start the corresponding number of processes according to the user's CPU core information.

process_data

Note: Since the python multi-process mechanism is not memory-friendly, if the memory is too small, it will cause OOM.

Average checkpoints

The average_checkpoints.py script can be used to average the parameters of multiple checkpoints, which usually improves the robustness of the model. You can use the following command to perform checkpoint averaging.

$python average_checkpoints.py \
	--model_dir=./checkpoints/ \
	--output_dir=./merged_ckpt/ \
	--max_count=20

average_checkpoints

Benchmark

All experiments are based on BERT-Base , the GPU uses GTX 1080 (8G), and Tnesorflow version is 1.14.0.

Fine-tune

Max batch size

max_seq_len 1 GPU 2 GPU 4 GPU
32 154 308 616
64 73 146 292
96 47 94 188
128 34 68 136
256 14 28 56
512 5 10 20

Fine-tune Max Batch Size

Speed

In terms of calculation speed, the comparison of the time (ms) consumed by the model to run one training step at full GPU load under different max_seq_len conditions was tested.

max_seq_len 1 GPU 2 GPU 4 GPU
32 711 739 765
64 693 717 730
96 674 701 716
128 666 694 713
256 602 633 675
512 515 532 568

Fine-Tune Speed

Inference

Max batch size

max_seq_len 1 GPU 2 GPU 4 GPU
32 4510 9005 17985
64 2216 4408 8806
96 1477 2953 5902
128 739 1479 2958
256 369 739 1478
512 128 256 512

Inference Max Batch Size

Speed

The test compares the time (s) consumed by the model to run an Inference with the GPU fully loaded under different max_seq_len conditions.

max_seq_len 1 GPU 2 GPU 4 GPU
32 9.06 12.07 15.31
64 9.84 11.94 14.42
96 7.87 8.02 8.56
128 5.19 5.41 5.79
256 5.52 5.98 6.52
512 4.83 5.17 5.37

Inference Speed

Citing

If you use Bert-Classifier in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{yaohua2019bertclassifier,
  title={bert-classifier},
  author={Yaohua, Guo},
  howpublished={\url{https://github.com/guoyaohua/BERT-Classifier}},
  year={2019}
}

bert-classifier's People

Contributors

guoyaohua avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.