A character-based word segmentation that employs a character sequence.
- BiLSTM-CRF
- Character-based word segmentation
- CRF as an inference layer
- CoNLL tagging scheme based
- BIES:
- micro-avg: 98.46
- macro-avg: 87.04
- Boundary:
- micro-avg: 99.18
- macro-avg: 95.95
See best2010_cooker for more details.
- train set
- raw: best2010_80.shuf.raw.sl
- segmented: best2010_80.shuf.seg.sl
- validation set
- raw: best2010_10v.shuf.raw.sl
- segmented: best2010_10v.shuf.seg.sl
- test set
- raw: best2010_10t.shuf.raw.sl
- segmented: best2010_10t.shuf.seg.sl
- python3 >= 3.7.3
- torch >= 1.7.1+cu101
- allennlp >= 2.0.1
- numpy >= 1.20.0
- pathlib >= 1.0.1
- gensim >= 3.8.3
- pickle
- train: training a model
- eval: evaluate a model based on conlleval
- decode: decode an input file (unsegmented text) to a segmented words
- interactive: decode input text (unsegmented text) in Terminal-like (command prompt)
Modes can be specified in the sample scripts
- sample_scripts/sample_seg_w.sh
- Specify
CRF_TOP_K
to determine candidate segmentation results
usage: segmenter.py [-h] [--execute_mode {train,eval,decode,interactive}]
[--task TASK] [--quiet] [--gpu GPU]
[--epoch_begin EPOCH_BEGIN] [--epoch_end EPOCH_END]
[--break_point BREAK_POINT] [--batch_size BATCH_SIZE]
[--grad_clip GRAD_CLIP] [--optimizer OPTIMIZER]
[--adam_alpha ADAM_ALPHA] [--adam_beta1 ADAM_BETA1]
[--adam_beta2 ADAM_BETA2]
[--adam_weight_decay ADAM_WEIGHT_DECAY]
[--adagrad_lr ADAGRAD_LR]
[--adagrad_lr_decay ADAGRAD_LR_DECAY]
[--adagrad_weight_decay ADAGRAD_WEIGHT_DECAY]
[--adadelta_lr ADADELTA_LR] [--adadelta_rho ADADELTA_RHO]
[--adadelta_weight_decay ADADELTA_WEIGHT_DECAY]
[--rmsprop_lr RMSPROP_LR] [--rmsprop_alpha RMSPROP_ALPHA]
[--rmsprop_weight_decay RMSPROP_WEIGHT_DECAY]
[--sgd_lr SGD_LR] [--sgd_momentum SGD_MOMENTUM]
[--sgd_weight_decay SGD_WEIGHT_DECAY]
[--scheduler SCHEDULER]
[--exponential_gamma EXPONENTIAL_GAMMA]
[--model_path MODEL_PATH]
[--input_data_path_prefix PATH_PREFIX]
[--train_data TRAIN_DATA] [--valid_data VALID_DATA]
[--test_data TEST_DATA] [--decode_data DECODE_DATA]
[--output_data OUTPUT_DATA] [--input_data_format {sl,wl}]
[--output_data_format {sl,wl}]
[--output_token_delimiter OUTPUT_TOKEN_DELIM]
[--lowercase_alphabets] [--normalize_digits]
[--ignored_labels IGNORED_LABELS]
[--token_freq_threshold TOKEN_FREQ_THRESHOLD]
[--token_max_vocab_size TOKEN_MAX_VOCAB_SIZE]
[--pretrained_embed_usage PRETRAINED_EMBED_USAGE]
[--unigram_embed_model_path UNIGRAM_EMBED_MODEL_PATH]
[--external_dic_path EXTERNAL_DIC_PATH]
[--tagging_unit TAGGING_UNIT]
[--embed_dropout EMBED_DROPOUT]
[--rnn_dropout RNN_DROPOUT]
[--rnn_unit_type RNN_UNIT_TYPE] [--rnn_bidirection]
[--rnn_batch_first] [--rnn_n_layers RNN_N_LAYERS]
[--rnn_hidden_size RNN_HIDDEN_SIZE]
[--evaluation_method EVALUATION_METHOD]
[--mlp_dropout MLP_DROPOUT]
[--unigram_embed_size UNIGRAM_EMBED_SIZE]
[--mlp_n_layers MLP_N_LAYERS]
[--mlp_hidden_size MLP_HIDDEN_SIZE]
[--mlp_activation MLP_ACTIVATION]
[--inference_layer_type INFERENCE_LAYER]
[--crf_top_k CRF_TOP_K]
optional arguments:
-h, --help show this help message and exit
--execute_mode {train,eval,decode,interactive}, -x {train,eval,decode,interactive}
Choose a mode from among 'train', 'eval', 'decode',
and 'interactive'
--task TASK, -t TASK Select a task
--quiet, -q Do not output log file and serialized model file
--gpu GPU, -g GPU GPU device id (use CPU if specify a negative value)
--epoch_begin EPOCH_BEGIN
Conduct training from i-th epoch (Default: 1)
--epoch_end EPOCH_END, -e EPOCH_END
Conduct training up to i-th epoch (Default: 5)
--break_point BREAK_POINT
The number of instances which trained model is
evaluated and saved (Default: 10000)
--batch_size BATCH_SIZE, -b BATCH_SIZE
The number of examples in each mini-batch (Default:
64)
--grad_clip GRAD_CLIP
Gradient norm threshold to clip (Default: 5.0)
--optimizer OPTIMIZER, -o OPTIMIZER
Choose optimizing algorithm from among 'adam',
'adedelta', 'adagrad', 'rmsprop', and 'sgd' (Default:
adam)
--adam_alpha ADAM_ALPHA
alpha (learning rate) for Adam (Default: 0.001)
--adam_beta1 ADAM_BETA1
beta1 for Adam (Default: 0.9)
--adam_beta2 ADAM_BETA2
beta2 for Adam (Default: 0.999)
--adam_weight_decay ADAM_WEIGHT_DECAY
Weight decay (L2 penalty) for Adam (Default: 0)
--adagrad_lr ADAGRAD_LR
Initial learning rate for AdaGrad (Default: 0.1)
--adagrad_lr_decay ADAGRAD_LR_DECAY
Learning rate decay for AdaGrad (Default: 0)
--adagrad_weight_decay ADAGRAD_WEIGHT_DECAY
Weight decay (L2 penalty) for AdaGrad (Default: 0)
--adadelta_lr ADADELTA_LR
Initial learning rate for AdaDelta (Default: 1.0)
--adadelta_rho ADADELTA_RHO
rho for AdaDelta (Default: 0.9)
--adadelta_weight_decay ADADELTA_WEIGHT_DECAY
Weight decay (L2 penalty) for AdaDelta (Default: 0)
--rmsprop_lr RMSPROP_LR
Initial learning rate for RMSprop (Default: 0.1)
--rmsprop_alpha RMSPROP_ALPHA
alpha for RMSprop (Default: 0.99)
--rmsprop_weight_decay RMSPROP_WEIGHT_DECAY
Weight decay (L2 penalty) for RMSprop (Default: 0)
--sgd_lr SGD_LR Initial learning rate for SGD optimizers (Default:
0.1)
--sgd_momentum SGD_MOMENTUM
Momentum factor for SGD (Default: 0.0)
--sgd_weight_decay SGD_WEIGHT_DECAY
Weight decay (L2 penalty) for SGD (Default: 0)
--scheduler SCHEDULER, -s SCHEDULER
Choose scheduler for optimizer from among
'exponential' (Default: exponential)
--exponential_gamma EXPONENTIAL_GAMMA
Multiplicative factor of learning rate decay (Learning
rate decay) for Exponential (Default: 0.99)
--model_path MODEL_PATH, -m MODEL_PATH
pt/pkl file path of the trained model. 'xxx.hyp' and
'xxx.s2i' files are also read simultaneously when you
specify 'xxx_izzz.pt/pkl' file
--input_data_path_prefix PATH_PREFIX, -p PATH_PREFIX
Path prefix of input data
--train_data TRAIN_DATA
File path succeeding 'input_data_path_prefix' of
training data
--valid_data VALID_DATA
File path succeeding 'input_data_path_prefix' of
validation data
--test_data TEST_DATA
File path succeeding 'input_data_path_prefix' of test
data
--decode_data DECODE_DATA
File path of input text which succeeds
'input_data_path_prefix'
--output_data OUTPUT_DATA
File path to output parsed text
--input_data_format {sl,wl}, -f {sl,wl}
Choose format of input data from among 'sl' and 'wl'
(Default: sl)
--output_data_format {sl,wl}
Choose format of output data from among 'sl' and 'wl'
(Default: sl)
--output_token_delimiter OUTPUT_TOKEN_DELIM
Specify delimiter symbol between words for SL format
when output analysis results on decode/interactive
mode (Default ' ')
--lowercase_alphabets
Lowercase alphabets in input text
--normalize_digits Normalize digits by the same symbol in input text
--ignored_labels IGNORED_LABELS
Sepecify labels to be ignored on evaluation by format
'label_1,label_2,...,label_N'
--token_freq_threshold TOKEN_FREQ_THRESHOLD
Token frequency threshold. Tokens whose frequency are
lower than the the threshold are regarded as unknown
tokens (Default: 1)
--token_max_vocab_size TOKEN_MAX_VOCAB_SIZE
Maximum size of token vocaburaly. low frequency tokens
are regarded as unknown tokens so that vocaburaly size
does not exceed the specified size so much if set
positive value (Default: -1)
--pretrained_embed_usage PRETRAINED_EMBED_USAGE
Specify usage of pretrained embedding model from among
'init' 'concat' and 'add'
--unigram_embed_model_path UNIGRAM_EMBED_MODEL_PATH
File path of pretrained model of token (character or
word) unigram embedding
--external_dic_path EXTERNAL_DIC_PATH
File path of external word dictionary listing known
words
--tagging_unit TAGGING_UNIT
Specify tagging unit, 'single' for character-based
model
--embed_dropout EMBED_DROPOUT
Dropout ratio for embedding layers (Default: 0.0)
--rnn_dropout RNN_DROPOUT
Dropout ratio for RNN vertical layers (Default: 0.2)
--rnn_unit_type RNN_UNIT_TYPE
Choose unit type of RNN from among 'lstm', 'gru' and
'plain (tanh) ' (Default: lstm)
--rnn_bidirection Use bidirectional RNN (Default: False)
--rnn_batch_first To provide the input and output tensor as (batch, seq,
feature) (Default: False)
--rnn_n_layers RNN_N_LAYERS
The number of RNN layers (Default: 1)
--rnn_hidden_size RNN_HIDDEN_SIZE
The size of hidden units (dimension) for RNN (Default:
256)
--evaluation_method EVALUATION_METHOD
Evaluation method for segmentation
--mlp_dropout MLP_DROPOUT
Dropout ratio for MLP of sequence labeling model
(Default: 0.0)
--unigram_embed_size UNIGRAM_EMBED_SIZE
The size (dimension) of token (character or word)
unigram embedding (Default: 128)
--mlp_n_layers MLP_N_LAYERS
The number of layers of MLP of sequence labeling
model. The last layer projects input hidden vector to
dimensional space of number of labels (Default: 1)
--mlp_hidden_size MLP_HIDDEN_SIZE
The size of hidden units (dimension) of MLP of
sequence labeling model (Default: 300)
--mlp_activation MLP_ACTIVATION
Choose type of activation function for Muti-layer
perceptron from between'sigmoid' and 'relu' (Default:
relu)
--inference_layer_type INFERENCE_LAYER
Choose type of inference layer for sequence labeling
model from between 'softmax' and 'crf' (Default: crf)
--crf_top_k CRF_TOP_K
Specify the top k for CRF (viterbi algorithm)
(Default: 5)
- sl: sentence line
- wl: word line
- Okumura-Takamura-Funakoshi Lab: Natural Language Processing Group at Tokyo Institute of Technology
- AIAT: Artificial Intelligence Association of Thailand
- Implementations based on modification of seikanlp
- N/A