Giter Site home page Giter Site logo

jeffchy / re2rnn Goto Github PK

View Code? Open in Web Editor NEW
112.0 6.0 19.0 112 KB

Source code for the EMNLP 2020 paper "Cold-Start and Interpretability: Turning Regular Expressions intoTrainable Recurrent Neural Networks"

Python 92.36% Jupyter Notebook 7.64%
regular-expression text-classification

re2rnn's Introduction

RE2RNN

Source code for the EMNLP2020 paper: "Cold-start and Interpretability: Turning Regular Expressions into Trainable Recurrent Neural Networks", Chengyue Jiang, Yinggong Zhao, Shanbo Chu, Libin Shen, and Kewei Tu.

Citation

@inproceedings{jiang-etal-2020-cold,
    title = "Cold-start and Interpretability: Turning Regular Expressions into Trainable Recurrent Neural Networks",
    author = "Jiang, Chengyue  and
      Zhao, Yinggong  and
      Chu, Shanbo  and
      Shen, Libin  and
      Tu, Kewei",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.258",
    pages = "3193--3207",
}

Requirements

  • pytorch 1.3.1
  • tensorly 0.5.0
  • numpy
  • tqdm
  • automata-tools
  • pyparsing

Data

Raw dataset files, preprocessed dataset files, glove word embedding matrix, rules for each dataset, and the decomposed automata files can be downloaded here: Google Drive, Tencent Drive.

You can download and extract the zip file, and replace the original data directory. The directory structure should be:

.
├── data
│   ├── ATIS
│   │   ├── automata
│   │   ├── ....
│   ├── TREC
│   │   ├── automata
│   │   ├── ....
│   ├── ....
├── src
│   ├── ....
├── src_simple
│   ├── ....
├── model
│   ├── ....
├── imgs
│   ├── ....

If you have done these, you can skip to the training part.

Regular Expressions

We provide the RE rules for three datasets, ATIS, QC(TREC-6) and SMS. Our REs are word-level, not char-level. We show the symbols and their meanings in the following table.

Symbol Meaning
$ wildcard
% numbers, e.g. 5, 1996
& punctuations
? 0 or 1 occurrence
* zero or more occurrences
+ one or more occurrences
(a|b) a or b

Regular expressions Examples

ATIS - abbreviation label.

[abbreviation]
( $ * ( mean | ( stand | stands ) for | code ) $ * ) | ( $ * what is $ EOS )

SMS - spam label.

[spam]
$ * dating & * $ * call $*

Regular Expression to FA

We show examples on ATIS dataset, for other datasets, simply change --dataset option to TREC or SMS.

prepare the dataset

You need first download the GloVe 6B embeddings, and place the embedding files into data/emb/glove.6B/ You can also prepare the dataset from the raw dataset by running the following command.

python data.py --dataset ATIS

RE to FA

We turn the regular expressions into a finite automaton using our automata-tools package implemented by (@linonetwo). This tool is modified based on https://github.com/sdht0/automata-from-regex. This package require the 'dot' command for drawing the automata.

Or running the following command to convert REs/reversed RE (for backward direction) to FA.

python create_automata.py --dataset ATIS --automata_name all --reversed 0
python create_automata.py --dataset ATIS --automata_name all --reversed 1

The regular expression for ATIS - abbreviation mentioned above can be represented using following automaton. avatar

Run the REs

The RE system's result is got by running the un-decomposed automaton you just created.

python main.py --model_type Onehot --dataset ATIS --only_probe 1 --wfa_type viterbi \
--epoch 0 --automata_path_forward all.1105155534-1604591734.6171093.split.pkl --seed 0

Decomposing FAs

We convert the FAs using tensor-rank decomposition.

Run the following command to convert REs to FA.

python decompose_automata.py --dataset ATIS --automata_name automata_name --rank 150 --init svd

FAs to FA-RNN and training FA-RNN.

To train the initialize the FA-RNNs on ATIS, SMS, and TREC make sure you finish the above steps. Then let's train an FA-RNN initialized by the decomposed automata. If you have downloaded automata and place them into the right location, you can run.

python main.py --dataset ATIS --run save_dir --model_type FSARNN --beta 0.9 \
 --wfa_type forward --seed 0 --lr 0.001 --bidirection 0 --farnn 0 --random 0

Please check the function get_automata_from_seed in utils/utils.py to understand which automaton you are using.

If you use newly decomposed automata, you need to specify the --automata_path_forward and --automata_path_backward options.

For example: FARNN

python main.py --dataset ATIS --run save_dir --model_type FSARNN --bidirection 0 \
--beta 0.9 --wfa_type forward --seed 0 --lr 0.001 --farnn 0 --normalize_automata l2 \
--automata_path_forward automata.newrule.split.randomseed150.False.0.0003.0.pkl

For example: BiFARNN

python main.py --dataset ATIS --run save_dir --model_type FSARNN --bidirection 1 \
--beta 0.9 --wfa_type forward --seed 0 --lr 0.001 --farnn 0 --normalize_automata l2 \
--automata_path_forward automata.newrule.split.randomseed150.False.0.0003.0.pkl \
--automata_path_backward automata.newrule.reversed.randomseed150.False.0.0735.0.pkl

For example: FAGRU

python main.py --dataset ATIS --run save_dir --model_type FSARNN --bidirection 0 \
--beta 0.9 --wfa_type forward --seed 0 --lr 0.001 --farnn 1 --normalize_automata l2 \
--automata_path_forward automata.newrule.split.randomseed150.False.0.0003.0.pkl

We also remove some options and unimportant code to provide a cleaner version of code in /src_simple, in which only contains FARNN related code. As an example:

python main_min.py --dataset ATIS --run save_dir --model_type FSARNN --beta 0.3

Interpretability and Models.

You first need to download the FA-RNN models and config files here: Google Drive, Tencent Drive. Please place the files in the /model directory.

To seed the log and hyper-parameters of these provided model, simple using pickle to load the .res config files. For example, to achieve the hyper-params for model D0.9739-T0.9653-DI0.8655-TI0.8645-1106095843-1604656723.555744-ATIS-0.model, you can run:

import pickle
pickle.load(open('1106095843-1604656723.5809364.res', 'rb'))

Note that some useless hyper-parameters in the config files are cleaned in the final version/simple version, so the config file may not be directly used, just filter out the useless hyper-params.

We provide several examples showing how to convert the trained model parameters back into WFAs, and threshold them into NFA. See the file jupyter/checkTrainedRules.ipynb.

re2rnn's People

Contributors

jeffchy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

re2rnn's Issues

Use my dataset

Hello, thank you very much for your work, it is very meaningful, but I have some questions to ask you. Can I use my own data set to replace it? And if so, how should I do it? Looking forward to your reply.

关于张量秩分解问题

1.当状态数非常大时,比如K=750, 张量秩分解的误差非常大,应该如何解决?
2.当K=750时,我们的rank应该选取多少比较合适?

About atis.{}.new.pkl

Hello, thank you very much for your beautiful work. I want to study your paper. I've created the directory as the Readme, but, when I execute "python data.py --dataset ATIS" it throw ERROR:
error
I checked the data.py, but I still don't understand what “atis.{}.new.pkl” means.
Waiting for your help, thanks.

BrokenPipeError:[Broken pipe]

when I reproduced your code, there was always a problem that I couldn't solve. I asked you for help:
When I run the code command python create_automata.py --dataset ATIS --automata_name all --reversed 0
the error will be:
BrokenPipeError:[Broken pipe]
捕获

关于多个WFAs合并成一个WFA的问题

你好。论文里提到了将多个WFAs合并成一个,具体是怎么操作的呢,比如对于V_embedding词嵌入向量,shape为(V,r),那是把r分配到每个子WFA吗。此外,对于D1,D2,我的猜想是将根据子WFA的状态states数对应的赋值到r*K的矩阵里面,我的猜想对吗?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.