sentence simplification with character-level transformer
- python3
- tensorflow
- nltk (if you need the bleu scores)
cd src
- prepare the data
mkdir trial trial/data
./mock.py
./data.py
- train a new model
mkdir trial/model trial/pred
./train.py
by default, these paths are used
src/trial/model
for model checkpointssrc/trial/pred
for predicted validation outputs~/cache/tensorboard-logdir/explicharr
for tensorboard summaries
- docs: paper and slides
- pred: output samples
- data: aligned sentences
- src: code
- showcase.ipynb: how to use the model without reading the source code
- mock.py: code for selecting and cleaning data for training and validation
- data.py: code for converting data to numpy arrays
- train.py: code for training, evaluating, and profiling
- model.py: the main implementation
util*.py
: various utilities- bleu.py: script for evaluating bleu scores
these characters are treated with special meanings
\xa0
non-breaking space, for unknown characters\x0a
newline, for marking the end of sequence\x20
space, for token boundries
for more details, see our paper