harvardnlp / im2markup Goto Github PK
View Code? Open in Web Editor NEWNeural model for converting Image-to-Markup (by Yuntian Deng yuntiandeng.com)
Home Page: https://im2markup.yuntiandeng.com
License: MIT License
Neural model for converting Image-to-Markup (by Yuntian Deng yuntiandeng.com)
Home Page: https://im2markup.yuntiandeng.com
License: MIT License
(base) [centos@datascience-gpudev-1 im2markup]$ python scripts/evaluation/evaluate_bleu.py --result-path results/results.txt --data-path data/sample/test_filter.lst --label-path data/sample/formulas.norm.lst
2019-12-15 02:03:15,469 root INFO Script being executed: scripts/evaluation/evaluate_bleu.py
Traceback (most recent call last):
File "scripts/evaluation/evaluate_bleu.py", line 87, in
main(sys.argv[1:])
File "scripts/evaluation/evaluate_bleu.py", line 60, in main
labels[img_path] = labels_tmp[int(idx)]
KeyError: 32771
(base) [centos@datascience-gpudev-1 im2mark
I am looking cnn.lua code and I have few questions.
what are
model:add(nn.AddConstant(-128.0))
model:add(nn.MulConstant(1.0 / 128))
and
model:add(nn.Transpose({2, 3}, {3,4})) -- (batch_size, H, W, 512)
model:add(nn.SplitTable(1, 3)) -- #H list of (batch_size, W, 512)
about?
Hi Authors,
The model trained by you is in Torch. I want to test my code on the pretrained weights of your CNN model. I am working in tensorflow and keras env. Can you help me in conversion of Torch model to HDF5 format. Not able to get it.
Hi,
I tried to use the Math-to-LaTeX Toy Example pre-trained model and test on my own equation. I use the following commands to perform testing:
th src/train.lua -phase test -gpu_id 1 -load_model -model_dir model/latex -visualize
-data_base_dir data/sample/images_processed/
-data_path data/sample/test_filter.lst
-label_path data/sample/formulas.norm.lst
-output_dir results
-max_num_tokens 500 -max_image_width 800 -max_image_height 800
-batch_size 5 -beam_size 5
When I follow your provided steps and test on your test data, everything is fine. But when I change the "-data_base_dir" and "-data_path" to point to my own cropped equation (such as 9+9+8=26, all in printed font, no handwritten) and keep "-label_path" unchanged, the test output "results.txt" is still nearly same as those ground truth labels in your "formulas.norm.lst". Even I change my equations, as long as the "formulas.norm.lst" is not changed, the test output is the same.
But once I change the "formulas.norm.lst" to contain the correct Latex expressions of my equations, the test output starts to make sense. How come this is the case? I suppose the model should predict labels without the assistance of ground labels, right? The labels should be used to calculate loss and distance, etc. only.
It looks like there is an offset in the dataset provided:
In im2latex_train.lst, the first line is:
1 60ee748793 basic
Which corresponds to this equation:
\int_{-\epsilon}^\infty dl\: {\rm e}^{-l\zeta} \int_{-\epsilon}^\infty dl' {\rm e}^{-l'\zeta} ll'{l'-l \over l+l'} \{3\,\delta''(l) - {3 \over 4}t\,\delta(l) \} =0. \label{eq21}
But the image 60ee748793 doesn't match. This image matches with the equation of the next line:
2 66667cee5b basic
Which is:
ds^{2} = (1 - {qcos\theta\over r})^{2\over 1 + \alpha^{2}}\lbrace dr^2+r^2d\theta^2+r^2sin^2\theta d\varphi^2\rbrace -{dt^2\over (1 - {qcos\theta\over r})^{2\over 1 + \alpha^{2}}}\, .\label{eq:sps1}
I see in the sample date in the repo starts at line index 0 which would explain that you chose to consider the first line as line 0, and would explain the "offset".
However, it would mean that the first equation doesn't have a matching image.
Did I miss something or is there really an issue with the dataset ?
Thanks for your help !
51238 1a00a76d4e basic in im2latex_train.lst
latexs around line 51238 in im2latex_formulas.lst are not the latex content in pic 1a00a76d4e.
1a00a76d4e should point to line 51729 in im2latex_formulas.lst.
I have found some of this case, but not sure how many.
I download data from https://zenodo.org/record/56198#.XZ7yK_n_yHt.
Is anything wrong?
Hi,
I tried to implement this model with pytorch, but I encountered some problems. As mentioned in your paper(http://arxiv.org/pdf/1609.04938v1.pdf),experiments are run on a 12GB Nvidia Titan X GPU, you train the model for 12 epochs and use the validation perplexity to choose the best model.But in my experiment, the accuracy of 20 epochs was still 0. After that, I try to use a small number of training samples(10 samples) to train to check the correctness of the code. I found that the loss would converge untill 500 epochs. My GPU is GTX 1080Ti * 2, in the case of batch_size = 16, an epoch takes about 30 minutes(batch_size=20 will OOM). But even if 500 epoch can complete the training, this time is too long to bear. I checked my code repeatedly but didn't find other erros. I am particularly curious as to what the problem is and I don't understand why there is such a big difference. Can you provide the code for the pytorch version to learn? Or what mistakes do you think might cause this problem?
Looking forward to your replay
Hello Authors:
We modified and trained your model on our PCs and got pretty high BLEU accuracy on test dataset. We use Transformer instead of RNN or LSTM. But When we try to use the trained model to predict some local images (for example, screenshot of a latex formula), the result is not so good. We did some data augmentation such as random downsample ratio or random Gaussian blur. But the test on local images still gets low accuracy. Would you share any thoughts about that? I would be very appreciated if you could give us any advice. Thanks!
Hi,
I am trying to replicate your results. Although I have no issues loading the trained model and test it on the toy test samples (100), when I try to use the same model to get the accuracy on all test samples in the 100K dataset(10355) the test accuracy becomes NAN after some time and I get an error after 2000 samples. I do not understand this behavior. I changed the token length to get rid of warnings, but that is no help. Please let me know if you faced the same issue.
log.txt
[01/27/19 17:43:52] 1.046239
[01/27/19 17:43:52] Number of samples 2000 - Accuracy = nan
[01/27/19 17:43:54] 1.082996
[01/27/19 17:43:58] 1.228099
[01/27/19 17:44:00] 1.140648
[01/27/19 17:44:03] 1.131666
[01/27/19 17:44:06] 1.043551
[01/27/19 17:44:09] 1.162436
[01/27/19 17:44:11] 1.087319
[01/27/19 17:44:14] 1.575318
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-2331/cutorch/lib/THC/generated/../generic/THCTensorMathPointwise.cu line=163 error=59 : device-side assert triggered
/home/mxm7832/torch/install/bin/luajit: /home/mxm7832/torch/install/share/lua/5.1/nn/THNN.lua:110: cuda runtime error (59) : device-side assert triggered at /tmp/luarocks_cutorch-scm-1-2331/cutorch/lib/THC/generated/../generic/THCTensorMathPointwise.cu:163
stack traceback:
[C]: in function 'v'
/home/mxm7832/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'Sigmoid_updateOutput'
/home/mxm7832/torch/install/share/lua/5.1/nn/Sigmoid.lua:4: in function 'func'
.../mxm7832/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
.../mxm7832/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward'
src/model/model.lua:360: in function 'feval'
src/model/model.lua:885: in function 'step'
src/train.lua:111: in function 'train'
src/train.lua:289: in function 'main'
src/train.lua:295: in main chunk
[C]: in function 'dofile'
...7832/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
When I used your script to normalize the formula, I found that there was an illegal LaTeX symbol 'ule' generated. After checking, I found that the symbol was originally '\rule', but the program mistook the '\r' in front of the symbol for a carriage return. This problem was solved when I changed the "\ rule {" to "\ \rule {" in line 344 of the preprocess_latex.js.
I have taken formulas file from original dataset, and created 7 different renderings for each formula, now for training my files are as follows
vocab file : same as dataset
formulas and formulas.norm files generated by https://github.com/Miffyli/im2latex-dataset
images folder, with all images preprocessed , again using above repo
and also train_filter.lst, validation_filter.lst, and test_filter.lst
When I am running training th src/train.lua -phase train -gpu_id 1 -model_dir model -input_feed -prealloc -data_base_dir input_files/img5 -data_path input_files/train.lst -val_data_path input_files/validation.lst -label_path input_files/formulas.norm.lst -vocab_file input_files/latex_vocab.txt -max_num_tokens 150 -max_image_width 500 -max_image_height 160 -batch_size 10 -beam_size 1 , its throwing the following error
a
[06/28/18 11:52:50] Command Line Arguments:
[06/28/18 11:52:50] -phase train -gpu_id 1 -model_dir model -input_feed -prealloc -data_base_dir input_files/img5 -data_path input_files/train.lst -val_data_path input_files/validation.lst -label_path input_files/formulas.norm.lst -vocab_file input_files/latex_vocab.txt -max_num_tokens 150 -max_image_width 500 -max_image_height 160 -batch_size 10 -beam_size 1
[06/28/18 11:52:50] End Command Line Arguments
[06/28/18 11:52:50] Using CUDA on GPU 1
[06/28/18 11:52:50] Building model
[06/28/18 11:52:50] Creating model with fresh parameters
[06/28/18 11:52:50] Loading vocab from input_files/latex_vocab.txt
[06/28/18 11:52:50] Switching on memory preallocation
[06/28/18 11:52:50] cnn_featuer_size: 512
[06/28/18 11:52:50] dropout: 0.000000
[06/28/18 11:52:50] encoder_num_hidden: 256
[06/28/18 11:52:50] encoder_num_layers: 1
[06/28/18 11:52:50] decoder_num_hidden: 512
[06/28/18 11:52:50] decoder_num_layers: 1
[06/28/18 11:52:50] target_vocab_size: 175
[06/28/18 11:52:50] target_embedding_size: 80
[06/28/18 11:52:50] max_encoder_l_w: 62
[06/28/18 11:52:50] max_encoder_l_h: 20
[06/28/18 11:52:50] max_decoder_l: 151
[06/28/18 11:52:50] input_feed: true
[06/28/18 11:52:50] batch_size: 10
[06/28/18 11:52:50] prealloc: true
[06/28/18 11:52:50] Number of parameters: 9255007
[06/28/18 11:52:58] Data base dir input_files/img5
[06/28/18 11:52:58] Load training data from input_files/train.lst
[06/28/18 11:52:58] Training data loaded from input_files/train.lst
[06/28/18 11:52:58] Load validation data from input_files/validation.lst
[06/28/18 11:52:58] Validation data loaded from input_files/validation.lst
[06/28/18 11:52:58] Lr: 0.100000
/home/saurabh/torch/install/bin/luajit: bad argument #2 to '?' (out of range at /home/saurabh/torch/pkg/torch/generic/Tensor.c:913)
stack traceback:
[C]: at 0x7f9cbfbef590
[C]: in function '__index'
/home/saurabh/torch/install/share/lua/5.1/image/init.lua:1840: in function 'rgb2y'
src/data/data_gen.lua:78: in function 'nextBatch'
src/train.lua:106: in function 'train'
src/train.lua:289: in function 'main'
src/train.lua:295: in main chunk
[C]: in function 'dofile'
...rabh/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50
Can anyone tell me what might be causing this, training with original dataset works fine.
Hi, I'm having a question about recognizing images that are not from the dataset. I tried to crop images from scientific papers, which has a similar size to the images in the dataset. However, the output would fail totally even for simple formulas. For debugging, I also took a screenshot of the images from the dataset, but the output for the screenshot would fail, even though the original images succeeded. The screenshot and the original image are almost the same, which really confuses me. Am I missing some preprocessing steps, or do I need to re-train the model with different image sizes? Hope this question doesn't sound too naive, but I really need some help. Thank you very much!
HI! Recently I experimented with your model and found that the accuracy has improved by more than 3% compared to what you mentioned in the paper.I would like to to ask if you have modified the model, or you think there may be a problem.
2018-03-27 10:40:10,218 root INFO BLEU = 91.20, 96.9/94.0/91.3/88.6 (BP=0.984, ratio=0.985, hyp_len=537287, ref_len=545740)
2018-03-27 12:02:39,761 root INFO Accuracy (w spaces): 0.821012
2018-03-27 12:02:39,761 root INFO Accuracy (w/o spaces): 0.846854
when I run
python scripts/preprocessing/preprocess_formulas.py --mode normalize --input-file data/sample/formulas.lst --output-file data/sample/formulas.norm.lst
I get
2016-10-05 05:29:51,614 root INFO Script being executed: scripts/preprocessing/preprocess_formulas.py
D(T)=\left(\begin{array}{cc}a(T)&0\0&a(T)^{-1}\end{array}\right) \ |a(T)|>1
{ [ParseError: KaTeX parse error: Expected 'EOF', got '' at position 68: rray}\right) \̲ |a(T)|>1 ] name: 'ParseError', position: 68 }
A_{ab} \stackrel\mathrm{ def}{\equiv} \frac{\partial ^2L_\mathrm{ q}}{\partial\dot{q}_a^{n_a}\partial \dot{q}_b^{n_b}}.
A _ { a b } \stackrel
[TypeError: Cannot read property 'type' of undefined]
2016-10-05 05:29:52,441 root INFO Jobs finished
I am running it on an ubuntu machine.
Let me know how to fix this.
Hi,
Fabulous work here. I am trying to create a dataset for mathematical logic (set / proof / model theory etc). Is it possible to obtain the code you used to create the image dataset from the LaTeX sources?
Best,
Andrew
when i use the model test the dataset that is big,I meet an error.How to deal with it?
/root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/cutorch/Tensor.lua:14: cuda runtime error (59) : device-side assert triggered at /root/torch/extra/cutorch/lib/THC/generic/THCTensorCopy.c:18
stack traceback:
[C]: in function 'copy'
/root/torch/install/share/lua/5.1/cutorch/Tensor.lua:14: in function 'localize'
src/model/model.lua:598: in function 'feval'
src/model/model.lua:885: in function 'step'
src/train.lua:116: in function 'train'
src/train.lua:300: in function 'main'
src/train.lua:306: in main chunk
[C]: in function 'dofile'
/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50
Got high generalization error when predicting using latex formula picture in real word, for example, below is a predict for one formula picture:
\begin{array} { c c } { { { { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } & { } &
And this is my training result:
EM 14.03 - BLEU-4 74.61 - perplexity -1.42 - Edit 78.67
Has someone stuck in the same situation as me?
I wish to test the model on a CPU, how can i achieve that? I have tried setting the gpd_id as 0 but it returns an error stating spatialconvolution requires cuda. Therefore, i included a line cudnn.convert(model[1], nn) in model.lua before line 55 to overcome the error. But now i'm getting another error:
bad argument #2 to '?' (number expected, got userdata)
stack traceback:
[C]: in ?
[C]: in function '__add'
../im2markup/src/model/model.lua:593: in function 'feval'
../im2markup/src/model/model.lua:887: in function 'step'
../im2markup/src/train.lua:111: in function 'train'
../im2markup/src/train.lua:291: in function 'main'
../im2markup/src/train.lua:297: in main chunk
Any help is appreciated! Thanks
when launching :
th src/train.lua -phase train -gpu_id 1
-model_dir model
-input_feed -prealloc
-data_base_dir data/sample/images_processed/
-data_path data/sample/train_filter.lst
-val_data_path data/sample/validate_filter.lst
-label_path data/sample/formulas.norm.lst
-vocab_file data/sample/latex_vocab.txt
-max_num_tokens 150 -max_image_width 500 -max_image_height 160
-batch_size 20 -beam_size 1
cudnn is not found....
I tried" luarocks install cudnn"
still doesen't work
Hi, I'm trying to train this model in another mathematical expression real data.
while I doing that I have a question, Why you downsample images in preprocess_images.py ? can I change downsample_ration to 1.0 for better performance?
Thanks.
My generated vocab dictionary has 556 tokens, would like to know how many you have? Can you provide your vocab dictionary?
I want to download the dataset at the http://lstm.seas.harvard.edu/latex/im2text_small.tgz URL, but it bounces to the https://lstmvis.vizhub.ai/, can you please tell me the new dataset URL, thank you very much
If possible can you make a python package for this?
Will be very grateful!!
On page two:
produces a feature grid V of size D × H' × W', where c denotes the number of channels and H' and W' are the reduced sizes from pooling.
I think it should be "D denotes the number of channels"
Hello,
I can understand we can't generalize unless we don't have the real different types of images and their ocr, we, can provide that dataset, to get accuracy as mathpix. I don't have the hardware to train so need your little help for that. Can you share your email id for that if possible?
I used the screenshots of my computer to intercept the formula in the papers. But none of them can be identified. Are there any special data processing methods for those data?
Hi, I'm testing this model in two different programming languages, python and Lua.
But whenever I tested it in python its have not good enough result compared with Lua one (almost 8 percent in Edit Distance Accuracy).
Can you explain why you implemented in Lua?
I wanted to run code on windows 10 but i got this error and it's making an empty file for me, i installed Perl and it doesn't work for me...
python scripts/preprocessing/preprocess_filter.py --filter --image-dir data/sample/images_processed --label-path data/sample/formulas.norm.lst --data-path data/sample/validate.lst --output-path data/sample/validate_filter.lst
2022-05-12 07:39:23,971 root INFO Script being executed: scripts/preprocessing/preprocess_formulas.py
'perl' is not recognized as an internal or external command,
operable program or batch file.
2022-05-12 07:39:23,984 root ERROR FAILED: perl -pe 's|hskip(.*?)(cm|in|pt|mm|em)|hspace{\1\2}|g' ../dataset/im2latex_formulas.lst > ../dataset-preprocess/im2latex_formulas.norm.lst
'cat' is not recognized as an internal or external command,
operable program or batch file.
2022-05-12 07:39:23,997 root ERROR FAILED: cat ../dataset-preprocess/im2latex_formulas.norm.lst.tmp | node scripts/preprocessing/preprocess_latex.js normalize > ../dataset-preprocess/im2latex_formulas.norm.lst
2022-05-12 07:39:23,998 root INFO Jobs finished
When I trained the model on my own dataset with 300,000 images, the gpu memory usage kept increasing until it is used up which killed the training process.
I am new to torch. Need your help to figure out this problem @da03
Hi~I want to generate some handwriting image, can you share your code?
I found that the provided model has a vocabulary size 525, however, following the preprocessing, I got a vocabulary with size 496.
After training, I started to evaluate and I found the prediction interesting.
The trained model did good prediction on some more complicate Latex such as fraction or sqrt, it failed on some simpler formula.
For example,
ground truth is "y=x^+2x +1" but the prediction is "y=x^2+2x +2x + 1".
ground truth is "270" but the prediction is "2700".
The decoder duplicates last symbol(s).
Any hint on how to tune the model to alleviate the issue?
My training results looks reasonable:
Epoch: 11 Step 43142 - Val Accuracy = 0.923066 Perp = 1.137150
Epoch: 12 Step 47064 - Val Accuracy = nan Perp = 1.138024
hello author,
how can I edit code suitable for test other dataset?
how can I edit code to show predicted mathematical expression on new test data?
Thanks.
Can you please help in the steps like generating the data by data augmenting and preparing the data for training
I found that some of the experiments in your paper were tested on the CROHME dataset, but the CROHME dataset is not an image format. How do you deal with it? Thanks for you.
Hi, guys,
I am trying using the scripts in this repo to preprocess the im2latex dataset, but I met this error as,
2020-08-26 17:16:23,199 root INFO Script being executed: scripts/preprocessing/preprocess_formulas.py
Traceback (most recent call last):
File "scripts/preprocessing/preprocess_formulas.py", line 87, in
main(sys.argv[1:])
File "scripts/preprocessing/preprocess_formulas.py", line 65, in main
for line in fin:
File "/home/songyuc/software/python/anaconda/anaconda3/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 2270: invalid continuation byte
So, how can I solve this?
Any answer or idea will be appreciated!
can you explain about the value 'Accuracy'?
Thanks.
I reproduced the result of printed and handwrited equations. Nice Results.
Besides, I generated a 2k printed 320x80 fraction eqns (e.g., \frac{1}{2}+\frac{1}{2} = 1) as training and val data.
The training step seems fine, but the testing result for arbitrary input is same (e.g., \frac{1}{3}+\frac{1}{3} = \frac{2}{3}).
In this case I set -max_num_tokens 50.
I am wondering is there any restrictions on image's format (or shape)? thank you.
To be honest, I'm not very familiar with Lua code, so I wish you could have a python implementation. It’s very important for me. Thank you very much
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.