facebookresearch / nougat Goto Github PK

View Code? Open in Web Editor NEW

7.8K 66.0 495.0 4.74 MB

Implementation of Nougat Neural Optical Understanding for Academic Documents

Home Page: https://facebookresearch.github.io/nougat/

License: MIT License

Python 99.67% Dockerfile 0.33%

nougat's Introduction

Nougat: Neural Optical Understanding for Academic Documents

This is the official repository for Nougat, the academic document PDF parser that understands LaTeX math and tables.

Project page: https://facebookresearch.github.io/nougat/

Install

From pip:

pip install nougat-ocr

From repository:

pip install git+https://github.com/facebookresearch/nougat

Note, on Windows: If you want to utilize a GPU, make sure you first install the correct PyTorch version. Follow instructions here

There are extra dependencies if you want to call the model from an API or generate a dataset. Install via

pip install "nougat-ocr[api]" or pip install "nougat-ocr[dataset]"

Get prediction for a PDF

CLI

To get predictions for a PDF run

$ nougat path/to/file.pdf -o output_directory

A path to a directory or to a file where each line is a path to a PDF can also be passed as a positional argument

$ nougat path/to/directory -o output_directory

usage: nougat [-h] [--batchsize BATCHSIZE] [--checkpoint CHECKPOINT] [--model MODEL] [--out OUT]
              [--recompute] [--markdown] [--no-skipping] pdf [pdf ...]

positional arguments:
  pdf                   PDF(s) to process.

options:
  -h, --help            show this help message and exit
  --batchsize BATCHSIZE, -b BATCHSIZE
                        Batch size to use.
  --checkpoint CHECKPOINT, -c CHECKPOINT
                        Path to checkpoint directory.
  --model MODEL_TAG, -m MODEL_TAG
                        Model tag to use.
  --out OUT, -o OUT     Output directory.
  --recompute           Recompute already computed PDF, discarding previous predictions.
  --full-precision      Use float32 instead of bfloat16. Can speed up CPU conversion for some setups.
  --no-markdown         Do not add postprocessing step for markdown compatibility.
  --markdown            Add postprocessing step for markdown compatibility (default).
  --no-skipping         Don't apply failure detection heuristic.
  --pages PAGES, -p PAGES
                        Provide page numbers like '1-4,7' for pages 1 through 4 and page 7. Only works for single PDFs.

The default model tag is 0.1.0-small. If you want to use the base model, use 0.1.0-base.

$ nougat path/to/file.pdf -o output_directory -m 0.1.0-base

In the output directory every PDF will be saved as a .mmd file, the lightweight markup language, mostly compatible with Mathpix Markdown (we make use of the LaTeX tables).

Note: On some devices the failure detection heuristic is not working properly. If you experience a lot of [MISSING_PAGE] responses, try to run with the --no-skipping flag. Related: #11, #67

API

With the extra dependencies you use app.py to start an API. Call

$ nougat_api

To get a prediction of a PDF file by making a POST request to http://127.0.0.1:8503/predict/. It also accepts parameters start and stop to limit the computation to select page numbers (boundaries are included).

The response is a string with the markdown text of the document.

curl -X 'POST' \
  'http://127.0.0.1:8503/predict/' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'file=@<PDFFILE.pdf>;type=application/pdf'

To use the limit the conversion to pages 1 to 5, use the start/stop parameters in the request URL: http://127.0.0.1:8503/predict/?start=1&stop=5

Dataset

Generate dataset

To generate a dataset you need

A directory containing the PDFs
A directory containing the .html files (processed .tex files by LaTeXML) with the same folder structure
A binary file of pdffigures2 and a corresponding environment variable export PDFFIGURES_PATH="/path/to/binary.jar"

Next run

python -m nougat.dataset.split_htmls_to_pages --html path/html/root --pdfs path/pdf/root --out path/paired/output --figure path/pdffigures/outputs

Additional arguments include

Argument	Description
`--recompute`	recompute all splits
`--markdown MARKDOWN`	Markdown output dir
`--workers WORKERS`	How many processes to use
`--dpi DPI`	What resolution the pages will be saved at
`--timeout TIMEOUT`	max time per paper in seconds
`--tesseract`	Tesseract OCR prediction for each page

Finally create a jsonl file that contains all the image paths, markdown text and meta information.

python -m nougat.dataset.create_index --dir path/paired/output --out index.jsonl

For each jsonl file you also need to generate a seek map for faster data loading:

python -m nougat.dataset.gen_seek file.jsonl

The resulting directory structure can look as follows:

root/
├── images
├── train.jsonl
├── train.seek.map
├── test.jsonl
├── test.seek.map
├── validation.jsonl
└── validation.seek.map

Note that the .mmd and .json files in the path/paired/output (here images) are no longer required. This can be useful for pushing to a S3 bucket by halving the amount of files.

Training

To train or fine tune a Nougat model, run

python train.py --config config/train_nougat.yaml

Evaluation

Run

python test.py --checkpoint path/to/checkpoint --dataset path/to/test.jsonl --save_path path/to/results.json

To get the results for the different text modalities, run

python -m nougat.metrics path/to/results.json

FAQ

Why am I only getting [MISSING_PAGE]?

Nougat was trained on scientific papers found on arXiv and PMC. Is the document you're processing similar to that? What language is the document in? Nougat works best with English papers, other Latin-based languages might work. Chinese, Russian, Japanese etc. will not work. If these requirements are fulfilled it might be because of false positives in the failure detection, when computing on CPU or older GPUs (#11). Try passing the --no-skipping flag for now.
Where can I download the model checkpoint from.

They are uploaded here on GitHub in the release section. You can also download them during the first execution of the program. Choose the preferred preferred model by passing --model 0.1.0-{base,small}

Citation

@misc{blecher2023nougat,
      title={Nougat: Neural Optical Understanding for Academic Documents}, 
      author={Lukas Blecher and Guillem Cucurull and Thomas Scialom and Robert Stojnic},
      year={2023},
      eprint={2308.13418},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Acknowledgments

This repository builds on top of the Donut repository.

License

Nougat codebase is licensed under MIT.

Nougat model weights are licensed under CC-BY-NC.

nougat's People

Contributors

Stargazers

Watchers

Forkers

evdcush xdarabseh ai-hub-deep-learning-fundamental krish240574 huyhoang17 techsd movitecc glaceage zzmjohn figo2002 huangdaoqin mosesyx lee9604 bit-engd can-song guome linnanwang learn01one zhoudai jasonnoy weijy026a luh1124 berooo miracleyin xiaollz miaolegemi622 zhoudeming1969 zcfrank1st songweiping paulhuang01 nianzu-ethan-zheng math345 victor-chow keyman9848 ilyushin hu56-dot yuelinxing jackli1942 fujiehuang codeofhuang eltociear jlia0 2132660698 uxfion aspnetcs timxian yuhan23333 yuhaoshan2018 zhouzhq7 enjoysport2022 williamqf-ai kaidduong mohamedalirashad hobbymarks dearfat paitesanshi qq3344520 kwon-jaehong pineking liuchaoxd guruace gfhe gshan4056 itscronkyo96 rohankumardubey yucnet yhyu13 romanviki jaedukseo soon14 rajveer43 diegoascanio jraiford1 hosseina2 harryjulian techthiyanes frankloud thanhpham1987 aminbahrami71 dorlinrozin mrhunsaker yamackocovali01 chunhualiu pravin-x109 farhadfa22 theniteshsingh masarutakahashii dsamuelhodge cartazio dan4k-tosh cleardry hehehe159 cyberflamego lidmj suprah925 git-tengsun artisdom raffaem reducesteps jessicalau86

nougat's Issues

RuntimeError: CUDA error: out of memory'

Is the current code for distributed training? I trained on a machine with 4 GPUs each with 24GB of memory, but still encountered the error RuntimeError: CUDA error: out of memory
How to solve this problem？

Can it be used to ocr non-English text?

Datasets examples

can you provide some examples about training datasets?
a few extact examples about train.jsonl

TypeError: BARTDecoder.prepare_inputs_for_inference() got an unexpected keyword argument 'past_key_values'

Document solution to `No GPU found. Conversion on CPU is very slow.`

I think it would be cool to add information to the "Install" section of the README that in order to use the GPU, one needs to install CUDA 11.7 or 11.8 and pytorch according to https://pytorch.org/get-started/locally/ (so something like pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117) on Windows before running pip install nougat-ocr (which otherwise installs only the CPU versions). I don't know though if these steps are only needed on Windows or also on other platforms.

Thanks a lot for developing this, it performs much better than the previous stuff I dabbled with.

Chinese model

When can the Chinese version be released？

module 'albumentations' has no attribute 'Affine'. Did you mean: 'IAAAffine'?

Macbook pro M1 Max OS: Ventuna 13.4

install cmd:
pip3.10 install git+https://github.com/facebookresearch/nougat

get the following error:

Traceback (most recent call last):
  File "/usr/local/bin/nougat", line 5, in <module>
    from predict import main
  File "/usr/local/lib/python3.10/site-packages/predict.py", line 17, in <module>
    from nougat import NougatModel
  File "/usr/local/lib/python3.10/site-packages/nougat/__init__.py", line 7, in <module>
    from .model import NougatConfig, NougatModel
  File "/usr/local/lib/python3.10/site-packages/nougat/model.py", line 34, in <module>
    from nougat.transforms import train_transform, test_transform
  File "/usr/local/lib/python3.10/site-packages/nougat/transforms.py", line 74, in <module>
    alb.Affine(shear={"x": (0, 3), "y": (-3, 0)}, cval=(255, 255, 255), p=0.03),
AttributeError: module 'albumentations' has no attribute 'Affine'. Did you mean: 'IAAAffine'?

about e-book data

Whether the e-book data for training llama is generated by nougat?

It doesn't seem to run on macOS?

Env:

OS: macOS 13.4 22F66 x86_64
Host: MacBookPro16,1
Kernel: 22.5.0
CPU: Intel i9-9880H (16) @ 2.30GHz
GPU: Intel UHD Graphics 630, AMD Radeon Pro 55
Memory: 35222MiB / 65536MiB

Error

WARNING:root:No GPU found. Conversion on CPU is very slow.
/Users/zero/anaconda3/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1678402629672/work/aten/src/ATen/native/TensorShape.cpp:3484.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
  0%|                                                     | 0/3 [00:00<?, ?it/s]

I'm having some difficulty running nougat on macOS:

I found no way to use the GPU AMD Radeon Pro 55 with this program.
Even after 10 minutes, there's still no progress bar to display.

How to switch between "0.1.0-small" and "0.1.0-base?

I have tried using nougat

nougat .\Nougat- Neural Optical Understanding for Academic Documents.pdf -o ./

It downloaded the 0.1.0-small version. How can I use the 0.1.0-base version for comparison?

Test dataset for benchmark evaluation

Thanks for your wonderful works!

As noted in your paper, there seems to be a lack of public benchmarks for academic documents. Would you kindly consider releasing your test dataset as a benchmark, allowing for comparative analysis?

Add setup instructions for this repo

I'm trying to run and modify this source code to be able to feed images directly into the model. The setup fails for me every time (a bunch of errors like:

python3.10/site-packages/pkg_resources/_vendor/packaging/requirements.py", line 37, in __init__
    raise InvalidRequirement(str(e)) from e
pkg_resources.extern.packaging.requirements.InvalidRequirement: .* suffix can only be used with `==` or `!=` operators
    torch (>=1.9.*)

when running setup.py install.

With the HuggingFace repo I can install all the requirements and get the app running. I figured it might be helpful to add to the repo which tooling is used for getting the model running within this source code.

I also know it was recommended to use LaTeX-OCR for my purposes in here but I've had poor performance with it (might be OS related) and wanted to see how this could do.

Any comparisons with commercial software Mathpix

Mathpix is one of the leading companies for pdf to latex conversion. It also does pretty well on math equations. Have you conduct any experiments with mathpix so that we can get some idea of how well the model performs?

Mathpix: https://mathpix.com/pdf-to-latex

getting x,y positions of the text in the original image?

Is there a way to compute rect boxes for the text detected? Or know roughly the starting x,y coordinate of a text paragraph?

How to train other languages

after the installation seemed successfully, met the downloading checkpoints file error

WARNING:root:No GPU found. Conversion on CPU is very slow.
downloading nougat checkpoint version 0.1.0-small to path /root/.cache/torch/hub/nougat
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 790, in urlopen
response = self._make_request(
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 536, in _make_request
response = conn.getresponse()
File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 461, in getresponse
httplib_response = super().getresponse()
File "/usr/lib/python3.10/http/client.py", line 1375, in getresponse
response.begin()
File "/usr/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.10/http/client.py", line 287, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 844, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.10/dist-packages/urllib3/util/retry.py", line 470, in increment
raise reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.10/dist-packages/urllib3/util/util.py", line 38, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 790, in urlopen
response = self._make_request(
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 536, in _make_request
response = conn.getresponse()
File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 461, in getresponse
httplib_response = super().getresponse()
File "/usr/lib/python3.10/http/client.py", line 1375, in getresponse
response.begin()
File "/usr/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.10/http/client.py", line 287, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/bin/nougat", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/predict.py", line 77, in main
args = get_args()
File "/usr/local/lib/python3.10/dist-packages/predict.py", line 55, in get_args
args.checkpoint = get_checkpoint(args.checkpoint)
File "/usr/local/lib/python3.10/dist-packages/nougat/utils/checkpoint.py", line 65, in get_checkpoint
download_checkpoint(checkpoint)
File "/usr/local/lib/python3.10/dist-packages/nougat/utils/checkpoint.py", line 50, in download_checkpoint
binary_file = download_as_bytes_with_progress(download_url, file)
File "/usr/local/lib/python3.10/dist-packages/nougat/utils/checkpoint.py", line 21, in download_as_bytes_with_progress
resp = requests.get(url, stream=True, allow_redirects=True)
File "/usr/local/lib/python3.10/dist-packages/requests/api.py", line 73, in get
return request("get", url, params=params, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 501, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

any pretrained model ?

No output MMD File for some PDFs

Greetings! There seems no output files when I apply Nougat to some documents. The output looks like this:

$ nougat ch-01.pdf -o .
...
WARNING:root:Found repetitions in sample 2
WARNING:root:Found repetitions in sample 3
WARNING:root:Skipping page 11 due to repetitions.
WARNING:root:Skipping page 12 due to repetitions.
 50%|█████████████████████████████████████                                     | 3/6 [01:16<01:06, 22.18s/it]WARNING:root:Found repetitions in sample 3
WARNING:root:Skipping page 16 due to repetitions.
100%|██████████████████████████████████████████████████████████████████████████| 6/6 [02:28<00:00, 24.71s/it]

What does these error messages mean? Is there any way I can successfully get an MMD file? Looking forward to your reply!

Please tell me how to use the nougat_api interface, what parameters it accepts, and how it can receive requests from other servers?

web

MPS accelaration

Hi all,
Why is the code not supporting MPS acceleration? Should it work or it has not been implemented quite yet?

No GPU found

I tried converting a PDF file, but I get the following warning:

WARNING:root:No GPU found

My machine does have a modern nvidia GPU with an up-to-date driver.

Pytorch Multi-GPU Support?

I installed pytorch to allow Nougat to work with my GPUs, but unfortunately only one (of two) is being used. How can I get pytorch to access the shared memory as opposed to defaulting to one GPU and using its memory?

Allow torch script export

Hi, first of all, thanks a lot for providing such an amazing model.

Probably this isn't on the top of your agenda, but your models are not compatible with TensorRT or TorchScript due to some of the locations in your script that uses numpy library. I'm referring to:

from nougat.transforms import train_transform, test_transform

These codes uses things like np.asarray, which triggers these errors:

Traceback (most recent call last):
  File "/home/philip/latex_training/deploy.py", line 221, in <module>
    main()
  File "/home/philip/latex_training/deploy.py", line 149, in main
    module = torch.jit.script(nougat_wrapper)
  File "/home/philip/latex_training/env/lib/python3.10/site-packages/torch/jit/_script.py", line 1284, in script
    return torch.jit._recursive.create_script_module(
  File "/home/philip/latex_training/env/lib/python3.10/site-packages/torch/jit/_recursive.py", line 480, in create_script_module
    return create_script_module_impl(nn_module, concrete_type, stubs_fn)
  File "/home/philip/latex_training/env/lib/python3.10/site-packages/torch/jit/_recursive.py", line 542, in create_script_module_impl
    script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
  File "/home/philip/latex_training/env/lib/python3.10/site-packages/torch/jit/_script.py", line 614, in _construct
    init_fn(script_module)
  File "/home/philip/latex_training/env/lib/python3.10/site-packages/torch/jit/_recursive.py", line 520, in init_fn
    scripted = create_script_module_impl(orig_value, sub_concrete_type, stubs_fn)
  File "/home/philip/latex_training/env/lib/python3.10/site-packages/torch/jit/_recursive.py", line 542, in create_script_module_impl
    script_module = torch.jit.RecursiveScriptModule._construct(cpp_module, init_fn)
  File "/home/philip/latex_training/env/lib/python3.10/site-packages/torch/jit/_script.py", line 614, in _construct
    init_fn(script_module)
  File "/home/philip/latex_training/env/lib/python3.10/site-packages/torch/jit/_recursive.py", line 520, in init_fn
    scripted = create_script_module_impl(orig_value, sub_concrete_type, stubs_fn)
  File "/home/philip/latex_training/env/lib/python3.10/site-packages/torch/jit/_recursive.py", line 546, in create_script_module_impl
    create_methods_and_properties_from_stubs(concrete_type, method_stubs, property_stubs)
  File "/home/philip/latex_training/env/lib/python3.10/site-packages/torch/jit/_recursive.py", line 397, in create_methods_and_properties_from_stubs
    concrete_type._create_methods_and_properties(property_defs, property_rcbs, method_defs, method_rcbs, method_defaults)
  File "/home/philip/latex_training/env/lib/python3.10/site-packages/torch/jit/_recursive.py", line 867, in try_compile_fn
    return torch.jit.script(fn, _rcb=rcb)
  File "/home/philip/latex_training/env/lib/python3.10/site-packages/torch/jit/_script.py", line 1341, in script
    fn = torch._C._jit_script_compile(
RuntimeError: 
Python builtin <built-in function asarray> is currently not supported in Torchscript:
  File "/home/philip/latex_training/env/lib/python3.10/site-packages/nougat/transforms.py", line 18
    def f(im):
        return transform(image=np.asarray(im))["image"]
                               ~~~~~~~~~~ <--- HERE
'f' is being compiled since it was called from 'SwinEncoder.__to_tensor_getter'
  File "/home/philip/latex_training/env/lib/python3.10/site-packages/nougat/model.py", line 144
    def to_tensor(self):
        if self.training:
            return train_transform
                   ~~~~~~~~~~~~~~~ <--- HERE
        else:
            return test_transform

As far as I know, Pytorch has no plans to make numpy arrays compatible.

Containerising the API

I'd like to write a Dockerfile to deploy the API in a container. Would you be open to this contribution? Any particular requirements for it if so? I was going to go with a slim Python 3.11 base image.

load error with the current pretrained model

When I use the current pretrained model from small version in https://github.com/facebookresearch/nougat/releases, I get thie error:
Traceback (most recent call last):
File "/home/mengfanqing/nougat-0.1.0-small/train.py", line 208, in
train(config)
File "/home/mengfanqing/nougat-0.1.0-small/train.py", line 182, in train
trainer.fit(model_module, data_module, ckpt_path=config.get("resume_from_checkpoint_path", None))
File "/home/mengfanqing/anaconda3/envs/naugat_official/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
call._call_and_handle_interrupt(
File "/home/mengfanqing/anaconda3/envs/naugat_official/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/mengfanqing/anaconda3/envs/naugat_official/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/mengfanqing/anaconda3/envs/naugat_official/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 946, in _run
self._checkpoint_connector._restore_modules_and_callbacks(ckpt_path)
File "/home/mengfanqing/anaconda3/envs/naugat_official/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 399, in _restore_modules_and_callbacks
self.resume_start(checkpoint_path)
File "/home/mengfanqing/anaconda3/envs/naugat_official/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 84, in resume_start
self._loaded_checkpoint = _pl_migrate_checkpoint(loaded_checkpoint, checkpoint_path)
File "/home/mengfanqing/anaconda3/envs/naugat_official/lib/python3.10/site-packages/pytorch_lightning/utilities/migration/utils.py", line 142, in _pl_migrate_checkpoint
old_version = _get_version(checkpoint)
File "/home/mengfanqing/anaconda3/envs/naugat_official/lib/python3.10/site-packages/pytorch_lightning/utilities/migration/utils.py", line 163, in _get_version
return checkpoint["pytorch-lightning_version"]
KeyError: 'pytorch-lightning_version'

This is mt environment: packages in environment at /home/mengfanqing/anaconda3/envs/naugat_official:
Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
aiohttp 3.8.5 pypi_0 pypi
aiosignal 1.3.1 pypi_0 pypi
albumentations 1.3.1 pypi_0 pypi
async-timeout 4.0.3 pypi_0 pypi
attrs 23.1.0 pypi_0 pypi
bzip2 1.0.8 h7b6447c_0
ca-certificates 2023.05.30 h06a4308_0
certifi 2023.7.22 pypi_0 pypi
charset-normalizer 3.2.0 pypi_0 pypi
click 8.1.7 pypi_0 pypi
cmake 3.27.2 pypi_0 pypi
datasets 2.14.4 pypi_0 pypi
dill 0.3.7 pypi_0 pypi
filelock 3.12.3 pypi_0 pypi
frozenlist 1.4.0 pypi_0 pypi
fsspec 2023.9.0 pypi_0 pypi
huggingface-hub 0.16.4 pypi_0 pypi
idna 3.4 pypi_0 pypi
imageio 2.31.3 pypi_0 pypi
jinja2 3.1.2 pypi_0 pypi
joblib 1.3.2 pypi_0 pypi
lazy-loader 0.3 pypi_0 pypi
ld_impl_linux-64 2.38 h1181459_1
levenshtein 0.21.1 pypi_0 pypi
libffi 3.4.4 h6a678d5_0
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libstdcxx-ng 11.2.0 h1234567_1
libuuid 1.41.5 h5eee18b_0
lightning-utilities 0.9.0 pypi_0 pypi
lit 16.0.6 pypi_0 pypi
markupsafe 2.1.3 pypi_0 pypi
mpmath 1.3.0 pypi_0 pypi
multidict 6.0.4 pypi_0 pypi
multiprocess 0.70.15 pypi_0 pypi
munch 4.0.0 pypi_0 pypi
ncurses 6.4 h6a678d5_0
networkx 3.1 pypi_0 pypi
nltk 3.8.1 pypi_0 pypi
nougat-ocr 0.1.0 pypi_0 pypi
numpy 1.25.2 pypi_0 pypi
nvidia-cublas-cu11 11.10.3.66 pypi_0 pypi
nvidia-cuda-cupti-cu11 11.7.101 pypi_0 pypi
nvidia-cuda-nvrtc-cu11 11.7.99 pypi_0 pypi
nvidia-cuda-runtime-cu11 11.7.99 pypi_0 pypi
nvidia-cudnn-cu11 8.5.0.96 pypi_0 pypi
nvidia-cufft-cu11 10.9.0.58 pypi_0 pypi
nvidia-curand-cu11 10.2.10.91 pypi_0 pypi
nvidia-cusolver-cu11 11.4.0.1 pypi_0 pypi
nvidia-cusparse-cu11 11.7.4.91 pypi_0 pypi
nvidia-nccl-cu11 2.14.3 pypi_0 pypi
nvidia-nvtx-cu11 11.7.91 pypi_0 pypi
opencv-python-headless 4.8.0.76 pypi_0 pypi
openssl 3.0.10 h7f8727e_2
orjson 3.9.5 pypi_0 pypi
packaging 23.1 pypi_0 pypi
pandas 2.1.0 pypi_0 pypi
pillow 10.0.0 pypi_0 pypi
pip 23.2.1 py310h06a4308_0
pyarrow 13.0.0 pypi_0 pypi
pymupdf 1.23.3 pypi_0 pypi
pymupdfb 1.23.3 pypi_0 pypi
python 3.10.12 h955ad1f_0
python-dateutil 2.8.2 pypi_0 pypi
python-levenshtein 0.21.1 pypi_0 pypi
pytorch-lightning 2.0.8 pypi_0 pypi
pytz 2023.3 pypi_0 pypi
pywavelets 1.4.1 pypi_0 pypi
pyyaml 6.0.1 pypi_0 pypi
qudida 0.0.4 pypi_0 pypi
rapidfuzz 3.2.0 pypi_0 pypi
readline 8.2 h5eee18b_0
regex 2023.8.8 pypi_0 pypi
requests 2.31.0 pypi_0 pypi
ruamel-yaml 0.17.32 pypi_0 pypi
ruamel-yaml-clib 0.2.7 pypi_0 pypi
scikit-image 0.21.0 pypi_0 pypi
scikit-learn 1.3.0 pypi_0 pypi
scipy 1.11.2 pypi_0 pypi
sconf 0.2.5 pypi_0 pypi
sentencepiece 0.1.99 pypi_0 pypi
setuptools 68.0.0 py310h06a4308_0
six 1.16.0 pypi_0 pypi
sqlite 3.41.2 h5eee18b_0
sympy 1.12 pypi_0 pypi
threadpoolctl 3.2.0 pypi_0 pypi
tifffile 2023.8.30 pypi_0 pypi
timm 0.5.4 pypi_0 pypi
tk 8.6.12 h1ccaba5_0
tokenizers 0.13.3 pypi_0 pypi
torch 2.0.1 pypi_0 pypi
torchmetrics 1.1.1 pypi_0 pypi
torchvision 0.15.2 pypi_0 pypi
tqdm 4.66.1 pypi_0 pypi
transformers 4.25.1 pypi_0 pypi
triton 2.0.0 pypi_0 pypi
typing-extensions 4.7.1 pypi_0 pypi
tzdata 2023.3 pypi_0 pypi
urllib3 2.0.4 pypi_0 pypi
wheel 0.38.4 py310h06a4308_0
xxhash 3.3.0 pypi_0 pypi
xz 5.4.2 h5eee18b_0
yarl 1.9.2 pypi_0 pypi
zlib 1.2.13 h5eee18b_0

I just fix the item in config/train_config.yml: resume_from_checkpoint_path: '/home/mengfanqing/nougat-0.1.0-small/model/pytorch_model.bin', which is in small version in https://github.com/facebookresearch/nougat/releases. I wonder if there exists some error with the ckp?

different output when test 'Neural Attentive Circuits'

ModFC and ModFFN. The Modulated Fully-Connected (ModFC) layer is the component that enables a module to condition its computation on a code. It replaces the fully-connected layer and supports a multiplicative conditioning of the input vector (\bm{x}) by the code vector(\bm{c}). It can be expressed

[\bm{y}=\text{ModFC}(\bm{x};\bm{c})=\bm{W}(\bm{x}\odot(1+\alpha,\text{LayerNorm }(\bm{W}_{c}\bm{c})))+\bm{b},] (1)

in mathpix

ModFC and ModFFN. The Modulated Fully-Connected (ModFC) layer is the component that enables a module to condition its computation on a code. It replaces the fully-connected layer and supports a multiplicative conditioning of the input vector 
Undefined control sequence \bm
 by the code vector 
Undefined control sequence \bm
. It can be expressed
Undefined control sequence \bm
(1)

I noticed 'MODEL_TAG = "0.1.0-small"' in nougat/utils/checkpoint.py.
I'm interested in obtaining a more advanced or accurate checkpoint to reproduce the output shown on https://facebookresearch.github.io/nougat/. Could you guide me on how I can achieve this?

ValueError: batch_size should be a positive integer value, but got batch_size=0

nougat sd.pdf
WARNING:root:No output directory. Output will be printed to console.
/home/caiyu/miniconda3/lib/python3.11/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
sampler <torch.utils.data.sampler.SequentialSampler object at 0x7f9a052ef3d0> 0 False
Traceback (most recent call last):
File "/home/caiyu/miniconda3/bin/nougat", line 8, in
sys.exit(main())
^^^^^^
File "/home/caiyu/miniconda3/lib/python3.11/site-packages/predict.py", line 113, in main
dataloader = torch.utils.data.DataLoader(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/caiyu/miniconda3/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 358, in init
batch_sampler = BatchSampler(sampler, batch_size, drop_last)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/caiyu/miniconda3/lib/python3.11/site-packages/torch/utils/data/sampler.py", line 232, in init
raise ValueError("batch_size should be a positive integer value, "
ValueError: batch_size should be a positive integer value, but got batch_size=0

Binary file of pdffigures2

Could you please provide the binary file for pdffigures2? Due to some reasons, I am unable to build pdffigures2 myself.

Optimize DataLoader to Adapt to System Resources

Description:

Current Behavior:

Currently, the DataLoader parameters like num_workers, pin_memory, and shuffle are hard-coded. This doesn't allow the software to adapt to different system configurations, leading to potentially inefficient use of resources.

Files:

lightening_module.py: The NougatDataPLModule class where the DataLoader is configured.

Proposed Changes:

Dynamic num_workers: Update the num_workers parameter to be set dynamically based on the number of available CPU cores.
```
num_workers = min(8, multiprocessing.cpu_count() - 1)
```
Dynamic pin_memory: Enable pin_memory when CUDA is available.
```
pin_memory = torch.cuda.is_available()
```
Dynamic shuffle: Although shuffling is generally good for training, making this parameter dynamic allows it to be turned off for validation and testing.

Expected Outcome:

By making these changes, the DataLoader can be more efficient and adaptable to different system resources.

no result

i use this model to convert pdf into mmd，but the mmd is [MISSING_PAGE_EMPTY:1]。no result

AttributeError: module 'cv2.dnn' has no attribute 'DictValue'

When trying to run the cli on any pdf I run into the the error above. Here's the stacktrace.

Traceback (most recent call last):
  File "/usr/local/bin/nougat", line 5, in <module>
    from predict import main
  File "/usr/local/lib/python3.8/dist-packages/predict.py", line 17, in <module>
    from nougat import NougatModel
  File "/usr/local/lib/python3.8/dist-packages/nougat/__init__.py", line 7, in <module>
    from .model import NougatConfig, NougatModel
  File "/usr/local/lib/python3.8/dist-packages/nougat/model.py", line 16, in <module>
    import cv2
  File "/usr/local/lib/python3.8/dist-packages/cv2/__init__.py", line 181, in <module>
    bootstrap()
  File "/usr/local/lib/python3.8/dist-packages/cv2/__init__.py", line 175, in bootstrap
    if __load_extra_py_code_for_module("cv2", submodule, DEBUG):
  File "/usr/local/lib/python3.8/dist-packages/cv2/__init__.py", line 28, in __load_extra_py_code_for_module
    py_module = importlib.import_module(module_name)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/usr/local/lib/python3.8/dist-packages/cv2/typing/__init__.py", line 169, in <module>
    LayerId = cv2.dnn.DictValue

什么是时间支持中文？ What is time support for Chinese?

I just tried Nougat, unfortunately it does not support Chinese. When will it be possible to support the conversion of Chinese documents?

Detail about generating dataset

I am tring to generate dataset, including process .tex to .html by LaTeXML and run nougat.dataset.split_htmls_to_pages, but I got some problems.
problems:

If .tex has begin{figure} block, no page would be recognized.
No page would be recognized without the --nocrossref parameter in the process of latexmlpost; but if I use the --nocrossref parameter, the content in the reference such as the number in the square bracket would disappear like image below .

Env:
ubuntu 1604 & 2204
LaTeXML = 0.8.6 & 0.8.7
Python = 3.10
Tex and PDF file source: https://arxiv.org/abs/adap-org/9912004

Question:

How to solve these problems I mentioned?
How do we achieve features in paper, including replacing user-defined macros, standardizing whitespace, adding optional brackets, normalizing tables, and replacing references and citations with their correct numbers? Does LaTeXML achieve some of them and how?
Is there any recommendation version for LaTeXML? Can you please provide commands about LaTeXML including latexml and latexmlpost?

Does it support language other then english? I wanted it to try for japanese text but I cant.

KIndly let me know if there is chance or method to use it on japanese text

Implementation of more informative and understandable docstring

I would like to implement more docstring to project

what is the version of pytorch_lighting?

The latest pytorch_lighting does not have params such as resume_from_checkpoint and gpus. Could you tell me the version of the pytorch_lighting you use? Moreover, could you give a requirement.txt for this repo?

Provide a end to end fine tune notebook.

Hi, I want to experiment, and fine tune nougat on other language. Please provide a notebook for fine tuning.

Can Nougat support Chinese character?

It works for english PDF file, but it seems not friendly for PDF that contains Chinese or Japanese characters.
Can I train it myself? I think it is quite diffcult for me to prepare the trainning data.

WARNING:root:Found repetitions in sample 0

Tried applying to individual pages of the PDF for EPR paper https://cds.cern.ch/record/405662/files/PhysRev.47.777.pdf.

While the first page works and I get a print out of the text, pages 2-4 don't work. I get errors like this:

504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
  0%|                                                       | 0/1 [00:00<?, ?it/s]WARNING:root:Found repetitions in sample 0
INFO:root:Processing file pg_0004.pdf with 1 pages
[MISSING_PAGE_EMPTY:1]

Add support for model tag selection in cli `--model`

The default model tag is 0.1.0-small. If we want to use the base model (or other tags in future), we should be able to use 0.1.0-base as a model tag liek this:

$ nougat path/to/file.pdf -o output_directory -m 0.1.0-base

The markdown file is empty

after run this command: nougat xxx.pdf , I got the mmd file, but there is no content in this file. What's the reason?

Torch.Size Error

download 0.1.0-base
pip install nougat-ocr
nougat test.pdf -c nougat_model_base -o ./

ERROR Message:
File "/path/lib/python3.9/site-packages/predict.py", line 78, in main
model = NougatModel.from_pretrained(args.checkpoint).to(torch.bfloat16)
File "/path/lib/python3.9/site-packages/nougat/model.py", line 682, in from_pretrained
model = super(NougatModel, cls).from_pretrained(
File "/path/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2379, in from_pretrained
) = cls._load_pretrained_model(
File "/path/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2695, in _load_pretrained_model
raise RuntimeError(f"Error(s) in loading state_dict for {model.class.name}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for NougatModel:
size mismatch for decoder.model.model.decoder.embed_positions.weight: copying a param with shape torch.Size([4098, 1024]) from checkpoint, the shape in current model is torch.Size([3586, 1024]).
You may consider adding ignore_mismatched_sizes=True in the model from_pretrained method.

for idx, sample in tqdm(enumerate(dataloader), total=len(dataloader)): for prediction

if we are just using a single pdf, the sample is a list and inference is expecting a tensor of an image ,so the below code will not work so we should make it into sample[0].where sample[0] is the tensor which is stored in the 0th index of the list
model_output = model.inference(image_tensors=sample)

this a function where i passed a single pdf file. and made predictions for each page
def predict():
model=NougatModel.from_pretrained("C:/Users/sshamsu/Documents/New folder/nougat weights").to(torch.bfloat16)#getting nougat pretrained model
if torch.cuda.is_available():
model.to("cuda")

dataset=LazyDataset("C:/Users/sshamsu/Downloads/research paper for Nought.pdf",  #it should be the file path of the pdf 
        partial(model.encoder.prepare_input,random_padding=False),
    )#object of the class LazyDataset 
dataloader = torch.utils.data.DataLoader(
        dataset,
        batch_size=1,
        shuffle=False,
        collate_fn=LazyDataset.ignore_none_collate,
        
    )
prediction=[]
for page_num,page_as_tensor in tqdm(enumerate(dataloader)):
    model_output = model.inference(image_tensors=page_as_tensor[0])
    output = markdown_compatible(model_output["predictions"][0])
    prediction.append(output)

final_mmd="".join(prediction).strip()

return final_mmd

zipfile.BadZipFile: File is not a zip file

/home/huangwei/.local/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
0%| | 0/3 [00:07<?, ?it/s]
Traceback (most recent call last):
File "/home/huangwei/.local/bin/nougat", line 8, in
sys.exit(main())
File "/home/huangwei/.local/lib/python3.8/site-packages/predict.py", line 130, in main
model_output = model.inference(image_tensors=sample)
File "/home/huangwei/.local/lib/python3.8/site-packages/nougat/model.py", line 653, in inference
output["predictions"] = postprocess(
File "/home/huangwei/.local/lib/python3.8/site-packages/nougat/postprocessing.py", line 504, in postprocess
return [
File "/home/huangwei/.local/lib/python3.8/site-packages/nougat/postprocessing.py", line 505, in
postprocess_single(s, markdown_fix=markdown_fix) for s in generation
File "/home/huangwei/.local/lib/python3.8/site-packages/nougat/postprocessing.py", line 435, in postprocess_single
if last_word in words.words():
File "/home/huangwei/.local/lib/python3.8/site-packages/nltk/corpus/util.py", line 121, in getattr
self.__load()
File "/home/huangwei/.local/lib/python3.8/site-packages/nltk/corpus/util.py", line 81, in __load
root = nltk.data.find(f"{self.subdir}/{self.__name}")
File "/home/huangwei/.local/lib/python3.8/site-packages/nltk/data.py", line 555, in find
return find(modified_name, paths)
File "/home/huangwei/.local/lib/python3.8/site-packages/nltk/data.py", line 542, in find
return ZipFilePathPointer(p, zipentry)
File "/home/huangwei/.local/lib/python3.8/site-packages/nltk/compat.py", line 41, in _decorator
return init_func(*args, **kwargs)
File "/home/huangwei/.local/lib/python3.8/site-packages/nltk/data.py", line 394, in init
zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
File "/home/huangwei/.local/lib/python3.8/site-packages/nltk/compat.py", line 41, in _decorator
return init_func(*args, **kwargs)
File "/home/huangwei/.local/lib/python3.8/site-packages/nltk/data.py", line 935, in init
zipfile.ZipFile.init(self, filename)
File "/usr/lib/python3.8/zipfile.py", line 1269, in init
self._RealGetContents()
File "/usr/lib/python3.8/zipfile.py", line 1336, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

KeyError:'SLURM_NODEID'

When I run python train.py --config config/train_nougat.yaml, I will report the following error:

Traceback (most recent call last):
File "train.py", line 220, in
train(config)
File "train.py", line 173, in train
trainer = pl.Trainer(
File "/root/miniconda3/envs/test_38/lib/python3.8/site-packages/pytorch_lightning/utilities/argparse.py", line 348, in insert_env_defaults
return fn(self, **kwargs)
File "/root/miniconda3/envs/test_38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 419, in init
self._accelerator_connector = AcceleratorConnector(
File "/root/miniconda3/envs/test_38/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 221, in init
self._lazy_init_strategy()
File "/root/miniconda3/envs/test_38/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 828, in _lazy_init_strategy
self.strategy.set_world_ranks()
File "/root/miniconda3/envs/test_38/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 211, in set_world_ranks
self.cluster_environment.set_global_rank(self.node_rank * self.num_processes + self.local_rank)
File "/root/miniconda3/envs/test_38/lib/python3.8/site-packages/pytorch_lightning/strategies/parallel.py", line 60, in node_rank
return self.cluster_environment.node_rank() if self.cluster_environment is not None else 0
File "/root/miniconda3/envs/test_38/lib/python3.8/site-packages/lightning_fabric/plugins/environments/slurm.py", line 136, in node_rank
return int(os.environ["SLURM_NODEID"])
File "/root/miniconda3/envs/test_38/lib/python3.8/os.py", line 673, in getitem
raise KeyError(key) from None
KeyError: 'SLURM_NODEID'

Why does this problem arise? Do you have any other special settings？

How to make it work for any image

I write an equation in a white paper, and take a photo by my phone, then use Nougat to extract equation, but I got nothing, what's the problem?

Floating point exception (core dumped)

Floating point exception (core dumped)

pip list | grep torch
pytorch-lightning 2.0.8
torch 2.0.1
torchmetrics 1.1.1
torchvision 0.15.2

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

CUDA Version: 11.7

About export to onnx model?

Hi, thank you for sharing such great work!

I want to ask how to convert nougat model to onnx file and do inference based on onnxruntime?