Giter Site home page Giter Site logo

marker's Introduction

Marker

Marker converts PDF to markdown quickly and accurately.

  • Supports a wide range of documents (optimized for books and scientific papers)
  • Supports all languages
  • Removes headers/footers/other artifacts
  • Formats tables and code blocks
  • Extracts and saves images along with the markdown
  • Converts most equations to latex
  • Works on GPU, CPU, or MPS

How it works

Marker is a pipeline of deep learning models:

  • Extract text, OCR if necessary (heuristics, surya, tesseract)
  • Detect page layout and find reading order (surya)
  • Clean and format each block (heuristics, texify
  • Combine blocks and postprocess complete text (heuristics, pdf_postprocessor)

It only uses models where necessary, which improves speed and accuracy.

Examples

PDF Type Marker Nougat
Think Python Textbook View View
Think OS Textbook View View
Switch Transformers arXiv paper View View
Multi-column CNN arXiv paper View View

Performance

Benchmark overall

The above results are with marker and nougat setup so they each take ~4GB of VRAM on an A6000.

See below for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.

Hosted API

There is a hosted API for marker available here. It has been tuned for performance, and generally takes 10s + 1s/page for conversion.

Commercial usage

I want marker to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.

The weights for the models are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options here.

Community

Discord is where we discuss future development.

Limitations

PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:

  • Marker will not convert 100% of equations to LaTeX. This is because it has to detect then convert.
  • Tables are not always formatted 100% correctly - text can be in the wrong column.
  • Whitespace and indentations are not always respected.
  • Not all lines/spans will be joined properly.
  • This works best on digital PDFs that won't require a lot of OCR. It's optimized for speed, and limited OCR is used to fix errors.

Installation

You'll need python 3.9+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See here for more details.

Install with:

pip install marker-pdf

Optional: OCRMyPDF

Only needed if you want to use the optional ocrmypdf as the ocr backend. Note that ocrmypdf includes Ghostscript, an AGPL dependency, but calls it via CLI, so it does not trigger the license provisions.

See the instructions here

Usage

First, some configuration:

  • Inspect the settings in marker/settings.py. You can override any settings with environment variables.
  • Your torch device will be automatically detected, but you can override this. For example, TORCH_DEVICE=cuda.
    • If using GPU, set INFERENCE_RAM to your GPU VRAM (per GPU). For example, if you have 16 GB of VRAM, set INFERENCE_RAM=16.
    • Depending on your document types, marker's average memory usage per task can vary slightly. You can configure VRAM_PER_TASK to adjust this if you notice tasks failing with GPU out of memory errors.
  • By default, marker will use surya for OCR. Surya is slower on CPU, but more accurate than tesseract. If you want faster OCR, set OCR_ENGINE to ocrmypdf. This also requires external dependencies (see above). If you don't want OCR at all, set OCR_ENGINE to None.

Convert a single file

marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --max_pages 10 --langs English
  • --batch_multiplier is how much to multiply default batch sizes by if you have extra VRAM. Higher numbers will take more VRAM, but process faster. Set to 2 by default. The default batch sizes will take ~3GB of VRAM.
  • --max_pages is the maximum number of pages to process. Omit this to convert the entire document.
  • --langs is a comma separated list of the languages in the document, for OCR

Make sure the DEFAULT_LANG setting is set appropriately for your document. The list of supported languages for OCR is here. If you need more languages, you can use any language supported by Tesseract if you set OCR_ENGINE to ocrmypdf. If you don't need OCR, marker can work with any language.

Convert multiple files

marker /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json --min_length 10000
  • --workers is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyond INFERENCE_RAM / VRAM_PER_TASK if you're using GPU.
  • --max is the maximum number of pdfs to convert. Omit this to convert all pdfs in the folder.
  • --min_length is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down)
  • --metadata_file is an optional path to a json file with metadata about the pdfs. If you provide it, it will be used to set the language for each pdf. If not, DEFAULT_LANG will be used. The format is:
{
  "pdf1.pdf": {"languages": ["English"]},
  "pdf2.pdf": {"languages": ["Spanish", "Russian"]},
  ...
}

You can use language names or codes. The exact codes depend on the OCR engine. See here for a full list for surya codes, and here for tesseract.

Convert multiple files on multiple GPUs

MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out
  • METADATA_FILE is an optional path to a json file with metadata about the pdfs. See above for the format.
  • NUM_DEVICES is the number of GPUs to use. Should be 2 or greater.
  • NUM_WORKERS is the number of parallel processes to run on each GPU. Per-GPU parallelism will not increase beyond INFERENCE_RAM / VRAM_PER_TASK.
  • MIN_LENGTH is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down)

Note that the env variables above are specific to this script, and cannot be set in local.env.

Troubleshooting

There are some settings that you may find useful if things aren't working the way you expect:

  • OCR_ALL_PAGES - set this to true to force OCR all pages. This can be very useful if the table layouts aren't recognized properly by default, or if there is garbled text.
  • TORCH_DEVICE - set this to force marker to use a given torch device for inference.
  • OCR_ENGINE - can set this to surya or ocrmypdf.
  • DEBUG - setting this to True shows ray logs when converting multiple pdfs
  • Verify that you set the languages correctly, or passed in a metadata file.
  • If you're getting out of memory errors, decrease worker count (increased the VRAM_PER_TASK setting). You can also try splitting up long PDFs into multiple files.

In general, if output is not what you expect, trying to OCR the PDF is a good first step. Not all PDFs have good text/bboxes embedded in them.

Useful settings

These settings can improve/change output quality:

  • OCR_ALL_PAGES will force OCR across the document. Many PDFs have bad text embedded due to older OCR engines being used.
  • PAGINATE_OUTPUT will put a horizontal rule between pages. Default: False.
  • EXTRACT_IMAGES will extract images and save separately. Default: True.
  • BAD_SPAN_TYPES specifies layout blocks to remove from the markdown output.

Benchmarks

Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods. It's noisy, but at least directionally correct.

Benchmarks show that marker is 4x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data). We show naive text extraction (pulling text out of the pdf with no processing) for comparison.

Speed

Method Average Score Time per page Time per document
marker 0.613721 0.631991 58.1432
nougat 0.406603 2.59702 238.926

Accuracy

First 3 are non-arXiv books, last 3 are arXiv papers.

Method multicolcnn.pdf switch_trans.pdf thinkpython.pdf thinkos.pdf thinkdsp.pdf crowd.pdf
marker 0.536176 0.516833 0.70515 0.710657 0.690042 0.523467
nougat 0.44009 0.588973 0.322706 0.401342 0.160842 0.525663

Peak GPU memory usage during the benchmark is 4.2GB for nougat, and 4.1GB for marker. Benchmarks were run on an A6000 Ada.

Throughput

Marker takes about 4GB of VRAM on average per task, so you can convert 12 documents in parallel on an A6000.

Benchmark results

Running your own benchmarks

You can benchmark the performance of marker on your machine. Install marker manually with:

git clone https://github.com/VikParuchuri/marker.git
poetry install

Download the benchmark data here and unzip. Then run benchmark.py like this:

python benchmark.py data/pdfs data/references report.json --nougat

This will benchmark marker against other text extraction methods. It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each.

Omit --nougat to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.

Thanks

This work would not have been possible without amazing open source models and datasets, including (but not limited to):

  • Surya
  • Texify
  • Pypdfium2/pdfium
  • DocLayNet from IBM
  • ByT5 from Google

Thank you to the authors of these models and datasets for making them available to the community!

marker's People

Contributors

aniketinamdar avatar github-actions[bot] avatar samuell avatar tosaddler avatar vikparuchuri avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

marker's Issues

JSON output?

Just wondering if this could be used to output the parsed content in a structured output, a la JSON. I tried looking at the internals and couldn't see if this was easily possible, or if you go directly from PDF to markdown.

How do I run Marker (convert_single.py, and others) Offline?

Hello,

I am trying out Marker in an environment that's not connected to the internet (security reasons). I am running into several issues even running convert_single.py.

(it's a Windows 10 virtual env).

Can you help me with how to run Marker offline?

Thank you,
Apurva

Errors for reference:
(marker-py3.10) (marker) C:\Users\APATHAK2\Documents\marker-master\marker-master>python convert_single.py C:\Users\APATHAK2\Documents\marker-master\marker-master\6941.pdf C:\Users\APATHAK2\Documents\marker-master\marker-master\6941.md --parallel_factor 2 --max_pages 22
Traceback (most recent call last):
File "C:\Users\APATHAK2\AppData\Local\pypoetry\Cache\virtualenvs\marker-UNt7FmZX-py3.10\lib\site-packages\urllib3\connectionpool.py", line 467, in _make_request
self._validate_conn(conn)
File "C:\Users\APATHAK2\AppData\Local\pypoetry\Cache\virtualenvs\marker-UNt7FmZX-py3.10\lib\site-packages\urllib3\connectionpool.py", line 1096, in _validate_conn
conn.connect()
File "C:\Users\APATHAK2\AppData\Local\pypoetry\Cache\virtualenvs\marker-UNt7FmZX-py3.10\lib\site-packages\urllib3\connection.py", line 642, in connect
sock_and_verified = _ssl_wrap_socket_and_match_hostname(
File "C:\Users\APATHAK2\AppData\Local\pypoetry\Cache\virtualenvs\marker-UNt7FmZX-py3.10\lib\site-packages\urllib3\connection.py", line 782, in ssl_wrap_socket_and_match_hostname
ssl_sock = ssl_wrap_socket(
File "C:\Users\APATHAK2\AppData\Local\pypoetry\Cache\virtualenvs\marker-UNt7FmZX-py3.10\lib\site-packages\urllib3\util\ssl
.py", line 470, in ssl_wrap_socket
ssl_sock = ssl_wrap_socket_impl(sock, context, tls_in_tls, server_hostname)
File "C:\Users\APATHAK2\AppData\Local\pypoetry\Cache\virtualenvs\marker-UNt7FmZX-py3.10\lib\site-packages\urllib3\util\ssl
.py", line 514, in _ssl_wrap_socket_impl
return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
File "C:\Python310\lib\ssl.py", line 513, in wrap_socket
return self.sslsocket_class._create(
File "C:\Python310\lib\ssl.py", line 1071, in _create
self.do_handshake()
File "C:\Python310\lib\ssl.py", line 1342, in do_handshake
self._sslobj.do_handshake()

IndexError: list index out of range

Traceback (most recent call last):
  File "marker/convert_single.py", line 22, in <module>
    full_text, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, parallel_factor=args.parallel_factor)
  File "marker/marker/convert.py", line 120, in convert_single_pdf
    blocks = order_blocks(
  File "marker/marker/ordering.py", line 103, in order_blocks
    add_column_counts(doc, doc_blocks, model, batch_size)
  File "marker/marker/ordering.py", line 97, in add_column_counts
    predictions = batch_inference(rgb_images, bboxes, words, model)
  File "marker/marker/ordering.py", line 58, in batch_inference
    encoding = processor(
  File "/home/vscode/.cache/pypoetry/virtualenvs/marker-rdHpm9Sx-py3.10/lib/python3.10/site-packages/transformers/models/layoutlmv3/processing_layoutlmv3.py", line 114, in __call__
    features = self.image_processor(images=images, return_tensors=return_tensors)
  File "/home/vscode/.cache/pypoetry/virtualenvs/marker-rdHpm9Sx-py3.10/lib/python3.10/site-packages/transformers/image_processing_utils.py", line 546, in __call__
    return self.preprocess(images, **kwargs)
  File "/home/vscode/.cache/pypoetry/virtualenvs/marker-rdHpm9Sx-py3.10/lib/python3.10/site-packages/transformers/models/layoutlmv3/image_processing_layoutlmv3.py", line 299, in preprocess
    images = make_list_of_images(images)
  File "/home/vscode/.cache/pypoetry/virtualenvs/marker-rdHpm9Sx-py3.10/lib/python3.10/site-packages/transformers/image_utils.py", line 124, in make_list_of_images
    if is_batched(images):
  File "/home/vscode/.cache/pypoetry/virtualenvs/marker-rdHpm9Sx-py3.10/lib/python3.10/site-packages/transformers/image_utils.py", line 97, in is_batched
    return is_valid_image(img[0])
IndexError: list index out of range

I can get as much information as you need.

Please, add suport for spellchecking of more languages - ie. Hunspell

PySpellChecker supports only few languages - English, Spanish, French, Portuguese, German, Russian, Arabic, Basque and Latvian...
(ref. https://pypi.org/project/pyspellchecker/)

May I suggest and kindly ask to implement hunspell or cyhunspell, as it support far more languages (including Croatian and Slovenian)?

It would be much better fit to Tesseract...

I am not coder, but I think I'm not wrong on this. In case I am, this is how I came to my conclusion.. (It took me a while to figure out that tesseract wasn't the problem and that SpellChecker is the issue..)

wsl
cd /mnt/d/03Marker/marker
poetry shell

(marker-py3.10) me@rpc:/mnt/d/03Marker/marker$ which python3
/home/me/.cache/pypoetry/virtualenvs/marker-LscICKmA-py3.10/bin/python3

Added 2 language libraries, from https://github.com/tesseract-ocr/tessdata:
ref.1 https://github.com/tesseract-ocr/tessdata/blob/main/hrv.traineddata
ref.2 https://github.com/tesseract-ocr/tessdata/blob/main/slv.traineddata

sudo wget https://github.com/tesseract-ocr/tessdata/blob/main/hrv.traineddata -O /usr/share/tesseract-ocr/5/tessdata/hrv.traineddata
HTTP request sent, awaiting response... 200 OK
Length: 14196 (14K) [text/plain]
Saving to: ‘/usr/share/tesseract-ocr/5/tessdata/hrv.traineddata’

sudo wget https://github.com/tesseract-ocr/tessdata/blob/main/slv.traineddata -O /usr/share/tesseract-ocr/5/tessdata/slv.traineddata
HTTP request sent, awaiting response... 200 OK
Length: 14206 (14K) [text/plain]
Saving to: ‘/usr/share/tesseract-ocr/5/tessdata/slv.traineddata’

sudo apt-get install tesseract-ocr-hrv tesseract-ocr-slv
The following NEW packages will be installed:
tesseract-ocr-hrv tesseract-ocr-slv
0 upgraded, 2 newly installed, 0 to remove and 5 not upgraded.

poetry show | grep spellchecker
pyspellchecker 0.7.2

poetry show | grep pytesseract
pytesseract 0.3.10

(marker-py3.10) me@rpc:/mnt/d/03Marker/marker$ tesseract --list-langs
List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (8):
deu
eng
fra
hrv
por
rus
slv
spa

The language libraries are located at - "/usr/share/tesseract-ocr/5/tessdata"
That location is set at settings.py as TESSDATA_PREFIX: str = "/usr/share/tesseract-ocr/5/tessdata"
and also in local.env as TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata

(marker-py3.10) me@rpc:/mnt/d/03Marker/marker/marker$ sudo /home/me/.cache/pypoetry/virtualenvs/marker-LscICKmA-py3.10/bin/python3 TessSpell.py
[sudo] password for me:
slv is not supported.
hrv is not supported.

TessSpell.py

from spellchecker import SpellChecker

def check_language_support(lang_code):
spell = SpellChecker()
if lang_code in spell.languages():
print(f"{lang_code} is supported.")
else:
print(f"{lang_code} is not supported.")

if name == "main":
check_language_support("slv") # Check for Slovenian
check_language_support("hrv") # Check for Croatian

Install on WSL

Hi,I am having problems with installation of scripts/install/ghostscript_install.sh. It gets stuck in a loop:

gcc -fvisibility=hidden -DSHARE_LCMS=0 -DHAVE_MKSTEMP -DHAVE_FILE64 -DHAVE_FSEEKO -DHAVE_MKSTEMP64 -DHAVE_SETLOCALE -DHAVE_SSE2 -DHAVE_BSWAP32 -DHAVE_BYTESWAP_H -DHAVE_STRERROR -DHAVE_ISNAN -DHAVE_ISINF -DHAVE_PREAD_PWRITE=1 -DGS_RECURSIVE_MUTEXATTR=PTHREAD_MUTEX_RECURSIVE -O2 -DNDEBUG -Wall -Wstrict-prototypes -Wundef -Wmissing-declarations -Wmissing-prototypes -Wwrite-strings -fno-strict-aliasing -Werror=declaration-after-statement -fno-builtin -fno-common -Werror=return-type -Wno-unused-local-typedefs -DHAVE_STDINT_H=1 -DHAVE_DIRENT_H=1 -DHAVE_SYS_DIR_H=1 -DHAVE_SYS_TIME_H=1 -DHAVE_SYS_TIMES_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_LIBDL=1 -DGX_COLOR_INDEX_TYPE="unsigned long long" -D__USE_UNIX98=1 -DBUILD_PDF=1 -I./pdf -DHAVE_RESTRICT=1 -fno-strict-aliasing -DHAVE_POPEN_PROTO=1 -DSHARE_LCMS=0 -I./lcms2mt/include -o ./obj/cmsmtrx.o -c ./lcms2mt/src/cmsmtrx.c
gcc -fvisibility=hidden -DSHARE_LCMS=0 -DHAVE_MKSTEMP -DHAVE_FILE64 -DHAVE_FSEEKO -DHAVE_MKSTEMP64 -DHAVE_SETLOCALE -DHAVE_SSE2 -DHAVE_BSWAP32 -DHAVE_BYTESWAP_H -DHAVE_STRERROR -DHAVE_ISNAN -DHAVE_ISINF -DHAVE_PREAD_PWRITE=1 -DGS_RECURSIVE_MUTEXATTR=PTHREAD_MUTEX_RECURSIVE -O2 -DNDEBUG -Wall -Wstrict-prototypes -Wundef -Wmissing-declarations -Wmissing-prototypes -Wwrite-strings -fno-strict-aliasing -Werror=declaration-after-statement -fno-builtin -fno-common -Werror=return-type -Wno-unused-local-typedefs -DHAVE_STDINT_H=1 -DHAVE_DIRENT_H=1 -DHAVE_SYS_DIR_H=1 -DHAVE_SYS_TIME_H=1 -DHAVE_SYS_TIMES_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_LIBDL=1 -DGX_COLOR_INDEX_TYPE="unsigned long long" -D__USE_UNIX98=1 -DBUILD_PDF=1 -I./pdf -DHAVE_RESTRICT=1 -fno-strict-aliasing -DHAVE_POPEN_PROTO=1 -DSHARE_LCMS=0 -I./lcms2mt/include -o ./obj/cmsnamed.o -c ./lcms2mt/src/cmsnamed.c
gcc -fvisibility=hidden -DSHARE_LCMS=0 -DHAVE_MKSTEMP -DHAVE_FILE64 -DHAVE_FSEEKO -DHAVE_MKSTEMP64 -

and so on.. what can I do?

Import error: from nougat import NougatModel

My environment is under wsl2, using python3.10,cuda version 11.8 and torch version 2.10.
The version of nougat I installed is 0.3.3, and I don't know which version is the right one.

The error message is as follows:
(cu118py310) k101@Skythinkbook:~/code/marker$ python convert_single.py ./data/images/verilog.pdf ./results/verilog.md
Traceback (most recent call last):
File "/home/k101/code/marker/convert_single.py", line 3, in
from marker.convert import convert_single_pdf
File "/home/k101/code/marker/marker/convert.py", line 6, in
from marker.cleaners.equations import replace_equations
File "/home/k101/code/marker/marker/cleaners/equations.py", line 8, in
from nougat import NougatModel
ImportError: cannot import name 'NougatModel' from 'nougat' (/home/k101/anaconda3/envs/cu118py310/lib/python3.10/site-packages/nougat/init.py)

Extraction from 2 column text, marker mixes left and right colum text paragraphs.

I've installed marker using wsl under Win11. Tested it in English, Croatian an Slovenian - it makes perfect job removing headers and footers, sidelines etc.

It struggles with :

  • text bullets (I can fix that easily by hand or with regex if required) and
  • with 2 column text - it mixes paragraphs from left and right column - not easily resolvable (reading unsolvable, for me at least).

Do you have any idea why is this happening or how to fix it?

I attached Test pdf file, in case you want to check output yourself.
pdfTestEN.pdf
Test pdf file is created with LibreOffice & Microsoft Print to Pdf.

Page Range feature

It would be extremely helpful to be able to target a specific set of pages for slice converting but also for updating previous attempts at converting (say for a page that was incorrectly or poorly converted).

I would imagine the syntax as (for a document that is 26 pages long):

python convert_single.py --page-range 1,10 a.pdf b.md

This would convert pages: 1,2,3,4,5,6,7,8,9,10

python convert_single.py --page-range 5,6 a.pdf b.md

This would convert pages: 5,6

python convert_single.py --page-range 24, a.pdf b.md

This would convert pages: 24,25,26

python convert_single.py --page-range -4,15 a.pdf b.md

This would convert pages: 23,24,25,26,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

Make it available throught a API

The actual implementation of marker is pretty bare bones, no way to use it easily outside just run the script. How be interesting to have an simple API call that would get the file and output the results.

Trouble with single pages

Hi! Thanks for your work!! I am having a problem, let's see if you can find a solution.

I have a PDF that contains 77 pages. If I do the conversion with all the pages, the conversion is well done, but if I try to do the conversion only with the first one, it does not extract the same content and starts failing. Do you know why this might be happening?

Takes too much time on M1 Apple Silicon chip

I am trying to test just a small pdf of 8 pages with normal text on m1 silicon chip with CPU with --parallel_factor 1000 and it takes around 1 minute. And the documentation in README claims quite less time per page. Can you please explain why is this and that did you use GPUs for benchmarking for speed and if yes then what was the configuration so I can achieve this.

Markdown sublists (lists of lists of lists)

First off, this is nearly magical. Thank you.

Second, there appears to be one issue where nested lists don't work properly.
test.md
611-mentally-ill-persons-12-21-2020.pdf

You'll notice it does 2 levels of list properly, but in section .06 where it gets very nested it breaks down.
If you can point me to the right place in your code I'd be happy to try to figure it out and send you a PR.
It looks like it would just need to prepend 4 spaces per indent, for list items but a quick look didn't indicate where you were detecting that

Bug: error on the `table.py` file when converting any PDF

Hello!

I tried converting a few PDF files, and all of them failed with the following error in the cleaners/table.py file:

pnum=last_block.pnum,
AttributeError: 'NoneType' object has no attribute 'pnum'

Adding a None check fixed it and now the conversion works:

if block.most_common_block_type() != "Table":
                if len(current_lines) > 0:
                    blockPnum = 0
                    if last_block != None:
                        blockPnum = last_block.pnum
                    new_block = Block(
                        lines=deepcopy(current_lines),
                        pnum=blockPnum,
                        bbox=current_bbox
                    )

Now, I'm not sure if this is caused by my hardware or a misconfiguration by my part. I'm running it with an AMD card. Is this worth a pull request?
Thanks for making this tool, it was fun to play with!

All these structures were converted into Level 2 (##)

I converted a PDF file which is a book. The book has a structure with Sections (level 1), Chapters (level 2), and Headings (level 3), but by using Marker, all these structures were converted into Level 2 (##) in the Markdown format.
image

Missing requirements.txt in repository, and indexer error on execution

No requirements.txt file on the VikParuchuri repo
https://github.com/VikParuchuri/marker

Used fork with requirements.txt to see missing packages
https://github.com/gardner/marker

Followed instructions for Windows
#12

Detectron 2 installation without issues (ref. Fix the "identifier 'single_box_iou_rotated' is undefined")

Adjusted requiremenst.txt - removed detectron2 (already installed), changed nougat to nougat-ocr.

Installed Pillow 10.0.1 (had to differ from required Pillow==9.5.0, because of other dependencies, see bellow)

(D:\03Marker\vMarker) D:\03Marker\vMarker>pip show Pillow
Name: Pillow
Version: 10.0.1
Summary: Python Imaging Library (Fork)
Home-page: https://python-pillow.org
Author: Jeffrey A. Clark (Alex)
Author-email: [email protected]
License: HPND
Location: d:\03marker\vmarker\lib\site-packages
Requires:
Required-by: detectron2, fvcore, imageio, img2pdf, layoutparser, matplotlib, ocrmypdf, pdf2image, pdfplumber, pikepdf, pytesseract, reportlab, scikit-image, torchvision

(D:\03Marker\vMarker) D:\03Marker\vMarker>pip check
No broken requirements found.

(D:\03Marker\vMarker) D:\03Marker\vMarker>python --version
Python 3.10.13

(D:\00Torch\vTorch) D:\00Torch\vTorch>python
import torch
print(torch.version)
print(torch.cuda.is_available())
2.1.1
import detectron2
print(detectron2.version)
0.6
Arranged setting (as best as I could - cuda etc.), according to instructions

When trying to run:

from dir where (convert_single.py is) - D:\03Marker\vMarker\marker
python convert_single.py D:/xLLMDocBase/10CFR50_Appendix_B.pdf D:/xLLMDocBase/output.md --parallel_factor 2 --max_pages 10

Error:
(D:\03Marker\vMarker) D:\03Marker\vMarker\marker>python convert_single.py D:/xLLMDocBase/10CFR50_Appendix_B.pdf D:/xLLMDocBase/output.md --parallel_factor 2 --max_pages 10
Traceback (most recent call last):
File "D:\03Marker\vMarker\marker\convert_single.py", line 3, in
from marker.convert import convert_single_pdf
File "D:\03Marker\vMarker\marker\marker\convert.py", line 5, in
from marker.extract_text import get_text_blocks
File "D:\03Marker\vMarker\marker\marker\extract_text.py", line 4, in
from spellchecker import SpellChecker
File "D:\03Marker\vMarker\lib\site-packages\spellchecker_init_.py", line 2, in
from spellchecker.core import Spellchecker,getInstance
File "D:\03Marker\vMarker\lib\site-packages\spellchecker\core.py", line 26, in
from indexer import DictionaryIndex
ModuleNotFoundError: No module named 'indexer'

Can't install indexer (require much earlier python version - 2.7)

Please, can you give me any advice on this?
Can you send me original requirements.txt?
Is my python command line call correct?

ImportError: cannot import name 'field_validator' from 'pydantic'

Python 3.9.16
Name: pydantic
Version: 1.10.10

Traceback (most recent call last):
  File "/marker/convert_single.py", line 3, in <module>
    from marker.convert import convert_single_pdf
  File "/marker/marker/convert.py", line 3, in <module>
    from marker.cleaners.table import merge_table_blocks, create_new_tables
  File "/marker/marker/cleaners/table.py", line 2, in <module>
    from marker.schema import Line, Span, Block, Page
  File "/marker/marker/schema.py", line 4, in <module>
    from pydantic import BaseModel, field_validator
ImportError: cannot import name 'field_validator' from 'pydantic' (/opt/conda/lib/python3.9/site-packages/pydantic/__init__.cpython-39-x86_64-linux-gnu.so)

I tried changing it to from pydantic.functional_validators import field_validator, but that doesn't work either.

This issue can be solved by pip install -U pydantic which upgrades to 2.5.3, although it results in:

fastapi 0.88.0 requires pydantic!=1.7,!=1.7.1,!=1.7.2,!=1.7.3,!=1.8,!=1.8.1,<2.0.0,>=1.6.2, but you have pydantic 2.5.3 which is incompatible.

CUDA out of memory

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
C:\Users\acer\miniconda3\Lib\site-packages\torch\functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ..\aten\src\ATen\native\TensorShape.cpp:3527.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "D:\github_projects\marker-master\convert_single.py", line 22, in <module>
    full_text, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, parallel_factor=args.parallel_factor)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\github_projects\marker-master\marker\convert.py", line 146, in convert_single_pdf
    filtered, eq_stats = replace_equations(
                         ^^^^^^^^^^^^^^^^^^
  File "D:\github_projects\marker-master\marker\cleaners\equations.py", line 294, in replace_equations
    predictions = get_nougat_text_batched(images, flat_reformat_region_lens, nougat_model, batch_size)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\github_projects\marker-master\marker\cleaners\equations.py", line 109, in get_nougat_text_batched
    model_output = nougat_model.inference(image_tensors=sample, early_stopping=False)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acer\miniconda3\Lib\site-packages\nougat\model.py", line 580, in inference
    last_hidden_state = self.encoder(image_tensors)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acer\miniconda3\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acer\miniconda3\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acer\miniconda3\Lib\site-packages\nougat\model.py", line 123, in forward
    x = self.model.layers(x)
        ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acer\miniconda3\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acer\miniconda3\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acer\miniconda3\Lib\site-packages\torch\nn\modules\container.py", line 215, in forward
    input = module(input)
            ^^^^^^^^^^^^^
  File "C:\Users\acer\miniconda3\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acer\miniconda3\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acer\miniconda3\Lib\site-packages\timm\models\swin_transformer.py", line 413, in forward
    x = blk(x)
        ^^^^^^
  File "C:\Users\acer\miniconda3\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acer\miniconda3\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acer\miniconda3\Lib\site-packages\timm\models\swin_transformer.py", line 295, in forward
    attn_windows = self.attn(x_windows, mask=self.attn_mask)  # nW*B, window_size*window_size, C
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acer\miniconda3\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acer\miniconda3\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\acer\miniconda3\Lib\site-packages\timm\models\swin_transformer.py", line 183, in forward
    attn = (q @ k.transpose(-2, -1))
            ~~^~~~~~~~~~~~~~~~~~~~~
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU 0 has a total capacty of 4.00 GiB of which 0 bytes is free. Of the allocated memory 3.38 GiB is allocated by PyTorch, and 16.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

cannot import name 'cached_property' from 'nougat.utils'

I'm used python3.9 env

(venv) (base) marker dev % python3 convert_single.py /Users/nunamia/Documents/github/nougat/demo /Users/nunamia/Documents/github/nougat/demo/output.md
Traceback (most recent call last):
File "/Users/nunamia/Documents/github/marker/convert_single.py", line 3, in
from marker.convert import convert_single_pdf
File "/Users/nunamia/Documents/github/marker/marker/convert.py", line 6, in
from marker.cleaners.equations import replace_equations
File "/Users/nunamia/Documents/github/marker/marker/cleaners/equations.py", line 8, in
from nougat import NougatModel
File "/Users/nunamia/Documents/github/marker/venv/lib/python3.9/site-packages/nougat/init.py", line 1, in
from nougat.app import Nougat
File "/Users/nunamia/Documents/github/marker/venv/lib/python3.9/site-packages/nougat/app.py", line 5, in
from nougat.asgi import serve
File "/Users/nunamia/Documents/github/marker/venv/lib/python3.9/site-packages/nougat/asgi.py", line 6, in
from nougat.context.request import Request
File "/Users/nunamia/Documents/github/marker/venv/lib/python3.9/site-packages/nougat/context/init.py", line 1, in
from nougat.context.request import Request
File "/Users/nunamia/Documents/github/marker/venv/lib/python3.9/site-packages/nougat/context/request.py", line 8, in
from nougat.utils import cached_property, File
ImportError: cannot import name 'cached_property' from 'nougat.utils' (/Users/nunamia/Documents/github/marker/venv/lib/python3.9/site-packages/nougat/utils/init.py)

I search find this code:
https://github.com/Riparo/nougat/blob/8453bc37e0b782f296952f0a418532ebbbcd74f3/nougat/context/request.py#L81

But I checked the official version and there is no such part of the code.

the version is:
[[package]]
name = "nougat-ocr"
version = "0.1.17"
description = "Nougat: Neural Optical Understanding for Academic Documents"
optional = false
python-versions = ">=3.7"
files = [
{file = "nougat_ocr-0.1.17-py3-none-any.whl", hash = "sha256:f776732c716250972c7de11a47b36e94fa48e271d67045a427f19f12eeeef118"},
]

MPS version not working

          Set parallel factor to max 2.  Set `TORCH_DEVICE` setting to `mps`.  See how to adjust settings in the README.

Originally posted by @VikParuchuri in #40 (comment)

This still doesn't work and gives the following error:

        NotImplementedError: The operator \'aten::roll\' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.\n'

got an error when using convert_single.py to convert pdf to md

running commands: python convert_single.py ./test.pdf ./test.md

Error:

Traceback (most recent call last):
File "/marker/convert_single.py", line 3, in
from marker.convert import convert_single_pdf
File "/marker/marker/convert.py", line 9, in
from marker.postprocessors.editor import edit_full_text
File "/marker/marker/postprocessors/editor.py", line 11, in
tokenizer = AutoTokenizer.from_pretrained(settings.EDITOR_MODEL_NAME)
File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 736, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1854, in from_pretrained
return cls._from_pretrained(
File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2073, in _from_pretrained
raise ValueError(
ValueError: Non-consecutive added token '' found. Should have index 259 but has index 0 in saved vocabulary.

Non-UTF-8 code starting with '\xe2'

I hope I've managed to fulfill the prerequisites, but when I try to convert a couple of different files, I get an error message. I'm running Manjaro btw (a version of Arch Linux):

SyntaxError: Non-UTF-8 code starting with '\xe2' in file /home/niklas/Downloads/x.pdf on line 2, but no encoding declared; see https://peps.python.org/pep-0263/ for details

Any help would be appreciated. Cheers and many thanks for making and sharing this software!

the table issue

seems the table always not be recognized successfully , even it works the format is still not right and cannot be read ,i only use CPU and transfer single file only, which part of the code working on the table function? thanks

How to obtain the model's pre-training parameters offline?

How to obtain the pre-training parameters of LayoutLMv3ForSequenceClassification and LayoutLMv3Processor offline?
I find settings.ORDERER_MODEL_NAME and settings.ORDERER_MODEL_NAME are not exist in master!

The following is the code in marker.ordering.py:
`from transformers import LayoutLMv3ForSequenceClassification, LayoutLMv3Processor
from PIL import Image
import io

from marker.schema import Page
from marker.settings import settings

processor = LayoutLMv3Processor.from_pretrained(settings.ORDERER_MODEL_NAME)

def load_ordering_model():
model = LayoutLMv3ForSequenceClassification.from_pretrained(
settings.ORDERER_MODEL_NAME,
torch_dtype=settings.MODEL_DTYPE,
).to(settings.TORCH_DEVICE)
model.eval()
return model
`

Units in readme benchmark table

Thanks for making this cool project!

The benchmarks section of the readme would be more helpful if it specified the units used when measuring time per page and accuracy. Different setups will obviously perform differently, but it would be helpful for people to understand the rough order of magnitude of the tool's speed.

Font information by using this

I see that we get markdown files as the output from this library. I want to use this library for resume parsing. Is there any way to extract font information (font size, color, name) in the output?

Failed table reading

Just tested out for the first time on one document. It failed properly reading the table.
Also, the flow is cut off from column 1 parargraph -> column 2 paragraph if there is a table at the end, even though the paragraph continues from column 1 to 2.

Screen Shot 2024-01-04 at 10 41 47 AM
Screen Shot 2024-01-04 at 10 41 37 AM

Not compatible with python 3.9?

I was able to install the project with pyenv and 3.9.18 on macos, but when running convert_single.py I got an error:

Traceback (most recent call last):
  File "/Users/milep/projects/marker/convert_single.py", line 3, in <module>
    from marker.convert import convert_single_pdf
  File "/Users/milep/projects/marker/marker/convert.py", line 3, in <module>
    from marker.cleaners.table import merge_table_blocks, create_new_tables
  File "/Users/milep/projects/marker/marker/cleaners/table.py", line 2, in <module>
    from marker.schema import Line, Span, Block, Page
  File "/Users/milep/projects/marker/marker/schema.py", line 56, in <module>
    class Span(BboxElement):
  File "/Users/milep/projects/marker/marker/schema.py", line 61, in Span
    ascender: float | None = None
TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

It worked with the 3.10.13 version.

The way to install VikParuchuri/marker on Windows 10.

The most challenging aspect of installing Marker on Windows lies in the detectron2 package developed by Facebook Research. Facebook Research is not very Windows-friendly, and they basically do not support or provide installation guidance for Windows.

The following records the process of installing VikParuchuri/marker on Windows 10.


To install the detectron2 package on Windows, you need to clone detectron2 and make some modifications before installation:

  1. Compilation of detectron2 requires a C/C++ compiler. I have MSVC (Visual Studio 2022) cl.exe in my environment, and you must have a similar C/C++ compiler in your environment.
    Visual Studio Download: https://visualstudio.microsoft.com/vs/community/

  2. Compilation of detectron2 requires NVIDIA CUDA's nvcc. You must install the CUDA Toolkit first. I installed version 12.3.
    CUDA Toolkit Download: https://developer.nvidia.com/cuda-downloads

  3. The torch package may also need to be installed. I installed the latest version provided by PyTorch:
    pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

  4. Install wheel:
    pip install wheel

  5. Clone detectron2:
    git clone https://github.com/facebookresearch/detectron2.git

  6. Fix the "identifier 'single_box_iou_rotated' is undefined" issue by Viliami. (Refer to: facebookresearch/detectron2#1601 (comment))

  7. Install the local detectron2.
    Install detectron2: pip install -e detectron2

If everything goes smoothly, detectron2 should be installed. If there are any issues, you'll need to check the error logs for further investigation.


Installing the Windows version of Tesseract and Ghostscript.

  1. To install Tesseract OCR on Windows
    setup tesseract-ocr-w64-setup-5.3.3.20231005.exe or a newer version
    https://digi.bib.uni-mannheim.de/tesseract/

  2. To install Ghostscript on Windows
    setup gs10021w64.exe or a newer version
    https://ghostscript.readthedocs.io/en/gs10.02.0/Install.html


Installing the VikParuchuri/marker

  1. git clone https://github.com/VikParuchuri/marker.git
  2. Remove detectron2 from VikParuchuri/marker/requirements.txt and install it manually using the aforementioned steps
  3. nougat in VikParuchuri/marker/requirements.txt is installing to the wrong repository. It needs to be removed from requirements.txt and installed from the repository developed by facebookresearch(https://github.com/facebookresearch/nougat).
    pip install nougat-ocr
  4. Install the missing dependencies.
    pip install -r requirements.txt
    pip install ftfy
    pip install spellchecker
    pip install pyspellchecker
    pip install ocrmypdf
    pip install nltk
    pip install thefuzz
    pip uninstall python-magic
    pip install python-magic-bin

Query Regarding Support for Additional Languages in Marker

Dear Marker Maintainers,

I hope this message finds you well. I am reaching out to inquire about the current language support within Marker and the potential for expanding this to include additional languages.

Having perused the settings.py file, I noted that there is a provision for a selection of languages, predominantly European ones alongside a few Asian languages. However, I am particularly interested in understanding whether there are plans afoot to incorporate further languages into this impressive tool.

The utility of Marker would be significantly enhanced by the inclusion of languages such as Arabic, which possesses unique orthographic characteristics, or smaller European languages that may not have been the focus of extensive testing.

Moreover, I would be grateful for any guidance on the process of contributing to the language list. Is there a particular protocol for proposing new languages, and are there specific requirements that a language must meet to be considered for addition?

I appreciate the remarkable work that has gone into developing Marker and look forward to its continued evolution.

Best regards,
yihong1120

M1 Mac Install Issue PyTorch

When i run poetry install I get:
Installing torch (2.1.0): Failed

RuntimeError

Unable to find installation candidates for torch (2.1.0)

Poetry installed the venv in Python 3.12

Why the model not as good as Nougat?

Why is the conversion result not as good as Nougat, especially when it comes to handling formulas and tables? Marker conversion introduces many errors, while Nougat has almost no errors.

Zero height box found, cannot convert properly

Every single PDF I have tried gets the following error:

Zero height box found, cannot convert properly
Traceback (most recent call last):
  File "/Users/grim/src/marker/convert_single.py", line 22, in <module>
    full_text, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, parallel_factor=args.parallel_factor)
  File "/Users/grim/src/marker/marker/convert.py", line 108, in convert_single_pdf
    block_types = detect_document_block_types(
  File "/Users/grim/src/marker/marker/segmentation.py", line 51, in detect_document_block_types
    encodings, metadata, sample_lengths = get_features(doc, blocks)
  File "/Users/grim/src/marker/marker/segmentation.py", line 160, in get_features
    encoding, other_data = get_page_encoding(doc[i], blocks[i])
  File "/Users/grim/src/marker/marker/segmentation.py", line 104, in get_page_encoding
    raise ValueError
ValueError

macOS 14.2
python 3.9

installed via exact instructions from git repo

Could not get markdown file

It looks like the program is working fine, but it doesn't get the markdown file.

The document has 850 pages. Is that because there are too many pages in a single document?
image

mac install issue

this is what the instruction says:

Create a local.env file in the root marker folder with TESSDATA_PREFIX=/path/to/tessdata inside it

where is the marker folder on Mac?

Problem: Loss of Equations Between Paragraphs in PDF to Markdown Conversion

I have some questions about the implementation. Can the PDF to Markdown conversion with Marker include marking the coordinate information of each paragraph in Markdown with the 'bbox': (x0, y0, x1, y1) format report with layout.json ext.
PyMuPDF provides data containing this information.

There seems to be a significant error in recognizing and storing equations. For example, formulas within the text or between paragraphs are being lost. How can this iss
ue be addressed?

[Attachment Included]

3.pdf

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.