This repo contains the code to extract text from pdf/picture/scanned document.
OCR (Optical Character Recognition) technique is used to identify words in a picture/scanned document and convert it into machine-readable text, that can be processed further with the help of computer. Although the technology is mature and uses advanced techniques, which quite often produces an erroneous output.
This repo contains the code for BYOB Challenge: OCR De-noising.
-
Clone the repository using:
git clone https://github.com/ViswanathaReddyGajjala/pdf2text.git
-
Please go to pdftoimage.com to convert the pdf file to jpg image.
-
We need to place the pdf and the correponding images in /data/demo folder.
-
Now, run the demo.py file.
-
Result can be seen on the command line(for windows users) or terminal(for Ubuntu users).
- Note: Please remove the files previously being compiled in the /data/demo and _/data/result folder.
- Localized text proposals on a pdf.