The hocr from fairfieldtekllc

hocr's Introduction

Hocr

THIS REPO IS NOW DEPRECIATED, PLEASE USE HOCR.NET

C# Library for converting PDF files to Searchable PDF Files

Need to batch convert 100 of scan PDF's to Searchable PFS's?
Don't want to pay thousands of dollars for a component?

I have personally tested this library with over 110 thousand PDFs. Beyond a few fringe cases the code has performed as it was designed.. I was able to process 110k pdfs (Some hundreds of pages) over a 3 day period using 5 servers.

Internally, Hocr uses Tesseract, GhostScript, iTextSharp and the HtmlAgilityPack. Please check the licensing for each nuget to make sure you are in compliance.

This library IS THREADSAFE so you can process multiple PDF's at the same time in different threads, you do not need to process them one at a time.

Use Hocr!

Example Usage:

using System;
using System.Collections.Generic;
using System.IO;
using System.Threading;
using Hocr.Enums;
using Hocr.Pdf;

namespace Hocr.Cmd
{
    internal class Program
    {
        private static bool _running = true;

        private static int _fileCounter = 3;

        private static PdfCompressor _comp;

        public static void Example(byte[] data, string outfile)
        {
            Tuple<byte[], string> odata = _comp.CreateSearchablePdf(data, new PdfMeta
            {
                Author = "Vince",
                KeyWords = string.Empty,
                Subject = string.Empty,
                Title = string.Empty
            });
            File.WriteAllBytes(outfile, odata.Item1);
            Console.WriteLine("OCR BODY: " + odata.Item2);
            Console.WriteLine("Finished " + outfile);
            _fileCounter = _fileCounter - 1;
            if (_fileCounter == 0)
                Environment.Exit(0);
        }

        private static void Main(string[] args)
        {
            List<string> distillerOptions = new List<string>
            {
                "-dSubsetFonts=true",
                "-dCompressFonts=true",
                "-sProcessColorModel=DeviceRGB",
                "-sColorConversionStrategy=sRGB",
                "-sColorConversionStrategyForImages=sRGB",
                "-dConvertCMYKImagesToRGB=true",
                "-dDetectDuplicateImages=true",
                "-dDownsampleColorImages=false",
                "-dDownsampleGrayImages=false",
                "-dDownsampleMonoImages=false",
                "-dColorImageResolution=265",
                "-dGrayImageResolution=265",
                "-dMonoImageResolution=265",
                "-dDoThumbnails=false",
                "-dCreateJobTicket=false",
                "-dPreserveEPSInfo=false",
                "-dPreserveOPIComments=false",
                "-dPreserveOverprintSettings=false",
                "-dUCRandBGInfo=/Remove"
            };


            PdfCompressorSettings pdfSettings = new PdfCompressorSettings
            {
                PdfCompatibilityLevel = PdfCompatibilityLevel.Acrobat_7_1_6,
                WriteTextMode = WriteTextMode.Word,
                Dpi = 400,
                ImageType = PdfImageType.Tif,
                ImageQuality = 100,
                CompressFinalPdf = true,
                DistillerMode = dPdfSettings.prepress,
                DistillerOptions = string.Join(" ", distillerOptions.ToArray())
            };

            _comp = new PdfCompressor(pdfSettings);
            _comp.OnExceptionOccurred += Compressor_OnExceptionOccurred;
            _comp.OnCompressorEvent += _comp_OnCompressorEvent;


            byte[] data = File.ReadAllBytes(@"Test1.pdf");
            byte[] data1 = File.ReadAllBytes(@"Test2.pdf");
            byte[] data2 = File.ReadAllBytes(@"Test3.pdf");

            new Thread(()=>
            {
                Console.WriteLine("Started Test 1");
                Example(data, @"Test1_ocr.pdf");
            }).Start();

            new Thread(() =>
            {
                Console.WriteLine("Started Test 2");
                Example(data1, @"Test2_ocr.pdf");
            }).Start();

            new Thread(() =>
            {
                Console.WriteLine("Started Test 3");
                Example(data2, @"Test3_ocr.pdf");
            }).Start();

            int counter = 0;
            while (_running)
            {
                Thread.Sleep(1000);
                Console.WriteLine("Working...." + counter);
                counter++;
            }

            Console.WriteLine("Finished!");
            Console.ReadLine();
        }

        private static void _comp_OnCompressorEvent(string msg) { Console.WriteLine(msg); }

        private static void Compressor_OnExceptionOccurred(PdfCompressor c, Exception x)
        {
            Console.WriteLine("Exception Occured! ");
            Console.WriteLine(x.Message);
            Console.WriteLine(x.StackTrace);
            _running = false;
        }
    }
}

Special Thanks to Koolprasadd for his original article at: https://tech.io/playgrounds/10058/scanned-pdf-to-ocr-textsearchable-pdf-using-c

hocr's People

Contributors

Stargazers

Watchers

hocr's Issues

Hocr.CMD Errors

I thought this may be caused by the GhostScript version, but I have updated everything.

Parser.cs file incorrectly parses embedded Div tags

Great job on this project! I'm finding it very useful! I did find a bug, though and wanted to report it.

I was chasing a bug where the resulting pdf that gets rebuilt from the split hocr files has an output that doubles up the text on a page. You don't see it in the pdf, but if you extract the text from the final pdf and look through it, I would often see the text repeated. What I found was the problem lay in how the div nodes are grabbed.

In many of the hocr files I was working with, inside the body tag, there was a div tag that then encapsulated the rest of the body of the page. All the other div tags were inside that initial one. When grabbing the div tags, the line below grabs all div tags and puts them in a collection of nodes:
HtmlNodeCollection nodes = body.SelectNodes("//div");

This is pretty efficient, but the first div tag, in my scenario, had child elements that made up the rest of the page. Since it contains all the other div tags, and those embedded div tags are also collected into the NodeCollection, all those embedded div tags are repeated in the resulting node parsing code. The node collection from that line above does not take into account the hierarchy of the nodes. It just grabs all the matching nodes and gives you a collection of them.

My fix was to change the above code line to this:
var divs = body.ChildNodes.Where(node => node.Name.ToLower() == "div");
HtmlNodeCollection nodes = new HtmlNodeCollection(null);
foreach (var div in divs) { nodes.Add(div); }

This code then only grabbed the div tag nodes of the child nodes of the body tag. In my case it was one div node. I then created a collection and parsed that collection. Since the child nodes are also parsed, then everything got parsed once and there weren't any repeats.

I've included an example hocr file for reference. I had to change the extension to txt so I could upload it,

example.txt

Recommend Projects

fairfieldtekllc / hocr Goto Github PK

hocr's Introduction

Hocr

Use Hocr!

hocr's People

Contributors

Stargazers

Watchers

Forkers

hocr's Issues

Hocr.CMD Errors

Parser.cs file incorrectly parses embedded Div tags

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent