Giter Site home page Giter Site logo

hocr's Introduction

Hocr

THIS REPO IS NOW DEPRECIATED, PLEASE USE HOCR.NET

C# Library for converting PDF files to Searchable PDF Files

  • Need to batch convert 100 of scan PDF's to Searchable PFS's?
  • Don't want to pay thousands of dollars for a component?

I have personally tested this library with over 110 thousand PDFs. Beyond a few fringe cases the code has performed as it was designed.. I was able to process 110k pdfs (Some hundreds of pages) over a 3 day period using 5 servers.

Internally, Hocr uses Tesseract, GhostScript, iTextSharp and the HtmlAgilityPack. Please check the licensing for each nuget to make sure you are in compliance.

This library IS THREADSAFE so you can process multiple PDF's at the same time in different threads, you do not need to process them one at a time.

Use Hocr!

Example Usage:

using System;
using System.Collections.Generic;
using System.IO;
using System.Threading;
using Hocr.Enums;
using Hocr.Pdf;

namespace Hocr.Cmd
{
    internal class Program
    {
        private static bool _running = true;

        private static int _fileCounter = 3;

        private static PdfCompressor _comp;

        public static void Example(byte[] data, string outfile)
        {
            Tuple<byte[], string> odata = _comp.CreateSearchablePdf(data, new PdfMeta
            {
                Author = "Vince",
                KeyWords = string.Empty,
                Subject = string.Empty,
                Title = string.Empty
            });
            File.WriteAllBytes(outfile, odata.Item1);
            Console.WriteLine("OCR BODY: " + odata.Item2);
            Console.WriteLine("Finished " + outfile);
            _fileCounter = _fileCounter - 1;
            if (_fileCounter == 0)
                Environment.Exit(0);
        }

        private static void Main(string[] args)
        {
            List<string> distillerOptions = new List<string>
            {
                "-dSubsetFonts=true",
                "-dCompressFonts=true",
                "-sProcessColorModel=DeviceRGB",
                "-sColorConversionStrategy=sRGB",
                "-sColorConversionStrategyForImages=sRGB",
                "-dConvertCMYKImagesToRGB=true",
                "-dDetectDuplicateImages=true",
                "-dDownsampleColorImages=false",
                "-dDownsampleGrayImages=false",
                "-dDownsampleMonoImages=false",
                "-dColorImageResolution=265",
                "-dGrayImageResolution=265",
                "-dMonoImageResolution=265",
                "-dDoThumbnails=false",
                "-dCreateJobTicket=false",
                "-dPreserveEPSInfo=false",
                "-dPreserveOPIComments=false",
                "-dPreserveOverprintSettings=false",
                "-dUCRandBGInfo=/Remove"
            };


            PdfCompressorSettings pdfSettings = new PdfCompressorSettings
            {
                PdfCompatibilityLevel = PdfCompatibilityLevel.Acrobat_7_1_6,
                WriteTextMode = WriteTextMode.Word,
                Dpi = 400,
                ImageType = PdfImageType.Tif,
                ImageQuality = 100,
                CompressFinalPdf = true,
                DistillerMode = dPdfSettings.prepress,
                DistillerOptions = string.Join(" ", distillerOptions.ToArray())
            };

            _comp = new PdfCompressor(pdfSettings);
            _comp.OnExceptionOccurred += Compressor_OnExceptionOccurred;
            _comp.OnCompressorEvent += _comp_OnCompressorEvent;


            byte[] data = File.ReadAllBytes(@"Test1.pdf");
            byte[] data1 = File.ReadAllBytes(@"Test2.pdf");
            byte[] data2 = File.ReadAllBytes(@"Test3.pdf");

            new Thread(()=>
            {
                Console.WriteLine("Started Test 1");
                Example(data, @"Test1_ocr.pdf");
            }).Start();

            new Thread(() =>
            {
                Console.WriteLine("Started Test 2");
                Example(data1, @"Test2_ocr.pdf");
            }).Start();

            new Thread(() =>
            {
                Console.WriteLine("Started Test 3");
                Example(data2, @"Test3_ocr.pdf");
            }).Start();

            int counter = 0;
            while (_running)
            {
                Thread.Sleep(1000);
                Console.WriteLine("Working...." + counter);
                counter++;
            }

            Console.WriteLine("Finished!");
            Console.ReadLine();
        }

        private static void _comp_OnCompressorEvent(string msg) { Console.WriteLine(msg); }

        private static void Compressor_OnExceptionOccurred(PdfCompressor c, Exception x)
        {
            Console.WriteLine("Exception Occured! ");
            Console.WriteLine(x.Message);
            Console.WriteLine(x.StackTrace);
            _running = false;
        }
    }
}

Special Thanks to Koolprasadd for his original article at: https://tech.io/playgrounds/10058/scanned-pdf-to-ocr-textsearchable-pdf-using-c

hocr's People

Contributors

fairfieldtekllc avatar winterleaf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

hocr's Issues

Hocr.CMD Errors

I thought this may be caused by the GhostScript version, but I have updated everything.

HOCR Error

Parser.cs file incorrectly parses embedded Div tags

Great job on this project! I'm finding it very useful! I did find a bug, though and wanted to report it.

I was chasing a bug where the resulting pdf that gets rebuilt from the split hocr files has an output that doubles up the text on a page. You don't see it in the pdf, but if you extract the text from the final pdf and look through it, I would often see the text repeated. What I found was the problem lay in how the div nodes are grabbed.

In many of the hocr files I was working with, inside the body tag, there was a div tag that then encapsulated the rest of the body of the page. All the other div tags were inside that initial one. When grabbing the div tags, the line below grabs all div tags and puts them in a collection of nodes:
HtmlNodeCollection nodes = body.SelectNodes("//div");

This is pretty efficient, but the first div tag, in my scenario, had child elements that made up the rest of the page. Since it contains all the other div tags, and those embedded div tags are also collected into the NodeCollection, all those embedded div tags are repeated in the resulting node parsing code. The node collection from that line above does not take into account the hierarchy of the nodes. It just grabs all the matching nodes and gives you a collection of them.

My fix was to change the above code line to this:
var divs = body.ChildNodes.Where(node => node.Name.ToLower() == "div");
HtmlNodeCollection nodes = new HtmlNodeCollection(null);
foreach (var div in divs) { nodes.Add(div); }

This code then only grabbed the div tag nodes of the child nodes of the body tag. In my case it was one div node. I then created a collection and parsed that collection. Since the child nodes are also parsed, then everything got parsed once and there weren't any repeats.

I've included an example hocr file for reference. I had to change the extension to txt so I could upload it,

example.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.