gowengit / docnet Goto Github PK

View Code? Open in Web Editor NEW

436.0 24.0 88.0 170.42 MB

DocNET is as fast PDF editing and reading library for modern .NET applications

License: MIT License

C# 99.03% Shell 0.97%

pdf netstandard netcore csharp jpeg pdf-document pdf-converter pdf-document-processor pdf-extractor pdf-conversion

docnet's Introduction

docnet

Description

docnet aims to be a fast PDF editing and data extraction library. It is a .NET Standard 2.0 wrapper for PDFium C++ library that is used by chromium.

PDFium version: 5445

Supported platforms:

win
linux
osx

Features

Examples

Render PDF page as PNG and display all character bounding boxes: example

Note: If you have issues running on Linux make sure that libgdiplus is installed since this example uses System.Drawing.Common.
Convert JPEG file to PDF: example

Usage

DocLib.Instance should be treated as a singleton that lives as long as your application. It should only be disposed when you intend to clean all unmanaged resources of PDFium.

.NET Framework Support

Newer versions of .NET Framework are also supported, Docnet.Core.targets tries to automatically find which version of the native PDFium binary to copy but that can sometime be unreliable especially if running on AnyCPU. You can manually specify DocnetRuntime property in your project file to influence which library version to copy. Allowed values are win-x64, win-x86, linux and osx.

Example below makes sure that we always copy x64 binary on windows:

  <PropertyGroup>
    <DocnetRuntime Condition=" '$([MSBuild]::IsOsPlatform(Windows))' ">win-x64</DocnetRuntime>
  </PropertyGroup>

docnet's People

Contributors

Stargazers

Watchers

Forkers

casperguo palku johnny-hotsauce juliogamasso pody2015 nathan-c 2k3o ameriscan mkenfenheuer lezhkin11 ricanteja harachie etanxie xbotter jansxue keethburu varujik nanofabricfx jay-hill jtone123 jroboyd bseay devel112 bjarnisan djgosnell jessthrysoee mayurjansari binarez codingseb herocod3r chuongmep stho32 krishnanandsivaraj hironagaki1 uzbekdev1 ricardovws fisu81 yanal-yves lanicon plumsail marknotgeorge messidagod mikkou-adafy git-thinh johndocs jbraendle devfadl renick dystudio kenexllc sujit1779 chriskulesza nh-nico thomasheindoerfer yvainb jpsaccount xtuzy burns47 dmitryzhelnin mevitae richardschoen marchrius rflipper charch1219 0xced zf-follower bio4554 kingshayben barnettben lemkhn krrrishh zanhaipeng phandinhsang1212 aaronjerez1 parsitio mathis1337 joshuabrink usmansabir hafman08 vanbinh85 bzmework mraipsec-mra walidkrout bpacholek manfromarce

docnet's Issues

Incorrect scalingFactor documentation

The documentation for PageDimensions ctor describes the scalingFactor as

Page scaling PPI factor

However, it seems to be incorrect, based on my experimentations, it is pixels per point factor, where 72 points is equal to one inch.
Example: I want 300 pixels per inch = scaling factor should be 300/72 ~ 4.1666666

Could it be changed to something like?

Page scaling factor in pixels-per-point. Convert PPI to scaling factor as PPI/72.

Converting PDF to image containing form fields returns blank fields

When calling the "GetImage" function with a PDF file that contains "Form Fields", the generated image does not contain the form fields values, becoming all blank after the operation.

We tried to change the RenderFlags used as arguments to GetImage function, but with no success at all.

Here is the code snippet that I used:

using (var docReader = rasterizer.GetDocReader(pdfBuffer, new Docnet.Core.Models.PageDimensions(1080, 1920)))
        for (var i = 0; i < docReader.GetPageCount(); i++)
        {
          using (var pageReader = docReader.GetPageReader(i))
          {
            var pageBytes = pageReader.GetImage(new NaiveTransparencyRemover(255, 255, 255), RenderFlags.RenderForPrinting);

            return BytesToBitmap(pageReader, pageBytes, dpi);
          }
        }

Is there anything, or any argument missing when calling the GetImage function to get the desired result?

Here follows the screenshots before and after the operations.

Screenshot before converting.

Screenshot after converting.

Original PDF.
OriginalFilledForms.pdf

Thanks in advance.

Constant scaling factor for PageReader

The dimensions I pass to GetDocReader(...) will be used to scale the PDF to that size. If I have different document with different sizes, those get scaled differently. So the resulting images will have the same size (assuming aspect ration was the same, e.g. A4 and A5 sizes).

Currently one way to workaround it would be using reflection to access private _scaling and calculate the original size of the document. Then opening it again with a different px size that was changed so I have a fix resolution.

GetDocReader() could have an overload that would take one double that is the resolution PixPer int or PixPerCM.

If interested I could look into it and creating a pull request.

.GetText() works, .GetImage() doesn't on first page on test file, wikipedia_0.pdf

Describe the bug
.GetText() works, .GetImage() doesn't on first page of your test file, wikipedia_0.pdf

To Reproduce
Steps to reproduce the behavior:

use .net 5.0
add Docnet.Core package and System.Drawing.Common
build unit test involving .GetText and .GetImage
Fails

Expected behavior
Image is created

Screenshots

Desktop (please complete the following information):
Windows
.net 5
unit test via Microsoft.VisualStudio.TestTools.UnitTesting

Merge more than 2 files at once

Is your feature request related to a problem? Please describe.

I need to merge more than 2 files at once (up to 100).
In my use case I need to merge all PDFs in a specified folder to a combined.pdf file.
I made it by calling N-1 time the provided DocLib.Instance.Merge method (where N is the number of PDF to merge.
The code looks like :

  if (Directory.Exists(folderPath))
  {
      string[] files = Directory
        .GetFiles(folderPath)
        .Where(f => f.EndsWith(".pdf") && Path.GetFileName(f) != "combined.pdf")
        .ToArray();

      if (files.Length > 0)
      {
          byte[] data = File.ReadAllBytes(files[0]);
          for (int i = 1; i < files.Length; i++)
          {
              data = DocLib.Instance.Merge(data, File.ReadAllBytes(files[i]));
          }

          File.WriteAllBytes(Path.Combine(folderPath, "combined.pdf"), data);
      }
  }

It's working but it's increasingly slow as the number of files increases (details in this Google Spreadsheet).

Describe the solution you'd like
Instead I would like to write this :

if (Directory.Exists(folderPath))
{
    string[] files = Directory
      .GetFiles(folderPath)
      .Where(f => f.EndsWith(".pdf") && Path.GetFileName(f) != "combined.pdf")
      .ToArray();

    var docOne = files[0];
    var otherdocs = files.Where(f => f != docOne).ToArray();

    if (files.Length > 0)
    {
        byte[] docOneBytes = File.ReadAllBytes(docOne);
        var otherdocsBytes = new List<byte[]>();

        for (int i = 1; i < files.Length; i++)
        {
            otherdocsBytes.Add(File.ReadAllBytes(files[i]));
        }

        docOneBytes = DocLib.Instance.Merge(docOneBytes, otherdocsBytes);
        File.WriteAllBytes(Path.Combine(folderPath, "combined.pdf"), docOneBytes);
    }
}

And have good performances.

Save page image

Detect Blank Page

Is your feature request related to a problem? Please describe.
The library should be able to detect blank pages. IPageReader should have a property or a function to return the number of objects or anything that can help to identify if the current page is empty.

Describe the solution you'd like
As per my limited understanding of pdfium, probably use FPDFPage_CountObject function to get the number of objects. Not 100% sure, open to more discussions.

Describe alternatives you've considered
IPageReader.GetText is always empty even when the page is not blank. I tested on the pdfs that have images and text both.
IPageReader.GetImage pretty much return the page as an image. Can a blank image be used to identify if the page is blank?

Support rendering of a page to a graphics object (DeviceContext)

I have been trying to use the docnet library to print pdf files. To do this, I currently need to get an image of the pdf page as a byte array and then load that into a bitmap and finally render it using the Graphics object when printing. Technically this works fine, but is relatively slow since rendering of the image is done using the Graphics DrawImage method.

It should be possible to improve performance if, instead of marshaling the image data twice, I could render the image directly into the underlying DeviceContext of the Graphics object. The FPDF_RenderPage function could be used for this. This function is only available on Windows so an exception would be required if it is run on another operating system.

If this feature is accepted, I am able to provide an initial implementation.

Accept byte array directly instead of a file path

It would be nice if we could pass in a pdf's byte[] directly to GetDocReader, instead of a file path. In my case, all I have is the byte array, so I need to first create a temp file with this, then pass that in, then ensure I delete it (I used the same idea as your test case with the TempFile).

I don't have time to actually create a full pull request, but I've tested the following addition to DocReader.cs and it seemd to work just fine:

 public DocReader(byte[] fileData, string password, int dimOne, int dimTwo)
        {
            _dimOne = dimOne;
            _dimTwo = dimTwo;
            lock (DocLib.Lock)
            {
                IntPtr unmanagedPointer = Marshal.AllocHGlobal(fileData.Length);
                Marshal.Copy(fileData, 0, unmanagedPointer, fileData.Length);
                var d = fpdf_view.FPDF_LoadMemDocument(unmanagedPointer, fileData.Length, password);
                Marshal.FreeHGlobal(unmanagedPointer);
                _docWrapper = new DocumentWrapper(d);
            }
        }

Obviously you probably want to add more error checking etc, but that does work fine.

Otherwise, your library solves a huge gap in dotnet core right now around dealing with PDF files!

Kudos!
Eric.

Bug on specific control in BoundBox class: "Validator.CheckOrder"

I encountered an issue with the pageReader.GetCharacters() method. With pdf file if the text is written from bottom to top, the control Validator.CheckOrder returns an exception.

To solve this problem, i had to delete these controls "Validator.CheckOrder":

After many tests on many pages (+100000) on many documents (+30000). Validator.CheckOrder is useless.
It would be preferable to delete it.

Performance comparison to other pdfium based libs

I have been starting to look for alternatives to iText7 for text extraction since iText has issues with some pdf documents that I need to handle which seem to have an uncommon (but still valid) encoding. docnet can handle these and all other documents that I used for the comparison just fine. In a next step I was looking at the extraction performance and I found that there are significant differences between the various libs that are pdfium based (currently looking at docnet, Patagames Pdfium.NET SDK and ironPDF).

I found that docnet was performing significantly slower than Patagames for a 63 page PDF document of about 1.5 MB (1.3 seconds vs 200 ms - measured with benchmarkdotnet). Would could be a reason that pdfium based text extraction has such significant performance differences? Are there any options to do extraction using docnet differently that I could try out?

Currently my code looks like this:

        using (var docReader = DocLib.Instance.GetDocReader(opts.FilePath, new PageDimensions()))
        {
            for (var i = 0; i < docReader.GetPageCount(); i++)
            {
                using (var pageReader = docReader.GetPageReader(i))
                {
                    var text = pageReader.GetText();
                    AddTextToResult(text, opts);
                }
            }
        }

Support angle for characters

It would be nice to be able to supply angle information for the characters found. This will also help in the future for grouping characters into words.

How to use the Split-Method?

I want to split a large pdf document in smaller documents.

var bytes = DocLib.Instance.Split("MyPDF.pdf","1,1,1");

But how do i get the 3 separate pages from these bytes?

Signature is not rendered in new version of pdfium

The signature is not being rendered.
This problem is solved in PdfiumViewer project and the issue link is here

Efficient memory usage in the case of big PDF Merge operations

Is your feature request related to a problem? Please describe.
We have a requirement which involves merging quite a lot of PDF files, and the library seems not to scale well with the increasing size (or number) of the PDF files involved.
To be specicific, we perform a number of sequential calls to a web API and we get a PDF file that we store in memory as a byte[]. Every time, we call the Merge(byte[], byte[]) method and we build (in memory) the PDF corresponding to the 'sum' of all the previous ones. Then we save it.
The problem is that the operation is quite heavy memory-wise because the result gets bigger and bigger and we end up often with a OutOfMemoryException.

Describe the solution you'd like
Some kind of PDF merge operation that allows us to avoid loading the result into memory (maybe using, for example, some kind of output Stream).

Describe alternatives you've considered
We've considered splitting up the operation and save multiple smaller files instead of a big one, but that's not something we can do for backwards compatibility reasons.

Additional context
Here's an example Console App I've done for testing purposes demonstrating the issue:

static async Task Main(string[] args)
{
    byte[]? accumulator = null;
    var files = Directory.EnumerateFiles(@"C:\temp\pdfs").ToList();
    for (int i = 0; i < files.Count(); i++)
    {
        if (i == 0)
        {
            accumulator = await File.ReadAllBytesAsync(files[i]);
            continue;
        }

        accumulator = DocLib.Instance.Merge(accumulator, await File.ReadAllBytesAsync(files[i]));
    }

    await using var sResult = File.OpenWrite(@"C:\temp\pdfs\all.pdf");
    await sResult.WriteAsync(accumulator);
}

Thank you for the project.

Ability to rotate a page in a document

Is your feature request related to a problem? Please describe.
I cannot see a way to rotate pages in a given PDF document. I believe the underlying library supports this.

Describe the solution you'd like
I would like a way to set rotation for one or more pages in a document (Rotate90, Rotate180, Rotate270).

Describe alternatives you've considered
I cannot see any alternatives within docnet. There seem to be other libraries that support this (Aspose, etc.)

Additional context
N/A

pdfium.dll locked by IIS/IIS Express

We are using V2.4.0 alpha4 (from Nuget) in a Web Forms project (.net 4.8).

The initial run works fine, DLL is copied and loaded fine. Works perfectly
The next time a build is attempted (in Visual Studio ) I get the following error:

Unable to copy file ...\packages\Docnet.Core.2.4.0-alpha.4\runtimes\win-x64\native\pdfium.dll" to "bin\pdfium.dll". The process cannot access the file 'bin\pdfium.dll' because it is being used by another process.

Killing the "IIS Worker Process" resolves the issue, but isn't a practical solution.

The same happens when using it on IIS. Again, the initial run works fine and DLL is copied and loaded fine.
When trying to publish the site, the bin directory can not be deleted, because 'bin\pdfium.dll' is locked by the web server.

Recycling the app pool of the website solves the problem, but again, that isn't really a practical solution.

Is there any way to release the lock after using docnet?

Create PDf from multiple bitmaps?

I saw there's an example using JPEG, but is it possible to generate a PDF from a set of bitmaps as well?

Warning MSB3270 There was a mismatch between the processor architecture of the project being built "MSIL" and the processor architecture of the reference

Hi is there any solution that solve this.

Severity Code Description Project File Line Suppression State
Warning MSB3270 There was a mismatch between the processor architecture of the project being built "MSIL" and the processor architecture of the reference "C:\Users\mkhas.nuget\packages\docnet.core\1.4.0\lib\netstandard2.0\Docnet.Core.dll", "AMD64". This mismatch may cause runtime failures. Please consider changing the targeted processor architecture of your project through the Configuration Manager so as to align the processor architectures between your project and references, or take a dependency on references with a processor architecture that matches the targeted processor architecture of your project. SWCG_BRM C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\MSBuild\Current\Bin\Microsoft.Common.CurrentVersion.targets 2106

System.AccessViolationException

Anybody else getting these exceptions intermittently?

System.AccessViolationException: 'Attempted to read or write protected memory. This is often an indication that other memory is corrupt.' on GetPageReader(0)

Edit: Fixed when using locks on all usages. Then the question is: is it possible to make it's usage multithreaded?

expose clipping for IPageReader.GetImage

I'm writing a viewer that converts a server-side PDF to image tiles to send to the client to display.

The viewer has zoom functionality, so to avoid pixelation when zoomed in, the server passes the zoom level to DocLib.GetDocReader to get a good quality image which it then crops into tiles.

The problem that I have is that as the zoom level increases, the image that we get from IPageReader.GetImage gets bigger until System.Drawing.Bitmap can no longer handle an image that big (and it starts being a large amount of memory to allocate).

Looking at the source code for PageReader, it looks like PDFium supports clipping the image, but this isn't exposed through Docnet:

docnet/src/Docnet.Core/Readers/PageReader.cs

Lines 211 to 214 in 728e6c9

    
           clipping.Left = 0; 
        
           clipping.Right = width; 
        
           clipping.Bottom = 0; 
        
           clipping.Top = height;

It would be great if this was exposed so I could pass in coordinates to get an area within the image rather than the whole image.

extract images from PDF

Some of the PDF files that my application processes were created by scanners, so they're basically a PDF containing nothing but one image per page.

I would like to extract the images so that I can deal with them as images rather than PDFs. I don't want to use GetImage to convert the whole page to an image because this will include the margins around the image.

Looking at the source code, it looks like Docnet includes the PDFium calls required to extract images from PDF files:

docnet/src/Docnet.Core/Bindings/PdfiumWrapper.cs

Lines 3211 to 3214 in 728e6c9

    
                       [SuppressUnmanagedCodeSecurity] 
        
                       [DllImport("pdfium", CallingConvention = CallingConvention.Cdecl, 
        
                           EntryPoint = "FPDFImageObj_GetBitmap")] 
        
                       internal static extern IntPtr FPDFImageObjGetBitmap(IntPtr image_object);

docnet/src/Docnet.Core/Bindings/PdfiumWrapper.cs

Lines 1869 to 1887 in 728e6c9

    
                       [SuppressUnmanagedCodeSecurity] 
        
                       [DllImport("pdfium", CallingConvention = CallingConvention.Cdecl, 
        
                           EntryPoint = "FPDFBitmap_GetBuffer")] 
        
                       internal static extern IntPtr FPDFBitmapGetBuffer(IntPtr bitmap); 
        
                       [SuppressUnmanagedCodeSecurity] 
        
                       [DllImport("pdfium", CallingConvention = CallingConvention.Cdecl, 
        
                           EntryPoint = "FPDFBitmap_GetWidth")] 
        
                       internal static extern int FPDFBitmapGetWidth(IntPtr bitmap); 
        
                       [SuppressUnmanagedCodeSecurity] 
        
                       [DllImport("pdfium", CallingConvention = CallingConvention.Cdecl, 
        
                           EntryPoint = "FPDFBitmap_GetHeight")] 
        
                       internal static extern int FPDFBitmapGetHeight(IntPtr bitmap); 
        
                       [SuppressUnmanagedCodeSecurity] 
        
                       [DllImport("pdfium", CallingConvention = CallingConvention.Cdecl, 
        
                           EntryPoint = "FPDFBitmap_GetStride")] 
        
                       internal static extern int FPDFBitmapGetStride(IntPtr bitmap);

It would be great if this was exposed so it was available to be used through Docnet.

Problem resolving pdfium dependencies in a Linuxdocker environment

I have problems when resolving the dependency pdfium in a docker under a Linux host in my dotnet core 2.2 project. I run it under Alpine.
I have installed pdfium and libpdfium and even linked it into the runtime folder. The environment is x64. I have also installed libgdiplus-dev (but it is not present inside the runtime folder).

The error I recieve is:

System.DllNotFoundException: Unable to load shared library 'pdfium' or one of its dependencies. In order to help diagnose loading problems, consider setting the LD_DEBUG environment variable: Error loading shared library libpdfium: No such file or directory

This is the files pertaining to pdfium in my runtime folder inside the docker:
libpdfium.dll -> /app/libpdfium.so
libpdfium.so
pdfium.dll -> /app/runtimes/linux-x64/native/pdfium.so
pdfium.so -> /app/runtimes/linux-x64/native/pdfium.so
pdfium_x64.so

(Yes, there are more than one link to pdfium. It is mainly due to me trying to fix the problems.)

Feature: support GetWords()

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

We usually query a document word-by-word. However, docnet only supports character-oriented queries. Character-oriented queries are really cool since users can build word-oriented queries based on them. However, I believe it will be better if this common requirement could be implemented in docnet package.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Thus, there is a need for GetWords function to return a list of words. Each word model has the location box and the text information, just like GetCharacters.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Include pdfium without extension for Mac OS X runtime

Hello,
I tried to run my solution that is targeted to .Net Framework 4.7 with Mono on mac os x and it returned me the error "pdfium not loaded". Actually, I spent a few hours trying to understand what's the issue until got help from my teammates. My output folder included the file pdfium.dylib but still didn't work. The workaround was to remove extension ".dylib" from the file on the output folder.
The file pdfium.dylib is loaded successfully by .net core apps but not from Mono. Of course, using Mono with the library is a rare case, but one improvement can be to check if the environment is Mono then copy the file without extension or copy both files if runtime is mac os x.

Just wanted to report in case if it can save time to other developers.

Digital signatures not rendered

I have a digitally signed PDF that I'm trying to render.
All the page content is correctly rendered, but the signature.
I'm trying to render it with the flag RenderAnnotations

Cluster characters into words

Hi, thanks for the great library! When can we expect clustering characters into words, since I see it's mentioned in the features list.

Page Dimensions

Hello, sorry for insist :)
Can you help me with: #72 (comment)_

Accept stream input

Currently it appears that this library supports byte[] and filepath inputs. It would be much more flexible if it also allowed for Stream input.

PDF to image not working in Azure

The sample code at https://github.com/GowenGit/docnet/blob/master/examples/pdf-to-image/PdfToImage/Program.cs uses System.Drawing.Common, but it is not supported in Azure.

Can you provide a sample using SkiaSharp?

Thanks

Page Dimensions

Hello, I want to print a ticket with paper size: 70mm x 40mm.
I get an error at line: using var _reader = DocLib.Instance.GetDocReader(pdf, pageDim);

'dimOne can't be more than dimTwo'
en Docnet.Core.Validation.Validator.CheckNotGreaterThan(Int32 valueOne, Int32 valueTwo, String nameOne, String nameTwo)
en Docnet.Core.Models.PageDimensions..ctor(Int32 dimOne, Int32 dimTwo)

Can you allow the width is greater then the height, please?

Split pages of a document into individual PDF files

Is your feature request related to a problem? Please describe.
I wonder if it would be possible to implement a method where we split a PDF file such that each page in the document becomes an individual file.

Describe the solution you'd like
It looks like we have the possibility to split PDF into two by providing a page range, but instead of doing that in a loop where I get a single page on every step and create a PDF out of that I think it would be great if the new method would simply return a list of pages that can later be saved as separate files.

Describe alternatives you've considered
As stated above, the alternative seems to be looping over the document using the existing Split methods.

Additional context
Maybe there is a better approach that solves this in a more elegant way?

Doesn't work on Android

PDF doesn't load on android device

Exception reported by GetCharacters method

Using the method 'GetCharacters' the exception "top coordinate can't be more than bottom coordinate" is reported for the attached document.

Testcode:
using (var docReader = DocLib.Instance.GetDocReader(@"input2.pdf", new PageDimensions(10000, 10000)))
{
using (var pageReader = docReader.GetPageReader(0))
{
var characters = pageReader.GetCharacters();
}
}

DocNet Version: 2.3.1 / Windows 10
testfile.pdf

Expose flags for page rendering

Have you considered exposing the different rendering flags to the fpdf_view.FPDF_RenderPageBitmapWithMatrix(bitmap, _page, matrix, clipping, _flags_here_) call in your PageReader.GetImage() method?

I am most interested in being able to render the pdf with annotations enabled: FPDF_ANNOT 0x01.

GetImage without transparency and black and white mode

Thanks for the awesome library
It could be nice to have GetImage() method to retrieve bitmap array without transparency
And maybe black and white bitmap array option as well
There are many cases where transparency will take just additional memory - for example documents to OCR and etc
PDF mostly contains documents so it could be useful cases

Add Arm/64-bit target binary

Hi,

Recently we added a 32-bit arm target to docnet, this target has been working well in Kavita (self-hosted comic/book reader) since then, however, some of our users are running a 64-bit operating system on their raspberry pi and it looks like they are unable to use the 32-bit build.

I know there are issues with Travis credits but I wanted to ask how feasible it would be to add a 64-bit target. If it sounds ok I can see about submitting a pull request.

Thank you

strong name signed version on nuget

Is your feature request related to a problem? Please describe.
Hit the problem A strongly-named assembly is required. (Exception from HRESULT: 0x80131044) on Windows when the app requires.

Describe the solution you'd like
Perhaps sign the lib? I'm not sure how to make it yet, however.

Describe alternatives you've considered
The app based on the lib would turn the feature off, but that would reduce the overall security sometimes.

Additional context
Add any other context or screenshots about the feature request here.

Missing essential examples

This looks like some interesting code, and might lead to something promising, but how are the methods supposed to be called? For example, how are the args supposed to be structured in the main execution code, where there are 3 different scenarios? I could not find any documentation or real, actual examples. Without that, users are left to wide-open, unending guessing, starting with proper method calling and input for "args" below:

if (args.Length != 3)
{
throw new ArgumentException(nameof(args));
}

        var file = new JpegImage
        {
            Bytes = File.ReadAllBytes(args[0]),
            Width = int.Parse(args[1]),
            Height = int.Parse(args[1])
        };

        var bytes = DocLib.Instance.JpegToPdf(new[] { file });

        File.WriteAllBytes("file.pdf", bytes);

Is there a way to create a PDF and add pages?

I need to be able to create a new page and content and insert it into a target PDF.

Currently I'm merging multiple inputs into a single output PDF.
I want to insert an index page, showing the list of documents, and which page number they appear at in the output.
So, the output document would be an index page, followed by all of the inputs concatenated together.

I don't seem to be able to find a way to create PDFs or pages in DocNet. Seems pretty fundamental to a PDF toolkit.

Arm/RaspberryPi support

Hi!

Would there be any interest to also have an Arm/Linux/RaspberryPi target for this project? The use case we have is for handling pdf files in the self-hosted book viewer Kavita on a Raspberry Pi. If that would be acceptable I could look into submitting a pull request.

I suppose this would mostly be about compiling a 'docnet/src/Docnet.Core/runtimes/linux-arm/native/pdfium.so' binary and hooking it into the plumbing.

Thank you.

Signed assembly (strong-named)

Hi,
We would like to use docnet in a project that is strongly named (signed).

From the docs:
A strong-named assembly can only use types from other strong-named assemblies. Otherwise, the integrity of the strong-named assembly would be compromised.
https://docs.microsoft.com/en-us/dotnet/standard/assembly/create-use-strong-named

Would it be possible to sign the assembly that is in the nuget package? All dotnet libraries usually come strongly named (signed) including the dotnet core ones.

Example PDF to Image is not thread-safe in web environments

The sample for converting PDF-pages to an image using the sample works fine until it is used in a multi-thread fashion. I used the sample in a ASP.NET core 2.2 Razor-Page handler. This handler created thumbnails of PDF on the fly and therefore is called from the client in parallel leading to multiple parallel calls to the sampel code.

Since I could not come up with a real solution i put a Mutex around the sample code. That works - but is it a good solution?

Is there an example of how to extract images?

I'm looking for a comparative example of how to extract images from a PDF using DocNET.

Here are a couple examples from other libraries I'm looking at:

UglyPig's Image Extraction Example: https://github.com/UglyToad/PdfPig/blob/master/examples/ExtractImages.cs
See Patagames "Extracting Images" code example: https://pdfium.patagames.com/

Can this be done in DocNET as well?

left coordinate can't be more than right coordinate

We have a problem with certain PDF files, in some documents we get this error:

left coordinate can't be more than right coordinate, STACK: at Docnet.Core.Validation.Validator.CheckOrder(Int32 coordOne, Int32 coordTwo, String nameOne, String nameTwo)
at Docnet.Core.Models.BoundBox..ctor(Int32 left, Int32 top, Int32 right, Int32 bottom)
at Docnet.Core.Readers.PageReader.d__10.MoveNext()

Is it possible to add a specific property that will avoid this validation on BoundBox class so that the developer can later do his own validation?

We don't know exactly where the problem is and on what character this is happening, but this only happens on certain documents, we also have an example that we can send you if you could look at it and determine the real reason?

I must also emphasize that if I insert the same PDF into some other software such as ABBYY it does not report any error and all the characters are read properly.

Transparency issues when generating jpeg image

Basically Docnet generates black background instead of white or transparent.

003_Test_Output (1).zip
003_Test.pdf

Question about functionality

Hey peeps!
I was wondering if docnet can be used to digitally sign PDFs?
Thank you :)

Comments on PDF files not showing on rendered image

I would be great if comments on PDF files will be included on the rendered image.

Pdf to PNG when the picture is a little fuzzy, how to solve it?

Unable to load DLL 'pdfium': The specified module could not be found

I must be doing something stupid, I installed your package via nuget for my ASP.Net application. But when attempting to use docnet I get Unable to load DLL 'pdfium': The specified module could not be found I've tried manually copying the pdfium.dll to the root of the application and set it to always copy to the output directory to ensure it makes it into \bin. But no go. Any ideas?

	clipping.Left = 0;
	clipping.Right = width;
	clipping.Bottom = 0;
	clipping.Top = height;

	[SuppressUnmanagedCodeSecurity]
	[DllImport("pdfium", CallingConvention = CallingConvention.Cdecl,
	EntryPoint = "FPDFImageObj_GetBitmap")]
	internal static extern IntPtr FPDFImageObjGetBitmap(IntPtr image_object);

	[SuppressUnmanagedCodeSecurity]
	[DllImport("pdfium", CallingConvention = CallingConvention.Cdecl,
	EntryPoint = "FPDFBitmap_GetBuffer")]
	internal static extern IntPtr FPDFBitmapGetBuffer(IntPtr bitmap);

	[SuppressUnmanagedCodeSecurity]
	[DllImport("pdfium", CallingConvention = CallingConvention.Cdecl,
	EntryPoint = "FPDFBitmap_GetWidth")]
	internal static extern int FPDFBitmapGetWidth(IntPtr bitmap);

	[SuppressUnmanagedCodeSecurity]
	[DllImport("pdfium", CallingConvention = CallingConvention.Cdecl,
	EntryPoint = "FPDFBitmap_GetHeight")]
	internal static extern int FPDFBitmapGetHeight(IntPtr bitmap);

	[SuppressUnmanagedCodeSecurity]
	[DllImport("pdfium", CallingConvention = CallingConvention.Cdecl,
	EntryPoint = "FPDFBitmap_GetStride")]
	internal static extern int FPDFBitmapGetStride(IntPtr bitmap);