pdf-raku / pdf-raku Goto Github PK

View Code? Open in Web Editor NEW

17.0 6.0 5.0 3.41 MB

Low level tools for reading, writing and manipulation of PDFs

License: Artistic License 2.0

Raku 100.00%

pdf raku-module

pdf-raku's Introduction

[Raku PDF Project] / PDF

PDF-raku

Overview

This is a low-level Raku module for accessing and manipulating data from PDF documents.

It presents a seamless view of the data in PDF or FDF documents; behind the scenes handling indexing, compression, encryption, fetching of indirect objects and unpacking of object streams. It is capable of reading, editing and creation or incremental update of PDF files.

This module understands physical data structures rather than the logical document structure. It is primarily intended as base for higher level modules; or to explore or patch data in PDF or FDF files.

It is possible to construct basic documents and perform simple edits by direct manipulation of PDF data. This requires some knowledge of how PDF documents are structured. Please see 'The Basics' and 'Recommended Reading' sections below.

Classes/roles in this module include:

PDF - PDF document root (trailer)
PDF::IO::Reader - for indexed random access to PDF files
PDF::IO::Filter - a collection of standard PDF decoding and encoding tools for PDF data streams
PDF::IO::IndObj - base class for indirect objects
PDF::IO::Serializer - data marshalling utilities for the preparation of full or incremental updates
PDF::IO::Crypt - decryption / encryption
PDF::IO::Writer - for the creation or update of PDF files
PDF::COS - Raku Bindings to PDF objects [Carousel Object System, see COS]

Example Usage

To create a one page PDF that displays 'Hello, World!'.

#!/usr/bin/env raku
# creates examples/helloworld.pdf
use PDF;
use PDF::COS::Name;
use PDF::COS::Dict;
use PDF::COS::Stream;
use PDF::COS::Type::Info;

sub prefix:</>($s) { PDF::COS::Name.COERCE($s) };

# construct a simple PDF document from scratch
my PDF $pdf .= new;
my PDF::COS::Dict $catalog = $pdf.Root = { :Type(/'Catalog') };

my @MediaBox  = 0, 0, 250, 100;

# define font /F1 as core-font Helvetica
my %Resources = :Procset[ /'PDF', /'Text'],
                :Font{
                    :F1{
                        :Type(/'Font'),
                        :Subtype(/'Type1'),
                        :BaseFont(/'Helvetica'),
                        :Encoding(/'MacRomanEncoding'),
                    },
                };

my PDF::COS::Dict $page-index = $catalog<Pages> = { :Type(/'Pages'), :@MediaBox, :%Resources, :Kids[], :Count(0) };
# add some standard metadata
my PDF::COS::Type::Info $info = $pdf.Info //= {};
$info.CreationDate = DateTime.now;
$info.Producer = "Raku PDF";

# define some basic content
my PDF::COS::Stream() $Contents = { :decoded("BT /F1 24 Tf  15 25 Td (Hello, world!) Tj ET" ) };

# create a new page. add it to the page tree
$page-index<Kids>.push: { :Type(/'Page'), :Parent($page-index), :$Contents };
$page-index<Count>++;

# save the PDF to a file
$pdf.save-as: 'examples/helloworld.pdf';

Then to update the PDF, adding another page:

#!/usr/bin/env raku
use PDF;
use PDF::COS::Stream;
use PDF::COS::Type::Info;

my PDF $pdf .= open: 'examples/helloworld.pdf';

# locate the document root and page tree
my $catalog = $pdf<Root>;
my $Parent = $catalog<Pages>;

# create additional content, use existing font /F1
my PDF::COS::Stream() $Contents = { :decoded("BT /F1 16 Tf  15 25 Td (Goodbye for now!) Tj ET" ) };

# create a new page. add it to the page-tree
$Parent<Kids>.push: { :Type( :name<Page> ), :$Parent, :$Contents };
$Parent<Count>++;

# update or create document metadata. set modification date
my PDF::COS::Type::Info $info = $pdf.Info //= {};
$info.ModDate = DateTime.now;

# incrementally update the existing PDF
$pdf.update;

Description

A PDF file consists of data structures, including dictionaries (hashes) arrays, numbers and strings, plus streams for holding graphical data such as images, fonts and general content.

PDF files are also indexed for random access and may also have internal compression and/or encryption.

They have a reasonably well specified structure. The document starts from the Root entry in the trailer dictionary, which is the main entry point into a PDF.

This module is based on the PDF 32000-1:2008 1.7 specification. It implements syntax, basic data-types, serialization and encryption rules as described in the first four chapters of the specification. Read and write access to data structures is via direct manipulation of tied arrays and hashes.

The Basics

The examples/helloworld.pdf file that we created above contains:

%PDF-1.3
%...(control characters)
1 0 obj <<
  /CreationDate (D:20151225000000Z00'00')
  /Producer (Raku PDF)
>>
endobj

2 0 obj <<
  /Type /Catalog
  /Pages 3 0 R
>>
endobj

3 0 obj <<
  /Type /Pages
  /Count 1
  /Kids [ 4 0 R ]
  /MediaBox [ 0 0 250 100 ]
  /Resources <<
    /Font <<
      /F1 6 0 R
    >>
    /Procset [ /PDF /Text ]
  >>
>>
endobj

4 0 obj <<
  /Type /Page
  /Contents 5 0 R
  /Parent 3 0 R
>>
endobj

5 0 obj <<
  /Length 44
>> stream
BT /F1 24 Tf  15 25 Td (Hello, world!) Tj ET
endstream
endobj

6 0 obj <<
  /Type /Font
  /Subtype /Type1
  /BaseFont /Helvetica
  /Encoding /MacRomanEncoding
>>
endobj

xref
0 7
0000000000 65535 f 
0000000014 00000 n 
0000000101 00000 n 
0000000155 00000 n 
0000000334 00000 n 
0000000404 00000 n 
0000000501 00000 n 
trailer
<<
  /ID [ <d743a886fcdcf87b69c36548219ea941> <d743a886fcdcf87b69c36548219ea941> ]
  /Info 1 0 R
  /Root 2 0 R
  /Size 7
>>
startxref
610
%%EOF

The PDF is composed of a series indirect objects, for example, the first object is:

1 0 obj <<
  /CreationDate (D:20151225000000Z00'00')
  /Producer (Raku PDF)
>> endobj

It's an indirect object with object number 1 and generation number 0, with a << ... >> delimited dictionary containing the author and the date that the document was created. This PDF dictionary is roughly equivalent to the Raku hash:

{ :CreationDate("D:20151225000000Z00'00'"), :Producer("Raku PDF"), }

The bottom of the PDF contains:

trailer
<<
  /ID [ <d743a886fcdcf87b69c36548219ea941> <d743a886fcdcf87b69c36548219ea941> ]
  /Info 1 0 R
  /Root 2 0 R
  /Size 7
>>
startxref
610
%%EOF

The << ... >> delimited section is the trailer dictionary and the main entry point into the document. The entry /Info 1 0 R is an indirect reference to the first object (object number 1, generation 0) described above. The entry /Root 2 0 R points the root of the actual PDF document, commonly known as the Document Catalog.

Immediately above the trailer is the cross reference table:

xref
0 7
0000000000 65535 f 
0000000014 00000 n 
0000000101 00000 n 
0000000155 00000 n 
0000000334 00000 n 
0000000404 00000 n 
0000000501 00000 n

This indexes the indirect objects in the PDF by byte offset (generation number) for random access.

We can quickly put PDF to work using the Raku REPL, to better explore the document:

snoopy: ~/git/PDF-raku $ raku -M PDF
> my $pdf = PDF.open: "examples/helloworld.pdf"
ID => [CÜ{ÃHADCN:C CÜ{ÃHADCN:C], Info => ind-ref => [1 0], Root => ind-ref => [2 0]
> $pdf.keys
(Root Info ID)

This is the root of the PDF, loaded from the trailer dictionary

> $pdf<Info>
{CreationDate => D:20151225000000Z00'00', ModDate => D:20151225000000Z00'00', Producer => Raku PDF}

That's the document information entry, commonly used to store basic meta-data about the document.

(PDF::IO has conveniently fetched indirect object 1 from the PDF, when we dereferenced this entry).

> $pdf<Root>
{Pages => ind-ref => [3 0], Type => Catalog}

The trailer Root entry references the document catalog, which contains the actual PDF content. Exploring further; the catalog potentially contains a number of pages, each with content.

> $pdf<Root><Pages>
{Count => 1, Kids => [ind-ref => [4 0]], MediaBox => [0 0 420 595], Resources => Font => F1 => ind-ref => [6 0], Type => Pages}
> $pdf<Root><Pages><Kids>[0]
{Contents => ind-ref => [5 0], Parent => ind-ref => [3 0], Type => Page}
> $pdf<Root><Pages><Kids>[0]<Contents>
{Length => 44}
"BT /F1 24 Tf  15 25 Td (Hello, world!) Tj ET"

The page /Contents entry is a PDF stream which contains graphical instructions. In the above example, to output the text Hello, world! at coordinates 100, 250.

Reading and Writing of PDF files:

PDF is a base class for opening or creating PDF documents.

my $pdf = PDF.open("mydoc.pdf" :repair) Opens an input PDF (or FDF) document.
- :!repair causes the read to load only the trailer dictionary and cross reference tables from the tail of the PDF (Cross Reference Table or a PDF 1.5+ Stream). Remaining objects will be lazily loaded on demand.
- :repair causes the reader to perform a full scan, ignoring and recalculating the cross reference stream/index and stream lengths. This can be handy if the PDF document has been hand-edited.
$pdf.update This performs an incremental update to the input pdf, which must be indexed PDF (not applicable to PDFs opened with :repair, FDF or JSON files). A new section is appended to the PDF that contains only updated and newly created objects. This method can be used as a fast and efficient way to make small updates to a large existing PDF document.
- :diffs(IO::Handle $fh) - saves just the updates to an alternate location. This can be later appended to the base PDF to reproduce the updated PDF.
$pdf.save-as("mydoc-2.pdf", :compress, :stream, :preserve, :rebuild) Saves a new document, including any updates. Options:
- :compress - compress objects for minimal size
- :!compress - uncompress objects for human readability
- :stream - write the PDF progressively
- :preserve - copy the input PDF, then incrementally update. This is generally faster and ensures that any digital signatures are not invalidated,
- :rebuild - discard any unreferenced objects. renumber remaining objects. It may be a good idea to rebuild a PDF Document, that's been incrementally updated a number of times.

Note that the :compress and :rebuild options are a trade-off. The document may take longer to save, however file-sizes and the time needed to reopen the document may improve.

$pdf.save-as("mydoc.json", :compress, :rebuild); my $pdf2 = $pdf.open: "mydoc.json" Documents can also be saved and opened from an intermediate JSON representation. This can be handy for debugging, analysis and/or ad-hoc patching of PDF files.

Reading PDF Files

The .open method loads a PDF index (cross reference table and/or stream). The document can then be access randomly via the .ind.obj(...) method.

The document can be traversed by dereferencing Array and Hash objects. The reader will load indirect objects via the index, as needed.

use PDF::IO::Reader;
use PDF::COS::Name;

my PDF::IO::Reader $reader .= new;
$reader.open: 'examples/helloworld.pdf';

# objects can be directly fetched by object-number and generation-number:
my $page1 = $reader.ind-obj(4, 0).object;

# Hashes and arrays are tied. This is usually more convenient for navigating
my $pdf = $reader.trailer<Root>;
$page1 = $pdf<Pages><Kids>[0];

# Tied objects can also be updated directly.
$reader.trailer<Info><Creator> = PDF::COS::Name.COERCE: 't/helloworld.t';

Utility Scripts

pdf-rewriter.raku [--repair] [--rebuild] [--stream] [--[/]compress] [--password=Xxx] [--decrypt] [--class=Module] [--render] <pdf-or-json-file-in> [<pdf-or-json-file-out>] This script is a thin wrapper for the PDF .open and .save-as methods. It can typically be used to:
- uncompress or render a PDF for human readability
- repair a PDF who's cross-reference index or stream lengths have become invalid
- convert between PDF and JSON

Decode Filters

Filters are used to compress or decompress stream data in objects of type PDF::COS::Stream. These are implemented as follows:

Filter Name	Short Name	Filter Class
ASCIIHexDecode	AHx	PDF::IO::Filter::ASCIIHex
ASCII85Decode	A85	PDF::IO::Filter::ASCII85
CCITTFaxDecode	CCF	NYI
Crypt		NYI
DCTDecode	DCT	NYI
FlateDecode	Fl	PDF::IO::Filter::Flate
LZWDecode	LZW	PDF::IO::Filter::LZW (`decode` only)
JBIG2Decode		NYI
JPXDecode		NYI
RunLengthDecode	RL	PDF::IO::Filter::RunLength

Input to all filters is byte strings, with characters in the range \x0 ... \0xFF. latin-1 encoding is recommended to enforce this.

Each filter has encode and decode methods, which accept and return latin-1 encoded strings, or binary blobs.

my Blob $encoded = PDF::IO::Filter.encode( :dict{ :Filter<RunLengthDecode> },
                                      "This    is waaay toooooo loooong!");
say $encoded.bytes;

Encryption

PDF::IO::Crypt supports RC4 and AES encryption (revisions /R 2 - 4 and versions /V 1 - 4 of PDF Encryption).

To open an encrypted PDF document, specify either the user or owner password: PDF.open( "enc.pdf", :password<ssh!>)

A document can be encrypted using the encrypt method: $pdf.encrypt( :owner-pass<ssh1>, :user-pass<abc>, :aes )

:aes encrypts the document using stronger V4 AES encryption, introduced with PDF 1.6.

Note that it's quite common to leave the user-password blank. This indicates that the document is readable by anyone, but may have restrictions on update, printing or copying of the PDF.

An encrypted PDF can be saved as JSON. It will remain encrypted and passwords may be required, to reopen it.

Built-in objects

PDF::COS also provides a few essential derived classes, that are needed read and write PDF files, including encryption, object streams and cross reference streams.

Class	Base Class	Description
PDF	PDF::COS::Dict	document entry point - the trailer dictionary
PDF::COS::Type::Encrypt	PDF::COS::Dict	PDF Encryption/Permissions dictionary
PDF::COS::Type::Info	PDF::COS::Dict	Document Information Dictionary
PDF::COS::Type::ObjStm	PDF::COS::Stream	PDF 1.5+ Object stream (packed indirect objects)
PDF::COS::Type::XRef	PDF::COS::Stream	PDF 1.5+ Cross Reference stream
PDF::COS::TextString	PDF::COS::ByteString	Implements the 'text-string' data-type

pdf-raku's People

Contributors

Stargazers

Watchers

Forkers

shaneisley tklebanoff tbrowder niner melezhik

pdf-raku's Issues

Failed test 'write content'

"zef install PDF" output:

===> Testing: PDF:ver<0.2.8>:auth<github:p6-pdf>:api<PDF-1.7>
# Failed test 'write content'
# at home#sources/26CC3C5D046C986175D965C1FD37A27CDA20209C (PDF::Grammar::Test) line 42
# expected: "[ (DOS\\r\\nCR1\\(\0) <444f530d0a435232> ] TJ"
#  matcher: 'json-eqv'
#      got: "[ (DOSrnCR1\\(\0) <444f530d0a435232> ] TJ"
# :ast({:content(${:TJ($[{:array($[{:literal("DOSrnCR1(\0")}, {:hex-string("DOS\r\nCR2")}])},])})})
# Looks like you failed 1 test of 21
===> Testing [FAIL]: PDF:ver<0.2.8>:auth<github:p6-pdf>:api<PDF-1.7>

Using Rakudo version 2018.03 on Linux

Remove PDF::Reader::Tied run-time compositions

$object but PDF::Reader::Tied composes a completely new object which is wasteful and likely to lead to issues (two sets of objects).

This needs to be done declaratively . Will need a little reorganization of the reader methods + object class definitions. I'm hoping I can get rid of the anonymous ties currently being composed by PDF::Reader::Tied.

`pdf-rewriter.raku --render` corrupting an encrypted PDF's outlines

Or it's somehow leaving, the outlines unencrypted. Obvious if viewed with evince, etc.

Sample simple in encrypted pdf with outlines attached
tst.pdf

attached
out.pdf
out.pdf shows the result of pdf-rewriter --render tst.pdf out.pdf

pdf-rewriter.raku failing to rewrite PDF 1.5 with /XRef and /ObjStm objects

If I read and write such a PDF:

$ raku -I . bin/pdf-rewriter.raku t/pdf/samples/pdf-1.5-obstm_and_xref_streams.pdf /tmp/out.pdf
opening t/pdf/samples/pdf-1.5-obstm_and_xref_streams.pdf ...
saving ...
done

Then GS no longer works on it:

gs /tmp/out.pdf 
GPL Ghostscript 9.50 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
   **** Error:  An error occurred while reading an XREF table.
   **** The file has been damaged.  This may have been caused
   **** by a problem while converting or transfering the file.
   **** Ghostscript will attempt to recover the data.
   **** However, the output may be incorrect.
   **** Error:  Trailer dictionary not found.
                Output may be incorrect.
   No pages will be processed (FirstPage > LastPage).

   **** This file had errors that were repaired or ignored.
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

   **** The rendered output from this file may be incorrect.
GS>

The /Index entry in the /XRef is ~~obviously wrong~~ suspicious, and should probably be sorted anyway : [/Index [ 0 1 12 1 19 1 24 ...]

Tie Hash classes to attributes, not methods

For example, currently do:

our class PDF::Object::DOM::XRef
    is PDF::Object::Stream {
    method W is rw { self<W>; }
    method Size is rw { self<Size>; }
    method Index is rw { self<Index> }
...
}

Might be nicer to allow something like:

our class PDF::Object::DOM::XRef
    is PDF::Object::Stream {
    has Array $.W is tied;
    has Int $.Size is tied;
    has Int $.Index is tied;
...
}

Just needs a bit more introspection/metaprogramming in PDF::Object::Tie::Hash

PDF::Object.post-process needs to go

This is currently adding back-links, e.g. adding /Parent entry to Page which points back to pages.
Wasn't a good idea to attempt this post-serialization in the first place. Doesn't fit in incremental-updates. Needs to be done earlier. Will need to avoid or manage cyclical references.

Enumerated names are being over-escaped (latest Rakudo)

Consider:

use Test;
plan 1;
use PDF::IO::Writer;
enum ( :Heydər("Heydər Əliyev") );
is PDF::IO::Writer.write-name(Heydər), '/Heyd#c9#99r#20Əliyev'

Rakudo 2021.04 produces fails with:

1..1
not ok 1 - 
# Failed test at /tmp/tst.raku line 5
# expected: '/Heyd#c9#99r#20Əliyev'
#      got: '/#48#65#79#64#c9#99#72#20#c6#8f#6c#69#79#65#76'
# You failed 1 test of 1

CI tests failing on windows

e.g. https://github.com/pdf-raku/PDF-raku/runs/4708310704

Convert PDF::Object::Dict & Array from classes to roles

The assumption that we can statically analyse an indirect object to determine its type isn't holding throughout the dom. Resolution of object types may need to be done later - as a coercement when dereferencing via the DOM.

See pdf-raku/PDF-Class-raku#5.

Roles can be applied later don't require the construction of new objects, when applied using the does operator.

Changing type resolution strategies may also affect PDF::Object::Delegator.

PNG 1, & 2 bit predictor decoding/encoding issues

Consider this script, run from PDF-Content-p6:

use PDF::Lite;
use PDF::Content::Image;
my PDF::Lite $pdf .= new;
my $page = $pdf.add-page;
$page.graphics: {
    my $y = 20;
    for <t/images/basn0g01.png t/images/basn3p02.png t/images/basn0g08.png> -> $file {
        my $img = PDF::Content::Image.open($file);
        .do($img,20,$y);
        $y += 80;
    }
}
$pdf.save-as: "/tmp/tst-image.pdf";

If the images are then decompressed via pdf-rewriter --uncompress /tmp/tst-image.pdf

The first two displayed images are corrupted in xdf. I suspect that 1 & 2 bit PNG predictors aren't being decoded corrected, but need to check this out.

LZW Compression is non-conformant

I'm noticing that LibGnuPDF::Filter and PDF::Storage::Filter are giving different results for LZW decompression.

It's looking like this module's implementation is not compliant. It should either be dropped or reported from Perl5 (PDF::API2::Basic::PDF::Filter::LZWDecode).

Deprecations

Just to track some deprecations:

PDF::IO.substr - This method name for byte-level slicing made sense before we had newline combiners in latin encodings, but is now misleading. Replacing with PDF::IO.byte-str
PDF.permitted - Moving this method to PDF class

Low-level error when attempting to decompress LZW encoded PDF

I've found a particular PDF with LZW encoding that dies when attempting to decompress (see attached);

$ raku -I . bin/pdf-rewriter.raku --uncompress /tmp/000094.pdf 
opening /tmp/000094.pdf ...
uncompressing ...
Cannot unbox a type object (Any) to int.
  in method decode at /home/david/git/PDF-raku/lib/PDF/IO/Filter/LZW.rakumod (PDF::IO::Filter::LZW) line 64
  in method decode at /home/david/git/PDF-raku/lib/PDF/IO/Filter/LZW.rakumod (PDF::IO::Filter::LZW) line 24
  in method decode-item at /home/david/git/PDF-raku/lib/PDF/IO/Filter.rakumod (PDF::IO::Filter) line 24
  in method decode at /home/david/git/PDF-raku/lib/PDF/IO/Filter.rakumod (PDF::IO::Filter) line 13
  in method decode at /home/david/git/PDF-raku/lib/PDF/COS/Stream.rakumod (PDF::COS::Stream) line 83
  in block  at /home/david/git/PDF-raku/lib/PDF/COS/Stream.rakumod (PDF::COS::Stream) line 60
  in method uncompress at /home/david/git/PDF-raku/lib/PDF/COS/Stream.rakumod (PDF::COS::Stream) line 106
  in block  at /home/david/git/PDF-raku/lib/PDF/IO/Reader.rakumod (PDF::IO::Reader) line 873
  in method recompress at /home/david/git/PDF-raku/lib/PDF/IO/Reader.rakumod (PDF::IO::Reader) line 857
  in sub MAIN at bin/pdf-rewriter.raku line 62
  in block <unit> at bin/pdf-rewriter.raku line 5

Other LZW PDF compressed files seem to work ok.
000094.pdf

Either this is valid, and there's something wrong with LZW decompression, or it's not and this is a LTA error.

More investigation needed.

XRef Streams not reliably discarded in 1.4 save

E.g. after $ raku -I . bin/pdf-rewriter.raku t/pdf/samples/pdf-1.5-1.4-hybrid.pdf /tmp/out.pdf

Saved /tmp/out.pdf still has the XRef stream object.

Setup concurrency between PDF::IO::Serializer and PDF::IO::Writer

Currently the serializer builds a complete AST tree which is then passed to the writer which deconstructs out outputs it.

Simple, but single threaded and creating a memory peak with the need to fully construct the intermediate structrure.

Would be good to use a supply channel, or similar to allow the serializer to construct objects in parallel and the writer to consume them as they become available.

Probably should be stable. I.e. objects are always produced and consumed in the same order, so that output PDF's remain structurally similar when run repeatably.

This should hopefully reduced peak memory usage and improve serialization speeds on multi CPU platforms. Needs to be bench-marked.

Adopt new coercion semantics

Affects method coerce() in PDF::COS PDF::COS::Coercer and elsewhere.

Deprecate. replace with COERCE where practical.

There's also a two argument form: coerce($val, $type). Refactor to eliminate where possible. Rename (to coerce-to()?).

Failing to locate trailer in http://www.stillhq.com/pdfdb/000049/data.pdf

From the PDF Test Database. Looks valid to me.

 perl6 -I. bin/pdf-rewriter.pl --/repair /home/david/Documents/test-pdf/000049.pdf /tmp/out.pdf
opening /home/david/Documents/test-pdf/000049.pdf ...
Expected file trailer 'startxref ... %%EOF', got: "btype /Form  /FormType 1 /BBox [ ...                     "
  in method load-index at /home/david/git/PDF-p6/lib/PDF/Reader.pm (PDF::Reader) line 546
  in method load-cos at /home/david/git/PDF-p6/lib/PDF/Reader.pm (PDF::Reader) line 432
  in method open at /home/david/git/PDF-p6/lib/PDF/Reader.pm (PDF::Reader) line 243
  in method open at /home/david/git/PDF-p6/lib/PDF/Reader.pm (PDF::Reader) line 175
  in sub MAIN at bin/pdf-rewriter.pl line 29
  in block <unit> at bin/pdf-rewriter.pl line 7

Base specification should be ISO 32000/2008, not PDF 1.7

The source code and README reference the PDF 1.7 reference manual.

The ISO 32000 seems to be the more widely accepted standard. It's mostly just a matter of adjusting section references and checking any pasted comments, for example, in PDF::DAO::Type::Info.

Handle 'must be indirect reference' constraints in spec

A generic mechanism may be necessary to handle the 'must be an indirect reference; constraint on certain entries. For example the threads entry in the Catalog object see [PDF 1.7 TABLE 3.25 Entries in the catalog dictionary].

Most likely the entry trait (Dictionaries) and index trait Arrays needs an additional :ind-ref argument that is somehow interpreted during Dict/Array AST constructions and/or passed-through to the serializer. so that it 'knows' to construct an indirect object.

zef install PDF test fail

===> Searching for: PDF
===> Updating cpan mirror: https://raw.githubusercontent.com/ugexe/Perl6-ecosystems/master/cpan1.json
===> Updating p6c mirror: https://raw.githubusercontent.com/ugexe/Perl6-ecosystems/master/p6c1.json
===> Updated p6c mirror: https://raw.githubusercontent.com/ugexe/Perl6-ecosystems/master/p6c1.json
===> Updated cpan mirror: https://raw.githubusercontent.com/ugexe/Perl6-ecosystems/master/cpan1.json
===> Searching for missing dependencies: Compress::Zlib, PDF::Grammar:ver<0.2.1+>
===> Searching for missing dependencies: Compress::Zlib::Raw
===> Testing: Compress::Zlib::Raw:ver<1.0.1>:authgithub:retupmoca
===> Testing [OK] for Compress::Zlib::Raw:ver<1.0.1>:authgithub:retupmoca
===> Testing: Compress::Zlib:ver<1.1.0>:authgithub:retupmoca
===> Testing [OK] for Compress::Zlib:ver<1.1.0>:authgithub:retupmoca
===> Testing: PDF::Grammar:ver<0.2.4>:authgithub:pdf-raku:api<PDF.1.7>
[PDF::Grammar] # loading t/helloworld.pdf (set $TEST_PDF to override)
===> Testing [OK] for PDF::Grammar:ver<0.2.4>:authgithub:pdf-raku:api<PDF.1.7>
===> Testing: PDF:ver<0.4.4>:authgithub:pdf-raku:api<PDF.1.7>
===> Testing [FAIL]: PDF:ver<0.4.4>:authgithub:pdf-raku:api<PDF.1.7>
Aborting due to test failure: PDF:ver<0.4.4>:authgithub:pdf-raku:api<PDF.1.7> (use --force-test to override)

stephenroe@Jester raku % raku -v
Welcome to 𝐑𝐚𝐤𝐮𝐝𝐨™ v2020.10.
Implementing the 𝐑𝐚𝐤𝐮™ programming language v6.d.
Built on MoarVM version 2020.10.

on macOS Catalina 10.15.7 (19H15)

stephenroe@Jester raku % zef --verbose install PDF
===> Searching for: PDF
===> Found: PDF:ver<0.4.4>:authgithub:pdf-raku:api<PDF.1.7> [via Zef::Repository::LocalCache]
===> Searching for missing dependencies: PDF::Grammar:ver<0.2.1+>
===> Found dependencies: PDF::Grammar:ver<0.2.4>:authgithub:pdf-raku:api<PDF.1.7>, PDF::Grammar:ver<0.2.4>:authgithub:pdf-raku:api<PDF.1.7> [via Zef::Repository::LocalCache]
===> Testing: PDF::Grammar:ver<0.2.4>:authgithub:pdf-raku:api<PDF.1.7>
[PDF::Grammar] t/00objects.t ....... ok
[PDF::Grammar] t/content-ops.t ..... ok
[PDF::Grammar] t/content-parse.t ... ok
[PDF::Grammar] t/fdf-parse.t ....... ok
[PDF::Grammar] t/function-parse.t .. ok
[PDF::Grammar] t/pdf-components.t .. ok
[PDF::Grammar] t/pdf-objects.t ..... ok
[PDF::Grammar] # loading t/helloworld.pdf (set $TEST_PDF to override)
[PDF::Grammar] t/pdf-parsefile.t ... ok
[PDF::Grammar] t/pdf-regex.t ....... ok
[PDF::Grammar] All tests successful.
[PDF::Grammar] Files=9, Tests=602, 10 wallclock secs ( 0.17 usr 0.05 sys + 14.11 cusr 1.01 csys = 15.34 CPU)
[PDF::Grammar] Result: PASS
===> Testing [OK] for PDF::Grammar:ver<0.2.4>:authgithub:pdf-raku:api<PDF.1.7>
===> Testing: PDF:ver<0.4.4>:authgithub:pdf-raku:api<PDF.1.7>
[PDF] t/00-helloworld.t ...... ok
[PDF] t/01-readme.t .......... ok
[PDF] t/cos-coerce.t ......... ok
[PDF] t/cos-date-string.t .... ok
[PDF] t/cos-deref.t .......... ok
[PDF] t/cos-text-string.t .... ok
[PDF] t/cos-tie-entry.t ...... ok
[PDF] t/cos-tie-index.t ...... ok
[PDF] t/cos-tie.t ............ ok
[PDF] t/cos-type-info.t ...... ok
[PDF] t/cos-type-objstm.t .... ok
[PDF] t/cos-type-xref.t ...... ok
[PDF] t/cos-util.t ........... ok
[PDF] t/filter-ascii85.t ..... ok
[PDF] t/filter-asciihex.t .... ok
[PDF] t/filter-flate.t ....... ok
[PDF] t/filter-lzw.t ......... ok
[PDF] t/filter-predictors.t .. ok
[PDF] t/filter-runlength.t ... ok
[PDF] t/filter.t ............. ok
[PDF] t/indobj-array.t ....... ok
[PDF] t/indobj-bool.t ........ ok
[PDF] t/indobj-dict.t ........ ok
[PDF] t/indobj-name.t ........ ok
[PDF] t/indobj-null.t ........ ok
[PDF] t/indobj-num.t ......... ok
[PDF] t/indobj-stream.t ...... ok
[PDF] t/indobj-string.t ...... ok
[PDF] t/indobj.t ............. ok
[PDF] t/io-crypt.t ...........
[PDF] No subtests run
[PDF] t/io-serialize.t ....... ok
[PDF] t/io-util.t ............ ok
[PDF] t/io.t ................. ok
[PDF] t/pdf-cos.t ............ ok
[PDF] t/pdf-crypt-aes.t ......
[PDF] Failed 10/10 subtests
[PDF] t/pdf-crypt-rc4.t ......
[PDF] Failed 17/17 subtests
[PDF] t/pdf-open.t ...........
[PDF] All 10 subtests passed
[PDF] t/pdf-reencrypt.t ......
[PDF] Failed 6/6 subtests
[PDF] t/read-fdf.t ........... ok
[PDF] t/read-pdf.t ........... ok
[PDF] t/reader-deref.t ....... ok
[PDF] t/reader-exceptions.t .. ok
[PDF] t/update-encrypted.t ...
[PDF] Failed 5/5 subtests
[PDF] t/update.t ............. ok
[PDF] t/write-ast.t .......... ok
[PDF] t/write-indobj.t ....... ok
[PDF] t/write-pdf.t .......... ok
[PDF] Test Summary Report
[PDF] -------------------
[PDF] t/io-crypt.t (Wstat: 6 Tests: 0 Failed: 0)
[PDF] Non-zero wait status: 6
[PDF] Parse errors: No plan found in TAP output
[PDF] t/pdf-crypt-aes.t (Wstat: 6 Tests: 0 Failed: 0)
[PDF] Non-zero wait status: 6
[PDF] Parse errors: Bad plan. You planned 10 tests but ran 0.
[PDF] t/pdf-crypt-rc4.t (Wstat: 6 Tests: 0 Failed: 0)
[PDF] Non-zero wait status: 6
[PDF] Parse errors: Bad plan. You planned 17 tests but ran 0.
[PDF] t/pdf-open.t (Wstat: 6 Tests: 10 Failed: 0)
[PDF] Non-zero wait status: 6
[PDF] Parse errors: No plan found in TAP output
[PDF] t/pdf-reencrypt.t (Wstat: 6 Tests: 0 Failed: 0)
[PDF] Non-zero wait status: 6
[PDF] Parse errors: Bad plan. You planned 6 tests but ran 0.
[PDF] t/update-encrypted.t (Wstat: 6 Tests: 0 Failed: 0)
[PDF] Non-zero wait status: 6
[PDF] Parse errors: Bad plan. You planned 5 tests but ran 0.
[PDF] Files=47, Tests=839, 144 wallclock secs ( 0.36 usr 0.14 sys + 216.07 cusr 24.04 csys = 240.61 CPU)
[PDF] Result: FAIL
===> Testing [FAIL]: PDF:ver<0.4.4>:authgithub:pdf-raku:api<PDF.1.7>
Aborting due to test failure: PDF:ver<0.4.4>:authgithub:pdf-raku:api<PDF.1.7> (use --force-test to override)

PNG Predictors not working for BItsPerComponent < 8

See todo test in 6aac1a7

physical root should be trailer not document root

DOM pdf-raku/PDF-Class-raku#3 stems from the fact that we're treating the Root and Info entries differently.

The Trailer, not the Root entry should to be the entry-point for Reading, Serialization, and Writing Purposes. This will simplify the handling and interfaces between the above and the DOM.

A little refactoring is needed.

Handle FDF files, not just PDF

PDF::Tools have a currrent bias towards processing only PDF documents. FDF is another related format that we should be able to handle. Both formats have different DOM structure. PDF files are normally indexed. FDF files aren't.

With just a little more flexibility and a few more tests, I think that we should be able to handle both through the tool-chain (Reader, Storage, & Writer)

Implement /Parent indirect refs in serialisation

See second todo test in serialization.t

implement PDF 1.5+ Object Streams

Most likely in PDF::IO::Serializer.

These do result in smaller PDF files.

Potentially also of benefit to serialization, in reducing peak memory size and number of objects, but only if we can the object streams built early enough.

Currently failing on Rakudo 2019.11+ blead

e.g.

$ perl6 -v;prove -e'perl6 -I.' -v t
This is Rakudo version 2019.11-388-g09e66e504 built on MoarVM version 2019.11-113-g703f023d5
implementing Perl 6.d.
t/00-helloworld.t ...... ===SORRY!=== Error while compiling /home/david/git/PDF-p6/t/00-helloworld.t
===SORRY!=== Error while compiling /home/david/git/PDF-p6/lib/PDF.pm (PDF)
===SORRY!=== Error while compiling /home/david/git/PDF-p6/lib/PDF/IO/Serializer.pm (PDF::IO::Serializer)
===SORRY!=== Error while compiling /home/david/git/PDF-p6/lib/PDF/COS/Stream.pm (PDF::COS::Stream)
Package 'PDF::COS::Stream' already has a sub 'Length' (did you mean to declare a multi-method?)
at /home/david/git/PDF-p6/lib/PDF/COS/Stream.pm (PDF::COS::Stream):6

at /home/david/git/PDF-p6/lib/PDF/IO/Serializer.pm (PDF::IO::Serializer):6

at /home/david/git/PDF-p6/lib/PDF.pm (PDF):10

at /home/david/git/PDF-p6/t/00-helloworld.t:5
Dubious, test returned 1 (wstat 256, 0x100)
No subtests run 
t/01-readme.t .......... ===SORRY!=== Error while compiling /home/david/git/PDF-p6/t/01-readme.t
===SORRY!=== Error while compiling /home/david/git/PDF-p6/lib/PDF.pm (PDF)
===SORRY!=== Error while compiling /home/david/git/PDF-p6/lib/PDF/IO/Serializer.pm (PDF::IO::Serializer)
===SORRY!=== Error while compiling /home/david/git/PDF-p6/lib/PDF/COS/Stream.pm (PDF::COS::Stream)
Package 'PDF::COS::Stream' already has a sub 'Length' (did you mean to declare a multi-method?)
at /home/david/git/PDF-p6/lib/PDF/COS/Stream.pm (PDF::COS::Stream):6

at /home/david/git/PDF-p6/lib/PDF/IO/Serializer.pm (PDF::IO::Serializer):6

at /home/david/git/PDF-p6/lib/PDF.pm (PDF):10

at /home/david/git/PDF-p6/t/01-readme.t:4

(... + lots more failures).

Simple trick of overriding the Attribute compose method not currently working. May need an alternative,

Nested indirect objects do not serialize properly

They are currently being inlined. See first todo test in t/serialize.t

encryption support

This should be possible via the OpenSSL module, which supports AES.

This module should probably be an optional dependency, Die with a message such 'This PDF is AES encrypted. Please install the OpenSSL module and try again', or some-such.

Slowdowns noticed after panda install

I noticed this when I was trying to profile code:

$ perl6 -v
This is Rakudo version 2016.03-105-ga3b8fef built on MoarVM version 2016.03-84-g4afd7b6
implementing Perl 6.c.
$ perl6 -I lib -e'use PDF::DAO;PDF::DAO.coerce: :stream{} for 1..10'
$ time perl6 -I lib -e'use PDF::DAO;PDF::DAO.coerce: :stream{} for 1..50'

real    0m2.875s
user    0m2.740s
sys     0m0.132s

But when I drop the -I lib from the command line, and use the same version, panda installed:

$ perl6 -e'use PDF::DAO;PDF::DAO.coerce: :stream{} for 1..10'
$ time perl6 -e'use PDF::DAO;PDF::DAO.coerce: :stream{} for 1..50'

real    0m25.763s
user    0m25.564s
sys     0m0.188s

Invocations of the coerce method are much slower. I suspect a rakudo precompilation issue, but more investigation is needed,

Text line/word breaking wrt Standard Annex #14

http://unicode.org/reports/tr14/ contains some useful info that can be used to improve and better generalize word/line breaking in the $.page.text() method.

Without going overboard, the text breaking method could make use of the non-breaking classes, break opportunities and better handle numeric context.

A null incremental update to a PDF 1.5+ file can break Adobe reader

The Perl 5 PDF::API2 example from https://perlmonks.com/?node_id=1233444, which fails to render on Adobe Reader is also a problem here.
Something as simple as:
use PDF; my PDF $p .= open: "test.pdf"; $p.save-as: "test6.pdf"
Can fail to open in Adobe Reader, if the input file contains 1.5+ cross reference streams.

Page composition - versus PDF::API2

Some, what I consider to be minor issues/restrictions/dated behavior, related to the existing PDF::API2 interface. Probably several issues, but I just want to jot them down for the moment:

it only lets you append content to an existing page. This makes it difficult to e.g. insert a base background to an existing PDF. Most common use case is to append content, but new API also should allow prepending of content to a page.
PDF::API2 exposes separate gsave, transform, and grestore methods. I'm leaning towards having a single gs object to which text, images and/or graphics can be appended. gs can be nested forming a tree structure. I'd say nested gs transforms should, by default, be relative to each other.
PDF::API2 has a gfx object. My preference is to call it a canvas and model it more on the HTML5 Canvas API - see https://developer.mozilla.org/en-US/docs/Web/API/Canvas_API
PDF::API2 has issues when appending to a page whose content hasn't been wrapped in a q .. Q (gsave ... grestore). This node discusses the issue and work-arounds - http://stackoverflow.com/questions/25524492/manipulating-a-pdf-file-with-different-rotations-and-scaling-with-perls-pdfap/29067780#29067780. Just make sure we solve this and also that we always emit 'well formed' content, so that anything downstream from us doesn't have the same issues.

Opening of large PDFS using a lot more memory on Rakudo 2020.1+ blead

$ perl6 -v
This is Rakudo version 2020.02.1-261-g478239e61 built on MoarVM version 2020.02.1-69-g16ff1585e
implementing Raku 6.d.
$ perl6 -I . bin/pdf-rewriter.raku ~/Documents/PDF32000_2008.pdf /tmp/out.pdf
opening /home/david/Documents/PDF32000_2008.pdf ...
^C

Seems to occur on index load. Regressed since Rakudo 2020.02.1 Investigation need.

travis build error "Cannot find method 'find_symbol'"

Known issue. Already reported by moritz++ https://rt.perl.org/Ticket/Display.html?id=126816

Lighten numeric representations

Currently, AST constructs numerics as pairs with key int, real or bool and every single values is coerced to PDF::COS::Int, PDF::COS::Realor PDF::COS::Bool.

That seems like a lot of work, when these directly map to Raku representations and 99.9% of them are not indirect objects.

I'm considering a refactor where numbers are simply passed through, except in the (rare) case where they are indirect objects.

This should mostly be an internal refactor, however.

PDF::Grammar will need a lite parsing mode
PDF::Content works with internals structures somewhat and may be impacted (maybe it shouldn't)
Serialization to json will change (to something simpler and more compact)
Regress PDF::Lite, PDF::Class, PDF::Tags, PDF::Font::Loader

Would be good to support reading json in the previous format.

Possibly strings should be kept in the current format literal, hex-string, name, stream - there's no dominant format or natural default.

PDF::COS::TextString should be PDFDoc encoded, when not BOM/UTF-16BE

Currently assumes latin-1