Giter Site home page Giter Site logo

pdf-grammar-raku's Introduction

[Raku PDF Project] / PDF::Grammar

PDF-Grammar

Although PDF documents do not lend themselves to an overall BNF style grammar description; there are areas where these can be put to use, including:

  • PDF file header and trailer/xref parsing
  • Parsing of objects fetched via the xref index. Top level objects commomly include: dictionarys , streams, arrays or numbers.
  • The overall file structure for FDF files (which are not indexed), or for full-scan recovery of PDF files (headers, objects, cross-reference tables and footers).
  • Parsing the operands that make up content streams. These are used to markup text, forms, images and graphical elements.

PDF::Grammar is a set of Raku grammars for parsing and validation of real-world PDF examples. There are four grammars:

PDF::Grammar::Content - describes the text and graphics operators that are used to produce page layout.

PDF::Grammar::Content::Fast - is an optimized version of PDF::Grammar::Content.

PDF::Grammar::FDF - this describes the file structure of FDF (Form Data) exchange files.

PDF::Grammar::PDF - this describes the file structure of PDF documents, including headers, trailers, top-level objects and the cross-reference table.

PDF::Grammar::Function - a tokeniser for Postscript Calculator (type 4) functions.

PDF-Grammar has so far been tested against a number of sample of PDF documents and may still be subject to change.

I have been working off the PDF 1.7 reference manual (http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf). I've relaxed rules, when needed, to handle real-world examples.

Usage Notes

  • PDF input files typically contain a mixture of ASCII directives and binary data, plus byte-orientated addressing. For this reason:

    • files should be read as binary (avoid encoding layers)
    • strings should be decoded as latin1

    % rakudo -MPDF::Grammar::PDF -e"say PDF::Grammar::PDF.parse: slurp($f, :bin).decode('latin-1')"

  • This module is put to work by the down-stream PDF module. E.g. to uncompress a PDF, using the installed pdf-rewriter script:

    % pdf-rewriter.raku --uncompress flyer.pdf
    

Examples

  • parse some markup content:

    % raku -M PDF::Grammar::Content -e"say PDF::Grammar::Content.parse('(Hello, world\041) Tj')"

  • parse a PDF file:

    % rakudo -MPDF::Grammar::PDF -e"say PDF::Grammar::PDF.parsefile( $f )"

  • dump the contents of a PDF

    use v6;
    use PDF::Grammar::PDF;
    use PDF::Grammar::PDF::Actions;
    
    sub MAIN(Str $pdf-file) {
        my $actions = PDF::Grammar::PDF::Actions.new;
    
        if PDF::Grammar::PDF.parsefile( $pdf-file, :$actions ) {
            say $/.ast.raku;
        }
        else {
            say "failed to parse PDF: $pdf-file";
        }
    }
    

AST Reference

The action methods in this module return AST trees. Each node in the tree consists of a key, value pair, where the key is the AST Tag, indicating the type of the AST node.

For example, here's the AST tree for the following parse:

use PDF::Grammar::PDF;
use PDF::Grammar::PDF::Actions;
my $actions = PDF::Grammar::PDF::Actions.new;

PDF::Grammar::PDF.parse( q:to"--END-DOC--", :rule<ind-obj>, :$actions);
3 0 obj <<
   /Type /Pages
   /Count 1
   /Kids [4 0 R]
>>
endobj
--END-DOC--

say '# ' ~ $/.ast.raku;
# :ind-obj($[3, 0, :dict({:Count(:int(1)), :Kids(:array([:ind-ref($[4, 0])])), :Type(:name("Pages"))})])

Note that there's also a lite mode which skips types bool, int, real and null:

$actions .= new: :lite;
PDF::Grammar::PDF.parse( q:to"--END-", :rule<ind-obj>, :$actions);
3 0 obj << /Count 1 >> endobj
--END--
say '# ' ~ $/.ast.raku;
# :ind-obj($[3, 0, :dict({:Count(1)})])

This is an indirect object (ind-obj), it contains a dictionary object (dict). Entries in the dictionary are:

  • Count with integer value (int) of 1.
  • Kids, and array (array) containing one indirect reference (ind-ref).
  • Type with name (name) 'Pages'.

In most cases, the node type corresponds to the name of the rule or token that was used to construct the node.

This AST representation is used extensively throughout the PDF tool-chain. For example, as an intermediate format by PDF::Writer for reserialization.

For reference, here is a list of all AST node types:

AST Tag Raku Type Description
array Array[Any] Array object type, e.g. [ 0 0 612 792 ]
body Array[Hash] The FDF/PDF body consisting of ind-obj and comment entries. A PDF with revisions has multiple body segments
bool Bool Boolean object type, e.g. true [1]
comment Str (Write only) a comment string
cos Hash A PDF or FDF document, consisting of a header and body array
dict Hash Dictionary object type, e.g. << /Type /Catalog /Pages 3 0 R >>
encoded Str Raw encoded stream data. This is returned as a latin-1 byte-string.
entries Array[Hash] A list of entries in a cross reference segment
decoded Str Uncompressed/unencrypted stream data
gen-num UInt Object generation number
header Hash PDF or FDF header, e.g. %PDF1.4
hex-string Str A hex-string, e.g. <736e6f6f7079>
ind-ref Array[UInt] An indirect reference, .e.g. 23 2 R
ind-obj Any An indirect object. This is a three element array that contains an object number, generation number and the object
int Int Integer object type, e.g. 42 [1]
obj-count UInt object count/number of entries in a cross reference segment
obj-first-num UInt object first number in a cross reference segment
obj-num UInt Object number
offset UInt byte offset of an indirect object in the file.
literal Str A literal string, e.g. (Hello, World!)
name Str Name string, e.g. /Fred
null Mu Null object type, e.g. null [1]
real Real Real object type, e.g. 42.0 [1]
start UInt Start position of stream data (returned by ind-obj-nibble rule)
startxref UInt byte offset from the start of the file to the start of the trailer
stream Hash Stream object type. A dictionary indirect object followed by stream data
trailer Hash Trailer. This typically contains the trailer dict entry.
type Str Document type; 'pdf', or 'fdf'
version Rat The PDF / FDF version number, parsed from the header

Note [1] Types bool, int, real, and null don't appear in lite mode.

See also

  • PDF - Raku module for PDF manipulation, including compression, encryption and reading and writing of PDF data.

pdf-grammar-raku's People

Contributors

dwarring avatar moritz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pdf-grammar-raku's Issues

Honor PDF 2.0 /L (/Length) in content inline-image dictionary

The /L parameter, newly defined in PDF 2.0, fixes ambiguity when the EI terminator happens to also appear in a content stream, e.g.:

BI /L 6 ID abc EI EI

/L 6 indicates that we should consume 6 bytes as image data, then look for the EI terminator. The actual image data is: abc IE.

Changed xref AST

The xref AST currently looks very similar to the physical structure with an implied sequence of object numbers.

xref
0 4
0000000000 65535 f
0000000009 00000 n
0000000074 00000 n
0000000120 00000 n

Produces an AST of:

[
    {
        :obj-count(4),
        :obj-first-num(0);
        :entries[[0, 65535, 0],
                     [9,  0,  1],
                     [74,  0, 1],
                     [120, 0, 1],
                    ]
    }
]

Issue is that the PDF layer has to rebuild this to another intermediate structure that is more logical

[
    {
        :obj-count(4),
        :entries[[0, 65535, 0, 0],
                     [1, 0, 9,   1],
                     [2, 0, 74,  1],
                     [3, 0, 120, 1],
                     ]
    }
]

Proposal is to change AST to the second more logical form. This will help streamline and simplify PDF index loading

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.