Giter Site home page Giter Site logo

pdf-corpus's Introduction

PDF corpus

This project allows to quickly create hand-crafted PDF files. The main Python script pdf-corpus.py is an ad-hoc template engine to easily prototype new PDFs.

Installation

To compile the corpus, just make it (you need a Python interpreter). All .txt files contained in the corpus/ folder are then converted into PDFs.

Description

Each PDF in the corpus is described by a .txt file that indicates the template to use and the content to insert in the template. The following templates are defined, but you can easily create your own by tweaking the Python code.

  • contentstream: A simple document containing one page in A4 format. You define the graphic commands to put in the page's content stream (see my cheat sheet). For convenience, a font resource is declared as /F1.
  • objects: A lower level template to directly declare objects. Simple streams can be defined, for which the template computes the /Length field.

Available corpus

The corpus already contains some files. These examples are classified into the following categories.

  • corpus/contentstream/: Playing with graphics instructions.
  • corpus/name/: Escape sequences in names.
  • corpus/number/: How numbers are parsed.

If you want to learn more about how these examples work, you can have a look at my blog posts: introduction to PDF syntax. I also make one-page cheat sheet(s) about PDF. For further details you can also dive into the PDF specification.

Disclaimer

Once compiled, these example files may not be fully compliant with the specification. In particular, they may be interpreted differently by different PDF readers.

pdf-corpus's People

Contributors

gendx avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.