Rand Stats



[Raku PDF Project] / PDF::Grammar


Although PDF documents do not lend themselves to an overall BNF style grammar description; there are areas where these can be put to use, including:

PDF::Grammar is a set of Raku grammars for parsing and validation of real-world PDF examples. There are four grammars:

PDF::Grammar::Content - describes the text and graphics operators that are used to produce page layout.

PDF::Grammar::Content::Fast - is an optimized version of PDF::Grammar::Content.

PDF::Grammar::FDF - this describes the file structure of FDF (Form Data) exchange files.

PDF::Grammar::PDF - this describes the file structure of PDF documents, including headers, trailers, top-level objects and the cross-reference table.

PDF::Grammar::Function - a tokeniser for Postscript Calculator (type 4) functions.

PDF-Grammar has so far been tested against a number of sample of PDF documents and may still be subject to change.

I have been working off the PDF 1.7 reference manual (http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf). I've relaxed rules, when needed, to handle real-world examples.

Usage Notes


AST Reference

The action methods in this module return AST trees. Each node in the tree consists of a key, value pair, where the key is the AST Tag, indicating the type of the AST node.

For example, here's the AST tree for the following parse:

use PDF::Grammar::PDF;
use PDF::Grammar::PDF::Actions;
my $actions = PDF::Grammar::PDF::Actions.new;

PDF::Grammar::PDF.parse( q:to"--END-DOC--", :rule<ind-obj>, :$actions);
3 0 obj <<
   /Type /Pages
   /Count 1
   /Kids [4 0 R]

say '# ' ~ $/.ast.raku;
# :ind-obj($[3, 0, :dict({:Count(:int(1)), :Kids(:array([:ind-ref($[4, 0])])), :Type(:name("Pages"))})])

Note that there's also a lite mode which skips types bool, int, real and null:

$actions .= new: :lite;
PDF::Grammar::PDF.parse( q:to"--END-", :rule<ind-obj>, :$actions);
3 0 obj << /Count 1 >> endobj
say '# ' ~ $/.ast.raku;
# :ind-obj($[3, 0, :dict({:Count(1)})])

This is an indirect object (ind-obj), it contains a dictionary object (dict). Entries in the dictionary are:

In most cases, the node type corresponds to the name of the rule or token that was used to construct the node.

This AST representation is used extensively throughout the PDF tool-chain. For example, as an intermediate format by PDF::Writer for reserialization.

For reference, here is a list of all AST node types:

AST TagRaku TypeDescription
arrayArray[Any]Array object type, e.g. [ 0 0 612 792 ]
bodyArray[Hash]The FDF/PDF body consisting of ind-obj and comment entries. A PDF with revisions has multiple body segments
boolBoolBoolean object type, e.g. true [1]
commentStr(Write only) a comment string
cosHashA PDF or FDF document, consisting of a header and body array
dictHashDictionary object type, e.g. << /Type /Catalog /Pages 3 0 R >>
encodedStrRaw encoded stream data. This is returned as a latin-1 byte-string.
entriesArray[Hash]A list of entries in a cross reference segment
decodedStrUncompressed/unencrypted stream data
gen-numUIntObject generation number
headerHashPDF or FDF header, e.g. %PDF1.4
hex-stringStrA hex-string, e.g. <736e6f6f7079>
ind-refArray[UInt]An indirect reference, .e.g. 23 2 R
ind-objAnyAn indirect object. This is a three element array that contains an object number, generation number and the object
intIntInteger object type, e.g. 42 [1]
obj-countUIntobject count/number of entries in a cross reference segment
obj-first-numUIntobject first number in a cross reference segment
obj-numUIntObject number
offsetUIntbyte offset of an indirect object in the file.
literalStrA literal string, e.g. (Hello, World!)
nameStrName string, e.g. /Fred
nullMuNull object type, e.g. null [1]
realRealReal object type, e.g. 42.0 [1]
startUIntStart position of stream data (returned by ind-obj-nibble rule)
startxrefUIntbyte offset from the start of the file to the start of the trailer
streamHashStream object type. A dictionary indirect object followed by stream data
trailerHashTrailer. This typically contains the trailer dict entry.
typeStrDocument type; 'pdf', or 'fdf'
versionRatThe PDF / FDF version number, parsed from the header

Note [1] Types bool, int, real, and null don't appear in lite mode.

See also