Wiki Discussion: DCG and phrase/3

Since I am currently working on a PDF/PostScript parser using DCGs and it has lots of places that make for useful examples, it will most likely serve as the basis for the Wiki.

The only documentation needed for understanding PostScript for a parser is
PostScript Language Reference third edition

The first iteration of the parser was done to get an idea of how the Prolog code needed to be structured to correctly and efficiently parse PostScript. It was written by basically reading the manual from the start and adding predicates/facts as I read. Knowing that the code would be a throw away version, it was written so that as a whole it could be discarded, but such that the lower parts could be salvaged and reused. Also the most important artifact that was needed from this version were the test cases.

Since the goal of the final version of the parser is to be used to parse 10s of thousands of PDFs and PS files as a large batch job, and knowing from previous experience of writing and profiling many parsers that the byte level parsing does almost all of the work, 80%, 90% or more is not uncommon; that code needed to be fast and efficient. So instead of doing a classic lexer/tokenizer and parser, the code attempts to go from reading bytes to syntax trees in one pass. The meaning of attempt is that a byte is not processed again and again in several parts, even the use of append/2 or append/3 is considered to much overhead and that the clauses should be as deterministic as possible with as few redos as possible.

PostScript does make use of delimiters for tokenization and even notes this in the manual, but this sentence just conflicted with the design goals:

Any token that consists entirely of regular characters and cannot be interpreted as a number is treated as a name object (more precisely, an executable name).

Implementing this in Prolog is actually very easy, just parse as a number and if it fails it must be a name, what could be easier. Problem with that is for every name all the bytes of the data that make up the name first have to be parsed as a number, then upon failing it could be considered a name, but to be correct the entire name has to be parsed to make sure it has no illegal characters. Then if a tokenizer is used, the bytes are processed in entirety again; the tokenization taking place before the classification/typing of the tokens.

In the rewrite the part requiring the most detail was writing the DCG to validate the input, and especially the types number or name, as it processed each byte and then determine the type of the value and the bytes for the value. This was accomplished by starting with a pen and paper syntax diagram

PostScript Number and Name object Syntax Diagram.tif (745.0 KB)

which then was upgraded to a GraphViz DOT file

PostScript Number Name Syntax Diagram.gv (20.6 KB)

and converted into an SVG
(Note: To better see the SVG, e.g using Chrome on Windows, right click below, select Open image in new tab, then in other tab with SVG, zoom in and pan.)
(Note: The license for the gv and svg is Creative Commons Attribution-ShareAlike 4.0 International License but Discourse eats links in uploads and the image was done as link to creativecommons.org.)

PostScript%20Number%20Name%20Syntax%20Diagram%20(fixed)

Now because the test cases existed, most of the low level predicates existed and worked and the overall design was visible, it was just a matter of cutting/pasting code into the newer version. Then to verify the code worked, the test cases were all commented out and then uncommitted for each of the states in order and fixed. This want much faster than the original write.

If you take a closer look at the original hand drawn diagram and the SVG you will notice that the SVG has some new states, namely s01 (Radix base 1) and s02 (Radix base 2) and at the top are added Transition sets. The states were added so that the type of the value could be identified upon seeing an end of stream or delimiter at that point and not pass data to latter states. The transition sets are there to avoid having to list about 90 or more individual characters on the transition lines. In the code for each set and each character in a set, there is a simple DCG to make use of indexing. Makes for large source files, but very fast parsing. Yes this could be done even faster by writing custom C functions and having the Prolog interface to the C code, but writing Prolog/C interfaces is not in my playbook at present.