Wiki Discussion: DCG and phrase/3

Wiki · May 28, 2019, 10:17am

This is a topic to discuss the Wiki

EricGT · June 15, 2019, 4:59pm

Since I am currently working on a PDF/PostScript parser using DCGs and it has lots of places that make for useful examples, it will most likely serve as the basis for the Wiki.

The only documentation needed for understanding PostScript for a parser is
PostScript Language Reference third edition

The first iteration of the parser was done to get an idea of how the Prolog code needed to be structured to correctly and efficiently parse PostScript. It was written by basically reading the manual from the start and adding predicates/facts as I read. Knowing that the code would be a throw away version, it was written so that as a whole it could be discarded, but such that the lower parts could be salvaged and reused. Also the most important artifact that was needed from this version were the test cases.

Since the goal of the final version of the parser is to be used to parse 10s of thousands of PDFs and PS files as a large batch job, and knowing from previous experience of writing and profiling many parsers that the byte level parsing does almost all of the work, 80%, 90% or more is not uncommon; that code needed to be fast and efficient. So instead of doing a classic lexer/tokenizer and parser, the code attempts to go from reading bytes to syntax trees in one pass. The meaning of attempt is that a byte is not processed again and again in several parts, even the use of append/2 or append/3 is considered to much overhead and that the clauses should be as deterministic as possible with as few redos as possible.

PostScript does make use of delimiters for tokenization and even notes this in the manual, but this sentence just conflicted with the design goals:

Any token that consists entirely of regular characters and cannot be interpreted as a number is treated as a name object (more precisely, an executable name).

Implementing this in Prolog is actually very easy, just parse as a number and if it fails it must be a name, what could be easier. Problem with that is for every name all the bytes of the data that make up the name first have to be parsed as a number, then upon failing it could be considered a name, but to be correct the entire name has to be parsed to make sure it has no illegal characters. Then if a tokenizer is used, the bytes are processed in entirety again; the tokenization taking place before the classification/typing of the tokens.

In the rewrite the part requiring the most detail was writing the DCG to validate the input, and especially the types number or name, as it processed each byte and then determine the type of the value and the bytes for the value. This was accomplished by starting with a pen and paper syntax diagram

PostScript Number and Name object Syntax Diagram.tif (745.0 KB)

which then was upgraded to a GraphViz DOT file

PostScript Number Name Syntax Diagram.gv (20.6 KB)

and converted into an SVG
(Note: To better see the SVG, e.g using Chrome on Windows, right click below, select Open image in new tab, then in other tab with SVG, zoom in and pan.)
(Note: The license for the gv and svg is Creative Commons Attribution-ShareAlike 4.0 International License but Discourse eats links in uploads and the image was done as link to creativecommons.org.)

PostScript%20Number%20Name%20Syntax%20Diagram%20(fixed)

Now because the test cases existed, most of the low level predicates existed and worked and the overall design was visible, it was just a matter of cutting/pasting code into the newer version. Then to verify the code worked, the test cases were all commented out and then uncommitted for each of the states in order and fixed. This want much faster than the original write.

If you take a closer look at the original hand drawn diagram and the SVG you will notice that the SVG has some new states, namely s01 (Radix base 1) and s02 (Radix base 2) and at the top are added Transition sets. The states were added so that the type of the value could be identified upon seeing an end of stream or delimiter at that point and not pass data to latter states. The transition sets are there to avoid having to list about 90 or more individual characters on the transition lines. In the code for each set and each character in a set, there is a simple DCG to make use of indexing. Makes for large source files, but very fast parsing. Yes this could be done even faster by writing custom C functions and having the Prolog interface to the C code, but writing Prolog/C interfaces is not in my playbook at present.

EricGT · June 15, 2019, 5:28pm

For those who are tempted to write one of these parsers for themselves, here are the test cases. Be warned that if you do this without looking at my suggestions and diagrams this could take you days or longer if you can only do it in your spare hours.

:- begin_tests(execute).

postscript_scan_test_case("0",number(0),[]).
postscript_scan_test_case("+0",number(0),[]).
postscript_scan_test_case("-0",number(0),[]).
postscript_scan_test_case("1",number(1),[]).
postscript_scan_test_case("-1",number(-1),[]).
postscript_scan_test_case("+1",number(1),[]).
postscript_scan_test_case("12",number(12),[]).
postscript_scan_test_case("-12",number(-12),[]).
postscript_scan_test_case("+12",number(12),[]).
postscript_scan_test_case("123",number(123),[]).
postscript_scan_test_case("-123",number(-123),[]).
postscript_scan_test_case("+123",number(123),[]).
postscript_scan_test_case("2147483647",number(2147483647),[]).
postscript_scan_test_case("2147483648",number(2147483648),[]).       % NB This is not converted to real as expected due to architectural limits as noted in appendix B. Prolog's limit for digits in integer is not limited by register size but by memory size since integers are symbolic.
postscript_scan_test_case("+2147483647",number(2147483647),[]).
postscript_scan_test_case("+2147483648",number(2147483648),[]).      % NB This is not converted to real as expected due to architectural limits as noted in appendix B. Prolog's limit for digits in integer is not limited by register size but by memory size since integers are symbolic.
postscript_scan_test_case("-2147483648",number(-2147483648),[]).
postscript_scan_test_case("-2147483649",number(-2147483649),[]).     % NB This is not converted to real as expected due to architectural limits as noted in appendix B. Prolog's limit for digits in integer is not limited by register size but by memory size since integers are symbolic.
postscript_scan_test_case("0.",number(0.0),[]).
postscript_scan_test_case("+0.",number(0.0),[]).
postscript_scan_test_case("-0.",number(0.0),[]).
postscript_scan_test_case(".0",number(0.0),[]).
postscript_scan_test_case("+.0",number(0.0),[]).
postscript_scan_test_case("-.0",number(0.0),[]).
postscript_scan_test_case("0.0",number(0.0),[]).
postscript_scan_test_case("1.",number(1.0),[]).
postscript_scan_test_case("-1.",number(-1.0),[]).
postscript_scan_test_case("+1.",number(1.0),[]).
postscript_scan_test_case("-0.002",number(-0.002),[]).
postscript_scan_test_case("-.002",number(-0.002),[]).
postscript_scan_test_case("34.5",number(34.5),[]).
postscript_scan_test_case("-3.62",number(-3.62),[]).
postscript_scan_test_case("-1.0",number(-1.0),[]).
postscript_scan_test_case("123.6e10",number(1236000000000.0),[]).
postscript_scan_test_case("123.6e-10",number(1.236e-8),[]).
postscript_scan_test_case("1.0E-5",number(1.0e-5),[]).
postscript_scan_test_case("1E6",number(1000000.0),[]).
postscript_scan_test_case("23E1",number(230.0),[]).
postscript_scan_test_case("10#123",number(123),[]).
postscript_scan_test_case("8#1777",number(1023),[]).
postscript_scan_test_case("16#FFFE",number(65534),[]).
postscript_scan_test_case("2#1000",number(8),[]).
postscript_scan_test_case("36#09AZ",number(12059),[]).
postscript_scan_test_case("36#09az",number(12059),[]).
postscript_scan_test_case("23#1",number(1),[]).
postscript_scan_test_case("1 ",number(1),[32]).

postscript_scan_test_case("0a",executable_name('0a'),[]).
postscript_scan_test_case("+0a",executable_name('+0a'),[]).
postscript_scan_test_case("-0a",executable_name('-0a'),[]).
postscript_scan_test_case("1a",executable_name('1a'),[]).
postscript_scan_test_case("-1a",executable_name('-1a'),[]).
postscript_scan_test_case("+1a",executable_name('+1a'),[]).
postscript_scan_test_case("2147483647a",executable_name('2147483647a'),[]).
postscript_scan_test_case("2147483648a",executable_name('2147483648a'),[]).
postscript_scan_test_case("-2147483648a",executable_name('-2147483648a'),[]).
postscript_scan_test_case("-2147483649a",executable_name('-2147483649a'),[]).
postscript_scan_test_case("0.a",executable_name('0.a'),[]).
postscript_scan_test_case("+0.a",executable_name('+0.a'),[]).
postscript_scan_test_case("-0.a",executable_name('-0.a'),[]).
postscript_scan_test_case("1.a",executable_name('1.a'),[]).
postscript_scan_test_case("-1.a",executable_name('-1.a'),[]).
postscript_scan_test_case("+1.a",executable_name('+1.a'),[]).
postscript_scan_test_case(".0a",executable_name('.0a'),[]).
postscript_scan_test_case("+.0a",executable_name('+.0a'),[]).
postscript_scan_test_case("-.0a",executable_name('-.0a'),[]).
postscript_scan_test_case("-0.002a",executable_name('-0.002a'),[]).
postscript_scan_test_case("-.002a",executable_name('-.002a'),[]).
postscript_scan_test_case("34.5a",executable_name('34.5a'),[]).
postscript_scan_test_case("-3.62a",executable_name('-3.62a'),[]).
postscript_scan_test_case("123.6e10a",executable_name('123.6e10a'),[]).
postscript_scan_test_case("123.6e-10a",executable_name('123.6e-10a'),[]).
postscript_scan_test_case("1.0E-5a",executable_name('1.0E-5a'),[]).
postscript_scan_test_case("1E6a",executable_name('1E6a'),[]).
postscript_scan_test_case("-1.0a",executable_name('-1.0a'),[]).
postscript_scan_test_case("-1.a",executable_name('-1.a'),[]).
postscript_scan_test_case("0.0a",executable_name('0.0a'),[]).
postscript_scan_test_case("10#123a",executable_name('10#123a'),[]).
postscript_scan_test_case("8#17778",executable_name('8#17778'),[]).
postscript_scan_test_case("16#FFFEG",executable_name('16#FFFEG'),[]).
postscript_scan_test_case("2#10002",executable_name('2#10002'),[]).
postscript_scan_test_case("36#09AZ.",executable_name('36#09AZ.'),[]).
postscript_scan_test_case("36#09az.",executable_name('36#09az.'),[]).
postscript_scan_test_case("abc",executable_name(abc),[]).
postscript_scan_test_case("Offset",executable_name('Offset'),[]).
postscript_scan_test_case("$$",executable_name($$),[]).
postscript_scan_test_case("23A",executable_name('23A'),[]).
postscript_scan_test_case("13-456",executable_name('13-456'),[]).
postscript_scan_test_case("a.b",executable_name('a.b'),[]).
postscript_scan_test_case("$MyDict",executable_name('$MyDict'),[]).
postscript_scan_test_case("@pattern",executable_name('@pattern'),[]).
postscript_scan_test_case("a ",executable_name('a'),[32]).
postscript_scan_test_case("+",executable_name('+'),[]).
postscript_scan_test_case("-",executable_name('-'),[]).
postscript_scan_test_case("1E",executable_name('1E'),[]).
postscript_scan_test_case("12E",executable_name('12E'),[]).
postscript_scan_test_case("1E+",executable_name('1E+'),[]).
postscript_scan_test_case("12E+",executable_name('12E+'),[]).
postscript_scan_test_case("1#",executable_name('1#'),[]).
postscript_scan_test_case("12#",executable_name('12#'),[]).

postscript_scan_test_case("(a)",literal_string("a"),[]).
postscript_scan_test_case("((a))",literal_string("(a)"),[]).
postscript_scan_test_case("(This is a string)",literal_string("This is a string"),[]).
postscript_scan_test_case("(a\\nb)",literal_string("a\nb"),[]).
postscript_scan_test_case("(a\\rb)",literal_string("a\rb"),[]).
postscript_scan_test_case("(a\\tb)",literal_string("a\tb"),[]).
postscript_scan_test_case("(a\\bb)",literal_string("a\bb"),[]).
postscript_scan_test_case("(a\\fb)",literal_string("a\fb"),[]).
postscript_scan_test_case("(a\\\\b)",literal_string("a\\b"),[]).
postscript_scan_test_case("(a\\(b)",literal_string("a(b"),[]).
postscript_scan_test_case("(a\\)b)",literal_string("a)b"),[]).
postscript_scan_test_case("()",literal_string(""),[]).
postscript_scan_test_case("(a\\\rb)",literal_string("ab"),[]).
postscript_scan_test_case("(a\\\nb)",literal_string("ab"),[]).
postscript_scan_test_case("(a\\\r\nb)",literal_string("ab"),[]).
postscript_scan_test_case("(\\0053)",literal_string("\005\3"),[]).
postscript_scan_test_case("(\\053)",literal_string("\053\"),[]).
postscript_scan_test_case("(\\53)",literal_string("+"),[]).
postscript_scan_test_case("(\\005)",literal_string("\005\"),[]).
postscript_scan_test_case("(\\05)",literal_string("\005\"),[]).
postscript_scan_test_case("(\\5)",literal_string("\005\"),[]).
postscript_scan_test_case("(\\a)",literal_string("a"),[]).
postscript_scan_test_case("(\\z)",literal_string("z"),[]).
postscript_scan_test_case("(a) ",literal_string("a"),[32]).

postscript_scan_test_case("<901fa3>",bytes([144,31,163]),[]).      % NB This is translated to bytes which is a representation suitable for SWI-Prolog.
postscript_scan_test_case("<901fa>",bytes([144,31,160]),[]).
postscript_scan_test_case("<901fa> ",bytes([144,31,160]),[32]).
postscript_scan_test_case("<~9jqo^~>",bytes([77,97,110,32]),[]).

postscript_scan_test_case("/name",literal_name(name),[]).
postscript_scan_test_case("/",literal_name(''),[]).
postscript_scan_test_case("/name ",literal_name(name),[32]).

postscript_scan_test_case("//name",immediately_evaluated_name(name),[]).
postscript_scan_test_case("//name ",immediately_evaluated_name(name),[32]).

postscript_scan_test_case("[123 /abc (xyz)]",array([number(123),literal_name(abc),literal_string("xyz")]),[]).

postscript_scan_test_case("{add 2 div}",procedure([executable_name(add),number(2),executable_name(div)]),[]).

postscript_scan_test_case("<<\n\c/F1 4 0 R\n/F2 5 0 R\n/F3 6 0 R\n/F5 7 0 R\n/F6 8 0 R\n>>",dictionary([literal_name('F1'),number(4),number(0),executable_name('R'),literal_name('F2'),number(5),number(0),executable_name('R'),literal_name('F3'),number(6),number(0),executable_name('R'),literal_name('F5'),number(7),number(0),executable_name('R'),literal_name('F6'),number(8),number(0),executable_name('R')]),[]).

test(001, [forall(postscript_scan_test_case(Input,Expected_postscript_object,Expected_rest))]) :-
    string_codes(Input,Codes),
    DCG = scan(Postscript_object),
    phrase(DCG,Codes,Rest),
    assertion( Rest == Expected_rest ),
    assertion( Postscript_object == Expected_postscript_object ).

test(003) :-
    Input = "\c
        (Strings may contain newlines\n\c
        and such.)\c
    ",
    string_codes(Input,Codes),
    DCG = scan(Postscript_object),
    phrase(DCG,Codes,Rest),
    assertion( Rest == [] ),
    assertion( Postscript_object == literal_string("Strings may contain newlines\nand such.") ).

test(004) :-
    % Note: Percent (%) is encoded as \x25 because in Prolog % starts a comment
    Input = "\c
        (Strings may contain special characters *!&}^\x25 and\n\c
        balanced parentheses ( ) (and so on).)\c
    ",
    string_codes(Input,Codes),
    DCG = scan(Postscript_object),
    phrase(DCG,Codes,Rest),
    assertion( Rest == [] ),
    assertion( Postscript_object == literal_string("Strings may contain special characters *!&}^% and\nbalanced parentheses ( ) (and so on).") ).

test(005) :-
    % Note: In a Prolog string \ is escaped as \\
    Input = "\c
        (These \\\n\c
        two strings \\\n\c
        are the same.)\c
    ",
    string_codes(Input,Codes),
    DCG = scan(Postscript_object),
    phrase(DCG,Codes,Rest),
    assertion( Rest == [] ),
    assertion( Postscript_object == literal_string("These two strings are the same.") ).

% DON'T ERASE. Used to gtrace individual test cases for scan//1
test(006) :-
    Input = "2#",
    string_codes(Input,Codes),
    DCG = scan(Postscript_object),
    phrase(DCG,Codes,Rest),
    assertion( Rest == [] ),
    assertion( Postscript_object == executable_name('2#') ).

:- end_tests(execute).

?- make.
% d:/cellular information/source code/prolog/pdf/postscript__interpreter compiled into postscript_interpreter 0.08 sec, 0 clauses
% PL-Unit: execute ........................................................................................................................................... done
% All 139 tests passed
true.

PS

After some more reviewing of the states, found a missing state that is needed. I will leave that as an exercise for the reader; also remember to add the any needed test cases.

peter.ludemann · June 15, 2019, 6:07pm

I wrote a tutorial on DCGs a long time ago … I can send you a scan of it if you wish (it implements a bidirectional cdecl). I intend to eventually translate it from IBM Prolog syntax to SWI-Prolog (that shows you how long ago I wrote it) and publish it.
PM me via Discourse if you want to see it.

EricGT · July 16, 2020, 2:51pm

Starting to think that one should avoid closed list when parsing with DCGs and use difference list, e.g. accumulator passing, and when the accumulator needs to be converted to a value then have a predicate that converts the codes to a value, e.g. string, number, etc. but the value should be put in a structure, e.g. port(Number), url(String) by the predicate that returns the codes converted to a value upon returning the value.

EDIT

More rules of thumb that seem to be making more sense when converting BNF to DCGs for parsing.

Create one module (base module) that contains the BNF converted to DCGs using difference list for accumulator passing.
Using a new module as a library to create the other needed code that builds on top of the base module. There can be more than one need for using the base module with conflicting needs, e.g. reproducing the input exactly as output, or converting the input it a parse tree, or converting the input into an abstract syntax tree. In other words don’t put all of the code into one file as this could result in predicates with the same predicate indicator but with different uses which obviously calls many unnecessary calls.
Don’t do any premature optimizations, e.g. something as simple as this could be a possible premature optimization that conflicts with a different need latter.

without optimization

proseval(T0,T) -->
    "<",
    { T0 = [0'<|T1] },
    proseval_text(T1,T2),
    ">",
    { T2 = [0'<|T] }.

with premature optimization

proseval([0'<|T0],T) -->
    "<",
    proseval_text(T0,[0'>|T]),
    ">".

In other words if you write DCGs for parsing based on BNF and those same BNF rules are used for multiple purposes then if you do premature optimizations you can find yourself undoing those optimizations for a different use so the code can be reused. So just avoid the optimizations in the base module. Also as I like to note often, get the code working correctly first then go back and do optimizations if needed. If needed means that your code is so slow you can take a coffee break, not that you optimize because you can.

EDIT

Another concept that that I have known about but need to relate in the Wiki would be a Chinese Wall when creating the structure that will be used, e.g parse tree or AST. Depending on how much information you need defines how the predicates are created that convert the codes to structures. In other words you will need more then one set of instructions crossing over the Chinese Wall. Not knowing about having more than one Chinese Wall document is a major hurdle that many don’t see past when trying to learn about parsing. It could also be one document with sections that have conflicting results. They don’t see that sometimes one document will work and at other times many are needed. When many are needed I am finding that modules come in handy.

EDIT

Need to consider a section on Mixfix operators (ref)

EDIT

Starting to think there are at least three different types of fixed point checks.

Characters codes - Useful as a recognizer.
Format preserving - Useful when all the white space and such needs to be preserved. Can’t recall this one ever being used with programming languages. Even languages such as Python where indentation means something this level of parsing is to much. This has a use but I very rarely need it. Another example where this one would be needed is if a number has leading zeros and they need to be preserved, e.g. 00001.
Canonical - Unless preserving formatting is needed, this should be the one to achieve.

EDIT

Useful info in this topic: Phrase_from_file vs phrase

EricGT · June 12, 2023, 6:20pm

Attached are the two files related to parsing PostScript as part of processing PDF files.

One is the scanner (.pl) the other are the test cases (.plt).

Note: These have not been fully tested and are dumps of my code with lots of old code commented out but left in place. If you download them do not expect me to answer any questions on them as it was a few years ago when I created them and as I noted here, never finished the project because there was a better way to solve the problem I needed solving. They do have a copyright and are NOT released as Open Source, they are presented here to show how hard it is to create a PostScript scanner.

PostScript__scanner.pl (368.6 KB)
PostScript__scanner.plt (1.3 MB)

Topic		Replies	Views
Most efficient DCG for text parsing? Algorithm	4	1623	July 5, 2021
Tokenizing files with DCGs General dcg	3	531	August 12, 2022
Using a Parser for a DSL in Prolog Help!	3	432	June 6, 2023
Phrase_from_file vs phrase Help!	29	2597	September 14, 2020
What's the idiomatic way of developing DCGs? Help!	8	551	December 17, 2020

Wiki Discussion: DCG and phrase/3

Related Topics