Parsing text using a formal grammar: SWIPL Example

Trying to understand the example.

So I think my definition of a “parser” may be much more inclusive than yours. A (text) parser is a program which takes flat text as input and produces a result. It ranges from a simple recognizer - the input text is, or is not, a legal “sentence” in the language formally specified by the grammar. Tokenizers (as I view them) produce a flat list of “tokens”. More capable parsers might produce an abstract syntax tree to be used in the next step of processing, e.g., mapping to a Prolog term, or some other suitable form for compiling.

pPEG is a generic parser system that uses a formal grammar specification to map input text to a generic kind of syntax tree, nothing more or less. If I want to produce Prolog terms from a grammar specification of Prolog syntax, I need a back end to define the semantics of the input text. But I could use the same formal grammar to produce, for example, a JSON equivalent of a term as a string, and that would require a different back end.

And because I could possibly implement any pPEG as a DCG (is that what you mean by “directly realized”?) doesn’t make it a “rip-off”. By the same token, DCG’s are a “rip-off” from Prolog itself. Furthermore, pPEG has an entirely different programming model - more like regular expressions’ direct execution model IMO. But, yes, I suppose it would be possible to “transpile” a PEG grammar to a DCG which would then presumably have to be asserted before it could be used. This would be more inline with traditional grammar systems, e.g., ANTLR, which I speculated might be one of the reasons grammars lagged far behind regular expressions in either usage or direct language support. There are probably many such options I might have chosen but did not. Instead I more closely followed the VM structure of implementations for other programming languages (JavaScript, Python, …) because I knew they worked.

Also note that pPEG operates on strings, not lists or arrays of characters, which could have a significant impact on memory usage and possibly even performance (not that I’m overly concerned about this right now):

?- S=`abcdefghijklmnopqrstuvwxyz`, term_size(S,Sz).
S = [97, 98, 99, 100, 101, 102, 103, 104, 105|...],
Sz = 78.

?- S="abcdefghijklmnopqrstuvwxyz", term_size(S,Sz).
S = "abcdefghijklmnopqrstuvwxyz",
Sz = 6.

Expressed as a list, how big would a 1000 line program be?

Surely there’s a lot going on in a Prolog system between syntax recognition and runtime execution. So I don’t know what you want to call it, but, given the expression 4-3-2, it’s what defines the difference between 4-(3-2) and (4-3)-2. The syntax is the same but the meaning (semantics?) is different. Also whether the string “1234`” is a base 10 number, or a base 16 number (or some other base). Again, same syntax but different meaning/interpretation/semantics? This seems quite different to me than arithmetic evaluation, for example.

A PEG grammar just specifies the syntax of a “language”. In addition, pPEG defines the structure of the output of a parse operation using such a specification. This defined structure is a tree where the terminal node values are strings. That’s it. This is analogous to what library(pcre) regular expressions do - they map input text to a dictionary of strings. So if you want something other than the defined result structure, e.g., a Prolog term, you require an additional translation step. I doubt whether you can can come up with a list of pros and cons of a two pass or N-pass “parser” in general, it really depends on the application. But using something like pPEG, the first pass is reduced to specifying a formal grammar for the syntax which, I would argue, is useful for many reasons.

Some grammar systems support semantic actions as part of the grammar rule and I would put DCGs in that category. It was decided not to do that in pPEG. It results in grammar specifications which are not portable between different implementations (it usually involves inserting code in the host language) and produces more complex grammar specifications which mix syntax and “semantic” issues together.

Now the SWIP example parser is just that, an example. If you only want to do syntax verification, i.e., recognition, you don’t need the back end at all. But in addition to being an example, the back end primarily exists for simple grammar testing reasons, e.g., so I can:

?- string_termList("X is Y+Z.",[X is Y+Z]).

?- string_termList("- - - .",[- - -]).

But I see no need to turn this back end into a grammar in it’s own right - that’s what Prolog is good at. And that’s probably why I rarely use DCGs, which, if I understand it, are rewritten into Prolog clauses in a fairly straight forward manner.

I’m also not interested in producing end-to-end documentation of what SWI-Prolog does, that’s what the reference manual is for. But what is missing, at least for me, is a concise specification of its syntax, so the pPEG specification fills that gap.

As far as comparing Prolog dialects, that’s much bigger than a parser issue. Comparing syntax’s is a necessary but insufficient step in that process. Having a formal (and testable) grammar for that comparison would seem to be useful.

Just to point out that the pPEG syntax for SWIP defines a superset of valid SWIP programs, i.e., the terms do not have to be clauses. It wouldn’t be hard to add this restriction or it could be done in the “semantic” analysis.

There is a downside to postponing this check, namely, the original source may have been “lost” so error messages with source location information would be problematical.

Without a formal spec, one is never sure, but this “parses”; as a ptree and translated to a term:

?- prolog_ptree("p(A :- B) :- q(A), r(B).",Ptree).
Ptree = 'Prolog'([expr(['Compound'([atom("p"), arg([var("A"), 'InfixOp'([op(":-")]), var("B")])]), 'InfixOp'([op(":-")]), expr(['Compound'([atom("q"), var("A")]), 'InfixOp'([op(",")]), 'Compound'([atom("r"), var("B")])])])]).

?- string_termList("p(A :- B) :- q(A), r(B).",[T]).
T =  (p((A:-B)):-q(A), r(B)).

True, because the Prolog grammar is context sensitive. For example a “,” in an argument sequence or a list is not an operator unless it’s in a parenthesized expression. And the “|” cannot appear as an operator in a list, but can in an argument sequence as you point out, although it “means” something completely different. So unfortunately you end up with three similar but different rules for expressions. Perhaps the standardization folks should have taken more care.

Operator priorities, or indeed operators, are not fundamentally supported by PEG (and therefore pPEG). I believe there are operator precedence grammars, but PEG isn’t one of them. Normally a grammar with operators (and precedences) is written in PEG with a hierarchical rule structure, but that doesn’t work if all the operators aren’t defined in the grammar itself. Fortunately most machine oriented languages don’t have this problem.

But although they’re not defined by the syntax, operator definitions of some description are required to recognize the Prolog syntax, i.e., to separate operands and operators in an expression (since they’re syntactically ambiguous). But only the operator class, i.e., prefix, infix, or postfix, is required for recognition. The “fudge” the pPEG Prolog grammar as written uses is a non-portable extension that tests whether the previous matched operator (rule op) is defined and is of the required class using current_op/3.

The Covington draft of the ISO standard (my only reference) has no formal spec for the syntax but it’s got a bunch of informal descriptions like:

There can be only one | in a list, and no commas after it.

The comma has three functions: it separates arguments of functors, it separates elements of lists, and it is an infix operator of priority 1000. Thus (a,b) (without a functor in front) is a structure, equivalent to ‘,’(a,b).

The same atom can be an operator in more than one class (such as the infix and
prefix minus sign). To avoid the need for unlimited lookahead when parsing, the same atom cannot be both an infix operator and a postfix operator.

The challenge is to capture these kinds of statements in a formal specification.

(Nothing of Impartance here)

Not the source for sure. (I find it hard to find anything in that morass of C code.) I used the SWIP reference manual (mainly SWI-Prolog -- Manual) , Covington, and experimentation.

No, but the SWIP reference manual does.

It only crashes on Windows. On POSIX systems it uses the sigaltstack mechanism to gracefully recover:

34 ?- peano(1000000, L, []), atom_codes(A, L), term_string(T, A).
ERROR: term_string/2: C-stack limit (8,388,608 bytes) exceeded.
ERROR: Use the shell command ulimit -s size to enlarge the limit.

Possibly something similar can be done on Windows. I don’t know how though … In particular as we use MinGW (gcc) which doesn’t do have the __try ... __catch MSVC extension.

I encapsulated this in a module to scope the flag setting:

:- module(peano_parse,[peano//1,parse_peano/2]).

:- set_prolog_flag(double_quotes, codes).

peano(0) --> !, "0".
peano(N) --> {M is N-1}, "s(", peano(M), ")".

parse_peano(N,T) :- 


?- parse_peano(1000,T).
T = s(s(s(s(s(s(s(s(s(s(...)))))))))).

But the pPEG engine is just a recursive descent parser so at some point it’s going to run out of memory. It doesn’t crash on MacOS, but the default error handling tries to output the humungous peano expression which totally clogs up the QT console used by the Mac bundle app. This then requires a manual “Force Quit”. (It may complete but I didn’t have the patience to find out.)

I can avoid this by catching the resource_error myself. Somewhere between 150000 and 175000, it runs out of stack:

?- catch(parse_peano(150000,T),error(resource_error(_),_),format("resource_error\n",[])).
T = s(s(s(s(s(s(s(s(s(s(...)))))))))).

?- catch(parse_peano(175000,T),error(resource_error(_),_),format("resource_error\n",[])).

But you raise another point regarding resiliency. The pPEG engine has a check to catch an infinite recursion on grammar rules; the so-called left hand recursion issue. It only kicks in once a rule depth threshold is exceeded, but this threshold isn’t a big number (currently 64). When it exceeds the threshold, the pPEG engine runs quite slowly since it does this rather expensive check on every rule call. So to run these artificially high test cases in reasonably acceptable time, I had to disable this check, but this is something I would not do in a production version, but perhaps the threshold requires some tuning.

But I’m not sure where this discussion is going. To me the value of this example is two-fold. 1) It’s an significant and understandable example (at least to Prolog users) of pPEG capabilities (a stretch case IMO), and 2) the grammar itself is a decent starting point for a formal specification of the SWIP syntax, and may be a reasonable starting point for other dialects. And since it’s executable, it has some value in testing any changes to the syntax going forward without requiring simultaneous changes to the builtin parser.

Also note the the pPEG implementation is heavily dependent on a few builtin string primitives like sub_string/5, so it’s not quite a pure Prolog implementation. Other SWIP dependencies include stack introspection for the rule recursion check and debug hooks for tracing.

(Nothing of Impartance here)

But you can’t compare a DCG grammar that only parses peano strings with a pPEG grammar that parses Prolog. If I define a pPEG grammar for peano strings:

Peano = 's(' ('0' / Peano) ')'

considerably deeper nesting (~5 times)can be handled without exhausting memory:

?- catch(parse_peano_Wgrammar(800000,Tree),error(resource_error(_),_),format("resource_error\n",[])).
Tree = 'Peano'(['Peano'(['Peano'(['Peano'(['Peano'([...])])])])]).

?- catch(parse_peano_Wgrammar(900000,Tree),error(resource_error(_),_),format("resource_error\n",[])).

It may never be quite space efficient as a DCG that ends up producing terms “natively” whereas the pPEG parser produces a syntax tree. (Note the extra square brackets in the 'Peano' terms designating a list of child nodes.)

As to whether a million record csv string can be parsed in the default memory configuration, I doubt it, but really don’t know. It’d parse approximately as many as the syntax tree representation fits in available memory.

Also the 1 GB of stack space is divided up into separate chunks, any of which may the limiting factor. I don’t know enough about the underlying machinery to speculate which stack it might be, and the csv parser may have different characteristics than the peano parser (which has heavy duty rule recursion).

There is no need for this. DCGs work fine with the default setting. You only need this flag if you want to write e.g. phrase(my_grammar, "hello world"). It isn’t common to parse a known string in source code, so this only applies to the toplevel. There you can write ?- phrase(my_grammar, `hello world`).

That suggests the global stack isn’t the limiting factor for this particular, so it’s probably the local stack. Executing the Peano grammar on the pPEG VM on top of Prolog is probably about 5 times the number of nested Prolog calls as a DCG of Peano. In general, executing a pPEG grammar would be similar to meta-interpreting the DCG.

I guess if I was really interested in parsing peano strings in a minimum of total space I’d write a fairly simple C primitive.

The intent of the SWIPL example is to mimic SWIP, so it should produce the same result:

?- op(9,fy,fy), op(9,yfx,yfx).

?- string_termList("fy 1 yfx 2.",[T]), write_canonical(T).
T = fy 1 yfx 2.

I’m not exactly sure what you think is going wrong.

So why is GNU Prolog interpretation correct?

My analysis: When operator precedence values are the same (in this case both 9), associativity is used to define semantics. With a prefix operator and a following infix operator there are 6 (2 prefix and 3 infix) possible combinations to determine left, right, or non-associative.

Non-associative: (fx,xfx) and (fx,xfy)

Right-associative: (fy,xfy) and (fy,xfx)

Left-associative: (fx,yfx)

This leaves (fy,yfx), as in this example. This is ambiguous, it could be interpreted either left or right and I haven’t seen any “rule” which dictates what it should be. SWIP has chosen left and GNU has chosen right. Such is the state of “standard” Prolog.

Note there are probably similar issues with prefix-postfix and infix-postfix combinations.

(Nothing of Impartance here)

Just to clarify a few things:

  1. When you write X = (1 xfy 2 yfx 3) in a top level command you are invoking the SWI-Prolog parser.

  2. This project: is completely independent of that parser. It only claims to replicate the results of the builtin SWI-Prolog parser solely for the purpose of testing/verifying the Prolog grammar captured in the example. Any so-called bugs in the SWI-Prolog parser will be replicated in this parser by design.

  3. The op_associativityEq/3 rules are part of this last parser, not the builtin parser. Yes it would be simple enough to modify the rules but then the two parsers would be inconsistent which is not helpful in testing the grammar.

  4. If the builtin SWIP parser changes, e.g., for Table 6 compatibility, I’m happy to modify the example parser to reflect those changes. File an issue on github if you think it violates the standard. After all these years I do wonder if changes in this area will break more things than they fix, but that’s not my call.

  5. I do not have access to the ISO standard because it’s behind a paywall. (This is a matter of principle for me.) So my “spec” is the SWI-Prolog implementation, for better or worse.

  6. If by an “unparser” (converts terms to strings?) you mean something like write_canonical, the example parser does not have one; I use write_canonical as you’ve done.

I suspect most Prolog users have only a superficial understanding of operator semantics and you’ve raised some valid points which the various Prolog implementers should probably address. It would appear the ISO standard has been somewhat ignored in this regard.

As I see it, in the pre-ISO days Prolog systems tried to be as tolerant as they could be and some projects used this enthusiastically to define DSLs (Domain Specific Languages). Prolog systems did resolve ambiguities in different ways though. ISO tried to resolve this mess. As a result though we ended up with something that is rather restrictive. Even the standard-aware SICStus Prolog decided not to follow. They use Prolog non-determinism. I’ve tried to build something fully compatible, but failed and dropped that project.

I don’t know where we should move. We could all cleanly implement the ISO rules. That breaks many of these fancy DSLs (forcing additional parenthesis), but would work pretty much ok for arithmetic, etc. Ideally we’d get to something that is well defined and flexible. For now I just leave things alone. If anyone wants to pick it up, please do so.

The last time this discussion came up was with the ALE system for NLP parsing that has been designed for SICStus when porting it to SWI-Prolog. That is not about how to handle fy 1 yfx 2, but the general idea that SICStus accepts a lot more than it should according to strict ISO.

Ideally anything that is valid ISO syntax should be accepted and produce the same term. I don’t know whether this is reachable using some small fixes or that would require a redesign of the operator handling.

Anyway, I’ve got other priorities :slight_smile:

Sure this was never tested. Only since an earlier discussion we have a small test suite for operator handling. Still, there are very few examples from the wild where this is an issue and thus it is better left nicely resting at the bottom of the priority stack :slight_smile: