Trying to understand the example.
So I think my definition of a âparserâ may be much more inclusive than yours. A (text) parser is a program which takes flat text as input and produces a result. It ranges from a simple recognizer - the input text is, or is not, a legal âsentenceâ in the language formally specified by the grammar. Tokenizers (as I view them) produce a flat list of âtokensâ. More capable parsers might produce an abstract syntax tree to be used in the next step of processing, e.g., mapping to a Prolog term, or some other suitable form for compiling.
pPEG is a generic parser system that uses a formal grammar specification to map input text to a generic kind of syntax tree, nothing more or less. If I want to produce Prolog terms from a grammar specification of Prolog syntax, I need a back end to define the semantics of the input text. But I could use the same formal grammar to produce, for example, a JSON equivalent of a term as a string, and that would require a different back end.
And because I could possibly implement any pPEG as a DCG (is that what you mean by âdirectly realizedâ?) doesnât make it a ârip-offâ. By the same token, DCGâs are a ârip-offâ from Prolog itself. Furthermore, pPEG has an entirely different programming model - more like regular expressionsâ direct execution model IMO. But, yes, I suppose it would be possible to âtranspileâ a PEG grammar to a DCG which would then presumably have to be asserted before it could be used. This would be more inline with traditional grammar systems, e.g., ANTLR, which I speculated might be one of the reasons grammars lagged far behind regular expressions in either usage or direct language support. There are probably many such options I might have chosen but did not. Instead I more closely followed the VM structure of implementations for other programming languages (JavaScript, Python, âŚ) because I knew they worked.
Also note that pPEG operates on strings, not lists or arrays of characters, which could have a significant impact on memory usage and possibly even performance (not that Iâm overly concerned about this right now):
?- S=`abcdefghijklmnopqrstuvwxyz`, term_size(S,Sz).
S = [97, 98, 99, 100, 101, 102, 103, 104, 105|...],
Sz = 78.
?- S="abcdefghijklmnopqrstuvwxyz", term_size(S,Sz).
S = "abcdefghijklmnopqrstuvwxyz",
Sz = 6.
Expressed as a list, how big would a 1000 line program be?
Surely thereâs a lot going on in a Prolog system between syntax recognition and runtime execution. So I donât know what you want to call it, but, given the expression 4-3-2
, itâs what defines the difference between 4-(3-2)
and (4-3)-2
. The syntax is the same but the meaning (semantics?) is different. Also whether the string â1234`â is a base 10 number, or a base 16 number (or some other base). Again, same syntax but different meaning/interpretation/semantics? This seems quite different to me than arithmetic evaluation, for example.
A PEG grammar just specifies the syntax of a âlanguageâ. In addition, pPEG defines the structure of the output of a parse operation using such a specification. This defined structure is a tree where the terminal node values are strings. Thatâs it. This is analogous to what library(pcre)
regular expressions do - they map input text to a dictionary of strings. So if you want something other than the defined result structure, e.g., a Prolog term, you require an additional translation step. I doubt whether you can can come up with a list of pros and cons of a two pass or N-pass âparserâ in general, it really depends on the application. But using something like pPEG, the first pass is reduced to specifying a formal grammar for the syntax which, I would argue, is useful for many reasons.
Some grammar systems support semantic actions as part of the grammar rule and I would put DCGs in that category. It was decided not to do that in pPEG. It results in grammar specifications which are not portable between different implementations (it usually involves inserting code in the host language) and produces more complex grammar specifications which mix syntax and âsemanticâ issues together.
Now the SWIP example parser is just that, an example. If you only want to do syntax verification, i.e., recognition, you donât need the back end at all. But in addition to being an example, the back end primarily exists for simple grammar testing reasons, e.g., so I can:
?- string_termList("X is Y+Z.",[X is Y+Z]).
true.
?- string_termList("- - - .",[- - -]).
true.
But I see no need to turn this back end into a grammar in itâs own right - thatâs what Prolog is good at. And thatâs probably why I rarely use DCGs, which, if I understand it, are rewritten into Prolog clauses in a fairly straight forward manner.
Iâm also not interested in producing end-to-end documentation of what SWI-Prolog does, thatâs what the reference manual is for. But what is missing, at least for me, is a concise specification of its syntax, so the pPEG specification fills that gap.
As far as comparing Prolog dialects, thatâs much bigger than a parser issue. Comparing syntaxâs is a necessary but insufficient step in that process. Having a formal (and testable) grammar for that comparison would seem to be useful.
Just to point out that the pPEG syntax for SWIP defines a superset of valid SWIP programs, i.e., the terms do not have to be clauses. It wouldnât be hard to add this restriction or it could be done in the âsemanticâ analysis.
There is a downside to postponing this check, namely, the original source may have been âlostâ so error messages with source location information would be problematical.
Without a formal spec, one is never sure, but this âparsesâ; as a ptree and translated to a term:
?- prolog_ptree("p(A :- B) :- q(A), r(B).",Ptree).
Ptree = 'Prolog'([expr(['Compound'([atom("p"), arg([var("A"), 'InfixOp'([op(":-")]), var("B")])]), 'InfixOp'([op(":-")]), expr(['Compound'([atom("q"), var("A")]), 'InfixOp'([op(",")]), 'Compound'([atom("r"), var("B")])])])]).
?- string_termList("p(A :- B) :- q(A), r(B).",[T]).
T = (p((A:-B)):-q(A), r(B)).
True, because the Prolog grammar is context sensitive. For example a â,
â in an argument sequence or a list is not an operator unless itâs in a parenthesized expression. And the â|
â cannot appear as an operator in a list, but can in an argument sequence as you point out, although it âmeansâ something completely different. So unfortunately you end up with three similar but different rules for expressions. Perhaps the standardization folks should have taken more care.
Operator priorities, or indeed operators, are not fundamentally supported by PEG (and therefore pPEG). I believe there are operator precedence grammars, but PEG isnât one of them. Normally a grammar with operators (and precedences) is written in PEG with a hierarchical rule structure, but that doesnât work if all the operators arenât defined in the grammar itself. Fortunately most machine oriented languages donât have this problem.
But although theyâre not defined by the syntax, operator definitions of some description are required to recognize the Prolog syntax, i.e., to separate operands and operators in an expression (since theyâre syntactically ambiguous). But only the operator class, i.e., prefix, infix, or postfix, is required for recognition. The âfudgeâ the pPEG Prolog grammar as written uses is a non-portable extension that tests whether the previous matched operator (rule op
) is defined and is of the required class using current_op/3
.
The Covington draft of the ISO standard (my only reference) has no formal spec for the syntax but itâs got a bunch of informal descriptions like:
There can be only one | in a list, and no commas after it.
The comma has three functions: it separates arguments of functors, it separates elements of lists, and it is an infix operator of priority 1000. Thus (a,b) (without a functor in front) is a structure, equivalent to â,â(a,b).
The same atom can be an operator in more than one class (such as the infix and
prefix minus sign). To avoid the need for unlimited lookahead when parsing, the same atom cannot be both an infix operator and a postfix operator.
The challenge is to capture these kinds of statements in a formal specification.
(Nothing of Impartance here)
Not the source for sure. (I find it hard to find anything in that morass of C code.) I used the SWIP reference manual (mainly SWI-Prolog -- The SWI-Prolog syntax) , Covington, and experimentation.
No, but the SWIP reference manual does.
It only crashes on Windows. On POSIX systems it uses the sigaltstack mechanism to gracefully recover:
34 ?- peano(1000000, L, []), atom_codes(A, L), term_string(T, A).
ERROR: term_string/2: C-stack limit (8,388,608 bytes) exceeded.
ERROR: Use the shell command ulimit -s size to enlarge the limit.
Possibly something similar can be done on Windows. I donât know how though ⌠In particular as we use MinGW (gcc) which doesnât do have the __try ... __catch
MSVC extension.
I encapsulated this in a module to scope the flag setting:
:- module(peano_parse,[peano//1,parse_peano/2]).
:- set_prolog_flag(double_quotes, codes).
peano(0) --> !, "0".
peano(N) --> {M is N-1}, "s(", peano(M), ")".
parse_peano(N,T) :-
peano(N,L,[46]),
string_codes(S,L),
string_termList(S,[T]).
Now:
?- parse_peano(1000,T).
T = s(s(s(s(s(s(s(s(s(s(...)))))))))).
But the pPEG engine is just a recursive descent parser so at some point itâs going to run out of memory. It doesnât crash on MacOS, but the default error handling tries to output the humungous peano expression which totally clogs up the QT console used by the Mac bundle app. This then requires a manual âForce Quitâ. (It may complete but I didnât have the patience to find out.)
I can avoid this by catching the resource_error myself. Somewhere between 150000 and 175000, it runs out of stack:
?- catch(parse_peano(150000,T),error(resource_error(_),_),format("resource_error\n",[])).
T = s(s(s(s(s(s(s(s(s(s(...)))))))))).
?- catch(parse_peano(175000,T),error(resource_error(_),_),format("resource_error\n",[])).
resource_error
true.
But you raise another point regarding resiliency. The pPEG engine has a check to catch an infinite recursion on grammar rules; the so-called left hand recursion issue. It only kicks in once a rule depth threshold is exceeded, but this threshold isnât a big number (currently 64). When it exceeds the threshold, the pPEG engine runs quite slowly since it does this rather expensive check on every rule call. So to run these artificially high test cases in reasonably acceptable time, I had to disable this check, but this is something I would not do in a production version, but perhaps the threshold requires some tuning.
But Iâm not sure where this discussion is going. To me the value of this example is two-fold. 1) Itâs an significant and understandable example (at least to Prolog users) of pPEG capabilities (a stretch case IMO), and 2) the grammar itself is a decent starting point for a formal specification of the SWIP syntax, and may be a reasonable starting point for other dialects. And since itâs executable, it has some value in testing any changes to the syntax going forward without requiring simultaneous changes to the builtin parser.
Also note the the pPEG implementation is heavily dependent on a few builtin string primitives like sub_string/5
, so itâs not quite a pure Prolog implementation. Other SWIP dependencies include stack introspection for the rule recursion check and debug hooks for tracing.
But you canât compare a DCG grammar that only parses peano strings with a pPEG grammar that parses Prolog. If I define a pPEG grammar for peano strings:
Peano = 's(' ('0' / Peano) ')'
considerably deeper nesting (~5 times)can be handled without exhausting memory:
?- catch(parse_peano_Wgrammar(800000,Tree),error(resource_error(_),_),format("resource_error\n",[])).
Tree = 'Peano'(['Peano'(['Peano'(['Peano'(['Peano'([...])])])])]).
?- catch(parse_peano_Wgrammar(900000,Tree),error(resource_error(_),_),format("resource_error\n",[])).
resource_error
true.
It may never be quite space efficient as a DCG that ends up producing terms ânativelyâ whereas the pPEG parser produces a syntax tree. (Note the extra square brackets in the 'Peano'
terms designating a list of child nodes.)
As to whether a million record csv
string can be parsed in the default memory configuration, I doubt it, but really donât know. Itâd parse approximately as many as the syntax tree representation fits in available memory.
Also the 1 GB of stack space is divided up into separate chunks, any of which may the limiting factor. I donât know enough about the underlying machinery to speculate which stack it might be, and the csv
parser may have different characteristics than the peano parser (which has heavy duty rule recursion).
There is no need for this. DCGs work fine with the default setting. You only need this flag if you want to write e.g. phrase(my_grammar, "hello world").
It isnât common to parse a known string in source code, so this only applies to the toplevel. There you can write ?- phrase(my_grammar, `hello world`).
That suggests the global stack isnât the limiting factor for this particular, so itâs probably the local stack. Executing the Peano grammar on the pPEG VM on top of Prolog is probably about 5 times the number of nested Prolog calls as a DCG of Peano. In general, executing a pPEG grammar would be similar to meta-interpreting the DCG.
I guess if I was really interested in parsing peano strings in a minimum of total space Iâd write a fairly simple C primitive.
The intent of the SWIPL example is to mimic SWIP, so it should produce the same result:
?- op(9,fy,fy), op(9,yfx,yfx).
true.
?- string_termList("fy 1 yfx 2.",[T]), write_canonical(T).
yfx(fy(1),2)
T = fy 1 yfx 2.
Iâm not exactly sure what you think is going wrong.
So why is GNU Prolog interpretation correct?
My analysis: When operator precedence values are the same (in this case both 9), associativity is used to define semantics. With a prefix operator and a following infix operator there are 6 (2 prefix and 3 infix) possible combinations to determine left, right, or non-associative.
Non-associative: (fx,xfx)
and (fx,xfy)
Right-associative: (fy,xfy)
and (fy,xfx)
Left-associative: (fx,yfx)
This leaves (fy,yfx)
, as in this example. This is ambiguous, it could be interpreted either left or right and I havenât seen any âruleâ which dictates what it should be. SWIP has chosen left and GNU has chosen right. Such is the state of âstandardâ Prolog.
Note there are probably similar issues with prefix-postfix and infix-postfix combinations.
Just to clarify a few things:
-
When you write
X = (1 xfy 2 yfx 3)
in a top level command you are invoking the SWI-Prolog parser. -
This project: https://github.com/ridgeworks/pPEGpl/tree/main/Examples/SWIP-grammar is completely independent of that parser. It only claims to replicate the results of the builtin SWI-Prolog parser solely for the purpose of testing/verifying the Prolog grammar captured in the example. Any so-called bugs in the SWI-Prolog parser will be replicated in this parser by design.
-
The
op_associativityEq/3
rules are part of this last parser, not the builtin parser. Yes it would be simple enough to modify the rules but then the two parsers would be inconsistent which is not helpful in testing the grammar. -
If the builtin SWIP parser changes, e.g., for Table 6 compatibility, Iâm happy to modify the example parser to reflect those changes. File an issue on github if you think it violates the standard. After all these years I do wonder if changes in this area will break more things than they fix, but thatâs not my call.
-
I do not have access to the ISO standard because itâs behind a paywall. (This is a matter of principle for me.) So my âspecâ is the SWI-Prolog implementation, for better or worse.
-
If by an âunparserâ (converts terms to strings?) you mean something like
write_canonical
, the example parser does not have one; I usewrite_canonical
as youâve done.
I suspect most Prolog users have only a superficial understanding of operator semantics and youâve raised some valid points which the various Prolog implementers should probably address. It would appear the ISO standard has been somewhat ignored in this regard.
As I see it, in the pre-ISO days Prolog systems tried to be as tolerant as they could be and some projects used this enthusiastically to define DSLs (Domain Specific Languages). Prolog systems did resolve ambiguities in different ways though. ISO tried to resolve this mess. As a result though we ended up with something that is rather restrictive. Even the standard-aware SICStus Prolog decided not to follow. They use Prolog non-determinism. Iâve tried to build something fully compatible, but failed and dropped that project.
I donât know where we should move. We could all cleanly implement the ISO rules. That breaks many of these fancy DSLs (forcing additional parenthesis), but would work pretty much ok for arithmetic, etc. Ideally weâd get to something that is well defined and flexible. For now I just leave things alone. If anyone wants to pick it up, please do so.
The last time this discussion came up was with the ALE system for NLP parsing that has been designed for SICStus when porting it to SWI-Prolog. That is not about how to handle fy 1 yfx 2
, but the general idea that SICStus accepts a lot more than it should according to strict ISO.
Ideally anything that is valid ISO syntax should be accepted and produce the same term. I donât know whether this is reachable using some small fixes or that would require a redesign of the operator handling.
Anyway, Iâve got other priorities
Sure this was never tested. Only since an earlier discussion we have a small test suite for operator handling. Still, there are very few examples from the wild where this is an issue and thus it is better left nicely resting at the bottom of the priority stack
But why would you do that, i.e., change a non-ambiguous set of operators to an ambiguous set that relies on some external rule/table rather than explicit semantics of precedence and associativity?
Given the original set of operators, SWIP produces the same result as SICStus:
?- op( 500, fy, !),op( 500, fy, ?),op( 500, xfy, :).
true.
?- X = (![Y]:p(Y)),write_canonical(X).
!(:([A],p(A)))
X = ![Y]:p(Y).
I see the problem corner cases as ones than can be easily avoided through judicious operator definitions. Any DSLâs that use such cases are asking for trouble IMO.
That said it wouldnât be a bad idea to align SWIP with the standard (and several other Prolog implementations by the looks of it), but I agree itâs pretty low on the priority scale given that itâs probably been this way for years (decades?).
Fine, Iâll accept your statement that there is a rule in a dead standards document hidden behind a paywall that sys itâs not ambiguous.
And it appears Iâm the not only one that largely ignores said document, but Iâm not a Prolog implementor. Any success at standardizing anything these days seems to rely on the communication and goodwill of those folks.