Parsing text using a formal grammar: SWIPL Example

Looks like somebody did also try to realize this rule in your pPEG
SWIPL Example. If I check the source code of pl-parse.pl
I find for example the following:

% simple infix
build_term_([V1, op(_P1,_A1,Op1), V2], VarsIn, VarsOut, Term) :-            
	not_op(V1), not_op(V2),
    Etc...

The only problem is that this is done too late. Is suspect doing
this in the pl-grammar.pl would be more appropriate. But in
pl-grammar.pl there is hardly a not_op/1.

What I found useful was pPEGs negation !. But I only found
a partial solution and not a full solution. Since ( op ) cannot
be parsed anymore.

So by 6.3.1.3 ISO would prohibit:

?- X = (- -).
X = - (-).

and

?- X = (- * -).
X =  (-)*(-).

If so, would you really want to enforce it?

I see only two practical alternatives to the expr grammar rule; either (as written):

	expr   = PrefixOp " " (&PrefixOp expr / !InfixOp expr)
	       / term " " ( InfixOp " " expr " " / PostfixOp " " )*

This appears to exploit 6.3.1.3 to a degree since it says an InfixOp cannot be an operand of a PrefixOp. So this fails (as does SWIP):

?- string_termList("- * .",[T]).
% pPEG Error: Prolog.expr failed, expected Prolog.expr at line 1.5:
%   1 | - * .
%           ^

Alternatively:

	expr   = PrefixOp " " expr
	       / term " " ( InfixOp " " expr " " / PostfixOp " " )*

and this fails:

?- string_termList("- * - .",[T]).
% pPEG Error: Prolog.expr failed, expected Prolog.expr at line 1.7:
%   1 | - * - .
%             ^

You have suggested a third alternative with a !PrefixOp guard, but then this fails:

% /Users/rworkman/Documents/PrologDev/SWI_Prolog/pPEG/SWIP-grammar/pl_grammar compiled into pl_grammar 0.02 sec, 0 clauses
?- string_termList("- .",[T]).
% pPEG Error: Prolog.expr failed, expected Prolog.expr at line 1.3:
%   1 | - .
%         ^

which seems even more restrictive than ISO, so it wouldn’t make my short list.

Scryer seems to take 6.3.1.3 literally. That seems too restrictive to me and, it appears, most Prolog implementations agree. For what it’s worth, here’s what Eclipse has to say on the matter (at least they say something):

A.3.3 Operator Ambiguities
Unlike the canonical syntax, operator syntax can lead to ambiguities.

For instance, when a prefix operator is followed by an infix or postfix operator, the prefix is often not meant to be a prefix operator, but simply the left hand side argument of the following infix or postfix. In order to decide whether that is the case, ECLiPSe uses the operator’s relative precedences and their associativities, and, if necessary, a two-token lookahead. If this rules out the prefix-interpretation, then the prefix is treated as a simple atom. In the rare case where this limited lookahead is not enough to disambigute, the prefix must be explicitly enclosed in parentheses.

Another source of ambiguity are operators which have been declared both infix and postfix. In this case, ECLiPSe uses a one-token lookahead to check whether the infix-interpretation can be ruled out. If yes, the operator is interpreted as postfix, otherwise as infix. Again, in rare cases parentheses may be necessary to enforce the interpretation as postfix.

When a binary prefix operator is followed by an infix operator, then either of them could be the main functor. Faced with the ambiguity, the system will prefer the infix interpretation. To force the binary prefix to be recognised, the infix must be enclosed in parentheses.

Ref: Formal definition of clause syntax

Note: Eclipse supports a binary prefix operator which is outside the scope of this discussion so the last paragraph can be ignored.

And this also fails:

But there should be some way to make it work.

Edit 17.05.2022:
It also works in simp_parse/2:

?- simp_parse([-], T).
T =  (-).

?- simp_parse(['(',-,')'], T).
T =  (-).

Thanks. Here’s another I found:

“Precedences in specifications and implementations of programming languages” :

But to my untrained eye, they don’t address the root of the problem with Prolog expressions, namely that operators and atoms are syntactically indistinguishable. So “- *” could be parsed as “-(*)” (it’s actually a syntax error in SWIP) and “- * -” as “*((-),(-))”. Note in the first, “-” is and operator and “*” is an atom, while the roles are reversed in the second. In theory with sufficient lookahead this can be resolved, but it gets a bit tricky in practice for the general case and it gets worse when trying to define it in a grammar which recognizes Prolog syntax, which is my main focus here.

In practice, it’s not a big deal since you can just add enough parentheses to produce the right semantics. But as @j4n_bur53 rightly points out, the implicit assumptions built into various builtin Prolog parsers can adversely affect portability. At least, a grammar forces you to make some of them explicit.

I guess the ECLiPSe realization is not ISO compatible.

Since it violates this here:

You can try yourself:

[eclipse 1]: X = (- * x0).
X = (-) * x0

Are you starting to see a trend here? I am.

My conclusion is that the ISO standard may not be ambiguous (I don’t really know for reasons stated earlier) but most implementations consider it too restrictive. But in loosening the restrictions ambiguities have been introduced and the strategies for resolving these often haven’t been well documented. In practice, nobody cares much since you can always add parentheses to get the desired semantics. Well, they don’t care until you try to port a Prolog program to a different system with different resolution strategies and things break.

What can be done at this point is unclear. If everyone documented their strategy like Eclipse does, that would be a start.

I just pushed a new version updating the SWIP Example. Fixes include:

  • proper treatment of trailing escape sequences in quoted atoms
  • check for proper operator class in expression reduction
  • proper handling of non-associative operators in right reduction of expressions
  • added special “''” escape sequence in char codes (only grammar change)

I think this includes everything discussed so far. The known discrepancy with SWIP “xfx fy” associativity is unchanged from previous versions. In such cases, SWIP will generate an error while the pPEG version will not.

I found computer assistent more expressions that exemplify differences.
Your [x0, **, -, x1] and your [-, *, x0] was indeed found again as well.

For fuzz3:

+--------- [-, :-, x0]
         +------------------------- [x0, **, -, x1]
                                  +------ ppeg3
                                  +------ []
                                        + trealla3
                                        + []
                                        + swi3
                                        + []
                                        + scryer3
                                        + []
                                        + jekejeke3
                                        + gnu3
         +---------------------- [:-, :-, x0]
                               +--------- eclipse3
                               +--------- sicstus3

And for fuzz4:

+---------- [**, **, x0]
          +----------------------- [-, **, x0]
                                 +------- trealla4
                                 +------- ['(', '(', ')', -, x0]
                                        + scryer4
                                        + gnu4
          +-------- [-, **, x0]
                  +--------- [-, *, x0]
                           +------- [-, :-, x0]
                                  +------ swi4
                                  +------ eclipse4
                           +------------- jekejeke4
                  +-------- [-, :-, x0]
                          +-------------- ppeg4
                          +-------------- sicstus4

There is no reference to the simplified parser simp_parse/2 anymore,
its about differences between the different Prolog systems.

Warning: I used an algorithm that picks the first shortest example, that
discriminates two paths. There are more examples not shown.