Parsing text using a formal grammar: SWIPL Example

Well since its your code, it should be quite easy for you to extend this change:

op_associativityEq(fy,yfx,left) :- current_prolog_flag(iso, false).
op_associativityEq(fy,yfx,right) :- current_prolog_flag(iso, true).
/* Add further pairs that differ from what ISO core standard
   specifies in Table 6 etc.. and what SWI-Prolog implements */

And you will also be able to play around with it conveniently:

?- op(9,fy,fy), op(9,yfx,yfx).
true.

?- string_termList("fy 1 yfx 2.", [T]), write_canonical(T), nl.
yfx(fy(1),2)
T = fy 1 yfx 2.

?- set_prolog_flag(iso, true).
true.

?- string_termList("fy 1 yfx 2.", [T]), write_canonical(T), nl.
fy(yfx(1,2))
T = fy 1 yfx 2.

?- set_prolog_flag(iso, false).
true.
/* Test more samples that differ from what ISO core standard
   specifies in Table 6 etc.. and what SWI-Prolog implements */

Hope this Helpsā€¦

Unfortunately, even if you, or Boris who offered help, have a complete list of
the pairs that differ, this does not yet allow you to update the C code of SWI-Prolog.
I suspect the unparse would also profit from some treatment, see below.

Otherwise you will be bugged by a parser and unparser that are not in sync.

Edit 13.04.2022:
But some writeq/1, what is usually used in the top-level for answer substitutions
do a better job than what SWI-Prolog built-in unparser does. Now both yfx(fy(1),2)
and fy(yfx(1,2)) are displayed as fy 1 yfx 2:

/* SWI-Prolog */
?- op(9,fy,fy), op(9,yfx,yfx).
true.

?- T = yfx(fy(1),2).
T = fy 1 yfx 2.

?- T = fy(yfx(1,2)).
T = fy 1 yfx 2.

This should not be the case, and is a defect, since it obviously
leads to reparsing errors. I donā€™t get the same defect when
using ECLiPSe Prolog:

[eclipse 1]: op(9,fy,fy), op(9,yfx,yfx).

Yes (0.00s cpu)
[eclipse 2]: T = yfx(fy(1),2).

T = (fy 1) yfx 2
Yes (0.00s cpu)
[eclipse 3]: T = fy(yfx(1,2)).

T = fy 1 yfx 2
Yes (0.00s cpu)

But the exact placement of paranthesis must possibly then depend
on current_prolog_flag(iso, F) for SWI-Prolog. Otherwise
reparsing goes also wrongā€¦

I specifically offered help with figuring out where in the C code this is happening (Jan W said it is complicated) and maybe how to untangle it; I have no motivation or energy to actually write code like this at this moment.

Navigating the C-code is not that difficult. You find everything in pl-read.c.
But I donā€™t know the correct fix. Maybe you need to reduce from right-to-left
instead from left-to-right, to get the ISO core standard result.

The parser reminds me a little bit of a early chart parser. You can run
them bottom up. There are also ways to run them bottom up from right-to-left
and from left-to-right. This might give different first solution parse trees I guess.

I have some Prolog parsers prototypes, using forward chaining. Maybe its
possible to pin down the problem this way. I didnā€™t find the left/right information
coded in pl-read.c somewhere, that is coded in the pPEG parser from this thread.

The left/right information is missing in the C code. So the fix is possibly different.

The current pPEG example Prolog parser does what I want it to do, it faithfully parses SWI-Prolog terms using a formal grammar. It doesnā€™t pretend to parse ISO or any other Prolog dialect.

But if you want to use it for some other purpose, feel free to modify the parser, or the grammar, to do so.

The pPEG parser could be a blessing, but it could be also a further curse.

We donā€™t know what the pPEG parser does. For example I donā€™t find the
op_associativityEq/3 table in pl-read.c. Where did you get it from? Also
very simple examples donā€™t work in the pPEG parser:

?- string_termList("(- -1).", [T]), write_canonical(T), nl.
-(-,1)
T =  (-)-1.

?- X = (- -1), write_canonical(X), nl.
-(-1)
X = - -1.

Edit 14.04.2022:
Thanks to my sixth sense in finding errors, which spares me the money
in a master class of test driven development, I picked a quite a difficult
part of the usual Prolog parser, which cannot be solved by a two pass

parser so easily, as in your approach of pl_grammar.pl and pl_parse.pl.
Especially in the SWI-Prolog dialect its tricky, because you need lexical
information how close the minus sign was to the number. Also I double

checked here, the example is not listed as a current limitation:

A pPEG formal grammar for SWI-Prolog syntax
https://github.com/ridgeworks/pPEGpl/tree/main/Examples/SWIP-grammar

Could be that we now have two dialects, SWI-Prolog and pPEG example?

And that would be your decison to make, depending on what your goals are. But Iā€™m pretty sure your goals are not the same as mine.

Well the code is all on github if you care. It is completely independent of SWI-Prolog, nothing to do with pl-read.c. I wrote the code for op_associativityEq/3 - it is in pl_grammar.pl - based on my understanding of actual SWI-Prolog semantics. It may not be accurate; itā€™s just an example.

And this would be a bug. In fact itā€™s doubly important because itā€™s a bug in the grammar because itā€™s recognizing the first ā€˜-ā€™ as an atom and the second as an infix operator, so thank you for finding it.

As Iā€™ve tried to state many times now, the pPEG example parser is not part of any Prolog system and is not a dialect of Prolog. It merely tries to mimic the SWI-Prolog parser using a formal grammar. Please donā€™t try to cast it as anything more. And itā€™s clearly as not free of bugs of its own as you clearly demonstrated, but Iā€™ll try to improve that situation.

So the problem in the grammar was that the rule for InfixOp did not apply the same guard as PrefixOp for ā€˜-ā€™ preceding a digit. Correct rule in pl_grammar.pl should be:

	InfixOp   = !('-' [0-9]) op <pl_grammar:testOp infix>  # '-' before number not infix

Now:

?- string_termList("(- -1).", [T]), write_canonical(T), nl.
-(-1)
T = - -1.

Iā€™ll include this fix in my next push to pPEG.

FWIW, I will not touch pPEG anymore, even not with tweezers,
until it comes with a test suite, in the same folder like where the
source code is housed. At least this is what Ulrich Neumerkel did

for ISO core standard prolog, and which serves as a point of orientation.
Now we have for SWI-Prolog dialect nor the pPEG dialect any point of
orientation. Its just arbitary code uploaded to the internet. Although I think

SWI-Prolog might be in a better situation. Not sure, it has plunit and maybe
it has somewhere read test cases? But it has only with_output_to/2, and
no with_input_from/2. How is parsing tested? I never did dig up some

SWI-Prolog read test cases, maybe there are some? Thats of course my fault.

Edit 14.04.2022:
Test suites in general are even ridiculed by people here in this forum,
in that citations get criminally tampered, like here:

But the footer of Ulrich Neumerkel nowhere shows the phrase
" (I tried them, and they all work for SWI-Prolog)". On the other
hand those who ridicule these efforts do not make a good

professional example by themselve in these matters. The word
Ulrich Neumerkel chose, i.e. ā€œvulgarizeā€ is maybe not optimal, this
is some Viennese lingo indicating outside academic matters, I

wouldnā€™t use this word, I would rather say ā€œdisseminateā€ or somesuch.

So I think thatā€™s a net win for me, at least until you understand what it is and isnā€™t, i.e., itā€™s not meant to be a replacement for any Prolog parser, itā€™s just an example to try and test a formal grammar specification of a particular Prolog dialectā€™s syntax. My initial test plan is to be able to syntactically recognize all Prolog files in SWIPā€™s top level ā€˜bootā€™ and ā€˜libraryā€™ directories. Unfortunately none of those files contain ā€œ- -1ā€.

By all means continue to use the official and definitive SWIP parser, thatā€™s what Iā€™m doing.

Bye for now and thanks for at least one useful actionable item.

P.S., the quick fix to the grammar I suggested causes other issues so this will continue to be a work in progress.

Here is one more test case, which you might be interested in:

First SWI-Prolog:

/* SWI-Prolog 8.5.8 */
?- op(9, fy, fy), op(9, xfy, xfy), op(9, yf, yf).
true.

?- X = (fy 1 xfy 2 yf), write_canonical(X), nl.
yf(fy(xfy(1,2)))
X = fy 1 xfy 2 yf.

And then pPEG:

/* pPEG SWIPL example 14.04.2022 */
?- op(9, fy, fy), op(9, xfy, xfy), op(9, yf, yf).
true.

?- string_termList("(fy 1 xfy 2 yf).", [T]), write_canonical(T), nl.
fy(yf(xfy(1,2)))
T = fy 1 xfy 2 yf.

Its also different.

Edit 14.04.2022:
Its not from my stock of test cases, I was currious what
happens when we use more than only two operators. And
my sixth sense hit the jackpot from the first test case I invented.

Wasnā€™t expecting that it goes wrong since your marketing
genious still left in limbo of believing pPEG = SWI-Prolog. Will
also check my systems. What would be also a testing method,

would be fuzz testing or Logtalks QuickCheck tailored towards
parsing, generating random samples. Maybe SWI-Prolog has
something like term_atom/2 so that cross checking between

SWI-Prolog and pPEG can be automatized. Now I want to visit
this event. I have sleeping problems, because of my sixth sense,
I am all the time fuzzing things, better replace me:

Fuzzing Java with Jazzer
Large tech companies such as Microsoft and Google are relying
on fuzzers more and more to automate finding security issues
in their software. In 2019, Google found the majority of potential
security issues in Chromium via fuzzing - over 18,000 bugs in total.
https://www.jug.ch/html/events/2022/jazzer.html

Yes, this one is more fundamental. When you see ā€œ(fy 1 xfy 2 ...)ā€ which should be right associative, do you reduce 1 xfy 2 ... or just reduce/replace 1 xfy 2 and continue. I suspect Iā€™m doing it wrong, but Iā€™ll have to think about it.

You can of course priorize defects. I have no problems with that.

BTW: The latest example is quite resilient in the ISO core sphere,
i.e. with the term building result as per ISO standard, which
differs from the term building SWI-Prolog result and also differs

from the term building pPEG 14.04.2022 result. Its a third possible
result, namely (fy 1 xfy 2 yf) = fy(xfy(1,yf(2))):

Dogelog Player: Pass āœ“
Formerly Jekejeke Prolog: Pass āœ“
ECLiPSe Prolog: Pass āœ“
Tau Prolog: Pass āœ“
Scryer Prolog: Pass āœ“
SICStus Prolog: Pass āœ“
GNU Prolog: Pass āœ“

I think Iā€™ve sorted it out in my local development version since:

?- op(9, fy, fy), op(9, xfy, xfy), op(9, yf, yf).
true.

?- string_termList("(fy 1 xfy 2 yf).", [T]), write_canonical(T), nl.
yf(fy(xfy(1,2)))
T = fy 1 xfy 2 yf.

but Iā€™m not taking it to the bank just yet.

Thanks.

What I could exclude as a source of error, is ordering of
the operator table itself. At least for SWI-Prolog C code it does not
have an impact, which sequence of operator definitions I use:

This sequence of op/3 calls:

?- op(9, fy, fy), op(9, yfx, yfx).
true.

?- X = (fy 1 yfx 2), write_canonical(X), nl.
yfx(fy(1),2)
X = fy 1 yfx 2.

And this sequence of op/3 calls, give the same:

?- op(9, yfx, yfx), op(9, fy, fy).
true.

?- X = (fy 1 yfx 2), write_canonical(X), nl.
yfx(fy(1),2)
X = fy 1 yfx 2.

This does not exclude the possibility, that reordering the
operator table, inside SWI-Prolog C code, wouldnā€™t give
another result. It only shows that op/3 cannot confuse the parser.

Edit 15.04.2022:
Here is an example where operator table ordering, respective
reduction rules ordering, influence the reduction result.
Reduction rules ordered like this:

[fy,X] ~> [fy(X)].
[X,yfx,Y] ~> [yfx(X,Y)].

?- reduce([fy,1,yfx,2],X), write_canonical(X), nl.
[yfx(fy(1),2)]
X = [fy 1 yfx 2].

Or the same reduction rules ordered like this, gives a different result:

[X,yfx,Y] ~> [yfx(X,Y)].
[fy,X] ~> [fy(X)].

?- reduce([fy,1,yfx,2],X), write_canonical(X), nl.
[fy(yfx(1,2))]
X = [fy 1 yfx 2].

The reducer code itself is here below, it has a cut (!), so its
sensitive to the reduction rule ordering:

:- op(1200,xfx,~>).

reduce(X, Y) :-
   (A  ~> B),
   append(U, V, X),
   append(A, H, V),
   append(B, H, W),
   append(U, W, Z), !, reduce(Z, Y).
reduce(X, X).

So I think Iā€™ve finally figured this one out. Itā€™s a little nasty because within a Prolog expression a ā€œ-ā€ can be a prefix operator, an infix operator, an atom, or the leading character of a number. So hereā€™s a little test that uses all of them:

?- T= - -1- -, write_canonical(T), nl.
-(-(-1),-)
T = - -1-(-).

It getā€™s a little tricky to cleanly separate the syntax (defined by a grammar) from the semantics in order to figure out which is which.

Cool!

When this is pushed, its possible to change the fuzzer to also
include - operator, prefix and infix, and negative numbers, so
that a greater audience could also life experience some validation

run. To automatically hunt for, more or now less?, easter eggs.

Edit 15.04.2022:
Right now I get, based on the 14.04.2022 version:

?- between(1,100,_), random_expr(0, _, A), swi_parse(A, X), 
ppeg_parse(A, Y), X \== Y, write('expr: '), write(A), write('\nswi: '), 
write_canonical(X), write('\nppeg: '), write_canonical(Y), nl, nl, fail.

expr: - fx x0 - x1 yf - x2
swi: -(-(-(fx(x0)),yf(x1)),x2)
ppeg: -(-(fx(x0),-(yf(x1),x2)))

expr: - - -8 yf
swi: -(-(yf(-8)))
ppeg: -(-(-),yf(8))

Etc...

The new random actions are:

random_expr(N, M, A) :-
   K is 14+N*7, %%% increase here
   random(0, K, I),
   random_action(I, N, M, A).

%%% add these actions before the last action
random_action(7, N, N, A) :-
   random(0, 21, X),
   Y is X - 10,
   number_codes(Y, L),
   atom_codes(A, L).
random_action(8, N, M, A) :- !,
   random_expr(N, H, B),
   random_expr(H, M, C),
   atom_concat(B, ' - ', D),
   atom_concat(D, C, A).
random_action(9, N, M, A) :- !,
   random_expr(N, M, B),
   atom_concat('- ', B, A).

I also changed the last random action so that it generates x<number> and not only <number>.

These are even cases where the ISO core standard agrees with
SWI-Prolog. Possibly because the operator - infix and prefix gives
no ambiguity, because the priorities are different:

?- current_op(X,Y,-).

X = 200
Y = fy ? ;

X = 500
Y = yfx

The test results are, for these two test cases:

Dogelog Player: Pass āœ“
Formerly Jekejeke Prolog: Pass āœ“
ECLiPSe Prolog: Pass āœ“
Tau Prolog: Fail āœ—
Scryer Prolog: Pass āœ“
SICStus Prolog: Pass āœ“
GNU Prolog: Pass āœ“

Tau Prolog fails the second test caseā€¦

Yes, I agree different precedences render the associativity values irrelevant.

After yet more fiddles to my development version:

?- string_termList("- fx x0 - x1 yf - x2.", [T]), write_canonical(T), nl.
-(-(-(fx(x0)),yf(x1)),x2)
T = -fx x0-x1 yf-x2.

?- string_termList("- - -8 yf.", [T]), write_canonical(T), nl.
-(-(yf(-8)))
T = - - -8 yf.

Hard to know when youā€™ve reached the end of the long tail.

A simple reader, without negative numbers, is a few lines of DCG.

?- reader(X,1200,[-,fx,x0,-,x1,yf,-,x2],[]), write_canonical(X), nl.
-(-(-(fx(x0)), yf(x1)), x2)
X = -fx x0-x1 yf-x2.

Here is the source code, unfortunately only for ISO core standard:

% reader(-Term, +Integer, +List, -List)
reader(X, L) -->
   reader_primary(Z, L, K), reader_secondary(Z, X, L, K).

% reader_primary(-Term, +Integer, -Integer, +List, -List)
reader_primary(H, L, R) --> [A], {current_op(R, M, A), is_prefix(M, E)}, !,
   {L < R -> throw(error(syntax_error(operator_clash),_)); true},
   {T is R-E}, reader(Z, T), {H =.. [A,Z]}.
reader_primary(X, _, 0) --> [X].

% reader_secondary(+Term, -Term, +Integer, +Integer, +List, -List)
reader_secondary(H, X, L, C) --> [A],
   {current_op(R, M, A), is_infix(M, D, E), L >= R}, !,
   {R-D < C -> throw(error(syntax_error(operator_clash),_)); true},
   {T is R-E}, reader(Z, T),
   {J =.. [A,H,Z]},
   reader_secondary(J, X, L, R).
reader_secondary(H, X, L, C) --> [A],
   {current_op(R, M, A), is_postfix(M, D), L >= R}, !,
   {R-D < C -> throw(error(syntax_error(operator_clash),_)); true},
   {J =.. [A,H]},
   reader_secondary(J, X, L, R).
reader_secondary(H, H, _, _) --> [].

% is_infix(-Atom, -Integer, -Imteger)
is_infix(xfx, 1, 1).
is_infix(yfx, 0, 1).
is_infix(xfy, 1, 0).

% is_prefix(-Atom, -Integer)
is_prefix(fx, 1).
is_prefix(fy, 0).

% is_postfix(-Atom, -Integer)
is_postfix(xf, 1).
is_postfix(yf, 0).

If I load the reader into SWI-Prolog, and if I run also the fuzzer, I can
compute some defect rate. I get that SWI-Prolog has a half as high
error rate than the pPEG parser from 14.04.2022:

?- aggregate_all(count, (between(1,10000,_), random_expr(0, _, A), 
ppeg_parse(A, X), simp_parse(A, Y), X \== Y), C).
C = 583.

?- aggregate_all(count, (between(1,10000,_), random_expr(0, _, A), 
swi_parse(A, X), simp_parse(A, Y), X \== Y), C).
C = 273.

The defect rates are ca. 6% for pPEG 14.04.2022 and ca. 3% for SWI-Prolog.