Formal grammar: SWIPL Example error handling

I made the following change:

string_termList(String,Terms) :-
	prolog_grammar(PG),  % from pl_grammar
%	peg_parse(PG,String,'Prolog'(Nodes)),
 	peg_parse(PG,String,'Prolog'(Nodes),_,[verbose(silent)]),
	nodes_terms(Nodes,Terms).

Now its silent:

?- string_termList("- x4 ** x2 =:= .", [Y]).
false.

And I can now also add failure handling in my adapter. The problem is
a parser that fails, and gets not catched by my adapter, avoids the X \== Y
check here, and gives false statistics. So I need to change my adapter:

Edit 02.05.2022:
Here are some test results. fuzz3.p can do parenthesis test cases
and fuzz4.p can provoke peg_parse errors. Both do not try to touch
the highly ambiguous cases.

Interestingly SWI agrees with my simplified parser, but disagrees
when I add peg_parse errors:

/* fuzz3.p, with parenthesis */
?- aggregate_all(count, (between(1,10000,_), random_expr(0, _, L, []),
|    ppeg_parse(L, X), simp_parse(L, Y), X \== Y), C).
C = 899.

?- aggregate_all(count, (between(1,10000,_), random_expr(0, _, L, []),
|    swi_parse(L, X), simp_parse(L, Y), X \== Y), C).
C = 0.

/* fuzz4.p, with parenthesis and provocation of peg_parse errors */
?- aggregate_all(count, (between(1,10000,_), random_expr(0, _, L, []),
|    ppeg_parse(L, X), simp_parse(L, Y), X \== Y), C).
C = 705.

?- aggregate_all(count, (between(1,10000,_), random_expr(0, _, L, []),
|    swi_parse(L, X), simp_parse(L, Y), X \== Y), C).
C = 302.

Maybe there are some explanations for these differences. One
example is already the SICStus and SWI-Prolog difference posted
a few hours ago. My simplified parser agrees with SICStus.

Now I had a look at the tokenization as well. There are cases where the
pPEG tokenizer is more error aware than the SWI-Prolog built-in tokenizer.

?- X = 'abc\xFF'.
X = abcÿ.

?- string_termList("'abc\\xFF' .", [T]).
false.

But unfortunately in verbose mode, it doesn’t show some error.

Edit 14.05.2022:
This is also some strange result, now both in the SWI-Prolog
tokenizer and in the pPEG SWIPL Example:

?- X = 'abc\xFxF'.
X = 'abc\u000FxF'.

?- string_termList("'abc\\xFxF' .", [T]).
T = 'abc\u000FxF'.

But it could be some extension? Possibly not, maybe more
a missed error, since this gives an error in both systems:

?- X = 'abc\xFxF\def'.
ERROR: Syntax error: Unknown character escape in quoted atom or string: `\d'

?- string_termList("'abc\\xFxF\\def' .", [T]).
% pPEG Error: _esc failed ...
false.

But the error somehow differs.

Thanks, this was a bug generating atoms which terminate in an escape sequence. Correct behaviour:

?- string_termList("'abc\\xFF' .", [T]).
T = abcÿ.

In general, anything that fails in verbose mode without a message is a bug.

Looks like the same error to me:

?- string_termList("'abc\\xFxF\\def' .", [T]).
% pPEG Error: _esc failed, expected ([\abcefnrstv'"`] / ('x' _hex+ '\'?) / (_octal+ '\'?) / ('u' _hex*4) / ('U' _hex*8)) at line 1.11:
%   1 | 'abc\xFxF\def' .
%                 ^
false.

This error is caused by a violation of the grammar which defines an escape sequence with the rule:

	_esc    = '\\' ( [\\abcefnrstv'"`] 
	               / 'x' _hex+ '\\'?
	               / _octal+ '\\'?
	               / 'u' _hex*4
	               / 'U' _hex*8
	               )

so the error message indicates the _esc rule failed and where it failed in the source.

Now I found a minor, really minor, further bug, concerning tokens.
This pPEG grammar production is incomplete:

_code   = _esc / ~[]

You can try:

?- X = 0'''.
X = 39.

?- X = 0'''.
X = 39.

?- string_termList("0'' .", [T]).
T = 39.

?- string_termList("0''' .", [T]).
% pPEG Error: Prolog.Prolog failed, expected _eox at line 1.4:
%   1 | 0''' .
%          ^
false.

So I understand:

?- X = 0'a.
X = 97.

?- X = 0''.
X = 39.

but not:

?- X = 0'''.
X = 39.

How is this valid syntax?

I can’t find relevant description in SWIP doc, but from Sicstus syntax:

 natural-number    --> digit...
                    |  base ' alpha...
                          { where each alpha must be less than the base,
                          treating a,b,... and A,B,... as 10,11,... }
                    |  0 ' char-item
                          { yielding the character code for char }
 
 char-item         --> char  { other than \ }
                    |  \ escape-sequence

In the good old days it was 0'<char>. With character escaping it became 0'<char-as-in-quoted-atom>. Quoted atoms allow for '' to represent a single '. AFAIK 0''' is ISO, although I’d use 0'\' for readability. 0'' is (probably) SWI-Prolog specific and there to ease the transition. Should be deprecated.

OK, think I understand. Two single quotes are a special escape sequence in quoted atoms so it must be allowed after “0'”.

Correct, quoted atoms have an additional escape so the following is required:

	_code   = _esc / "''" / ~[]

ie., an escape, two single quotes or any other single character (including a single quote). Now:

?- string_termList("0'''.",[T]).
T = 39.

?- string_termList("0''.",[T]).
T = 39.

?- string_termList("0'\'.",[T]).
T = 39.

Thanks for pointing this out.

Scryer Prolog is currently more strict:

$ target/release/scryer-prolog -v
"v0.9.0-117-gf2d3e55e"
$ target/release/scryer-prolog
?- X = 0''.
   error(syntax_error(incomplete_reduction),read_term/3:1
?- X = 0'''.
   X = 39.

But I wish they had better error messages.

That seems unnecessarily strict to me. The intent is clear and it’s not ambiguous, no? Although as Jan suggests, “0'\'” is perhaps better and presumably isn’t an error.

I guess you get the same if you change this:

_code   = _esc / "''" / ~[]

Into this here:

_code   = _esc / "''" / ~"'" ~[]

Right? Not 100% sure. Thats also what the ISO core standard says,
but the ISO core standard does use another formalization, it doesn’t
use some grammar formalism that has negation available.

You find that a single quote ' is exclude as a _code in this section:

6.4.2.1 Quoted characters.

Edit 16.05.2022:
How do I recompile the pPEG SWIPL Example? Can I simply edit pl_grammar.pl?
Woa! I only needed to reload the edited Prolog file, and it worked:

?- 
% ppegpl/examples/swip-grammar/pl_grammar compiled ...
?- string_termList("0'' .", [T]).
% pPEG Error: Prolog.integer failed, expected _code at line 1.3:
%   1 | 0'' .
%         ^
false.

?- string_termList("0''' .", [T]).
T = 39.

Close, but not quite. This would consume two characters after the "“0'”, a character which wasn’t a single quote and the following character. But I think this should do the trick (untested):

i.e., if not escape or “''”, any character other than a single quote. Thus the _code rule should fail if the character was a single quote.

Thanks to the power of quasi-quotations.

I think you meant:

?- string_termList("X = 1 2 'ab\\qc' 3 4  .",[T]).
% pPEG Error: Prolog.Prolog failed, expected _eox at line 1.9:
%   1 | X = 1 2 'ab\qc' 3 4  .
%               ^

for the pPEG test.

A couple of points. First PEG parsers are scannerless parsers so a message like “superfluous token” would never be generated (there are no tokens in the conventional sense).

Second, good error diagnostics from a recursive descent parser (like pPEG) can be quite challenging. Given multiple choices for a given rule, which failed choice could be considered the “real” cause of that failure? (Regular expressions have a similar issue but they usually don’t even try to produce a diagnostic message.) Generally a good heuristic is to use the cursor position (and rule name) at the furthest failure point in the string being parsed, although you can argue it’s not that helpful in this case.

Third, SWIP supports spaces in numbers:

?- X = 1 2 .
X = 12.

so that may be another source of confusion with this particular example. This may help clarify what’s actually happening in the parser:

?- string_termList("X = 1 'ab\\qc' 3 4  .",[T]).
% pPEG Error: Prolog.Prolog failed, expected _eox at line 1.7:
%   1 | X = 1 'ab\qc' 3 4  .
%             ^

So an expression was matched (i.e., “X = 1”) and then the parser terminated when it couldn’t find a full-stop. The high water mark was the beginning of the quote which definitely doesn’t match a full stop.

If parsing could get past this point to get to the syntax error in the quoted atom:

?- string_termList("X = 1+'ab\\qc' 3 4  .",[T]).
% pPEG Error: _esc failed, expected ([\abcefnrstv'"`] / ('x' _hex+ '\'?) / (_octal+ '\'?) / ('u' _hex*4) / ('U' _hex*8)) at line 1.11:
%   1 | X = 1+'ab\qc' 3 4  .
%                 ^

which may be closer to what you were expecting (except the parse doesn’t get that far).

There are two errors. Isn’t it somewhat subjective which is more interesting? Which one gets reported depends on the parser implementation and trying to dictate that seems like a dubious quest.

If the parser scans tokens on the go, i.e. if the parser is interwined with the scanner, then the first error the parser detects, could be the more interesting.

Such interwined parser and scanners are not ultra new. For example the modula programming language by Niklaus Wirth had first a multipass compiler, where a first pass was a scanner that generated tokens. And then other passes following.

But later versions of the modula programming languages interwined the scanner and parser. This could maybe also archived by scanner less parser? In your pipeline pl-parser.pl after pl-grammar.pl you are not anymore scanner less, although pl-grammar.pl is what one calls nowadays scanner less.

The above error when you would combine pl-grammar.pl and pl-parser.pl into a single scanner less parser, it would probably also show the ECLiPSe Error? A further option would be if the scanner does withhold the decoding of escaped characters or also some checks concerning numbers,

then if these conversions would be done during parsing, this could
be also some approach that would show the ECLiPSe Error.

This was the point I was trying to make, perhaps poorly; it’s implementation dependant.

Defining (and agreeing) on correct syntax and semantics is hard enough. Trying to decide on which error should be more “interesting” given multiple choices strikes me as a bridge too far. But don’t let me discourage you.

Personally, in this case, all the errors have to be fixed so it really doesn’t matter to me which one the parser finds “more interesting”.