Adding error detection to DCG rules

emacstheviking · May 6, 2020, 10:02pm

Once again I find myself floundering and getting more and more frustrated staring at my code in the debugger wondering why I feel like I am trying to herd sheep.

I have a set of rules that parse but now I want to add error handling. I am never sure where or when to add the throw, for example, just because one rule fails doesn’t mean it’s game over, case in point: given an s-expressions, when the body parse fails it is either because the DCG token list is empty (premature end of code) or that the next token is the “)”, which is expected at that point.

I have yet to figure out “the way”. The way in which you can repeatedly ask for more content but when that fails, then determine if the failure is legitimate i.e. the s-expression is closing…I am still working on it but I wondered if there was any tried and tested techniques out there?

I think I know what I have done wrong and how to fix it and I hope to enlighten myself any time now but I thought I’d ask the brains on this forum for some advice and pointers etc.

EricGT · May 6, 2020, 10:12pm

Is this a grammar of your own design or one you picked up some where?

More importantly is the grammar unambiguous?

emacstheviking · May 6, 2020, 10:19pm

It’s all beautifully hand rolled from nothing, it’s a simple (!) s-expression based language I am trying to create. The parsing per se isn’t the problem it’s how to insert the error detection…my issue knowing how to identify the correct place where you can say with confidence that “this is an error me being here” in a set of rules.
My tactic so far has been adding in another clause to catch the previous one failing, because at that point I know that the previous clause should have worked iff the syntax was correct.

Here is some of the code…for example, at the point of one_term failing I know that either the next token MUST be cp “)” or that some kind of error in the source is present.

collect_terms([T| Ts]) --> sexp_term(T), !, collect_terms(Ts).
collect_terms([]) --> [].

sexp_term(T) --> s_exp(T).

%%        s_exp(-Term).
%
s_exp(S) -->
    [op(Pos)]
    sexp_body(Pos, B), [cp(_)],
    {S =.. [sexp, Pos| B]}.

%
% the body can be "one_term" or it must be a closing paren that is
% ending the body. if the stack is empty or its not a ) we fail.
%
sexp_body(Pos,[T| Ts])     --> one_term(T), !, sexp_body(Pos, Ts).
%
% we get here because one_term FAILED, the only legit reason for that is
% if the next token is a close paren.. if we ran out or it's not a close
% paren then it's time to throw an error...
%
sexp_body(OpenAt,[]), [T]-->
    (   [T]
    ->  (   {functor(T,cp,_) -> true}
        ;   {arg(1,T,Pos)},
            syntax(Pos, expected_close_paren(OpenAt))
        )
    ;   syntax((-1,-1,-1),expected_close_paren(OpenAt))
    ).

% attempt to read one term: if this fails it MAY be due to an
% immediately following close paren which is ok. We leave that for the
% next level up to remove and continue....
one_term(T) -->
    (   list(T),      d('got a list ~w',[T])
    |   map(T),       d('got a map ~w',[T])
    |   atomic(T),    d('got an atomic: ~w',[T])
    |   sexp_term(T), d('got an sexp_term: ~w',[T])
    ).

I still have a LOT to learn about Prolog but I am hooked for life and this is just learning curve blues again.

EricGT · May 6, 2020, 10:24pm

I take you have read Error Handling in Using Definite Clause Grammars in SWI-Prolog by Anne Ogborn.

emacstheviking · May 6, 2020, 10:25pm

Yes, once or twice at least but what do I remember of it ?
I don[t want to resync per se as it’s not safe to continue in my scheme.
I will read again in the morning and see if it helps.

peter.ludemann · May 6, 2020, 11:10pm

Error handling was a hot topic in the 1970s and 1980s; gradually people realized that (a) it was a lot of work and (b) it wasn’t as helpful as first though, especially as machines got faster. So, the consensus seems to be to just stop (don’t bother to “resync”) and output a message about how far you got.

Basically, you want to keep track of how far you go. There are a couple of ways of doing this.

One approach is to modify the [...] construct to record the farthest progress. If the parser fails, just output "Error occurred here: ". This is probably the easiest; and it’s likely to be “good enough”. I’m pretty sure I did this in some code I wrote 20+ years ago, but I don’t think I have that code any more.

Another way is to add “failure” rules that indicate where things went bad. Typically, these won’t show quite the full progress unless you add a fair bit more complexity:

for_stmt --> [for], id [in], expr, [':'].
for_stmt --> [for], id, [in], expr, error_in(for_stmt, expecting:':').
for_stmt --> [for], id, [in], error_in(for_stmt, expecting:'<expr>').
for_stmt --> [for], id, error_in(for_stmt, expecting:'in').
for_stmt --> [for], error_in(for_stmt, expecting:'<id>').

Notice that there’s no error if the for token is missing; that’s because there could be a rule that calls this and if for_stmt fails, it’ll try an alternative. (This kind of thing is what makes error handling so tricky.)

The error_in//2 predicate can output a suitable message with information about the current position.

You can also refactor the above code to make it a bit less verbose. If you read Wirth’s original book on top-down recursive descent, he tells how to add error information (sorry, I don’t have the reference off the top of my head; it was probably written in the 1970s or 1980s … might be Algorithms + Data Structures = Programs).

EricGT · May 6, 2020, 11:18pm

Going to an extreme is the second answer in this post which is to use a meta-interpreter.

Don’t ask me questions about this as I can not answer them but it was something I was thinking about and someone did note it.

emacstheviking · May 6, 2020, 11:39pm

@peter.ludemann thanks for that…I solved my problem in the end like this:

collect_terms([T| Ts]) --> sexp_term(T), !, collect_terms(Ts).
collect_terms([]) -->   [T],  {arg(1,T,Pos)},  syntax(Pos, expected_sexpression).
collect_terms([]) --> [].

s_exp(S) -->
    [op(Pos)],
    sexp_body(Pos, B),
    [cp(_)],
    !,
    {S =.. [sexp, Pos| B]}.

sexp_body(Pos,[T| Ts])     --> one_term(T), !, sexp_body(Pos, Ts).
sexp_body(OpenAt,[]), [T]-->
    (   [T]
    ->  (   {functor(T,cp,_) -> true}
        ;   {arg(1,T,Pos)},
            syntax(Pos, expected_close_paren(OpenAt))
        )
    ;   syntax((-1,-1,-1),expected_close_paren(OpenAt))
    ).

and it’s just what I needed… I am understanding more of the nuances of Prolog every waking minute! I have some tests that seem to be backing up my thoughts:

:- begin_tests(syntax_errors).
test(missing_close_at_open_position,
     [throws(ast(syntax((-1,-1,-1), expected_close_paren( (0,1,0) ))))]) :-
    ast2("(defvar x 1", _).
test(wrong_close_bracket,
     [throws(ast(syntax((12,1,12), expected_close_paren( (0,1,0) ))))]) :-
    ast2("(defvar x 1 ]", _).
test(missing_open_paren_throws,
     [throws(ast(syntax((4,1,4), expected_sexpression)))]) :-
    ast2("    doh )",_).
test(missing_open_paren_after_one_term_throws,
     [throws(ast(syntax((6,1,6), expected_sexpression)))]) :-
    ast2("(one) :two",_).
:- end_tests(syntax_errors).

I will go to bed now, and then see what I did in the morning! I am gradually beginning to understand that your 20 years against my 20 months of Prolog means your inner wiring must be awesome! Mine is at the level of a nematode I think.

peter.ludemann · May 6, 2020, 11:55pm

I’m not awesome – and there was a 20+ year gap when I didn’t use Prolog at all.

But I first learned about compilers when people were still figuring out how to do things like LALR(k) parsers; nowadays many people probably skip directly to yacc/bison without much knowledge of what’s going on inside (and they probably don’t need that knowledge either).

EricGT · May 7, 2020, 12:01am

Nope, ANTLR is very popular. I use to use it many years ago and it is a very good parser if you don’t know DCGs, but once I learned DCGs I never looked at any other means of parsing because DCGs are so easy and powerful.

emacstheviking · May 7, 2020, 8:20am

I used flex/bison/yacc et al. many times over the years but then I actually figured out what DCG-s did, well, I never want to create anything serious anymore in anything but Prolog! Even though it’s hard work at times but then nothing worthwhile is easy. Apparently.

Boris · May 7, 2020, 8:29am

I am not too familiar with the subject, but it would seem that difficult things tend to feel more worthwile simply because they are or turned out to be difficult. For an overview and details I would recommend the book “Pre-Suasion” by Robert Cialdini. I have the book at home so I could also try to dig up the references on this particular subject if anyone is interested.

bwat · May 7, 2020, 9:08am

How to do errors when parsing Prolog:

Consider term_min which is a predicate with a few clauses where one of them detects a ‘[’ and so tries to parse a list.

%%% Must test for [] before [ term ].
term_min([], _, 0) --> ['['], [']'].
term_min(TermList, _, 0) --> ['['], term_list(TermList).

A term list is the head/start of list and is a term followed by a term list. We need this step because we must have at least one term in the list. If we jumped straight to term_list_tail then “[ ]” as input would be a list - we want “[]” to be the empty list.

term_list([Term|TermList]) --> term(Term, 999, _), term_list_tail(TermList).

The term list is terminated by a ‘]’, otherwise a ‘,’ followed by a term list, or ‘|’ followed by a term. Anything else is an error.

term_list_tail([]) --> [']'].
term_list_tail(TermList) --> [','], term_list(TermList).
term_list_tail(Term) --> ['|'], term(Term, 999, _), [']'].
term_list_tail(_) --> [C], {throw(syntax_error(badly_formed_list(C)))}.

A handy predicate for parsing is match/2:

%%%
%%%   Ensure the lookahead token matches our expectation
%%%   
match(Token, Token) :- !.
match(Wanted, Got) :-
        throw(syntax_error(unexpected_token(wanted(Wanted), got(Got)))).

An example of its use in parsing Prolog:

%%% Must test for {} before { term }.
term_min({}, _, 0) --> ['{'], ['}'].
term_min('{}'(Term), _, 0) --> ['{'], term(Term, 1200, _), [C], {match(C, '}')}.

emacstheviking · May 7, 2020, 9:10am

I already have code like this

%%        matches_(+Spec, +Acc, -Tokens) is semidet.
%
%         This scans the current token, if it matches the functor
%         named by Spec it is accumulated and we attempt to match
%         until all are done.

matches_([], Acc, Out) -->
    {   reverse(Acc, Out),
        debug(ast_match,'! All terms matched!', [Out]),
        !
    }.

matches_([Spec|Specs], Acc, Out) -->
    {debug(ast_match, 'matches_? ~k', [Spec])},
    (   i(Spec, Token)
    ->  (   {debug(ast_match, 'matched!! ~p with ~p', [Spec, Token])},
            !,
            matches_(Specs, [Token|Acc], Out)
        )
    ;   (   {debug(ast_match, 'in BAIL handler for matches: Spec: ~p',[Spec])},
            (   [Term]
            ->  {Term =.. [F,Pos|_]},
                syntax_error_at(parse, unexpected_token(Spec, F), Pos)
            ;   syntax_error_at(parse, premature_eob(Spec), -1)
            )
        )
    ).


%%        may(+Spec, -Tokens) is semidet.
%
%         Delegates to must//2 but if the unexpected_token exception
%         is caught we just fail instead. The premature EOB is not
%         caught.
%
%         @see must//2.

may([], _) -->
    {throw(ast_error(may_empty_spec))}.

may(In, Out, A, B) :-
    catch(
        phrase(must(In, Out), A, B),

but I will study in depth your reply as I think there is more to learn! Thanks!

bwat · May 7, 2020, 9:17am

You can also use DCGs to generate your error message of course. I’ve got a hook system for print_message that uses DCGs to turn terms into messages:

generate_message(error(syntax_error(Message), syntax_error(Goal, none, none, Message, none))) -->
	['ERROR ' - [], nl,
        'Error class : syntax error'- [], nl,
        'Goal in error : ~w' - [Goal], nl,
        'Message : ~q.' - [Message], nl].

generate_message(error(syntax_error(Message), syntax_error(Goal, Filename, Position, Message, Afters))) -->
	{stream_position_line_count(Position, LineCount),
	stream_position_line_position(Position, LinePosition)},
        ['ERROR ' - [], nl,
        'Error class : syntax error'- [], nl,
        'Goal in error : ~w' - [Goal], nl,
        'Message : ~q.' - [Message], nl,
        'File : ~q' - [Filename], nl,
        'Position (start of term) : line number ~q, column number ~q.'-[LineCount, LinePosition], nl,
        'Remaining tokens : ~t' - [Afters], nl].

emacstheviking · May 7, 2020, 9:19am

Already got those in another module ready to roll @bwat …I asked a question recently about if it was possible to somehow generate JSON, XML or plain text from the same message.

I’ve done ym groundwork…it’s the DCG hand to hand combat I am enjoying today! thanks though!

prolog:message(ast_error(syntax(map, uneven_elements, (L,C), File))) -->
    ['~s:~d:~d: map has an odd number of elements'-[File, L, C]].

prolog:message(ast_error(syntax(parse, illegal_toplevel_form, (L,C), File))) -->
    ['~s:~d:~d: illegal top level form, only () allowed here'-[File, L, C]].

prolog:message(ast_error(syntax(parse, empty_toplevel_form, (L,C), File))) -->
    ['~s:~d:~d: empty form at top level'-[File, L, C]].

...and so on

bwat · May 7, 2020, 9:33am

Good stuff. I’ve also implemented my own little Lisp/Scheme system but that was based on a VM in C not Prolog. It was 2326 lines of C. I implemented pretty much according to this standard: http://clhs.lisp.se/Body/02_b.htm

Have you read Quiennec ( https://christian.queinnec.org/WWW/LiSP.html )?

emacstheviking · May 7, 2020, 9:35am

not yet!

bwat · May 7, 2020, 9:42am

Basically Quiennec plots out all of the different courses you can take as an implementor of Lisp/Scheme. So many combinations it’s mind blowing.

A bit like Patrice Boizumault’s The Implementation of Prolog. Basically Boizumault plots out all of the different courses you can take as an implementor of Prolog. So many combinations it’s mind blowing.

When you get to the level of serious implementation, the two languages are quite similar yet so very different. It was a special breed who developed those two strands.

Just noticed both books were written by Frenchmen.

Topic		Replies	Views
Formal grammar: SWIPL Example error handling General	34	2477	May 18, 2022
Are DCG's just "syntactic sugar"? Help!	18	1475	July 9, 2019
DCG look ahead, how its done? General	47	298	August 13, 2024
What's the idiomatic way of developing DCGs? Help!	8	689	December 17, 2020
Go an exception when aborting DCG processing .. but is anyone interested in these? Tools bug	11	1149	July 6, 2020

Adding error detection to DCG rules

Related topics