Phrase_from_file with initial state as push back list

I am biting the bullet, mostly as a learning exercise, by attempting to rewrite my tokeniser using DCG instead of standard forms. I must be mad as it took enough pain to get it this far and it’s all unit tested and works!

Anyway, I want to use phrase_from_file but having re-read all the posts about lazy_list_location not being fit for constant use, and having used push-backs a few times, I want to know how you push the initial state onto the stream…

the rules would be something like this I guess, the state being a triple of (Offset,Line,Column) initially (0,1,0)…question is how do I get it in there!!!

rule(Args) -->
    skip_ws,
    do_something(Args).

do_something(Args), [P] -->
    [P0],
    ...code...
    transformed(P0, P).

skip_ws, [P] --> [P0, Chr],
    update_pos_by_chr(Chr, P),
    memberchk(Chr, ` \r\n\t`).

Am I on the right lines ? What I am trying to convey is my understanding that if a DCG rule is a leaf rule i.e. calls no other rules then it MUST pop off, modify and push back the state. If it calls other rules they must do the same and the responsibility is mine for ensuring the stack is kept in a consistent state. Actually I suddenly feel like I am doing FORTH coding!

This is both depressingly hard work and terrifyingly interesting all at the same time. I will finish it one day but I have to know I did as good a job as I could.
With your help!
Thanks
Sean :smiley:

I figured it out with the help of this page: https://www.metalevel.at/prolog/dcg

Basically, Markus provided this code:

state(S), [S] --> [S].
state(S0, S), [S] --> [S0].

and I have started my code like this:

tok(F,T) :-
    phrase_from_file( tok_(T), F).

tok_(Out) -->
    state(_, (0,1,0)),
    tok_([], Out).

As always, his pages are packed full of detail that you don’t see until you are ready and there is some heavy stuff on that page I want to play around with first…loving it!

Quiz question:

What if you wanted to define list//1 and rest//1 non-terminals, as in phrase(NT, List, Rest), without breaking DCGs abstraction?

Hint: call//1 + lambdas…

You rotter! Just when I was feeling comfortable.

I’ve played with the lambda library and found it interesting but I don’t think I am ready for that either!! I am trying to keep my Prolog code as simple as I can, trust me, I feel like a guru even talking about Prolog.
:smiley:
PLus, the hint is too vague as I don’t even understand what yo mean about DCG-s abstraction other than the fact it hides the the “In and Out” lists and provides the chaining/threading through the code…!!! I told you I was stupid.

Part of my initial confusion earlier was thinking that Prolog lists have to be homogenous like in Haskell, and so I didn’t think I could push a “state thing” onto the head of a list of character codes. Anyway, I now have state passing with phrase_from_file BUT it’s failing on the return of the accumulated list of tokens and I cannot understand why…here is the complete code so far, it scans all the brackets in my test file and it all looked good but the exit from tok_/3 fails and I have no idea why!


tok(F,All) :-
    phrase_from_file( tok_(T), F).

% ----------------------------------------------------------------------                                                                                                                                                                                                                                                     
tok_(Tokens) -->    %%% FAILS ON THIS LINE IN THE DEBUGGER ON EXIT
    state(_, (0,1,0)),
    tok_([], Tokens),
    {format("EXIT::~n~p",[Tokens])}.

tok_(Acc, Out) -->
    skip,
    (   bracket(T)
    ),
    !,
    {print(T),nl},
    tok_([T|Acc], Out).

tok_(Acc, Out) -->
    {reverse(Acc,Out),
     format("All done:~n~p",[Out])}.
% ----------------------------------------------------------------------                                                                                                                                                                                                                                                     
bracket(op(Pos)) --> next(0'(), state(Pos), !.
bracket(cp(Pos)) --> next(0')), state(Pos), !.
bracket(ol(Pos)) --> next(0'[), state(Pos), !.
bracket(cl(Pos)) --> next(0']), state(Pos), !.
bracket(om(Pos)) --> next(0'{), state(Pos), !.
bracket(cm(Pos)) --> next(0'}), state(Pos).
% ----------------------------------------------------------------------                                                                                                                                                                                                                                                     
skip --> ws, !, skip.
skip --> [].

ws --> next(C), {memberchk(C, ` \r\n\t`), debug(t2, 'skipped ~p',[C])}.

next(Chr), [P] -->
    [P0, Chr],
    {movepos(Chr, P0, P),
     debug(t2,'next: ~p, ~p => ~p',[Chr,P0,P])}.

state(S), [S] --> [S].      % reads the current state                                                                                                                                                                                                                                                                        
state(S0, S), [S] --> [S0]. % pushes back a new state                                                                                                                                                                                                                                                                        

and here is the output

?- gtrace,tok('fixtures/simple.lisp',X).
% next: 41, 0,1,0 => 1,1,1                                                                                                                                                                                                                                                                                                   
% next: 41, 0,1,0 => 1,1,1                                                                                                                                                                                                                                                                                                   
cp((1,1,1))
% next: 91, 1,1,1 => 2,1,2                                                                                                                                                                                                                                                                                                   
% next: 91, 1,1,1 => 2,1,2                                                                                                                                                                                                                                                                                                   
ol((2,1,2))
% next: 93, 2,1,2 => 3,1,3                                                                                                                                                                                                                                                                                                   
% next: 93, 2,1,2 => 3,1,3                                                                                                                                                                                                                                                                                                   
cl((3,1,3))
% next: 123, 3,1,3 => 4,1,4                                                                                                                                                                                                                                                                                                  
% next: 123, 3,1,3 => 4,1,4                                                                                                                                                                                                                                                                                                  
om((4,1,4))
% next: 125, 4,1,4 => 5,1,5                                                                                                                                                                                                                                                                                                  
% next: 125, 4,1,4 => 5,1,5                                                                                                                                                                                                                                                                                                  
cm((5,1,5))
% next: 10, 5,1,5 => 6,2,0                                                                                                                                                                                                                                                                                                   
% skipped 10                                                                                                                                                                                                                                                                                                                 
All done:
[cp((1,1,1)),ol((2,1,2)),cl((3,1,3)),om((4,1,4)),cm((5,1,5))]EXIT::
[cp((1,1,1)),ol((2,1,2)),cl((3,1,3)),om((4,1,4)),cm((5,1,5))]
false.

I am still trying to see what I did wrong… :expressionless:

Shouldn’t the first clause be:

tok(F,All) :-
    phrase_from_file( tok_(All), F).

I fixed that after I posted. I just ignored the singleton warnings at the time., It is still failing to return the token list. I am looking at the example on the docs page, the “match” example… I don’t understand why it is using aggregate_all either… have I not understood how phrase from file works?

My tok_ predicate is recursive until it runs out of data…is it that I SHOULD be using findall and just sucking one token at a time out of the file? I will try that and see.

Sometimes I just find it hard to really understand what is being said on the doc pages.

I am going to try that approach BUT it’s still not returning the list… so why should a single call be any different in terms of it failing? I wish I had the common language to make myself more understood at times like these.

This is not the bug you’re looking for… :smile: I think…

tok_(Tokens) -->    %%% FAILS ON THIS LINE IN THE DEBUGGER ON EXIT
    state(_, (0,1,0)),
    tok_([], Tokens),
    {format("EXIT::~n~p",[Tokens])}.

In the above grammar rule, you’re trowing away the first element. Try:

bug(All) :-
	phrase(do_copy(All), [1,2,3,4,5]).

do_copy(All) -->
	state(_, 0),
	copy(All).

copy([H| T]) --> [H], copy(T).
copy([]) --> [].
?- bug(All).
All = [0, 2, 3, 4, 5] .

Wow! That was subtle too…I fixed it with this additional predicate

init_state(S), [S,S0] --> [S0].

and modified the tok_ like so:

tok_(Tokens) -->
    init_state((0,1,0),
    :

That at least gets me the full token list I expected. Thanks for that indeed…the dangers of near-blindly using other peoples code you may not fully understand yet!

It’s still returning false though so one dead bug accounted for and one subtle mistake still to find.

See e.g. DCGs provide a threading state abstraction: don't break it - Logtalk

Sometimes, usually for debugging, we want to peek at the difference list that’s implicit in a grammar rule. The state//1 non-terminal allows you to peek at a single element. You could, of course, generalize it to look at two, or three, or … I.e. generalize it to a fixed number of elements. Using call//1 and a lambda expression, you can peek at all elements:

list(List) --> call({List}/[List,List]>>true).

For example:

bug(All) :-
	phrase(do_copy(All), [1,2,3,4,5]).

do_copy(All) -->
	state(_, 0),
	list(List), {writeq(List), nl},
	copy(All).

copy([H| T]) -->
    [H], list(List), {writeq(List), nl}, copy(T).
copy([]) -->
    [].

list(List) --> call({List}/[List,List]>>true).

state(S), [S] --> [S].      % reads the current state                        
state(S0, S), [S] --> [S0]. % pushes back a new state

Sample call:

?- bug(All).
[0,2,3,4,5]
[2,3,4,5]
[3,4,5]
[4,5]
[5]
[]
All = [0, 2, 3, 4, 5] ;
false.

Note that you cannot write:

list(List), List --> List.

If you try it, you will get:

?- bug(All).
ERROR: Arguments are not sufficiently instantiated

Why? The right side, List, will not be interpreted as a list but as non-instantiated call to a non-terminal. You may be tempted to write instead:

list([H|T]), [H|T] --> [H|T].

But that fails after a long output, providing no solution:

?- bug(All).
[0]
[2]
[3]
[4]
[5]
[4,5]
[5]
[3,4]
[4]
[5]
...
false.

If you try to fix it by using instead:

list([]) --> [].
list([H|T]), [H|T] --> [H|T].

or:

list([H|T]), [H|T] --> [H|T].
list([]) --> [].

This will give you the expected solution. But instead of getting that solution once, you will get it repeated an awful number of times:

?- bug(All).
[]
[]
[]
[]
[]
[]
All = [0, 2, 3, 4, 5] ;
[5]
[]
All = [0, 2, 3, 4, 5] ;
[4]
[]
[]
All = [0, 2, 3, 4, 5] ;
[5]
[]
All = [0, 2, 3, 4, 5] ;
[4,5]
[]
[]
All = [0, 2, 3, 4, 5] ;
...

All this just to give you, hopefully, a debugging trick that may help in your use of DCGs.

The lockdown made me do it! :innocent::smile:

P.S. Of course, you can define an auxiliary predicate that allows you to do the same without using call/1 and a lambda expression… left as an exercise. Hint: Beetlejuice - Wikipedia

Why using an accumulator and call reverse/2 in:

You can write instead:

tok_(Tokens) -->
    init_state((0,1,0),
    tok_collect(Tokens),
    {format("EXIT::~n~p",[Tokens])}.

tok_collect([T| Ts]) -->
    skip,
    (   bracket(T)
    ),
    !,
    {print(T),nl},
    tok_collect(Ts).
tok_collect([]) -->
     {format("All done:~n~p",[Out])}.

I used that code as it is neater, plus I never think of stuff like that…I remember the day the veil lifted as I single stepped append/3 for the millionth time. I am just about ready to quit on this though as I have wasted a whole day and only served to increase my frustration levels.

For entertainment value (we all need a laugh) here is the complete file. I have at least discovered why it fails…it was two things, one I fixed an done I have yet to solve but I rather suspect (a) it’s my fault and (b) see (a)

On completion the DCG stack still had the state, so I popped that, but it still failed because the stack had a remaining LF(10) character… so I think that the real issue is that my DCG rules are not fully consuming the source.

:- module(t2, [tok/2]).
:- use_module(tokeniser).

tok(File, Tokens) :-
    (   phrase_from_file( tokens_(Tokens), File)
    ->  print(Tokens)
    ;   print("Failed, but why?  :(")
    ).

tokens_(Tokens) -->
    init_state((0,1,0)),
    tok_collect(Tokens),
    pop_state(_),
    {format("EXIT:~p",[Tokens])}.

tok_collect([T| Ts]) -->
    skip,
    (   bracket(T)
    ),
    !,
    {print(T),nl},
    tok_collect(Ts).

tok_collect(_) --> [].

token_stream_(Acc, Out) -->
    skip,
    stream_token(T),
    {debug(t2,'token_stream_ found: ~p',[T])},
    !,
    token_stream_([T|Acc], Out).

token_stream(Acc,Out) -->
    {reverse(Acc,Out)},
    {debug(t2,'token_stream_ ended: ~p',[Out])}.

stream_token(T) --> bracket(T), !.

bracket(op(Pos)) --> next(0'(, Pos).
bracket(cp(Pos)) --> next(0'), Pos).
bracket(ol(Pos)) --> next(0'[, Pos).
bracket(cl(Pos)) --> next(0'], Pos).
bracket(om(Pos)) --> next(0'{, Pos).
bracket(cm(Pos)) --> next(0'}, Pos).

skip --> ws(_), skip.
skip --> [].

ws(Pos) --> next(0'\s,Pos).
ws(Pos) --> next(0'\n,Pos).
ws(Pos) --> next(0'\t,Pos).
ws(Pos) --> next(0'\r,Pos).

next(Chr,P0), [P] -->
    [P0, Chr],
    {movepos(Chr, P0, P),
    debug(t2,'next: ~p, ~p => ~p',[Chr,P0,P])
    }.

init_state(S), [S, S0] --> [S0].
pop_state(S) --> [S].
state(S), [S] --> [S].
state(S0, S), [S] --> [S0].

Is there a place that tells me what all the icons in the debugger mean, this one has me confused, the little smiley happy face man?
debuggerface

I end up with an unbound variable at the end of the list…is this a difference list then ? I think that’s why I had the accumulator version thinking about it.

I wrote:

tok_collect([]) -->
     {format("All done:~n~p",[Out])}.

But you changed that grammar rule to:

tok_collect(_) --> [].

It should be:

tok_collect([]) --> [].

Get some rest!

LOL. I must admit, I am a bit tired but I am watching this so not just yet:

I think string theorists will find that strings are difference lists…

Fixed it all!
NOW it’s time for bed! Will think about it… thanks @pmoura

Just to say thanks @pmoura for your help in resolving my brain wobble with it all.
As of very late PM last night I finally got all the V1 unit tests to pass against the new V2 tokeniser using a DCG instead of a hand cranked FSM like I had done.

Not only is the code easier to follow and reason about, it’s practically BNF at times after all, especially as I have started using “|” instead of “;” as I saw Markus Triska mention, but the code is approx two thirds the size and still does the same thing.

It has been a (mostly) pleasant exercise and one I won’t forget!

Onwards…today I resume extending the AST and code generation phase. It’s rendering C functions and variable assignments…not that useful but within an hour or two it should be doing unary,binary and n-ary operators.

Thanks again everybody who has dared come near my questions.
:smiley:

1 Like