A tokeniser I've written. Any suggestions on how to improve it?

Jan,

FYI, I was the one parsing really large data files, ~3G. My DCG parser is working as expected, thanks for your input. :slight_smile:

boris> It is also not immediately obvious
boris> how one can nicely “chain” the steps.
boris> In practice, it is often easiest
boris> (and on modern hardware perfectly OK)
boris> to read everything to a list,
boris> then tokenize it making another list,
boris> then parse it, etc.

I agree .
Pursuing a parse solution whilst encumbered by the need to accommodate an input model that gives You only a “this” and a next called “peek”
is extremely tedious and limiting .
And such a model puts incredible limitations on potential capabilities of Your grammer .
The historical prevalence of that approach is so significant
that it has rendered all computer languages necessarily similar
in very fundamental ways .

You can over come some of those limitation by creating a “peek” buffer .
Now perhaps You can ask (peek(Y,1,scanner))
instead of (peek(Y,scanner))
and also as (peek(Y,2,scanner)) .
But now You are creating a peek buffer .
Thus why not buffer first a whole clause
and be done with the interferance of those limitations
and their corruption of the possibilities available to You
as possible manifestations of the grammar ?

And as You say the motivation for dealing with all of that
is dubious at this point in time if it is motivation towards
that which is efficient or optimal because conservative
in regards to memory requirements .

But then again perhaps the task involves
complete consideration of a 300 gb file .
What is the more reasonable approach ?
Oh yes we have a modern solution for that :
We shall deploy a Hadoop cluster to the cloud @ tape.machine.vintage.marketplace.turing.com ?
Or perhaps with a sober-second look recognize
that the desire for infinite instantaneous access
to anything and everything at any time –
though mediated by impressive experts like Google –
is still fundamentally limited by that model
with requirement for an authoritive provision
of universally valid succession enabling the certaintea of sequence ?

Surely there must be some reason why so many of THESE
outdated and useless relics have “separator” in their name ?

:- between(0,31,code) , format(’~N

~10r ~s’,[code,[code]]) , false ; true .

And perhaps the bright sun shining on the cloud is a distraction
from the consideration of how such a task has been proven to be possible
with COBOL code ?

The easier path IMO is just assuming that all input
necessary for consideration of a clause
is available as a memory buffer :
conveniently provisioned by Prolog to be considered
via the very capable and familiar abstraction
that is the syntax of the Prolog list .

But if the favour or the interest is for the ‘this’ and the ‘peek’ ,
then it is very helpful to realize that
You are inevitably either going to be using
a finite state machine like regular expressions
or creating a recursive descent parser .

A recursive descent parser is very wonderfully fun IMO ,
it is my favourite and oft repeated toy project.
The examples in the Wikipedia article for “Recursive Descent Parser”
are extremely excellent and it is a no-brainer IMO to start
with a cut-n-paste of that sample code .

I actually tried to do that 3 weeks ago
but got completely side-tracked by deciding
to see if I could use op/3 to configure Prolog
such that it was capable of parsing the Pascal example verbatim .

Very frustrating !

But I did learn some thing from that failed attempt .
In considering that task of how to provision that text that makes sense to Pascal ,
as an implementation of that which makes sense to Prolog :
I noticed there was perhaps available an easy trick .
Whilst the Pascal code is arranged such that
function arguments proceed with earliest as left-most towards rightost as most recent ,
it seemed more fitting to Prolog that the arguments proceed from right-to-left
in consideration of progression .

That puts it more in line with the way
that for Prolog the function signature is the consequent ,
whilst the function body is the antecedent .

Compare typical impertiv programming ,
for which the function signature is the antecedent ,
and a for loop in the funnction body provides the consequence .

antecedent => { consequent } .

consequent :- ( antecedent ) .

~~~~~~~~ %&ZHcx;. ~~~~~~~~~

f,[black] --> [white],f .

f(_p_,_q_) :- f(_q_,_p_)  .  %% in case the user screws up %%

Since you replied to me, just wanted to say welcome back, Kintaklen!

1 Like

That is a very good point .

But one thing to be aware of :

As soon as You use == ;
You should be very honest with Your Self
and say “I am no longer being a logic programmer I am being an imperativ programmer” .
You can give Your Self a spanking about that like I do
but as a Prolog programmer anticipate a very sore a5s
because Prolog sits exactly at that intersection :slight_smile:
(intersection here meaning the intersection of geometry not the intersection of set theory)
the intersection perhaps pioneerable transition of imperativ programming and logic programming .

The meaning of == is dependent upon the context of the runtime .

From the point of view of the logic programming ; “the context of the runtime” is some thing like an evil interloper Younger brother guilty of numerous violations . And target of some disdain for its crude behaviour and its insistence upon technique that (from that point of view) cannot be demonstrated to be reason-able . That “guilty” Younger brother now also in receipt of some negativity via that disdain can save a wound to His self-esteem with the retort “Ya but You NEED me , Older brother . You might Hate me , but You must admit : nothing You propose to be true can be shown to be truly true in reality without me” . And Younger brother is very well equipped with argument there . Because the requirement for demonstration a.k.a proof of the hypothesis requiring conclusion of truth to be a demonstration tht is achievable in the realm of the substantial is the first principle of Science .

Is it absurd to propose that “computer” be twinned with “science” ?

There is a concept from science : the hidden variable that is applicable here .
That is applicable here because informativ to consider that the context of the runtime is full of hidden variables . Older brother Logic can never reason about hidden variables . He thinks they are a sickly abhorrence that should not exist .

I never use == , thus I would because now aware of that concern You raise
try to develop Your solution into that which addresses the same valid concern
but does so without the use of == .


== \= var nonvar ! ground-without-when
^^^ bad guys .

:sunny: jan the man :

read_quote(Char, Quote, Stream, string(String), LeftOver) :-
    read_quote(Char, Quote, Stream, Chars),
    get_char(LeftOver),
    string_chars(String, Chars).

read_quote(Quote, Quote, Stream, []) :-
    !.
read_quote(Char, Quote, Stream, [Char|More]) :-
    get_char(Stream, Next),
    read_quote(Next, Quote, Stream, More).

:blush: kintalken the fool :

read_quote(\+!,Char, Quote, Stream, string(String), LeftOver) :-
    read_quote(Char, Quote, Stream, Chars),
    get_char(LeftOver),
    string_chars(String, Chars).

read_quote(!,Quote, Quote, Stream, []) :-
    /*<s>!</s>*/.
read_quote(\+!,Char, Quote, Stream, [Char|More]) :-
    get_char(Stream, Next),
    read_quote(Next, Quote, Stream, More).

~~~~~~~~ %&ZHcx;. ~~~~~~~~

/* ... snip ... */
read_quote(Quote, Quote, Stream, []) :-
    !.
/* ... snip ... */
/* ... snip ... */
read_quote(!,Quote, Quote, Stream, []) :-
    /*<s>!</s>*/.
/* ... snip ... */

My apologies , mistake in that code .

Should be :

/* ... snap! ... */

/*<s>
read_quote(!,Quote, Quote, Stream, []) :-
    .
</s>*/
read_quote(!,Quote,Quote,Stream,[]) ->
    ;
/* ... snap! ... */
%% <https://pasteboard.co/I7vw3lT.png> %%

~~~~~~~~~ %&ZHcx;. ~~~~~~~~~

Many thanks for those pointers Jan.

I’ve realised a DCG is the most elegant approach after getting to grips with the various ways of iterating in Prolog which I described here Six ways to iterate in Prolog

I’ve also realised A parsing example I took from The Art of Prolog is overcomplicated and could be boiled down to one DCG (something I’ll tackle next).

Here’s the simplified tokeniser I’ve settled on.

:- module(tokeniser, [string_tokens/2]).

%% tokenise(+String, -Tokens) is det

string_tokens(String, Tokens) :-
  string_chars(String, Chars),
  phrase(tokens(Tokens), Chars).

% Definite Clause Grammar (DCG)

tokens([Token|Tokens]) --> ws, token(Token), ws, !, tokens(Tokens).
tokens([])             --> [].

ws --> [W], { char_type(W, space) }, ws.
ws --> [].

token(P) --> [C], { char_type(C, punct), \+char_type(C, quote), string_chars(P, [C]) }.
token(Q) --> quote(Cs), { string_chars(Q, Cs) }.
token(W) --> word(Cs), { string_chars(W, Cs) }.

quote([Quote|Ls])  --> [Quote], { char_type(Quote, quote) }, quote_rest(Quote, Ls).
quote_rest(Quote, [L|Ls]) --> [L], { L \= Quote }, quote_rest(Quote, Ls).
quote_rest(Quote, [Quote]) --> [Quote]. 

word([L|Ls])      --> [L], { char_type(L, alnum) }, word_rest(Ls).
word_rest([L|Ls]) --> [L], { char_type(L, alnum) }, word_rest(Ls).
word_rest([])     --> [].

Much better! Note that you can do quoted strings without some form of escape like this:

quoted(String) -->
    [Quote], { char_type(Quote, quote) }, string(Codes), [Quote],
    !,
    { string_codes(String), Codes) }.

This exploits the non-greedy string//1 from library(dcg/basics). It is not faster than your code. I expect it to be in the same range, but you should run some tests to verify that. It is really compact and easy to read though :slight_smile:

1 Like

A slight improvement or two you might want to consider. quote is basically a specific version of between so you might want to generalize it to do more than use just ". With between then modify it so that it can do nested bookends, e.g. if the bookends are { and } then if you want to grab a matching bookend in

{
   Line 1
   {
       Line 2 
   }
}

between will need to increment a value when seeing more { and decrement when seeing } and when the value is zero exit the predicate.