A tokeniser I've written. Any suggestions on how to improve it?

joeblog · March 7, 2019, 9:37am

Something I’ve struggled with teaching myself Prolog and wanting to write a compiler has been the stage one, generally called lexical analysis.

Thanks to a pointer to Chapter 10 in Richard O’Keefe’s The Craft of Prolog, I discovered I was on the wrong track using DCGs (the output of the code below can be used by a DCG, so that’s the next step).

To keep things simple, I only output a list of strings so the input code doesn’t have to conform to Prolog’s syntax for identifiers. My tests use the sample Pascal-type language used in the chapter of writing a compiler in The Art of Prolog and an English sentence used in the grammar example in Learn Prolog Now.

This is my test file tokeniser.plt:

:- begin_tests(tokeniser).
:- use_module(['tokeniser.pl']).

% Example taken from http://www.learnprolognow.org/lpnpage.php?pagetype=html&pageid=lpn-htmlse56
test(lpn_eg) :-
  string_tokens("The cow under the table shoots.", Tokens),
  Tokens = ["The", "cow", "under", "the", "table", "shoots", "."].

% Example taken from Chapter 24 "A Compiler" from The Art of Prolog  
test(pascal_eg) :-
  Code = "program factorial;
  begin
    read value;
    count := 1;
    result := 1;
    while count < value do
      begin
        count := count + 1;
        result := result * count
      end;
    write result
  end",
  string_tokens(Code, Tokens),
  Tokens = ["program","factorial",";","begin","read","value",";","count",
    ":=","1",";","result",":=","1",";","while","count","<","value","do",
    "begin","count",":=","count","+","1",";","result",":=","result","*","count",
    "end",";","write","result","end"].

:- end_tests(tokeniser).

And this is my module tokeniser.pl

:- module(tokeniser, [string_tokens/2,       % +String, -Tokens
                      read_file_to_tokens/3  % +FileName, -Tokens, +Options
                     ]).

read_rest_of_word(_Stream, Type, LeftOver, [], LeftOver) :-
  \+char_type(LeftOver, Type), !.

read_rest_of_word(Stream, Type, Char, [Char|Chars], LeftOver) :-
  % char_type(LeftOver, Type),
  read_rest_of_word(Stream, Type, Chars, LeftOver).

read_rest_of_word(Stream, Type, Chars, LeftOver) :-
  get_char(Stream, Char),
  read_rest_of_word(Stream, Type, Char, Chars, LeftOver).

read_words(_Stream, Char, []) :- 
  char_type(Char, end_of_file), !.

read_words(Stream, Char, Words) :-
  char_type(Char, space),
  read_words(Stream, Words).

read_words(Stream, Char, [Word|Words]) :-
  char_type(Char, alnum),
  read_rest_of_word(Stream, alnum, Chars, LeftOver),
  string_codes(Word, [Char|Chars]),
  read_words(Stream, LeftOver, Words).

read_words(Stream, Char, [Word|Words]) :-
  char_type(Char, punct), \+char_type(Char, quote),
  read_rest_of_word(Stream, punct, Chars, LeftOver),
  string_codes(Word, [Char|Chars]),
  read_words(Stream, LeftOver, Words).

read_words(Stream, Words) :-
  get_char(Stream, Char),
  read_words(Stream, Char, Words).

read_file_to_tokens(File, Tokens, Options) :-
  setup_call_cleanup(open(File, read, Stream, Options),
    read_words(Stream, Tokens),
    close(Stream)), !.
    
string_tokens(String, Tokens) :-
  open_string(String, Stream),
  read_words(Stream, Tokens), !.

abaljeu · March 7, 2019, 7:13pm

On the whole, code seems fine.

It’s a good idea to introduce ! after a decision is final. For example:

read_words(Stream, Char, [Word|Words]) :-
  char_type(Char, alnum),
  !,
  read_rest_of_word(Stream, alnum, Chars, LeftOver),

Once you have recognized an alnum, no other answers are possible, so ! eliminates the rest the clauses if the program ever backtracks looking for alternatives.

DCG is a very good option for tokenizers. (And after also for parsers). I recommend your next step be to redo this using DCG.

The code could be much cleaner if the get_char was separated from the token-making. Simply, you could just read the file to a list, and then DCG the list of chars to a list of tokens.

A more sophisticated incremental approach could build an open list of characters using get_char until EOF, and incrementally parses the open list of characters into an open list of tokens, but that’s probably best to save for later in your skill development (unless somebody has a particularly clear demonstration handy).

Asides:
Jan or manual writers: modes are not documented in the manual for these input functions. I can only guess that backtracking over a get_char will fail, not repeat. If that’s not the case, a few more cuts may be in order for the above program.

Anybody else find these scrolling code regions a nuisance?

As a matter of style I strongly dislike reusing the same predicate name for a related function. Although it is generally accepted practice in Prolog, to me it means I have to go and count the commas to decide if I’m looking at an additional clause or a different predicate, and then decide which predicate I’m calling. It further has disadvantages for type-checking and usage of call() where it’s too easy to call the wrong predicate by mistake, or not understand which is invoked by code you’re reading.

joeblog · March 8, 2019, 7:03am

Many thanks for those tips. O’Keefe also advises what he terms the “if-the-else” use of cuts in his tokeniser I’m basing mine on.

I’ve developed the code a bit further to tokenise quoted text, preserving white space and punctuation.

?- string_tokens('Preserve "Hello Text!" as a string', L).
L = ["Preserve", "\"Hello Text!\"", "as", "a", "string"].

Where I’ve hit a backtracking snag is escaping quotes within quotes. The code I’ve commented out below doesn’t work, with the program hanging if one gives it input such as ‘Preserve "Hello \“Text!” as a string’.

Anyone have suggestions of how to work around that snag?

:- module(tokeniser, [string_tokens/2,       % +String, -Tokens
                      read_file_to_tokens/3  % +FileName, -Tokens, +Options
                     ]).

read_rest_of_word(_Stream, Type, LeftOver, [], LeftOver) :-
  \+char_type(LeftOver, Type), !.

read_rest_of_word(Stream, Type, Char, [Char|Chars], LeftOver) :-
  % char_type(LeftOver, Type),
  read_rest_of_word(Stream, Type, Chars, LeftOver).

read_rest_of_word(Stream, Type, Chars, LeftOver) :-
  get_char(Stream, Char),
  read_rest_of_word(Stream, Type, Char, Chars, LeftOver).

read_rest_of_quote(_Stream, OpenQuote, OpenQuote, [OpenQuote]) :- !.

read_rest_of_quote(Stream, OpenQuote, Char, [Char|Chars]) :-
  Char \= OpenQuote, !,
  read_rest_of_quote(Stream, OpenQuote, Chars).

/* Doesn't work
read_rest_of_quote(Stream, OpenQuote, Escape, [Escape, Char|Chars]) :-
  string_code(Escape, [92]), % Escape = backslash "\"
  get_char(Stream, Char),
  read_rest_of_quote(Stream, OpenQuote, Chars).
*/

read_rest_of_quote(Stream, OpenQuote, Chars) :-
  get_char(Stream, Char),
  read_rest_of_quote(Stream, OpenQuote, Char, Chars).

read_words(_Stream, Char, []) :- 
  char_type(Char, end_of_file), !.

read_words(Stream, Char, Words) :-
  char_type(Char, space), !,
  read_words(Stream, Words).
  
read_words(Stream, OpenQuote, [Word|Words]) :-
  char_type(OpenQuote, quote), !,
  read_rest_of_quote(Stream, OpenQuote, Chars),
  % check to see if quote completed, or just escaped
  string_codes(Word, [OpenQuote|Chars]),
  read_words(Stream, Words).

read_words(Stream, Char, [Word|Words]) :-
  char_type(Char, alnum), !,
  read_rest_of_word(Stream, alnum, Chars, LeftOver),
  string_codes(Word, [Char|Chars]),
  read_words(Stream, LeftOver, Words).

read_words(Stream, Char, [Word|Words]) :-
  char_type(Char, punct), !, % \+char_type(Char, quote),
  read_rest_of_word(Stream, punct, Chars, LeftOver),
  string_codes(Word, [Char|Chars]),
  read_words(Stream, LeftOver, Words).

read_words(Stream, Words) :-
  get_char(Stream, Char),
  read_words(Stream, Char, Words).

read_file_to_tokens(File, Tokens, Options) :-
  setup_call_cleanup(open(File, read, Stream, Options),
    read_words(Stream, Tokens),
    close(Stream)).
    
string_tokens(String, Tokens) :-
  open_string(String, Stream),
  read_words(Stream, Tokens).

abaljeu · March 8, 2019, 1:40pm

We come here to a reason why having get_char embedded in your tokenizer is not good. My next suggestion is to start writing miniature tests to establish good/bad functionality. I’m not sure how you can actually write that test without always having a setup opening a file and also making files on disk. Not ideal.

The point is, with such a program structure you could quickly see for yourself what’s going on with an attempt to parse ’ “H” '.

pmoura · March 8, 2019, 3:06pm

Logtalk’s lgtunit provides support for testing input/output predicates. It’s used e.g. in the Prolog conformance suite bundled with Logtalk to test the standard input/output predicates. See e.g. https://github.com/LogtalkDotOrg/logtalk3/blob/15d6e9197ccd54aa0e58ad2355ce6feb2880b2ac/tests/prolog/predicates/read_term_3/tests.lgt and https://github.com/LogtalkDotOrg/logtalk3/blob/15d6e9197ccd54aa0e58ad2355ce6feb2880b2ac/tests/prolog/predicates/put_byte_2/tests.lgt for some usage examples. These support predicates do use files in their implementation but all file handling details are abstracted. This solution have the advantage of being portable. On the other hand, SWI-Prolog and other Prolog systems provides proprietary built-in predicates that allow e.g. reading and writing to an atom, which could be used for an alternative solution not based on files.

peter.ludemann · March 10, 2019, 11:13pm

This paper might be of interest:
https://www.researchgate.net/publication/220404296_Parsing_and_Compiling_Using_Prolog

CapelliC · March 11, 2019, 7:42am

Hi Joe

For learning purpose, it makes sense to implement basic utilities, but I have to disagree with your premise about DCGs.
They are one of the finest concepts evolved from Prolog, and highlight some points of strength of the language itself. Then, my suggestion would be to dig further into DCGs, instead of refraining from them.

When writing a compiler, error handling must not be overlooked. phrase_from_file/2, apart being a very interesting application of attribute variables, has builtin support for error report, that you should hardcode into your tokenizer. But to stay in mainstream Prolog, take for instance a pack like dcg_util and try to couple with library(dcg/basics), to learn about SWI-Prolog non ISO-standard strings, testing and debugging features.

For example:

:- module(tokenizer, [tokenize/2]).

:- use_module(library(dcg_util)).
:- use_module(library(dcg/basics)).

tokenize(Cs,Ts) :-
    phrase((blanks,list(token,blanks,Ts),blanks),Cs).

token(symbol(T)) -->
    symbol(T), !.
token(number(N)) -->
    number(N), !.
token(sign(S)) -->
    nonblanks(S), !,
    {memberchk(S, [`;`,`:=`,`<`,`+`,`*`])
    ,debug(tokenize,'~w',op(S))
    }.

symbol(S) -->
    code_type(csymf,First),
    greedy(code_type(csym),Rest),
    {atom_codes(S,[First|Rest])
    ,debug(tokenize,'~w',symbol(S))
    }.

code_type(Type,C) -->
    [C], {code_type(C,Type)}.

:- use_module(library(plunit)).
:- begin_tests(tokenizer).

test(pascal_eg) :-
  Code = `program factorial;
  begin
    read value;
    count := 1;
    result := 1;
    while count < value do
      begin
        count := count + 1;
        result := result * count
      end;
    write result
  end`,
  tokenize(Code, Tokens),
  writeln(Tokens).

:- end_tests(tokenizer).

This should get (an excerpt):

?- debug(tokenize).
true.
?- run_tests(tokenizer).
% PL-Unit: tokenizer 
% symbol(program)
% symbol(factorial)
% op([59])
...
% symbol(result)
% symbol(end)
[symbol(program),symbol(factorial),sign([59]),symbol(begin),symbol(read),symbol(value),sign([59])...symbol(count),symbol(end),sign([59]),symbol(write),symbol(result),symbol(end)]
Warning: /home/carlo/Desktop/tokenizer.pl:32:
	PL-Unit: Test pascal_eg: Test succeeded with choicepoint
 done
% test passed
true.

From an efficiency viewpoint, always try to keep up-to-date with the (immensely useful and always growing) library. The (old but gold) tokenize_atom/2 it’s hard to beat:

?- Code = `program factorial;
  begin
    read value;
    count := 1;
    result := 1;
    while count < value do
      begin
        count := count + 1;
        result := result * count
      end;
    write result
  end`, tokenize_atom(Code,Ts).
Code = [112, 114, 111, 103, 114, 97, 109, 32, 102|...],
Ts = [program, factorial, ;, begin, read, value, ;, count, :|...].

joeblog · March 11, 2019, 8:14am

Many thanks for the excellent pointers. I’m still busy with Chapter 10 of O’Keefe, but hope to move on shortly.

joeblog · March 11, 2019, 8:24am

I’m a big fan of DCGs since they tie in nicely with the “classic” theories of Noam Chomsky and John Backus (though I’m still battling to understand them properly).

A problem I’ve encountered with tutorials on them is the lexing step tends to be skipped. To quote The Art of Prolog: “Both the first stage of lexical analysis and the final output stage are relatively uninteresting and are not considered here.”

One of the reasons I’m using the “stepping character by character through a stream” approach is that is the way Richard O’Keefe does it in a whole chapter he wrote on tokenising in The Craft of Prolog (which somewhat weirdly to me is near the end of the book, after the chapters on writing interpreters and using grammars).

I’m still working my way through his full example of a tokeniser for DEC-10 Prolog, which covers all the issues I’m stuck on such as escaping quotes within quotes.

Anyway, it’s all very educational and I’m enjoying all the pointers to new papers and ideas a lot.

Boris · March 11, 2019, 12:56pm

If I remember correctly, the chapters in “The Craft of Prolog” (after the first one) are not written or meant to be read in any particular order.

You are correct about DCGs and tokenizing the input. The silent assumption seems to be that you have already tokenized to some degree, and that you are dealing with a list of tokens, not characters coming from input.

It is also not immediately obvious how one can nicely “chain” the steps. In practice, it is often easiest (and on modern hardware perfectly OK) to read everything to a list, then tokenize it making another list, then parse it, etc.

There is library(pure_input) which solves one part of the problem. I still haven’t figured out how to cleanly chain predicates that operate on lists (short of using lazy lists, and I don’t understand those).

EricGT · March 11, 2019, 2:07pm

Make use of indexing when possible. You are using memberchk/1, i.e. memberchk(S, [;,:=,<,+,*]) While you might not notice any difference for small test or small runs, if you are parsing something large you will start to notice a difference.
As a personal preference for tokens such as reserved works I would use the name directly instead of using them in a structure, e.g. begin, read, etc.
Since you are writing a compiler, odds are you will eventually want to add error messages that give the line and column of the error. I would suggest that you incorporate that into your code from the start as adding it later can result in a lot of rewriting. If you have to thread the line number through the predicates then you have to modify almost every predicate. If you choose to carry the position with every token it gets even worse. Think about what you need on the back end so that you can build it in instead of bolting it on or refactoring it in later.
Write lots of test cases. Normally with Prolog I would not push test cases so much, but with a compiler it is closer to being fully deterministic and thus test cases are needed. Learn to use the test module and the forall(:Generator) option.
Another item that might be of value with writing test cases with multiple lines is \c (Character escapes), e.g.

    Input_string = "\c
        RecName: Full=Uncharacterized protein 3R;\n\c
        Flags: Precursor;\n\c
    ",

In this case the first character of the first line is R. Because of the \c on the preceding line all of the space characters between the start of the line and the first non-whitespace character are removed.

If using :- set_prolog_flag(double_quotes, codes). and you have to use single characters as codes but want to see the character as an ASCII character you can do 0'\n or 0'{ etc.

CapelliC · March 13, 2019, 10:54am

I suggested pack(dcg_util) exactly because it illustrates a way to solve the problem, at least as I understand it.

In my example, I used greedy//2 and list//3, instead of hard coding the obvious recursive non-terminals
(it’s interesting to see how Michael managed to keep the whole working for both parse and generation, in basic Prolog).

To be true, dcg_util has a problem with eos//0, and maybe could be obsoleted by library(dcg/high_order), that I wasn’t aware of… still on my way to learning more…

joeblog · March 19, 2019, 7:07am

Dabbling further with writing a tokeniser, I’ve rewritten it to be tail recursive (which made me grasp why the usual example for tail recursion is the standard reverse(L1, L2) clause since tail recursion gives you a reversed result. This didn’t fix my problem with not being able to handle escaped quotes within quotes (which I suspect would involve set_prolog_flag(character_escapes, false) which I’m guessing would break skipping spaces).

Here is the tail recursive version of tokeniser.pl so far:

:- module(tokeniser, [string_tokens/2,       % +String, -Tokens
                      read_file_to_tokens/2  % +FileName, -Tokens
                     ]).
                     
string_tokens(String, Tokens) :-
  open_string(String, Stream),
  read_tokens(Stream, Tokens).

read_file_to_tokens(SrcDest, Tokens) :-
  setup_call_cleanup(open(SrcDest, read, Stream, []),
                     read_tokens(Stream, Tokens),
                     close(Stream)).
                     
read_tokens(Stream, Chars) :-
  get_char(Stream, Char),
  read_tokens(Char, Stream, [], Chars). 

read_tokens(end_of_file, _Stream, ReversedList, Tail) :- 
  reverse(ReversedList, Tail), !.

read_tokens(InChar, Stream, Tokens, Tail) :-
  char_type(InChar, space), !,
  get_char(Stream, Char),
  read_tokens(Char, Stream, Tokens, Tail).

read_tokens(InChar, Stream, Tokens, Tail) :-
  char_type(InChar, alnum), !,
  read_word(InChar, Stream, [], Word, LeftOver),
  read_tokens(LeftOver, Stream, [Word|Tokens], Tail).
  
read_tokens(QuoteChar, Stream, Tokens, Tail) :-
  char_type(QuoteChar, quote), !,
  get_char(Stream, InChar),
  read_quote(InChar, Stream, QuoteChar, [QuoteChar], Quote, LeftOver),
  read_tokens(LeftOver, Stream, [Quote|Tokens], Tail).
 
read_tokens(InChar, Stream, Tokens, Tail) :-
  char_type(InChar, punct), !,
  read_symbol(InChar, Stream, [], Symbol, LeftOver),
  read_tokens(LeftOver, Stream, [Symbol|Tokens], Tail).

read_word(LeftOver, _Stream, ReversedList, Word, LeftOver) :-
  \+char_type(LeftOver, alnum), !,
  reverse(ReversedList, Chars),
  string_chars(Word, Chars).
  
read_word(InChar, Stream, Chars, Tail, LeftOver) :-
  % char_type(InChar, alnum),
  get_char(Stream, OutChar),
  read_word(OutChar, Stream, [InChar|Chars], Tail, LeftOver).

% need to also split symbols on brackets
read_symbol(LeftOver, _Stream, ReversedList, Symbol, LeftOver) :-
  (char_type(LeftOver, quote) ; \+char_type(LeftOver, punct)), !, % ; dif(LeftOver, ",") string_code(_, '({[]})', LeftOver);
  reverse(ReversedList, Chars),
  string_chars(Symbol, Chars).

read_symbol(InChar, Stream, Chars, Tail, LeftOver) :-
  % char_type(InChar, punct),
  get_char(Stream, OutChar),
  read_symbol(OutChar, Stream, [InChar|Chars], Tail, LeftOver).

read_quote(end_of_file, _Stream, _QuoteChar, ReversedList, Quote, end_of_file) :-
  reverse(ReversedList, Chars),
  string_chars(Quote, Chars). 

read_quote(InChar, Stream, QuoteChar, Chars, Tail, LeftOver) :-
  dif(InChar, QuoteChar), !,
  get_char(Stream, OutChar),
  read_quote(OutChar, Stream, QuoteChar, [InChar|Chars], Tail, LeftOver).

read_quote(Escape, Stream, QuoteChar, Chars, Tail, LeftOver) :-
  char_code(Escape, 92), !, % Escape = backslash "\"
  get_char(Stream, EscapedChar),
  get_char(Stream, OutChar),
  read_quote(OutChar, Stream, QuoteChar, [EscapedChar, Escape|Chars], Tail, LeftOver).

read_quote(QuoteChar, Stream, QuoteChar, ReversedList, Quote, LeftOver) :-
  get_char(Stream, LeftOver),
  reverse([QuoteChar|ReversedList], Chars),
  string_chars(Quote, Chars).

EricGT · March 19, 2019, 9:20am

Instead of

char_code(Escape, 92), !, % Escape = backslash "\"

you might like

(Note: Corrected based on Jan’s response).

char_code(Escape, 0'\\), !,

Notice that instead of writing the number 92 and then adding comments to explain that 92 is the characters \, by using 0'\\ it shows the character as a human readable and is translated into 92 when compiled for processing.

I find that for a few cases in the code it really doesn’t matter that much, but when I get past several or have to refer back to the code after many months, using the pattern 0’ is much preferable.

As some feedback, I was expecting to see the use of DCG for the clauses that are processing the input stream. Are you planing or working toward using DCGs?

jan · March 19, 2019, 10:03am

Note this should be 0'\\ (double ```). What follows 0’ must be a single character as it appears in a quoted atom. SWI-Prolog tries to be a bit relaxed to deal more gracefully with the older Edinburgh conventions, but surely new code should follow the standard.

EricGT · March 19, 2019, 10:05am

Jan, Thanks.

joeblog · March 19, 2019, 2:01pm

The idea is to prepare input text for a DCG. I found tokenising using DCGs directly from a stream of characters overcomplicated the “thinking in terms of formal grammars” part of the exercise.

Since I don’t need escaped quotes for my project anyway, and I found stringing punctuation characters caused all kinds of unforeseen problems (such as seeing “],” as one token instead of two, I decided to simplify the tokeniser I intend using for now down to this:

:- module(tokeniser, [string_tokens/2,       % +String, -Tokens
                      read_file_to_tokens/2  % +FileName, -Tokens
                     ]).
                     
string_tokens(String, Tokens) :-
  open_string(String, Stream),
  read_tokens(Stream, Tokens).

read_file_to_tokens(SrcDest, Tokens) :-
  setup_call_cleanup(open(SrcDest, read, Stream, []),
                     read_tokens(Stream, Tokens),
                     close(Stream)).
                     
read_tokens(Stream, Chars) :-
  get_char(Stream, Char),
  read_tokens(Char, Stream, [], Chars). 

read_tokens(end_of_file, _Stream, ReversedList, Tail) :- 
  reverse(ReversedList, Tail), !.

read_tokens(InChar, Stream, Tokens, Tail) :-
  char_type(InChar, space), !,
  get_char(Stream, Char),
  read_tokens(Char, Stream, Tokens, Tail).

read_tokens(InChar, Stream, Tokens, Tail) :-
  char_type(InChar, alnum), !,
  read_word(InChar, Stream, [], Word, LeftOver),
  read_tokens(LeftOver, Stream, [Word|Tokens], Tail).
  
read_tokens(QuoteChar, Stream, Tokens, Tail) :-
  char_type(QuoteChar, quote), !,
  get_char(Stream, InChar),
  read_quote(InChar, Stream, QuoteChar, [QuoteChar], Quote, LeftOver),
  read_tokens(LeftOver, Stream, [Quote|Tokens], Tail).
 
read_tokens(InChar, Stream, Tokens, Tail) :-
  char_type(InChar, punct), !,
  get_char(Stream, OutChar),
  string_chars(Token, [InChar]), % not sure why this is needed, but it is
  read_tokens(OutChar, Stream, [Token|Tokens], Tail).

read_word(LeftOver, _Stream, ReversedList, Word, LeftOver) :-
  \+char_type(LeftOver, alnum), !,
  reverse(ReversedList, Chars),
  string_chars(Word, Chars).
  
read_word(InChar, Stream, Chars, Tail, LeftOver) :-
  % char_type(InChar, alnum),
  get_char(Stream, OutChar),
  read_word(OutChar, Stream, [InChar|Chars], Tail, LeftOver).

read_quote(InChar, Stream, QuoteChar, Chars, Tail, LeftOver) :-
  dif(InChar, QuoteChar), !,
  get_char(Stream, OutChar),
  read_quote(OutChar, Stream, QuoteChar, [InChar|Chars], Tail, LeftOver).

read_quote(QuoteChar, Stream, QuoteChar, ReversedList, Quote, LeftOver) :-
  get_char(Stream, LeftOver),
  reverse([QuoteChar|ReversedList], Chars),
  string_chars(Quote, Chars).

Here is the test file I’m using. My thinking (at the moment) is, to handle negative numbers, floats, and exponents delimited with ^, e or ** or who knows what, it’s best to just put those as symbols in the token list to be handled by the parser, probably written as a DCG, while storing the digits as strings.

:- begin_tests(tokeniser).
:- use_module(['/home/roblaing/libs/swipl/tokeniser.pl']).

% Example taken from http://www.learnprolognow.org/lpnpage.php?pagetype=html&pageid=lpn-htmlse56
test(lpn_eg) :-
  string_tokens("The cow under the table shoots.", Tokens),
  Tokens = ["The", "cow", "under", "the", "table", "shoots", "."].

% Example taken from Chapter 24 "A Compiler" from The Art of Prolog  
test(pascal_eg) :-
  Code = "program factorial;
  begin
    read value;
    count := 1;
    result := 1;
    while count < value do
      begin
        count := count + 1;
        result := result * count
      end;
    write result
  end",
  string_tokens(Code, Tokens),
  Tokens = ["program","factorial",";","begin","read","value",";","count",
    ":","=","1",";","result",":", "=","1",";","while","count","<","value","do",
    "begin","count",":", "=","count","+","1",";","result",":", "=","result","*","count",
    "end",";","write","result","end"].
   
test(json) :-
  Code = '{
	"control": "red",
	"step": 1,
	"piece_count": [
		["black", 12],
		["red", 12]
	],
	"cell": [
		["a", 1, "b"],
		["a", 2, "wp"],
		["a", 3, "b"],
		["a", 4, "b"],
		["a", 5, "b"],
		["a", 6, "bp"],
		["a", 7, "b"],
		["a", 8, "bp"],
		["b", 1, "wp"],
		["b", 2, "b"],
		["b", 3, "wp"],
		["b", 4, "b"],
		["b", 5, "b"],
		["b", 6, "b"],
		["b", 7, "bp"],
		["b", 8, "b"],
		["c", 1, "b"],
		["c", 2, "wp"],
		["c", 3, "b"],
		["c", 4, "b"],
		["c", 5, "b"],
		["c", 6, "bp"],
		["c", 7, "b"],
		["c", 8, "bp"],
		["d", 1, "wp"],
		["d", 2, "b"],
		["d", 3, "wp"],
		["d", 4, "b"],
		["d", 5, "b"],
		["d", 6, "b"],
		["d", 7, "bp"],
		["d", 8, "b"],
		["e", 1, "b"],
		["e", 2, "wp"],
		["e", 3, "b"],
		["e", 4, "b"],
		["e", 5, "b"],
		["e", 6, "bp"],
		["e", 7, "b"],
		["e", 8, "bp"],
		["f", 1, "wp"],
		["f", 2, "b"],
		["f", 3, "wp"],
		["f", 4, "b"],
		["f", 5, "b"],
		["f", 6, "b"],
		["f", 7, "bp"],
		["f", 8, "b"],
		["g", 1, "b"],
		["g", 2, "wp"],
		["g", 3, "b"],
		["g", 4, "b"],
		["g", 5, "b"],
		["g", 6, "bp"],
		["g", 7, "b"],
		["g", 8, "bp"],
		["h", 1, "wp"],
		["h", 2, "b"],
		["h", 3, "wp"],
		["h", 4, "b"],
		["h", 5, "b"],
		["h", 6, "b"],
		["h", 7, "bp"],
		["h", 8, "b"]
	]
}',
  string_tokens(Code, Tokens),
  Tokens = ["{","\"control\"",":","\"red\"",",",
     "\"step\"",":","1",",",
     "\"piece_count\"",":","[","[","\"black\"",",","12","]",",",
                           "[","\"red\"",",","12","]","]",",",
     "\"cell\"",":","[","[","\"a\"",",","1",",","\"b\"","]",",",
                        "[","\"a\"",",","2",",","\"wp\"","]",",",
                        "[","\"a\"",",","3",",","\"b\"","]",",",
                        "[","\"a\"",",","4",",","\"b\"","]",",",
                        "[","\"a\"",",","5",",","\"b\"","]",",",
                        "[","\"a\"",",","6",",","\"bp\"","]",",",
                        "[","\"a\"",",","7",",","\"b\"","]",",",
                        "[","\"a\"",",","8",",","\"bp\"","]",",",
                        "[","\"b\"",",","1",",","\"wp\"","]",",",
                        "[","\"b\"",",","2",",","\"b\"","]",",",
                        "[","\"b\"",",","3",",","\"wp\"","]",",",
                        "[","\"b\"",",","4",",","\"b\"","]",",",
                        "[","\"b\"",",","5",",","\"b\"","]",",",
                        "[","\"b\"",",","6",",","\"b\"","]",",",
                        "[","\"b\"",",","7",",","\"bp\"","]",",",
                        "[","\"b\"",",","8",",","\"b\"","]",",",
                        "[","\"c\"",",","1",",","\"b\"","]",",",
                        "[","\"c\"",",","2",",","\"wp\"","]",",",
                        "[","\"c\"",",","3",",","\"b\"","]",",",
                        "[","\"c\"",",","4",",","\"b\"","]",",",
                        "[","\"c\"",",","5",",","\"b\"","]",",",
                        "[","\"c\"",",","6",",","\"bp\"","]",",",
                        "[","\"c\"",",","7",",","\"b\"","]",",",
                        "[","\"c\"",",","8",",","\"bp\"","]",",",
                        "[","\"d\"",",","1",",","\"wp\"","]",",",
                        "[","\"d\"",",","2",",","\"b\"","]",",",
                        "[","\"d\"",",","3",",","\"wp\"","]",",",
                        "[","\"d\"",",","4",",","\"b\"","]",",",
                        "[","\"d\"",",","5",",","\"b\"","]",",",
                        "[","\"d\"",",","6",",","\"b\"","]",",",
                        "[","\"d\"",",","7",",","\"bp\"","]",",",
                        "[","\"d\"",",","8",",","\"b\"","]",",",
                        "[","\"e\"",",","1",",","\"b\"","]",",",
                        "[","\"e\"",",","2",",","\"wp\"","]",",",
                        "[","\"e\"",",","3",",","\"b\"","]",",",
                        "[","\"e\"",",","4",",","\"b\"","]",",",
                        "[","\"e\"",",","5",",","\"b\"","]",",",
                        "[","\"e\"",",","6",",","\"bp\"","]",",",
                        "[","\"e\"",",","7",",","\"b\"","]",",",
                        "[","\"e\"",",","8",",","\"bp\"","]",",",
                        "[","\"f\"",",","1",",","\"wp\"","]",",",
                        "[","\"f\"",",","2",",","\"b\"","]",",",
                        "[","\"f\"",",","3",",","\"wp\"","]",",",
                        "[","\"f\"",",","4",",","\"b\"","]",",",
                        "[","\"f\"",",","5",",","\"b\"","]",",",
                        "[","\"f\"",",","6",",","\"b\"","]",",",
                        "[","\"f\"",",","7",",","\"bp\"","]",",",
                        "[","\"f\"",",","8",",","\"b\"","]",",",
                        "[","\"g\"",",","1",",","\"b\"","]",",",
                        "[","\"g\"",",","2",",","\"wp\"","]",",",
                        "[","\"g\"",",","3",",","\"b\"","]",",",
                        "[","\"g\"",",","4",",","\"b\"","]",",",
                        "[","\"g\"",",","5",",","\"b\"","]",",",
                        "[","\"g\"",",","6",",","\"bp\"","]",",",
                        "[","\"g\"",",","7",",","\"b\"","]",",",
                        "[","\"g\"",",","8",",","\"bp\"","]",",",
                        "[","\"h\"",",","1",",","\"wp\"","]",",",
                        "[","\"h\"",",","2",",","\"b\"","]",",",
                        "[","\"h\"",",","3",",","\"wp\"","]",",",
                        "[","\"h\"",",","4",",","\"b\"","]",",",
                        "[","\"h\"",",","5",",","\"b\"","]",",",
                        "[","\"h\"",",","6",",","\"b\"","]",",",
                        "[","\"h\"",",","7",",","\"bp\"","]",",",
                        "[","\"h\"",",","8",",","\"b\"","]","]","}"].


test(quoted) :-
  string_tokens('Preserve "Hello Text!" as a string', Tokens),
  Tokens = ["Preserve", "\"Hello Text!\"", "as", "a", "string"].

/*
test(escaped_quote) :-
  string_tokens('Preserve "Hello\\" Text!" as a string', Tokens),
  Tokens = ["Preserve", "\"Hello\" Text!\"", "as", "\"a\"", "string"].
*/

:- end_tests(tokeniser).

EricGT · March 19, 2019, 2:38pm

Noticed you did

Tokens = ["Preserve", "\"Hello Text!\"", "as", "a", "string"].

which is using unification =/2.

A better approach with test cases

   assertion(Tokens == ["Preserve", "\"Hello Text!\"", "as", "a", "string"]).

Note the difference between unification =/2 and comparison with ==/2.

If the variable Tokens is not bound, then the unification will succeed and make it appear the test succeeded when it might have failed. I made the same mistake a few months ago. IIRC it was Paulo who pointed it out to me.

EricGT · March 19, 2019, 2:57pm

string_chars(Token, [InChar]), % not sure why this is needed, but it is

If you are referring to the [ ] needed for string_chars/2 it is because string_chars/2 requires the second argument to be a list, InChar is not a list but a single character.

jan · March 19, 2019, 3:21pm

joeblog:

read_quote(InChar, Stream, QuoteChar, Chars, Tail, LeftOver) :-
    dif(InChar, QuoteChar), !,
    get_char(Stream, OutChar),
    read_quote(OutChar, Stream, QuoteChar, [InChar|Chars], Tail, LeftOver).

read_quote(QuoteChar, Stream, QuoteChar, ReversedList, Quote, LeftOver) :-
    get_char(Stream, LeftOver),
    reverse([QuoteChar|ReversedList], Chars),
    string_chars(Quote, Chars).

This raises some questions. Generally I’d consider using
phrase_from_file/3. Combined with some stuff from library(dcg/basics) it
probably all becomes simpler and often also faster because phrase_from_file/3
reads the file in blocks rather than characters.

Anyway, some of this is a matter of taste. If you go the imperative way
anyway, I’d avoid dif/2. It is way slower than \== and as this
appear in the inner loop, this matters. Even better is to use matching
in the head.

Putting the strings back into the token seems odd. If you want to distinquish
types of tokens, wrap them into a term, so we have e.g.

[ string(“hello”), int(42), … ]. This can be really handy for the next level
of parsing!

Also the reverse is easily avoided, so we get something like this.

read_quote(Char, Quote, Stream, string(String), LeftOver) :-
    read_quote(Char, Quote, Stream, Chars),
    get_char(LeftOver),
    string_chars(String, Chars).

read_quote(Quote, Quote, Stream, []) :-
    !.
read_quote(Char, Quote, Stream, [Char|More]) :-
    get_char(Stream, Next),
    read_quote(Next, Quote, Stream, More).

You probably also want to check for end_of_file. I leave that to you.

I recall this was about really big data files. In the end you can write a C routine that reads the next token and then use library(lazy_lists) to create a lazy list that lets you pass infinite amounts of data in finite memory and pretty ok performance. Parsers typically spent most of their times on tokenizing, while this job is often trivial to do in C.

Topic		Replies	Views
A parsing example General	4	7799	March 27, 2019
Phrase_from_file with initial state as push back list General	16	626	May 3, 2020
Phrase_from_file vs phrase Help!	29	2949	September 14, 2020
Adding error detection to DCG rules Algorithm	18	1870	May 7, 2020
Tokenizing files with DCGs General dcg	3	667	August 12, 2022

A tokeniser I've written. Any suggestions on how to improve it?

Related topics