A tokeniser I've written. Any suggestions on how to improve it?

joeblog · April 3, 2019, 12:33pm

Many thanks for those pointers Jan.

I’ve realised a DCG is the most elegant approach after getting to grips with the various ways of iterating in Prolog which I described here Six ways to iterate in Prolog

I’ve also realised A parsing example I took from The Art of Prolog is overcomplicated and could be boiled down to one DCG (something I’ll tackle next).

Here’s the simplified tokeniser I’ve settled on.

:- module(tokeniser, [string_tokens/2]).

%% tokenise(+String, -Tokens) is det

string_tokens(String, Tokens) :-
  string_chars(String, Chars),
  phrase(tokens(Tokens), Chars).

% Definite Clause Grammar (DCG)

tokens([Token|Tokens]) --> ws, token(Token), ws, !, tokens(Tokens).
tokens([])             --> [].

ws --> [W], { char_type(W, space) }, ws.
ws --> [].

token(P) --> [C], { char_type(C, punct), \+char_type(C, quote), string_chars(P, [C]) }.
token(Q) --> quote(Cs), { string_chars(Q, Cs) }.
token(W) --> word(Cs), { string_chars(W, Cs) }.

quote([Quote|Ls])  --> [Quote], { char_type(Quote, quote) }, quote_rest(Quote, Ls).
quote_rest(Quote, [L|Ls]) --> [L], { L \= Quote }, quote_rest(Quote, Ls).
quote_rest(Quote, [Quote]) --> [Quote]. 

word([L|Ls])      --> [L], { char_type(L, alnum) }, word_rest(Ls).
word_rest([L|Ls]) --> [L], { char_type(L, alnum) }, word_rest(Ls).
word_rest([])     --> [].

Topic		Replies	Views
A parsing example General	4	7783	March 27, 2019
Tokenizing files with DCGs General dcg	3	657	August 12, 2022
What's the idiomatic way of developing DCGs? Help!	8	674	December 17, 2020
Can't see my error on unit test, assertion dump is identical Help!	13	1581	April 23, 2020
Wiki Discussion: DCG and phrase/3 Wiki Discussion	5	2588	June 12, 2023

A tokeniser I've written. Any suggestions on how to improve it?

Related topics