Tokenizing files with DCGs

natarg · August 12, 2022, 9:28am

I am trying to write a tokenizer with DCGs (I am a newbie) using this as a template.

My file format so far allows for strings, numbers and parenthesis, and I’d like to add comments. Example:

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;; A comment can contain any(thing) at all 42.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

bla bla (blabla) 34

The code I have so far is

:- set_prolog_flag(double_quotes, chars).
:- use_module(library(pio)).

tokenize(String, Tokens) :- phrase(tokenize(Tokens), String).
tokenize([]) --> [].
tokenize(Tokens) --> skip_spaces,  tokenize(Tokens).
tokenize([Tokens]) --> skip_comment,  tokenize(Tokens).      % Skip comments
tokenize([String|Tokens]) --> string(String), tokenize(Tokens).
tokenize([Number|Tokens]) --> number(Number), tokenize(Tokens).
tokenize([P|Tokens])      --> parens(P), tokenize(Tokens).

skip_spaces   --> code_types(white, [_|_]) | code_types(newline, [_|_]).
skip_comment --> comment(_).                                % Comment-skip 

/* Tokenize numbers and strings */
number(N) --> code_types(digit, [C|Cs]), {number_codes(N,[C|Cs])}.
string(N) -->  code_types(alnum, [C|Cs]), {atom_codes(M,[C|Cs]), string_lower(M, A), atom_string(N, A) }.

/* Treat parenthesis as separate words */
parens(P) --> [C], {C = 40, char_code(P, C)}.
parens(P) --> [C], {C = 41, char_code(P, C)}.


code_types(Type, [C|Cs]) --> [C], {code_type(C,Type)}, !, code_types(Type, Cs).
code_types(_, []) --> [].


/*  comments */
comment([]) --> [C], {char_code(C, 10)}.
comment([C|R]) --> [C], {char_code(C, 59)}, ... , comment(R).

/* Any sequence of characters */
... --> []| [_], ... .

The query phrase_from_file(tokenize(T), 'text.txt') produces the following error message (only the top reproduced here):

ERROR: Type error: `character' expected, found `59' (an integer)
ERROR: In:
ERROR:   [22] char_code(59,10)

I guess there is something I don’t understand about character encodings? Help is greatly appreciated.

oskardrums · August 12, 2022, 11:28am

Seems like comment//1 isn’t doing the right thing. The variable C is being unified with the first code point of the input (which is read from text.txt in this case), so C is an integer, not a character (an atom), as the error states.

To simply skip the comment, you could use something along the lines of:

:- use_module(library(dcg/basics)).
comment --> ";", string_without("\n", _).

natarg · August 12, 2022, 11:42am

Thanks. Your solution on input file text.txt returns false.
If I remove the comment, the rest tokenizes as expected.

Anyway, I’d really like to know how to do this from “first principles”, i.e. without dcg/basics.

Consider:

comment --> [59] , ... , [10].
comment --> [59], string_without([10], _).
comment --> [59], string_without("\n", _).
comment --> ";", string_without([10], _).
comment --> ";", string_without("\n", _).

1 and 2 produce the following output (why the nested list?):

?- phrase_from_file(tokenize(T), 'text.txt').
T = [[[[bla, bla, '(', blabla, ')', '34']]]]

3 gives T = [[]] whereas 4 and 5 both fail.

natarg · August 12, 2022, 12:16pm

Whoops! Sorry, the nesting was my bad. With that corrected, both 1 and 2 give sound results.
Seems, I must use character codes when reading from files.

Topic		Replies	Views
What's the idiomatic way of developing DCGs? Help!	8	668	December 17, 2020
A tokeniser I've written. Any suggestions on how to improve it? Help!	28	2964	April 3, 2019
Most efficient DCG for text parsing? Algorithm	4	1889	July 5, 2021
Wiki Discussion: DCG and phrase/3 Wiki Discussion	5	2587	June 12, 2023
Using DCGs for parsing a mustache-like template Algorithm	3	824	August 11, 2021

Tokenizing files with DCGs

Related topics