I am trying to write a tokenizer with DCGs (I am a newbie) using this as a template.
My file format so far allows for strings, numbers and parenthesis, and I’d like to add comments. Example:
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;; A comment can contain any(thing) at all 42.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
bla bla (blabla) 34
The code I have so far is
:- set_prolog_flag(double_quotes, chars).
:- use_module(library(pio)).
tokenize(String, Tokens) :- phrase(tokenize(Tokens), String).
tokenize([]) --> [].
tokenize(Tokens) --> skip_spaces, tokenize(Tokens).
tokenize([Tokens]) --> skip_comment, tokenize(Tokens). % Skip comments
tokenize([String|Tokens]) --> string(String), tokenize(Tokens).
tokenize([Number|Tokens]) --> number(Number), tokenize(Tokens).
tokenize([P|Tokens]) --> parens(P), tokenize(Tokens).
skip_spaces --> code_types(white, [_|_]) | code_types(newline, [_|_]).
skip_comment --> comment(_). % Comment-skip
/* Tokenize numbers and strings */
number(N) --> code_types(digit, [C|Cs]), {number_codes(N,[C|Cs])}.
string(N) --> code_types(alnum, [C|Cs]), {atom_codes(M,[C|Cs]), string_lower(M, A), atom_string(N, A) }.
/* Treat parenthesis as separate words */
parens(P) --> [C], {C = 40, char_code(P, C)}.
parens(P) --> [C], {C = 41, char_code(P, C)}.
code_types(Type, [C|Cs]) --> [C], {code_type(C,Type)}, !, code_types(Type, Cs).
code_types(_, []) --> [].
/* comments */
comment([]) --> [C], {char_code(C, 10)}.
comment([C|R]) --> [C], {char_code(C, 59)}, ... , comment(R).
/* Any sequence of characters */
... --> []| [_], ... .
The query phrase_from_file(tokenize(T), 'text.txt')
produces the following error message (only the top reproduced here):
ERROR: Type error: `character' expected, found `59' (an integer)
ERROR: In:
ERROR: [22] char_code(59,10)
I guess there is something I don’t understand about character encodings? Help is greatly appreciated.