Tokenize_atom equivalent that forces everything to lowercase atoms?

tokenize_atom/2 will convert any tokens in a sentence that start with an upper case letter to a quoted string:

https://www.swi-prolog.org/pldoc/doc_for?object=tokenize_atom/2

34 ?- tokenize_atom('The big dog.', R).
R = ['The', big, dog, '.'].

Is there a similar function that instead of doing this, will first lower case each token so you end up with a list of non-quoted atoms, at least for the words?

For example:

34 ?- tokenize_atom_hypothetical('The big dog.', R).
R = [the, big, dog, '.'].
1 Like

Even if there isn’t, it seems easy enough to do it:

tokenize_atom_and_downcase(A, Ts) :-
    tokenize_atom(A, Any_case),
    maplist(downcase_atom, Any_case, Ts).

With this:

?- tokenize_atom_and_downcase('The big dog ATE MY CAT!', Ts).
Ts = [the, big, dog, ate, my, cat, !].

Of course, if loosing information is not a problem:

?- downcase_atom('The big dog.',L),tokenize_atom(L,T).
L = 'the big dog.',
T = [the, big, dog, '.'].
3 Likes

Actually, I cannot think of an example where your solution works differently from first tokenizing and then downcasing. Are there unicode characters that would break this if you first downcase them?

I think that shouldn’t be the case. SWI-Prolog’s native Unicode handling is incomplete. It can merely pass on Unicode code points (on Windows limited to 16 bits) and do simple one-character classification and case folding. The SWI-Prolog Unicode library provides more advanced features.

2 Likes

SQL is case insensitive, but actually, quoted data should preserve the casing, lost by my solution. Yours works better, in this respect.

I see (I think). So you mean that there can be additional logic after tokenizing but before downcasing. Fair point.