- string_codes/2 with escape sequences \xXX for CR with LF

Using SWI-Prolog (threaded, 64 bits, version 8.3.3) on Windows 10

In converting some input into character codes for DCGs the cr with lf combination is not translating into the correct codes.

Correct examples

?- string_codes("\x0D",Codes).
Codes = [13].

?- string_codes("\x0A",Codes).
Codes = [10].

?- string_codes("\x0D \x0A",Codes).
Codes = [13, 32, 10].

?- string_codes("\x0D\\x0A",Codes).
Codes = [13, 10].

Example with what seems to be invalid conversion.

?- string_codes("\x0D\x0A",Codes).
Codes = [13, 120, 48, 65].

The documentation notes:

The closing \ is obligatory according to the ISO standard, but optional in SWI-Prolog to enhance compatibility with the older Edinburgh standard.

So is this a bug, a misreading of the documentation or something else? :grinning:

EDIT

Based on reply by Jan W. to use \uXXXX

NB \uXXXX needs four hex digits

?- string_codes("\u000D\u000A",Codes).
Codes = [13, 10].

\uXXXX results in error when two hex digits are given.

?- string_codes("\u0D\u0A",Codes).
ERROR: Syntax error: Illegal \u or \U sequence
ERROR: string_codes("
ERROR: ** here **
ERROR: \u0D\u0A",Codes) .

The docs you linked, for the \xXX..\ say,

The code \xa\3 emits the character 10 (hexadecimal ā€˜aā€™) followed by ā€˜3ā€™. Characters specified this way are interpreted as Unicode characters. See also \u .

If I apply this logic to your invalid conversion, I can rephrase that to:

The code \x0D\x0A emits the character 13 (hexadecimal ā€˜dā€™) followed by ā€˜x0Aā€™. Characters specified this way are interpreted as Unicode characters. See also \u .

This obviously does not answer your question but at least it seems consistent?

PS: if there is something, it has nothing to do with string_codes/2, it is about reading the string literal. Try:

X = "\x0D\x0A".

\xa\3 gets interpreted as \xa\ then 3. Since there is no escape for \3 then the only valid parse is as noted.

\x0D\x0A should be interpreted as \x0D then \x0A because the code should look ahead when seeing a backslash (\) to see if it is the start of an escape sequence. In this case \ is followed by x and thus signifies it is an escape sequence.

Unless you care about portability to other Prolog systems, use \uXXXX or \UXXXXXXXX instead of \xā€¦ The ISO Prolog syntax for character codes is awkward and the result means nothing as it is undefined what encoding should be respected. \u and \U are widely used is virtually all languages these days and defined to be Unicode code points.

1 Like