Which code points are escaped inside quoted atoms and strings?

What is the policy to escaping code points inside atoms and strings?
Since this is a Unicode question, the ISO core standard might
be not extremly helpful. So what is the best practice?

I got in SWI-Prolog:

?- X = 'abc\xFEFF\def'. 
X = 'abcdef'.

?- X = "abc\xFEFF\def". 
X = "abcdef".

The 0xFEFF code point doesn’t get escaped. If I
take the idea that a quoted atom is also a printable atom,
then this might need some improvement. I was rather expecting:

?- X = 'abc\xFEFF\def'.
X = 'abc\xFEFF\def'

?- X = "abc\xFEFF\def".
X = "abc\uFEFFdef"

That a string is printed with \u … instead of \x … \
is to help JSON and JavaScript output. How can I change
SWI-Prolog behaviour? Are there some flags?

In theory it all doesn’t matter too much as long as we escape the quote and the backslash. The logic is in putQuoted() in pl-write.c. It is prints anything >= 256 as is and all graph characters except for the quote and backslash for the low range. If then uses the defined escape sequences such as \n if it exists or ISO \ooo\ otherwise.

Printing may fail because the stream cannot represent the character. Than the logic depends on the stream_property/2 property representation_errors, which may raise an exception or print the character as \uXXXX, \UXXXXXXXX or (old school) an XML character entity. So, if you set the encoding of the stream to ascii and the representation error property to prolog you get \u (or \U) escapes.

Possibly we should use iswprint() to classify. The result of the iswXYZ functions depends on the locale and some systems have rather poor locale support. We could it to the derived property table built into SWI-Prolog (derived from the Unicode standard properties). I’m not too sure about the “good” rules or the usefulness.

It prints U+FEFF fine. Just, it is a zero-width space so there is little to see. How it works is pretty detailed explained in my previous reply.

I grepped the sources and found this:

so I guess this qualifies as “own stuff”?

Bottom line is that Prolog quoted write should write something read/1 can read. That is easy: escape the quote and the \. Next we may want something that is easy to read. Does the ISO standard say anything? I think it says newlines cannot be in a quoted string, so we need to escape newlines. After that it makes some sense to do the same for the well known escapes such as \t and write the ASCII control characters as \ooo\ as ISO says we cannot use \uXXXX. Above ASCII, ISO leaves us more or less in the dark AFAIK. We have two meaningful character classes: isprint() and isgraph(). As I understand it, isprint() says it is not a control character and isgraph() says it creates something visible. Added code_type/2 for print and indeed:

?- code_type(X, print), \+ (code_type(X, graph);code_type(X,space)).
false.

Adjusted putQuoted() to use iswgraph(), so characters that do not produce “ink” :slight_smile: are emitted as escaped. Now it appears Linux (glibc) classifies \uFEFF as print, graph and punct :frowning: Depending on OS and locale all all sort of things may happen … Escapes are now emitted as \x<hex>\. Guess we should have a flag or option to produce \uXXXX.

Writing JSON is completely disconnected from this.

Ok. This was a bit more work than anticipated :frowning: Pushed some stuff that cleans this up

  • Instead if iswgraph() it now uses the internal Unicode tables, escaping
    • Unassigned Unicode code points
    • Separators (Unicode Z*) except for the ASCII space
    • Control characters (Unicode C*)

There is a new flag character_escapes_unicode that controls whether the common \uXXXX or the idiosyncratic ISO Prolog \x<hex>\ is used as well as an option with the same name for write_term/2,3 that defaults to the flag. Also stream_property/2 and set_stream/2 now provide both prolog and unicode to decide on how to escape characters that cannot be represented.

The defaults are to use the \uXXXX notation. I’m sure many users would have preferred to use \x<hex>\ as default …

2 Likes