Which code points are escaped inside quoted atoms and strings?

j4n_bur53 · May 4, 2021, 7:00am

What is the policy to escaping code points inside atoms and strings?
Since this is a Unicode question, the ISO core standard might
be not extremly helpful. So what is the best practice?

I got in SWI-Prolog:

?- X = 'abc\xFEFF\def'. 
X = 'abcdef'.

?- X = "abc\xFEFF\def". 
X = "abcdef".

The 0xFEFF code point doesn’t get escaped. If I
take the idea that a quoted atom is also a printable atom,
then this might need some improvement. I was rather expecting:

?- X = 'abc\xFEFF\def'.
X = 'abc\xFEFF\def'

?- X = "abc\xFEFF\def".
X = "abc\uFEFFdef"

That a string is printed with \u … instead of \x … \
is to help JSON and JavaScript output. How can I change
SWI-Prolog behaviour? Are there some flags?

jan · May 4, 2021, 7:40am

In theory it all doesn’t matter too much as long as we escape the quote and the backslash. The logic is in putQuoted() in pl-write.c. It is prints anything >= 256 as is and all graph characters except for the quote and backslash for the low range. If then uses the defined escape sequences such as \n if it exists or ISO \ooo\ otherwise.

Printing may fail because the stream cannot represent the character. Than the logic depends on the stream_property/2 property representation_errors, which may raise an exception or print the character as \uXXXX, \UXXXXXXXX or (old school) an XML character entity. So, if you set the encoding of the stream to ascii and the representation error property to prolog you get \u (or \U) escapes.

Possibly we should use iswprint() to classify. The result of the iswXYZ functions depends on the locale and some systems have rather poor locale support. We could it to the derived property table built into SWI-Prolog (derived from the Unicode standard properties). I’m not too sure about the “good” rules or the usefulness.

jan · May 4, 2021, 10:59am

It prints U+FEFF fine. Just, it is a zero-width space so there is little to see. How it works is pretty detailed explained in my previous reply.

Boris · May 4, 2021, 11:43am

I grepped the sources and found this:

github.com

SWI-Prolog/packages-http/blob/35c668565051f188b0b47a816a9408b64ae51c71/json.c#L223


      
            { TRYPUTC(c, out);
            }
          
            return 0;
          }
          
          #undef TRYPUTC
          #define TRYPUTC(c, s) if ( Sputcode(c, s) < 0 ) { rc = FALSE; goto out; }
          
          static foreign_t
          json_write_string(term_t stream, term_t text)
          { IOSTREAM *out;
            char *a;
            pl_wchar_t *w;
            size_t len;
            int rc = TRUE;
          
            if ( !PL_get_stream(stream, &out, SIO_OUTPUT) )
              return FALSE;
          
            if ( PL_get_nchars(text, &len, &a, CVT_ATOM|CVT_STRING|CVT_LIST) )

so I guess this qualifies as “own stuff”?

jan · May 4, 2021, 3:10pm

Bottom line is that Prolog quoted write should write something read/1 can read. That is easy: escape the quote and the \. Next we may want something that is easy to read. Does the ISO standard say anything? I think it says newlines cannot be in a quoted string, so we need to escape newlines. After that it makes some sense to do the same for the well known escapes such as \t and write the ASCII control characters as \ooo\ as ISO says we cannot use \uXXXX. Above ASCII, ISO leaves us more or less in the dark AFAIK. We have two meaningful character classes: isprint() and isgraph(). As I understand it, isprint() says it is not a control character and isgraph() says it creates something visible. Added code_type/2 for print and indeed:

?- code_type(X, print), \+ (code_type(X, graph);code_type(X,space)).
false.

Adjusted putQuoted() to use iswgraph(), so characters that do not produce “ink” are emitted as escaped. Now it appears Linux (glibc) classifies \uFEFF as print, graph and punct Depending on OS and locale all all sort of things may happen … Escapes are now emitted as \x<hex>\. Guess we should have a flag or option to produce \uXXXX.

Writing JSON is completely disconnected from this.

jan · May 5, 2021, 9:49am

Ok. This was a bit more work than anticipated Pushed some stuff that cleans this up

Instead if iswgraph() it now uses the internal Unicode tables, escaping
- Unassigned Unicode code points
- Separators (Unicode Z*) except for the ASCII space
- Control characters (Unicode C*)

There is a new flag character_escapes_unicode that controls whether the common \uXXXX or the idiosyncratic ISO Prolog \x<hex>\ is used as well as an option with the same name for write_term/2,3 that defaults to the flag. Also stream_property/2 and set_stream/2 now provide both prolog and unicode to decide on how to escape characters that cannot be represented.

The defaults are to use the \uXXXX notation. I’m sure many users would have preferred to use \x<hex>\ as default …

Topic		Replies	Views
Print vs quoted output of Unicode characters Discussion discussion	2	722	June 25, 2022
Special character problem atom_codes Predicate	4	547	November 24, 2020
How to avoid escaping characters in Prolog strings (Discussion) Discussion	0	511	August 26, 2020
How to avoid escaping characters in Prolog strings Nice to know	0	1204	August 26, 2020
Bug in character escape handling General bug	8	383	March 6, 2023

Which code points are escaped inside quoted atoms and strings?

Related topics