Which code points are escaped inside quoted atoms and strings?

What is the policy to escaping code points inside atoms and strings?
Since this is a Unicode question, the ISO core standard might
be not extremly helpful. So what is the best practice?

I got in SWI-Prolog:

?- X = 'abc\xFEFF\def'. 
X = 'abcdef'.

?- X = "abc\xFEFF\def". 
X = "abcdef".

The 0xFEFF code point doesn’t get escaped. If I
take the idea that a quoted atom is also a printable atom,
then this might need some improvement. I was rather expecting:

?- X = 'abc\xFEFF\def'.
X = 'abc\xFEFF\def'

?- X = "abc\xFEFF\def".
X = "abc\uFEFFdef"

That a string is printed with \u … instead of \x … \
is to help JSON and JavaScript output. How can I change
SWI-Prolog behaviour? Are there some flags?

In theory it all doesn’t matter too much as long as we escape the quote and the backslash. The logic is in putQuoted() in pl-write.c. It is prints anything >= 256 as is and all graph characters except for the quote and backslash for the low range. If then uses the defined escape sequences such as \n if it exists or ISO \ooo\ otherwise.

Printing may fail because the stream cannot represent the character. Than the logic depends on the stream_property/2 property representation_errors, which may raise an exception or print the character as \uXXXX, \UXXXXXXXX or (old school) an XML character entity. So, if you set the encoding of the stream to ascii and the representation error property to prolog you get \u (or \U) escapes.

Possibly we should use iswprint() to classify. The result of the iswXYZ functions depends on the locale and some systems have rather poor locale support. We could it to the derived property table built into SWI-Prolog (derived from the Unicode standard properties). I’m not too sure about the “good” rules or the usefulness.

I don’t want to escape the quote and the backslash. The Prolog system might become bilingual, since it already accepts unicode escape. This works already in SWi-Prolog:

?- X = "abc\uFEFFdef", atom_codes(X, L).
X = "abcdef",       %%%%% here the \uXXXX disappeared
L = [97, 98, 99, 65279, 100, 101, 102].
                    %%%%% but its there, since 0xFEFF = 65279
?- X = 0xFEFF.
X = 65279.

The “\uXXXX” is already recognized. But how can I output “\uXXXX” ?. Currently the “\uXXXX” disappears in the output, see the answer X = "abcdef".

How can I make write/1 to print “\uXXXX” for non-printable code points?

It prints U+FEFF fine. Just, it is a zero-width space so there is little to see. How it works is pretty detailed explained in my previous reply.

= non-printable.

Doesn’t print anything. My question is whether its possible:

How is this solved in json_write and friends? Ok I see:

% foreign json_write_string/2.

Some 3rd party library or some own stuff?

I grepped the sources and found this:

so I guess this qualifies as “own stuff”?

How is \u0082 currently classified? Its above \u0080 (ASCII range 0…127). I get:

?- X = "abc\u0082def".
X = "abc\202\def".

?- current_output(X), json_write(X, "abc\u0082def").
"abc‚def"
X = <stream>(0x1046b3a60).

The JSON or JavaScript standard will possibly not require that \u0082 gets
quoted during writing. But we might want it for printable JSON nevertheless.

Is there something like printable JSON?

Edit 04.05.2021
Just a little sanity check about my own system, I get:

?- set_prolog_flag(double_quotes, string).

?- X = 'abc\x82\def'.
X = 'abc\202\def'

?- X = "abc\x82\def".
X = "abc\u0082def"

An octal coding wouldn’t be allowed for strict JavaScript. It seems to have a strict and non-strict mode. And only in non-strict mode octal would be allowed.

Could be helpful for debugging?

Now having a look at browser function JSON.stringify().
It has a parameter:

replacer Optional
A function that alters the behavior of the stringification process

But I need a break from JavaScript. Should do something else.
Will investigate later or tomorrow again.

Also I am reading too much into it. “stringification” refers to
the whole JSON term and not how strings are written…

Bottom line is that Prolog quoted write should write something read/1 can read. That is easy: escape the quote and the \. Next we may want something that is easy to read. Does the ISO standard say anything? I think it says newlines cannot be in a quoted string, so we need to escape newlines. After that it makes some sense to do the same for the well known escapes such as \t and write the ASCII control characters as \ooo\ as ISO says we cannot use \uXXXX. Above ASCII, ISO leaves us more or less in the dark AFAIK. We have two meaningful character classes: isprint() and isgraph(). As I understand it, isprint() says it is not a control character and isgraph() says it creates something visible. Added code_type/2 for print and indeed:

?- code_type(X, print), \+ (code_type(X, graph);code_type(X,space)).
false.

Adjusted putQuoted() to use iswgraph(), so characters that do not produce “ink” :slight_smile: are emitted as escaped. Now it appears Linux (glibc) classifies \uFEFF as print, graph and punct :frowning: Depending on OS and locale all all sort of things may happen … Escapes are now emitted as \x<hex>\. Guess we should have a flag or option to produce \uXXXX.

Writing JSON is completely disconnected from this.

Ok, nice! Will have a look in due time.

This related to Dogelog, which is a cross compiler towards JavaScript, which generates mixture of JSON and JavaScript. So the lazy solution I am currently using is that Prolog string datatype is written JavaScript compatible in my system. So I can for example simply do, in the cross compiler:

?- set_prolog_flag(double_quotes, string).

?- writeq(add([0,1,2,'abc\uFEFFdef'])), write(';'), nl.
add([0, 1, 2, 'abc\xFEFF\def']);      %%% not JavaScript, atom used and not string

?- write(add([0,1,2,"abc\uFEFFdef"])), write(';'), nl.
add([0, 1, 2, "abc\uFEFFdef"]);       %%% we got a JavaScript statement here!

The generated printable JSON is also JSON compatible. Since its in the scope of JSON syntax. Maybe for a “minified” output, I might provide some way to print directly non-printables. Since this possibly needs less space. So possibly also some new “minified” flag in the pipeline here.

BTW: Generating JavaScript via Prolog can be further pushed, like using foo() syntax available in SWI-Prolog, was available in my system, now only readable and not writable, should make it writeable again. Also block syntax etc… might be helpful. Just for quick and dirty aka lazy solutions.

I only need a “minified” flag. I don’t need a “\uXXXX” flag, this is regulated by the Prolog datatype atom or string. Since Prolog datatype string and the “\uXXXX” syntax is anyway not in the ISO core standard, I guess there is room to make these decisions.

Yeah, seems not to be stable. From the documentation:

in the default locale, iswgraph(0x2602) = false
in Unicode locale, iswgraph(0x2602) = true

ISO 30112 specifies which Unicode characters are include in POSIX graph category.

https://en.cppreference.com/w/cpp/string/wide/iswgraph

Edit 04.05.2021:
Checked ISO 30112, it says:

graphic_char= <any char except control_chars and space> ;

But what is control_chars or space in Unicode. Implemented something
based on Unicode categories Cc and Cf on my side.

Ok. This was a bit more work than anticipated :frowning: Pushed some stuff that cleans this up

  • Instead if iswgraph() it now uses the internal Unicode tables, escaping
    • Unassigned Unicode code points
    • Separators (Unicode Z*) except for the ASCII space
    • Control characters (Unicode C*)

There is a new flag character_escapes_unicode that controls whether the common \uXXXX or the idiosyncratic ISO Prolog \x<hex>\ is used as well as an option with the same name for write_term/2,3 that defaults to the flag. Also stream_property/2 and set_stream/2 now provide both prolog and unicode to decide on how to escape characters that cannot be represented.

The defaults are to use the \uXXXX notation. I’m sure many users would have preferred to use \x<hex>\ as default …

2 Likes