Print vs quoted output of Unicode characters

peter.ludemann · June 22, 2022, 6:34pm

There are two ways of escaping some Unicode characters: \\u and \\x.
For output, the quoted form prefers \\x and the portray form prefers \\u:

X = '\x2', Y = '\u0002',
     atom_codes(X, Codes), atom_chars(X, Chars),
     format('quoted(~q) canonical(~k) atom(~a) print(~p)~n', [X, X, X, X]).
quoted('\x2\') canonical('\x2\') atom() print('\u0002')
X = Y, Y = '\u0002',
Codes = [2],
Chars = ['\u0002'].

Is there a reason for quoted output preferring the \\x form and portray preferring the \\u form?
From reading the section of the manual on character escape sequences, I infer that the '\u` form isn’t in the ISO standard.

Also, I don’t understand the sentence “where \x defines a numeric character code, it doesn’t specify the character set in which the character should be interpreted” … what character set is being referred to, and how does the '\u` notation fix this problem?

j4n_bur53 · June 22, 2022, 7:26pm

It could be an IBM EBCDIC code. If the Prolog processor
character set is IBM EBCDIC. From the ISO Core Standard:

Unbenannt

In as far it would be possible that a Prolog processor supports
more decimal digits than only the latin decimal digits. Since the
standard says:

Unbenannt2

But something tells me, the ISO core standard didn’t have this
use case in mind. Rather more atoms and variables I guess.
On the same page more or less:

Unbenannt3

Edit 22.06.2022:
In the above terminology, a Prolog processor that supports
more character codes than those listed in 6.5, supports extended
characters, this holds for Prolog processors that support Unicode.

Each Unicode glyph that is not from 6.5, would be an extended
character. But then usally a Prolog processor also supports
the Unicode code point collation, which is a further ingredient

as per 6.6, i.e. the numbering of the glyphs as per Unicode.
If the later is the case then \xXXXX\ and \uXXXX say the same.
But I wonder whether a Prolog system has ever had \uXXXX

which got internally differently coded since 6.6 was not Unicode
code point numbering? Maybe there is some such Prolog system?

j4n_bur53 · June 25, 2022, 9:35am

What would make sense, is a ISO core standard working
group, that would draft these stream creation properties:

bom(Bool)
Specify detecting or writing a BOM.
encoding(Atom)
Specify a file encoding.

After all we have already 2022 and 50 years of Prolog. But
can we be sure that Prolog texts are exchangeable, if
they use Unicode code points?

What if a UTF-16 file, handy for CJK, comes along?

Edit 25.06.2022:
I feel I have not the according experience. Will redo this stuff now
for Dogelog Player after having it in formerly Jekejeke Prolog and still
discovering new corners. Maybe can tell more in a few years.

Also experience from formerly Jekejeke Prolog are that things
might look different for UrlConnection, since the server might do
the BOM detection and the client doesn’t need. I guess SWI-Prolog

has also some experience here through http_open/3.

Topic		Replies	Views
Which code points are escaped inside quoted atoms and strings? Discussion	5	396	May 5, 2021
Are all unicode digit characters classified? Help!	3	416	March 12, 2021
Special character problem atom_codes Predicate	4	548	November 24, 2020
Unicode symbols, and a possible numerical paradox! Discussion	46	2167	June 27, 2022
Subscript and superscript Unicode in SWI-Prolog General	1	133	June 10, 2024

Print vs quoted output of Unicode characters

Related topics