Careful with surrogates

j4n_bur53 · July 8, 2022, 8:47pm

Strange. According to Unicode the surrogates are only unassigned
characters but not illegal code points. They are legal code points.
But the release notes say the are illegal code points:

Anyway with the latest version on Windows I get a re-read problem:

?- X = "abc💐def💐ghi".
X = "abc\uD83D\uDC90def\uD83D\uDC90ghi".

?- X = "abc\uD83D\uDC90def\uD83D\uDC90ghi".
ERROR: Syntax error: Illegal character code
ERROR: X = "abc
ERROR: ** here **
ERROR: \uD83D\uDC90def\uD83D\uDC90ghi" .

So what is shown as an answer substition in the Windows
top-level, cannot be repasted and used. You see the
problem here. Strange that I can paste the string using

no escapes in the first query. But then it shows with escapes
in the answer, and subsequent usage fails. The error message
says something else than the release notes, here its character:

P.S.: I got the idea that its a legal code point and an unassigned character
from ISO_IEC_10646_2020, which shows me. But some encodings might
of course forbid it, but inside Unicode they are not really forbidden as code points.

j4n_bur53 · July 8, 2022, 9:00pm

On another side, on Windows and for the latest SWI-Prolog 8.5.14,
it seems that sub_atom/5 has no problems with the surrogate pairs,
very interesting this works fine, and the result an atom, is not escaped:

/* SWI-Prolog 8.5.14 */
?- X = "abc💐def💐ghi", sub_atom(X, 2, 3, _, Y).
X = "abc\uD83D\uDC90def\uD83D\uDC90ghi",
Y = 'c💐d'

Maybe its a problem that strings in SWI-Prolog are escaped
with \u, and then the result is different than for atoms, where
in atoms currently no escaping happens. All fine for atoms.

Edit 08.07.2022:
Funny observation, the new Ciao WASM playground gives
UTF-8, it seems their atom_codes/2 doesn’t know Unicode so well:

/* Ciao WASM */
?- X = 'Flower 💐', atom_codes(X,L).
L=[70,108,111,119,101,114,32,240,159,146,144],X=Flower 💐 ?

Also this works fine for the example which doesn’t work
fine in string printing. Althought the string shows surrogate
pairs in printing, which cannot be reread, the atom_codes/2

result has no such pairs:

/* SWI-Prolog 8.5.14 */
?- X = "abc💐def💐ghi", atom_codes(X, L).
X = "abc\uD83D\uDC90def\uD83D\uDC90ghi",
L = [97, 98, 99, 128144, 100, 101, 102, 128144, 103|...].

Maybe the solution is to escape it with \U instead of two \u,
or then do it like in atoms, don’t escape it. In both solution
you can also keep the illegal character error during read.

jfmc · July 9, 2022, 10:10am

Hi all,

Ciao added has partial support UTF-8: it allow it in identifiers (so you can write programs with UTF-8 syntax) but codes are still represented as bytes (for compatibility with some legacy programs that do not expect codes outside the 0-255 range). Meanwhile, translation between bytes<->codes can be done explicitly with other predicates. We are open to adopt whatever makes sense and reuse any existing testsuite (specially if it is coordinated with other Prolog systems).

Cheers,

jan · July 9, 2022, 3:04pm

That is a hard point. Considering that UTF-16 cannot represent the surrogate code points (unlike UTF-8 which an represent all code points) we cannot represent these code points in the Windows version and thus must disallow creating atoms or strings from them. If we allow for these code points UTF-16 encoded streams will not be able to handle them. So, surely on Windows there is little choice. On other platform we could facilitate them. For consistency and considering they most likely will remain unassigned forever, I decided to exclude them on all platforms. If there is a good reason to allow them anyway on non-Windows we can change that.

The code for writing strings bypassed the routines that correctly deal with printing UTF-16 content. Fixed. Should be in tomorrows daily build.

You might be lucky. My decision to go for mixed char/wchar_t when Unicode support became necessary was probably wrong. My 0.01$ advice would be to make the system fully 8-bit transparent (if this is not already the case) and decide that this representation is UTF-8. Then you need recoding in I/O and adjusting all the code that splits atoms (atom_codes/2, atom_chars/2, sub_atom/5, etc.) and you need stuff to convert to OS names such as file names, etc. I’d strongly recommend to make everything internal Unicode based and fully transparent. Please avoid that codes<->bytes translations must be done explicitly by the user. As I understand it (please correct otherwise) explicit conversions as also part of Python < 3, to be corrected in Python 3 at a high price.

That route makes the semantics for the required changes to the built-in predicates for text manipulation trivial. The main point of discussion is how to expand the ASCII syntax into the full Unicode domain. Roughly, this is the status:

Unquoted identifiers are fairly trivial as Unicode defines ID_START and ID_CONTINUE for that.
Unicode also defines the set of uppercase characters to distinguish variables from atoms. For scripts that have no notion of case we need some rules like _var and __var for what is normally _Var.
As you may have picked up, SWI-Prolog and Dodgelog now read decimal numbers from other scripts as if they were Arabic digits. There could be good arguments against this?
White space is also well defined and all handled as normal Prolog white space.
Notably the group defining symbols that glue together (like in ASCII =..) is a bit harder. Right now I consider the main categories S* (symbols) and P* (punctuation) symbol characters. Possibly all or some could better be considered solo characters because, while it is normal for ASCII to combine symbols (=<), the Unicode world often has a proper stand-alone symbols for everything.
When producing quoted output, all control and unassigned code points are written as \uXXXX (would be great if that could be accepted by all living Prolog systems rather than \xXXXX\ which is, AFAIK, only used by Prolog.

AFAIK SICStus only allows for Unicode in quoted strings (?). IMO we should allow other scripts to be used for programming as well.

Anyone with an overview of the current status in various Prolog systems?

Topic		Replies	Views
Cannot use \uXXXX to replicated JavaScript behaviour General	0	982	November 21, 2023
Unicode code point for U+10000 on Windows OS. Use in string results in Illegal character code Help!	5	1651	March 13, 2021
Unicode symbols, and a possible numerical paradox! Discussion	46	2170	June 27, 2022
Windows, unit test fails for semweb:ntriples General	1	376	March 31, 2022
Some predicates do not work properly with supplementary Unicode characters in Windows Predicate bug	1	304	January 17, 2023

Careful with surrogates

Related topics