Unicode code point for U+10000 on Windows OS. Use in string results in Illegal character code

Using SWI-Prolog (threaded, 64 bits, version 8.1.22) on Windows 10.

Unable to create Unicode code point U+10000 in a string.

Based on the documentation Character Escape Syntax

Using the character a in a string is working and with the escape variations.

?- C = "a".
C = "a".

?- C = "\x61".
C = "a".

?- C = "\x000061".
C = "a".

?- C = "\141".
C = "a".

?- C = "\u0061".
C = "a".

?- C = "\U00000061".
C = "a".

U+FFFF also works

?- C = "\xFFFF".
C = "".

?- C = "\x00FFFF".
C = "".

?- C = "\uFFFF".
C = "".

?- C = "\U0000FFFF".
C = "".

However U+10000 does not work

?- C = "\x010000".
ERROR: Syntax error: Illegal character code
ERROR: C = "\
ERROR: ** here **
ERROR: x010000" . 
?- C = "\U00010000".
ERROR: Syntax error: Illegal character code
ERROR: C = "
ERROR: ** here **
ERROR: \U00010000" . 
?- 

Also a change of the encoding with the Prolog flag does not change the result.

?- current_prolog_flag(encoding,Encoding).
Encoding = text.

?- set_prolog_flag(encoding,utf8).
true.

?- current_prolog_flag(encoding,Encoding).
Encoding = utf8.

?- C = "\x010000".
ERROR: Syntax error: Illegal character code
ERROR: C = "\
ERROR: ** here **
ERROR: x010000" . 
?- C = "\U00010000".
ERROR: Syntax error: Illegal character code
ERROR: C = "
ERROR: ** here **
ERROR: \U00010000" .  

Is this because it is being done on Windows 10, or is it something else?


EDIT

Also tried changing the encoding for the stream user_input but that resulted in a lack of permission.

?- stream_property(user_input,encoding(Encoding)).
Encoding = wchar_t.

?- set_stream(user_input,encoding(utf8)).
ERROR: No permission to encoding stream `user_input'
ERROR: In:
ERROR:   [10] set_stream(user_input,encoding(utf8))
ERROR:    [9] <user>

This is related to adding test cases for strings with this post.


Personal Notes

Surrogates and Supplementary Characters


Possible solutions to enable SWI-Prolog work with all the Unicode code points

  1. Convert SWI-Prolog internal representation from wchar_t to something larger (64 bits) to hold all of the Unicode code points.
  2. Don’t convert the Unicode to code points but leave them as an encoding such as UTF-8, or UTF-16 and create predicates to work with them at the Prolog level, or C functions to work with the encoding at the lower level. Example: How to make python 3 print() utf8
  3. A temporary solution. When converting from an encoding such as UTF-8 to a code point, convert the code points larger than U-FFFF to U-FFFD which is the Unicode replacement character

It is a long discussion that has been repeated several times :frowning: Bottom line is that Unicode text is represented internally as wchar_t arrays, where I once hoped these could hold any unicode value. Alas, in Windows wchar_t is 16 bits unsigned (unsigned short) using UTF-16 encoding, which means we cannot simply consider it an array of character codes. It is used as UCS-2, which means we cannot represent anything above 0xffff

At some point this needs to change. I’ve recently discussed the options with Matt Lilley. Bottom line is that there is no easy way out. There are a number of options with a varying amount of work and varying levels of future proof. This waits for someone dedicated enough to get it done.

2 Likes

For the not so timid.

If you are like me and regularlly use Windows but want full Unicode with SWI-Prolog and still want to use Windows, then you can

  1. Install WSL with Ubuntu on Windows
  2. Install SWI-Prolog via the PPA

Versions:
Windows: 10.0.18362 N/A Build 18362
WSL: 1
Ubuntu: 18.04.2 LTS
SWI-Prolog: (threaded, 64 bits, version 8.1.22)

?- stream_property(user_input,encoding(Encoding)).
Encoding = utf8.

?- C = "\U00010000".
C = "𐀀".

@jan

As noted in previous reply I installed SWI-Prolog on Unbuntu on WSL on Windows 10 and was able to

?- C = "\U00010000".
C = "𐀀".

So does this mean that SWI-Prolog can do full Unicode on a Linux OS but not the Windows OS? The previous discussion was not clear if this was specific to an OS.

Yes. On all other systems I’m aware of wchar_t is a 32-bit unsigned int and thus capable to represent all Unicode code points. If we want to give Microsoft some credits they were probably among the first to support early Unicode when there were only 2^16 code points. They designed two interfaces, one using 8-bit chars and one using 16 bit chars where, AFAIK, NT and up are internally using 16 bit chars. As Unicode extended over 2^16 code points they had a problem. Introducing a 32-bit char API or use a variable length encoding. They opted for the latter, adding surrogate pairs. Unix came a bit later in this process and moved to variable length encodings based on 8-bit chars (now mostly UTF-8) for communication/storage and 32-bit UCS-4 encoding internally for applications that want uniform arrays of code points.

At this moment SWI-Prolog is internally hybrid, using arrays of 8-bit chars for ISO-Latin one (the first 256 Unicode chars) and wchar_t for atoms/strings that contain one or more chars > 255. Roughly the options are:

  • Use the wchat_t as UTF-16 on Windows and make sure all predicates such as atom_codes, sub_atom, etc. can deal with that. Also requires changes to I/O to read/write UTF-16. We can use the native Windows API easily though.
  • Use 32-bit integers also internally for Windows. This requires replacing all the C99 wchar functions we use as these work ok UTF-16. It also requires conversions for all the calls to the Windows API.
  • Do what several modern languages seem to do: use internally only UTF-8. Then everything becomes uniform but as with (1) we have to rewrite a lot. For Unix we can in most cases talk natively to the OS, but for Windows we need to translated everything to/from UTF-16.

The only good news is that the introduction of wide strings was quite late and therefore all the stuff dealing with these issues is fairly well centralized.

2 Likes

7 posts were split to a new topic: Are all unicode digit characters classified?