Ann: SWI-Prolog 10.1.7

Dear SWI-Prolog user,

SWI-Prolog 10.1.7 is ready for download. This release lands the
long-running overhaul of Unicode handling — both in the
source-language syntax and in the editor/terminal rendering — together
with a handful of behaviour changes to align with the Unicode
recommendations for programming languages. A few of these change
observable defaults; users who rely on the old behaviour should read
the highlights below before upgrading.

Status

Unicode handling is under discussion in the PIP working group. The
current implementation realises where I think the consensus is growing
towards. SWI-Prolog will follow the decisions of the PIP group, which
implies that additional incompatible changes are not unlikely. Note that
these merely touch edge cases. Notably

  • Classification of solo characters.
  • Handling of Bracket pairs (Ps/Pe) and quote pairs (Pi/Pf).
    Notably the term-representation of quote pairs is new and likely to
    be debated. These features may also be restricted using Prolog flags
    and/or options.
  • Handling of non-ASCII digits is not supported by number_codes/2, etc.
    This may change.

Highlights

  • Unicode source syntax now follows UAX #31 (XID_Start / XID_Continue),
    with super-/subscript-digit profile additions for variables and atoms.
  • Default source encoding on Windows is now UTF-8 (was the legacy ANSI
    codepage).
  • Pluggable Unicode normalisation using the Prolog flag unicode_atoms, and
    Unicode bidi-override / isolate code points are now unconditionally
    rejected in source as a Trojan-source defence (CVE-2021-42574).
  • xpce editor and terminal now render and edit NFD combining marks,
    CJK wide glyphs, and supplementary-plane code points (emoji, CJK
    Extensions B–G) correctly on every platform.

Unicode source syntax

The classifier follows the Unicode release reported by the new read-only
Prolog flag unicode_syntax_version (currently '17.0.0'). See section
Unicode Prolog source in the manual for a worked example. Changes:

  • Identifiers are XID_Start followed by XID_Continue per UAX #31,
    with the super- and subscript digits (², ³, ¹, ⁰..⁹, ₀..₉) added as
    XID_Continue so variables like and X₁ work.
  • Variable vs. atom: a token is a variable iff it starts with _ or
    a code point in general category Lu. MODIFIED: titlecase
    letters (Lt, e.g. Dž) now start an atom, not a variable; this
    differs from earlier releases that used the broader derived
    Uppercase property.
  • MODIFIED: All Unicode symbol classes (Sm, Sc, Sk, So)
    and the connector/dash/other-punctuation classes (Pc, Pd, Po)
    are now treated as solo: each forms an atom on its own and does
    not glue with adjacent symbols. This is a deliberate
    break from earlier releases, in which Unicode symbols glued into
    compound atoms in the same way as ASCII symbols.
  • MODIFIED: NBSP (U+00A0) is no longer treated as whitespace.
    Outside quoted material it raises a stray-character syntax error.
    The layout set is now exactly Unicode Pattern_White_Space.
  • Bracket pairs (Ps/Pe) and quote pairs (Pi/Pf) are recognised
    as reader-level operators. An opening character followed by a
    Prolog term and the matching closing character reads as a unary
    compound '<open><close>'(Term), generalising the existing
    {Term} ⇒ '{}'(Term) form to the full Unicode Ps/Pe set.
    Quote pairs read literal text in the form selected by
    double_quotes; e.g. «hello, world» reads as
    '«»'("hello, world").
  • Seven Pattern_White_Space code points now act as line terminators
    everywhere (LF, VT, FF, CR, NEL, LS, PS). FIXED: %-comments,
    line-counter, and \<newline> continuation were silently ignoring
    NEL (U+0085), LS (U+2028), PS (U+2029) as well as CR/VT/FF.
  • In source code (read_term/2), numeric literals still use ASCII
    digits 0–9 only. Conversion via atom_number/2, number_codes/2 and
    number_chars/2 additionally accepts any Unicode Nd block; in a
    single number all digits must come from the same block.
  • New code_type/2 / char_type/2 categories prolog_layout (the eleven
    Pattern_White_Space code points) and prolog_end_of_line (the seven
    line-terminator-like ones). paren(Close) and the new
    quote(Close) cover the full Unicode Ps/Pe and Pi/Pf sets.
  • unicode_block/3 updated to Unicode 17.0.

Unicode atoms / normalisation

  • ADDED: Unicode handling is guided by the 4-valued option
    unicode_atoms. The values are accept (default),
    nfc, error, and reject. It follows the same three-tier
    hierarchy as encoding: Prolog flag → stream property → per-call
    option of read_term/2,3, read_clause/2,3, set_stream/2 and open/4.
  • Modes nfc and error auto-load library(unicode) at every entry
    point (previously only the flag’s setter did so).
  • Mode error falls back to a wcwidth-based check when no
    normalisation hook is registered, which can over-reject scripts
    (e.g. Thai) that use combining marks in NFC. A new read-only
    atom_normalize_hook flag tells you whether the precise check is
    available.
  • Unicode bidi-override and isolate code points (U+202A..U+202E and
    U+2066..U+2069) are unconditionally rejected in source tokens,
    quoted strings, and comments as a defence against the Trojan-source
    attack (CVE-2021-42574).
  • Loading library(unicode) now installs an NFC normaliser into the
    kernel via PL_atom_normalize_hook(), which makes unicode_atoms(nfc)
    work without further setup.

Windows

  • MODIFIED: Default source encoding on Windows is now UTF-8. The
    system locale on Windows usually reports a legacy codepage
    (Windows-1252 or similar), which used to make the default Prolog
    encoding flag ANSI/Latin-1 and caused UTF-8 source files to be
    read as their byte-wise Latin-1 interpretation.
  • FIXED: init_locale also sets LC_CTYPE to UTF-8. Without
    this, mbrtowc/wcrtomb stayed ASCII-only and any UTF-8 byte above
    0x7F flowing through PL_canonicalise_text returned EILSEQ —
    surfacing as Syntax error: illegal_multibyte_sequence the moment
    a user typed an emoji or other non-ASCII char in the libedit prompt.

C API

  • ADDED: PL_is_id_start, PL_is_id_continue, PL_is_uppercase,
    PL_is_decimal, PL_is_layout. Thin shims over the Unicode flag
    table so foreign extensions and embedded toolkits classify code
    points exactly as SWI-Prolog does, without needing the
    locale-dependent POSIX iswX() family.
  • ADDED: PL_wcwidth(int). Locale-portable display-column width
    for a Unicode code point — same answer as the rest of SWI-Prolog,
    regardless of the process LC_CTYPE or platform. On Unix/macOS it
    forwards to system wcwidth(3); on Windows it goes through the
    bundled mk_wcwidth() table.
  • ADDED: foreign_library_property/2 to query foreign library
    properties.

Editor and terminal (xpce)

  • FIXED: Editor column functions now account for CJK wide chars
    (each counts as 2 columns) and combining marks (each counts as 0).
    getColumnEditor() and getColumnLocationEditor() previously
    counted every non-tab character as one visual column, breaking caret
    positioning after CJK double-width and NFD content.
  • FIXED: Forward/backward char and delete now move by grapheme
    cluster, not code point. For NFD text a visible character may span
    multiple code points; the four basic editing operations previously
    stepped one code point at a time, leaving the caret stranded inside
    a cluster or deleting only part of a grapheme.
  • FIXED: Selection drag no longer corrupts text at grapheme
    cluster boundaries (Thai above-base vowel signs etc.).
  • FIXED: Selection endpoints snap to grapheme cluster boundaries.
  • FIXED: Terminal paint_chunks splits per cluster for non-ASCII
    text. When a system fixed-width font has no glyphs for a script
    (e.g. Thai), Pango falls back to a proportional font and shapes the
    run at its natural advance — so a selection within such a line no
    longer drifts horizontally.
  • FIXED: paint_line renders Unicode highlights without visual
    drift.
  • FIXED: Selection in display->copy sets both primary and
    secondary selection.
  • ADDED: nfd_style instance variable for visual NFD cluster
    highlighting on Editor and TerminalImage. Default @nil (disabled).
  • MODIFIED: auto_copy class variable on text_item, editor
    and terminal_image defaults to @off on Windows and macOS and
    @on elsewhere.
  • FIXED (Windows): supplementary-plane code points (emoji,
    CJK Extensions B–G) no longer render as their PUA-area low 16 bits
    or .notdef glyphs. Several wint_tuchar_t fixes; charW
    now falls back to uint32_t when wchar_t is too narrow; the
    text-buffer file format now round-trips SMP correctly; c_width,
    uchar_display_width, paint_line, regex CHR width and
    ENC_WCHAR stream paths all honour full code points.
  • ENHANCED: SDL backend Windows font specs now cover CJK,
    symbol/emoji, Thai/Lao (Leelawadee UI) and Yi syllables (Microsoft
    Yi Baiti) in the mono/sans/serif fallback chains. Korean, Japanese,
    Chinese and many symbols rendered as tofu before.
  • ENHANCED: font ->member walks the Pango fallback chain by
    default, so it returns @on for code points the system can render
    via fallback (e.g. U+1F600). An optional family=[bool] argument
    forces the old “primary font only” check.
  • ENHANCED: font <-domain likewise unions Pango fontset coverage.
  • ENHANCED: display-level methods to control and observe SDL’s
    on-screen-keyboard policy from Prolog.
  • FIXED: PceEmacs class menu (raised a type error), XPCE native
    finder icons, thread monitor icons.
  • FIXED: Icons for the thread monitor (use new SVG icons).

libedit

  • ADDED: el_get/2 to query editline properties. Initially
    supports editor(?Editor) (emacs / vi). Unknown properties
    raise domain_error(editline_property, _).
  • ADDED: bracketed_paste(?Boolean) property on el_set/2 and
    el_get/2. Tracked per el_context and toggled at runtime; skipped
    (and unbound) in vi mode.
  • ADDED: ^Z is bound to send EOF on Windows (matches the
    platform convention; Unix uses ^D).
  • FIXED: el_cursor() takes wchar_t units while electric() hands
    code points — caret motion past wide chars now lands on the right
    cell.
  • FIXED (Windows): pl_line and el_history_encoded use REP_EL
    not REP_MB so libedit’s UTF-8 bytes are decoded correctly under the
    legacy ANSI codepage.
  • ENHANCED: libedit uses PL_wcwidth as its wcwidth
    implementation, so swipl, swipl-win, pl-write/pl-fmt and xpce all
    share one column-width table.

Other

  • ADDED: library(macros) accepts dicts as valid macros.
  • ADDED: library(unicode) is now included in the WASM build.
  • FIXED: unload_file/1 clears isfile so a subsequent
    use_module/1 actually reloads — notably re-running
    :- use_foreign_library/1.
  • FIXED: tnodebug/0 no longer raises a domain error.
  • FIXED: code_type/2 no longer reports csymf twice.
  • C++ binding: PlTerm::unify_blob() now takes
    std::unique_ptr<T> for any T derived from PlBlob directly,
    without an upcast.

Documentation pipeline

  • The PDF manual is now built with lualatex (was pdflatex). Source
    files can carry literal Unicode without the \urldef ASCII-routing
    ceremony.

  • SWI manual sources renamed *.doc*.plx (Prolog LaTeX) so file
    managers, editors and contributors no longer mistake them for MS
    Word documents. .gitattributes maps *.plx to TeX for editor
    syntax highlighting and git diff hunks.

    Enjoy — Jan

BTW: There are some approaches to do it differently. Actually
moving the weights from solo to graphic. So that Emoijs don’t
need string quoting most of the time:

?- char_code(X, 0x1FA76).
X = 🩶 .
?- atom_concat(🩶, 🩶, X).
X = 🩶🩶 .

So one would put ‘So’ in the same category as the ISO core
standard graphical characters =, < , etc.. Allowing them
to form an atom by joining multiple characters.

I didn’t find some drawback, except that reading Prolog
text that contains such Emoijs, is highly Unicode version
dependent. So using solo here is the more defensive

and more Unicode version agnostic solution, and might
explain why SWI breaks away, maybe on the influence
of some PIP work. On the other hand making string quoting

unnessary can lead to a more comfortable coding
and interaction experience. Or for more serious applications,
like theorem proving, which might use ‘So’.

But this here works somehow better in SWI now,
we can cheat our way into Agda style identifiers, which
have a rich syntax, that can contain dashes:

/* SWI-Prolog 10.1.7 */
?- X = ≡-dec-A .
X = '≡'-dec-A.

But an unnecessary quoting of a solo during output?

Emoji are now solo. This means they do not needs quotes unless you have multiple adjacent, e.g.,

?- atom_concat(🩶, 🩶, X).
X = '🩶🩶'.

The idea is that most Unicode symbols have a semantics on their own and just about any concept that has a well known icon has a Unicode representation. So, you can use Unicode symbols as operators and stand alone atoms while you need quotes to make a string of them.

Whether or not that is always a good idea is debatable of course. The main issue raised is to possibly move the math symbols (Sm) from solo to symbol. This would allow any of them to glue with e.g., # to represent a constraint in clpfd style. I don’t know. Please come with examples of sequences of Unicode symbols you’d like to be handled as an atom (without quotes).

Yes. Fixed.

I don’t have some very prominent examples, I have only an
example that shows that both solo and graphical have its
perks. The problem is what to do with UNASSIGNED?

So my “quoting of a solo during output?” had not a 100%
certainty. For example, the new Orca Emoji, most fonts still show
tofu or diamond with question mark or a slashed tofu:

/* Unicode 16.0 */
?- code_category(0x1FACD, X).
X = 0.
?- char_code(X, 0x1FACD).
X = '\x1facd\'.

/* Unicode 17.0 */
?- code_category(0x1FACD, X).
X = 28.
?- char_code(X, 0x1FACD).
X = 🫍 .

So if a Prolog text is not accepted if it has UNASSIGNED,
there is the problem that a Prolog text from a Prolog system
with Unicode 17.0 text is not backward compatible to

with a Prolog system with Unicode 16.0. So even dropping
the quotes for solo is dangerous. I get the above output using
a UNASSIGNED heuristic for \x encoding display,

and an unquote display when the category is known. SWI
with the solo quoting is more defensive, on the safe side
somehow. I am using the Windows console with UTF-8:

/* SWI-Prolog 10.1.6 */
?- char_code(X, 0x1FACD).
X = '\U0001FACD'.
?- char_code(🫍,  X), format(atom(Y), '~16R', [X]).
ERROR: Syntax error: illegal_character
ERROR: char_code
ERROR: ** here **
ERROR: (🫍,  X), format(atom(Y), '~16R', [X]) .

/* SWI-Prolog 10.1.7 */
?- char_code(X, 0x1FACD).
X = '🫍'.
13 ?- char_code(🫍 , X), format(atom(Y), '~16R', [X]).
X = 129741,
Y = '1FACD'.

But it also shows backward compatibility issues as well, since
the console allows pasting the Orca Emoji. So I basically wonder
whether a Prolog text should have somewhere a Unicode

version mentioned. This is somehow a nasty direction of how Unicode
evolves. Another approach would be make UNASSIGNED solo,
but I had rejected this approach, since UNASSIGNED might

become whathever category in a future Unicode release. Further
a scanner / parser might also reject UNASSIGNED when in quotes,
basically forcing a \U, \u or \x escaping. It could be that SWI has

already a write option to more agressively force escaping? A write
option that also takes a Unicode version argument, can be extremly
costly since it would require the Prolog system to have either

multiple unicode databases or a versioned unicode database.

Those are true concerns. As is, we accept P* and S* as solos. There is also a derived property called Pattern Syntax. That is a stable subset of P* and S*. The downside is that it has been frozen in 2005 and lacks several currency symbols, a lot of arrows, the emoticons, etc.

I don’t know. Unicode is a moving target. As I understand it (but I could be wrong), it contains unassigned code points. These can become anything in the future. It also contains stable code points. These are guaranteed to never change. There is a third category that has an assignment, but without a guarantee it won’t change. I don’t know what properties can still change.

A fully versioned set of Unicode tables is getting rather expensive and complicated. What we could consider is to add another entry holding the stable Pattern Syntax symbols. Now we could use that to always quote the unstable ones (notably for write_canonical/1). We could also add a flag that tells the system whether to only allow for stable or any symbol/punctuation character?

Note there is a Prolog flag unicode_syntax_version that tells you the Unicode version used to generate the tables that drive the parser.

I guess the trade of is between portability guarantees and the ability to do fancy stuff using the latest Unicode standard …

Ah, ok, I see it came with this version 10.1.7. Oki Doki!

Well there is a PRIVATE_USE category covering U+E000–U+F8FF,
U+F0000–U+FFFFD and U+100000–U+10FFFD for example. I
am currently mapping it to “invalid”, similar like UNASSIGNED

and SURROGATE (when single standing) are mapped to “invalid”.
But one might not agree with that, for example mac users had the
apple logo logogrey sometimes at U+F813, and wikipedia lists dozen other

use cases. So maybe a Prolog processor conception could also be
that machine runs on Unicode + Private_Use_Extension, but for
interoperability one would possibly reduce it to Unicode somehow,

and some ascii representation of Private_Use_Extension. Since the
apple logo example is mapped to “invalid” I get the following, which
is an ascii representation of a Private_Use_Extension:

/* Unicode 16.0 and Unicode 17.0 */
?- char_code(X, 0xF813).
X = '\xf813\'.

On the other hand SWI-Prolog gives me:

/* SWI-Prolog 10.1.6 */
?- char_code(X, 0xF813).
X = '\uF813'.

/* SWI-Prolog 10.1.7 */
?- char_code(X, 0xF813).
X = ''.

BTW: I didn’t ban single standing surrogates completely, they are
tolerated in atoms and codes, they might lead to unexpected results,
but the “invalid” is rather a Prolog parser und unparser thing.

I think private use and “unstable” are different things. And yes, also for SWI-Prolog private use ends up as unassigned. I guess one could argue for an API that allows (dynamic) assignment …

SURROGATE values should be flagged as invalid characters, regardless of where they appear and/or whether it is a pair or the lone half of a pair. That is not yet implemented.

?- code_type(0xF813, X).
X = print ;
X = graph ;
X = punct ;
X = to_lower(63_507) ;
X = to_upper(63_507) ;
X = width(1).

The definition delivered by a KI is: An unstable Unicode code point
generally refers to a code point that is either unassigned (reserved),
a non-character, or private use.

My “invalid” definition is indeed different. While non-character are
some 66 specific code points, according to the KI, surrogates
are not called by the name non-character.

Among surrogates, we find the high surrogate code points
U+D800 to U+DB7F, and the low surrogate code points
U+DC00 to U+DFFF.

Well some encoding requires pairs. Like UTF-16. A pair then
exists on the physical level of the encoding, you might not
see it on the logical level of the encoding.

Example, one might do a kind of normalization:

?- atom_codes(X, [0xD83D, 0xDE02]).
X = 😂 .
?- atom_codes(😂, [X]), format_atom('~16R', [X], Y).
X = 128514, 
Y = '1F602'.

I don’t know what SWI or PIPs propose. Maybe your
annoucement of SWI 10.1.7 has something? Still a broken
pair on the physical level, might be then seen on the

logical level, or it might be totally banned. If its not
banned the behaviour might be simply:

?- atom_codes('\xD83D\', [X]), format_atom('~16R', [X], Y).
X = 55357, 
Y = 'D83D'.

So there are kind of two variation points in an extensive standard:
a) atom_codes(-, +) single standing surrogate behaviour, and
b) atom_codes(+, -) single standing surrogate behaviour.

It might be covered by your unicode_atoms flag. The problem is
an announcement such as Prolog 10.1.7 is not really a tutorial
with extensive examples and discussions. But maybe you have

already such a section on https://www.swi-prolog.org/ ? What I
demonstrated by means of atom_codes/2 is not tied to Prolog
source code. Its a runtime semantics of a built-in. It doesn’t fit here:

Unicode Prolog source
https://www.swi-prolog.org/pldoc/man?section=unicodesyntax

There is whole bunch of runtime semantics impact, that one
can adopt when adopting Unicode. i.e. atom_length/2, sub_atom/5
etc.. etc.. might undergo some revisions or clarifications.

Some funny effect of “wobbly atoms” with single standing surrogates.
Some atom_length/2 invariants are broken. It explains why sometimes
there is a tendency to ban “wobbly atoms”:

?- atom_length('abc\xD83D\', A).
A = 4.
?- atom_length('\xDE02\def', B).
B = 4.
?- atom_concat('abc\xD83D\', '\xDE02\def', X), atom_length(X, C).
X = 'abc😂def', C = 7.

Now we have 4+4 = 7. Some prominent language that has
“wobbly atoms” with normalization, is Java in the form of Strings.
Since in Java Strings are basically wchar strings with 16-bit

codes, and then Unicode code points are only an interpretation.

SWI-Prolog deals with encoding at I/O. After the I/O, everything is a sequence of Unicode code points and these do (should) not involve surrogates pairs, just as it does not include UTF-8 high bytes.

This should not happen. Which Prolog system? SWI-Prolog (also incorrectly) produces some unprintable atom :frowning:

atom_concat('abc\xD83D\', '\xDE02\def', X), atom_length(X, C).
ERROR: Syntax error: Illegal character code
ERROR: atom_concat('abc\
ERROR: ** here **
ERROR: xD83D\', '\xDE02\def', X), atom_length(X, C) .

I.e. it is not allowed to specify surrogate pairs using \x or \u syntax.

All these things work on code points. This does imply that the length may differ from the number of visible characters (graphemes) due to the use of combining characters. This seems common practice.

It is allowed in Java. You can write \uD83D alone. The compiler
doesn’t bark, nor the runtime. What I noticed having some
quick save format, i.e. qsave_program/2, can relief you from

defining a Unicode syntax, for the bootstrap part of your Prolog
system, and make you less dependent on a Unicode version.
For example a transpiler might generate Java code for a Prolog

system clauses. And while doing so any identifiers or whatever
become anyways quoted strings. But then if you have such a more
low level format, there is still the problem that you need to define

the runtime behaviour of your system that executes the code.
Here the Java proof, single surrogates still work in JDK 26:

image

I think its there because Java wants to be able to program
I/O decoders and encoders, in the Java programmig language
itself. Which then have single surrogates anyways.

So for low level systems programming wobbly strings make
sense. The programming languages JavaScript and Python
do also allow wobbly strings, it seems Java isn’t alone.

Java is older than UTF-16 (not by much) and probably imagined the now obsolete UCS-2 encoding as what the world would look like. Javascript and Windows fell into the same trap.

You can do so in Prolog as well, but then you need to use get_byte/2 and put_byte/2 and deal with the encoding yourself.

P.s. Tightening the system to avoid surrogates surfacing at the Prolog level completely.

Its not necessarely a trap and it has nothing to do with
Windows. Its only that the strings have a dual view.
You can imagine there is byte array, and you

can view it both as char and as integer sequence.
Its a little bit difficult to model in Prolog, since for example
the ISO core standard has nothing about a dual view of

atoms. The dual view appeared in 2004 with Java 1.5.
It was not available 1995 with Java 1.0. The dual view
parallels the evolution of Unicode from 16 bit to 20 bit,

in the same time frame. JavaScript does also have this
dual view. Python 3 is a little special since it doesnt follow
exactly the same approach. Python 3 could maybe serve

as a model for Prolog semantics, and not syntax, but it
does not ban wobbly strings, and operations behave
slightly different from the dual view. The particular

string semantic can be a real problem for interoperability:

/* Java and JavaScript Windows 11 Console */
?- compare(C, '𝄞', '豈').
C = < .

/* Python 3, Scryer, Trealla Windows 11 Console */
?- compare(C, '𝄞', '豈').
C = > .

And then in SWI-Prolog:

/* SWI-Prolog 10.1.7, Windows 11 Console */
?- compare(C, '𝄞', '豈').
C = (<).

?- char_code('𝄞', X).
X = 119070.

?- char_code('豈', X).
X = 63744.

All of them, Java, JavaScript and SWI-Prolog are not correct,
as the char code shows. Consistent unicode semantics is hard!
Python 3, Scryer and Trealla do it correctly. One could do it

correctly in Java/JavaScript by using a more costly compare/3,
using the integer view instead of the char view. Dunno whether
SWI-Prolog does it differently on other platforms.

This mostly confirms my view on UTF-16, which we use for compatibility with the C API on Windows. This is of course a bug as the result must be > as it is on non-Windows.

Unicode started with UCS-2, which is just fine except that 2^{16} code points is not enough, so we got UTF-16. That is an unfortunate choice as it is mostly one-character is one code-point, except … So, it leads to a lot of hard to track bugs. Most modern languages use UTF-8 internally. Unfortunately I went for a dual representation char* or wchar_t* This was fine on non-Windows as it keeps the content of atoms as arrays of code points, but requires a lot of hacks on Windows :frowning:

Note that 10.1.7 switches the xpce graphics system from char*/wchar_t* to char*/uchar_t*, where uchar_t is typedeffed to uint32_t.

In that sense, there is no and there should be no byte view. There is a code-point view, a one-code-point atom view (chars) and a grapheme view (atom with base char and combining characters). Bytes only enter the picture when doing I/O, including the C/C++ API.

I didn’t say there is a byte view. I only wrote there is a byte array
with two views. A byte view would be a third view. The dual view has
only Java char and Java integer view, you cannot access bytes.

The Java char view is 16-bit Unicode, properly. And the Java integer
view is 20-bit Unicode properly, which has 16-bit Unicode as a subset.
Means the 20-bit Unicode has still the high surrogate code points

U+D800 to U+DB7F, and the low surrogate code points U+DC00
to U+DFFF. Also there is the other direction, 20-bit Unicode is a subset
of multiple 16-bit Unicodes through the use of a high and low surrogate

encoding. Incomplete combos give single surrogates in wobbly strings.

In Java I cannot say “char” for a code point because the ISO
Prolog “char (code)” has a different meaning than the Java “char”.
The char view in Java is, as already stated strictly 16-bit Unicode,

not the same as the ISO Prolog “char (code)” definition which is
flexible and implementation dependent. The ISO Prolog char (code),
in an Unicode extension, could corresponds to the Java integer

view in Java, which is as already stated 20-bit Unicode. About a
grapheme view, I have no clue. While I can relate the Java integer
view to the ISO core standard. It is just what the ISO core standard

calls the set of character codes, abreviated as CC, it is literally
defined as a subset of I, i.e. as a subset of the integers:

CC can be larger than the ASCII minimum. But the ISO core
standard has no concept of grouping elements from CC into
graphemes, you would need to write a totally new chapter of

a novell PIP all by yourself and/or with the help of others and/or
as a derivative work of another document, i.e. taking an existing
standard already publicitly or otherwise available.

In my previous paragraph I repeated the existence
of wobbly strings. But somehow I asked myself what
will NFC do with it. Some KI tells me:

Most standard-compliant NFC implementations will
treat a lone surrogate as an unassigned/unrecognized character.
It will simply pass the surrogate through to the output unchanged.

And this little Python 3 test:

>>> import unicodedata
>>> print(hex(ord(unicodedata.normalize('NFC', '\ud800'))))
0xd800

In SWI-Prolog I don’t come past:

?- char_code(X, 0xd800).
ERROR: Type error: `character_code' expected, found `55296' (an integer)

So there is no door for low level system programming. Other
languages that are more UTF-8 leaning, might also not allow it,
since UTF-8 forbids it, they would need to adopt the WTF-8

(Wobbly Transformation Format). Python 3 seems to be
not in the strict UTF-8 camp, more wobbly friendly.

See BUG: internal error in GUI debugger - #2 by alanbur