Ann: SWI-Prolog 10.1.7

jan · May 11, 2026, 9:53am

Dear SWI-Prolog user,

SWI-Prolog 10.1.7 is ready for download. This release lands the
long-running overhaul of Unicode handling — both in the
source-language syntax and in the editor/terminal rendering — together
with a handful of behaviour changes to align with the Unicode
recommendations for programming languages. A few of these change
observable defaults; users who rely on the old behaviour should read
the highlights below before upgrading.

Status

Unicode handling is under discussion in the PIP working group. The
current implementation realises where I think the consensus is growing
towards. SWI-Prolog will follow the decisions of the PIP group, which
implies that additional incompatible changes are not unlikely. Note that
these merely touch edge cases. Notably

Classification of solo characters.
Handling of Bracket pairs (Ps/Pe) and quote pairs (Pi/Pf).
Notably the term-representation of quote pairs is new and likely to
be debated. These features may also be restricted using Prolog flags
and/or options.
Handling of non-ASCII digits is not supported by number_codes/2, etc.
This may change.

Highlights

Unicode source syntax now follows UAX #31 (XID_Start / XID_Continue),
with super-/subscript-digit profile additions for variables and atoms.
Default source encoding on Windows is now UTF-8 (was the legacy ANSI
codepage).
Pluggable Unicode normalisation using the Prolog flag unicode_atoms, and
Unicode bidi-override / isolate code points are now unconditionally
rejected in source as a Trojan-source defence (CVE-2021-42574).
xpce editor and terminal now render and edit NFD combining marks,
CJK wide glyphs, and supplementary-plane code points (emoji, CJK
Extensions B–G) correctly on every platform.

Unicode source syntax

The classifier follows the Unicode release reported by the new read-only
Prolog flag unicode_syntax_version (currently '17.0.0'). See section
Unicode Prolog source in the manual for a worked example. Changes:

Identifiers are XID_Start followed by XID_Continue per UAX #31,
with the super- and subscript digits (², ³, ¹, ⁰..⁹, ₀..₉) added as
XID_Continue so variables like X² and X₁ work.
Variable vs. atom: a token is a variable iff it starts with _ or
a code point in general category Lu. MODIFIED: titlecase
letters (Lt, e.g. ǅ) now start an atom, not a variable; this
differs from earlier releases that used the broader derived
Uppercase property.
MODIFIED: All Unicode symbol classes (Sm, Sc, Sk, So)
and the connector/dash/other-punctuation classes (Pc, Pd, Po)
are now treated as solo: each forms an atom on its own and does
not glue with adjacent symbols. This is a deliberate
break from earlier releases, in which Unicode symbols glued into
compound atoms in the same way as ASCII symbols.
MODIFIED: NBSP (U+00A0) is no longer treated as whitespace.
Outside quoted material it raises a stray-character syntax error.
The layout set is now exactly Unicode Pattern_White_Space.
Bracket pairs (Ps/Pe) and quote pairs (Pi/Pf) are recognised
as reader-level operators. An opening character followed by a
Prolog term and the matching closing character reads as a unary
compound '<open><close>'(Term), generalising the existing
{Term} ⇒ '{}'(Term) form to the full Unicode Ps/Pe set.
Quote pairs read literal text in the form selected by
double_quotes; e.g. «hello, world» reads as
'«»'("hello, world").
Seven Pattern_White_Space code points now act as line terminators
everywhere (LF, VT, FF, CR, NEL, LS, PS). FIXED: %-comments,
line-counter, and \<newline> continuation were silently ignoring
NEL (U+0085), LS (U+2028), PS (U+2029) as well as CR/VT/FF.
In source code (read_term/2), numeric literals still use ASCII
digits 0–9 only. Conversion via atom_number/2, number_codes/2 and
number_chars/2 additionally accepts any Unicode Nd block; in a
single number all digits must come from the same block.
New code_type/2 / char_type/2 categories prolog_layout (the eleven
Pattern_White_Space code points) and prolog_end_of_line (the seven
line-terminator-like ones). paren(Close) and the new
quote(Close) cover the full Unicode Ps/Pe and Pi/Pf sets.
unicode_block/3 updated to Unicode 17.0.

Unicode atoms / normalisation

ADDED: Unicode handling is guided by the 4-valued option
unicode_atoms. The values are accept (default),
nfc, error, and reject. It follows the same three-tier
hierarchy as encoding: Prolog flag → stream property → per-call
option of read_term/2,3, read_clause/2,3, set_stream/2 and open/4.
Modes nfc and error auto-load library(unicode) at every entry
point (previously only the flag’s setter did so).
Mode error falls back to a wcwidth-based check when no
normalisation hook is registered, which can over-reject scripts
(e.g. Thai) that use combining marks in NFC. A new read-only
atom_normalize_hook flag tells you whether the precise check is
available.
Unicode bidi-override and isolate code points (U+202A..U+202E and
U+2066..U+2069) are unconditionally rejected in source tokens,
quoted strings, and comments as a defence against the Trojan-source
attack (CVE-2021-42574).
Loading library(unicode) now installs an NFC normaliser into the
kernel via PL_atom_normalize_hook(), which makes unicode_atoms(nfc)
work without further setup.

Windows

MODIFIED: Default source encoding on Windows is now UTF-8. The
system locale on Windows usually reports a legacy codepage
(Windows-1252 or similar), which used to make the default Prolog
encoding flag ANSI/Latin-1 and caused UTF-8 source files to be
read as their byte-wise Latin-1 interpretation.
FIXED: init_locale also sets LC_CTYPE to UTF-8. Without
this, mbrtowc/wcrtomb stayed ASCII-only and any UTF-8 byte above
0x7F flowing through PL_canonicalise_text returned EILSEQ —
surfacing as Syntax error: illegal_multibyte_sequence the moment
a user typed an emoji or other non-ASCII char in the libedit prompt.

C API

ADDED: PL_is_id_start, PL_is_id_continue, PL_is_uppercase,
PL_is_decimal, PL_is_layout. Thin shims over the Unicode flag
table so foreign extensions and embedded toolkits classify code
points exactly as SWI-Prolog does, without needing the
locale-dependent POSIX iswX() family.
ADDED: PL_wcwidth(int). Locale-portable display-column width
for a Unicode code point — same answer as the rest of SWI-Prolog,
regardless of the process LC_CTYPE or platform. On Unix/macOS it
forwards to system wcwidth(3); on Windows it goes through the
bundled mk_wcwidth() table.
ADDED: foreign_library_property/2 to query foreign library
properties.

Editor and terminal (xpce)

FIXED: Editor column functions now account for CJK wide chars
(each counts as 2 columns) and combining marks (each counts as 0).
getColumnEditor() and getColumnLocationEditor() previously
counted every non-tab character as one visual column, breaking caret
positioning after CJK double-width and NFD content.
FIXED: Forward/backward char and delete now move by grapheme
cluster, not code point. For NFD text a visible character may span
multiple code points; the four basic editing operations previously
stepped one code point at a time, leaving the caret stranded inside
a cluster or deleting only part of a grapheme.
FIXED: Selection drag no longer corrupts text at grapheme
cluster boundaries (Thai above-base vowel signs etc.).
FIXED: Selection endpoints snap to grapheme cluster boundaries.
FIXED: Terminal paint_chunks splits per cluster for non-ASCII
text. When a system fixed-width font has no glyphs for a script
(e.g. Thai), Pango falls back to a proportional font and shapes the
run at its natural advance — so a selection within such a line no
longer drifts horizontally.
FIXED: paint_line renders Unicode highlights without visual
drift.
FIXED: Selection in display->copy sets both primary and
secondary selection.
ADDED: nfd_style instance variable for visual NFD cluster
highlighting on Editor and TerminalImage. Default @nil (disabled).
MODIFIED: auto_copy class variable on text_item, editor
and terminal_image defaults to @off on Windows and macOS and
@on elsewhere.
FIXED (Windows): supplementary-plane code points (emoji,
CJK Extensions B–G) no longer render as their PUA-area low 16 bits
or .notdef glyphs. Several wint_t → uchar_t fixes; charW
now falls back to uint32_t when wchar_t is too narrow; the
text-buffer file format now round-trips SMP correctly; c_width,
uchar_display_width, paint_line, regex CHR width and
ENC_WCHAR stream paths all honour full code points.
ENHANCED: SDL backend Windows font specs now cover CJK,
symbol/emoji, Thai/Lao (Leelawadee UI) and Yi syllables (Microsoft
Yi Baiti) in the mono/sans/serif fallback chains. Korean, Japanese,
Chinese and many symbols rendered as tofu before.
ENHANCED: font ->member walks the Pango fallback chain by
default, so it returns @on for code points the system can render
via fallback (e.g. U+1F600). An optional family=[bool] argument
forces the old “primary font only” check.
ENHANCED: font <-domain likewise unions Pango fontset coverage.
ENHANCED: display-level methods to control and observe SDL’s
on-screen-keyboard policy from Prolog.
FIXED: PceEmacs class menu (raised a type error), XPCE native
finder icons, thread monitor icons.
FIXED: Icons for the thread monitor (use new SVG icons).

libedit

ADDED: el_get/2 to query editline properties. Initially
supports editor(?Editor) (emacs / vi). Unknown properties
raise domain_error(editline_property, _).
ADDED: bracketed_paste(?Boolean) property on el_set/2 and
el_get/2. Tracked per el_context and toggled at runtime; skipped
(and unbound) in vi mode.
ADDED: ^Z is bound to send EOF on Windows (matches the
platform convention; Unix uses ^D).
FIXED: el_cursor() takes wchar_t units while electric() hands
code points — caret motion past wide chars now lands on the right
cell.
FIXED (Windows): pl_line and el_history_encoded use REP_EL
not REP_MB so libedit’s UTF-8 bytes are decoded correctly under the
legacy ANSI codepage.
ENHANCED: libedit uses PL_wcwidth as its wcwidth
implementation, so swipl, swipl-win, pl-write/pl-fmt and xpce all
share one column-width table.

Other

ADDED: library(macros) accepts dicts as valid macros.
ADDED: library(unicode) is now included in the WASM build.
FIXED: unload_file/1 clears isfile so a subsequent
use_module/1 actually reloads — notably re-running
:- use_foreign_library/1.
FIXED: tnodebug/0 no longer raises a domain error.
FIXED: code_type/2 no longer reports csymf twice.
C++ binding: PlTerm::unify_blob() now takes
std::unique_ptr<T> for any T derived from PlBlob directly,
without an upcast.

Documentation pipeline

The PDF manual is now built with lualatex (was pdflatex). Source
files can carry literal Unicode without the \urldef ASCII-routing
ceremony.
SWI manual sources renamed *.doc → *.plx (Prolog LaTeX) so file
managers, editors and contributors no longer mistake them for MS
Word documents. .gitattributes maps *.plx to TeX for editor
syntax highlighting and git diff hunks.

Enjoy — Jan

jan · May 12, 2026, 9:34am

Emoji are now solo. This means they do not needs quotes unless you have multiple adjacent, e.g.,

?- atom_concat(🩶, 🩶, X).
X = '🩶🩶'.

The idea is that most Unicode symbols have a semantics on their own and just about any concept that has a well known icon has a Unicode representation. So, you can use Unicode symbols as operators and stand alone atoms while you need quotes to make a string of them.

Whether or not that is always a good idea is debatable of course. The main issue raised is to possibly move the math symbols (Sm) from solo to symbol. This would allow any of them to glue with e.g., # to represent a constraint in clpfd style. I don’t know. Please come with examples of sequences of Unicode symbols you’d like to be handled as an atom (without quotes).

Yes. Fixed.

jan · May 12, 2026, 4:03pm

Those are true concerns. As is, we accept P* and S* as solos. There is also a derived property called Pattern Syntax. That is a stable subset of P* and S*. The downside is that it has been frozen in 2005 and lacks several currency symbols, a lot of arrows, the emoticons, etc.

I don’t know. Unicode is a moving target. As I understand it (but I could be wrong), it contains unassigned code points. These can become anything in the future. It also contains stable code points. These are guaranteed to never change. There is a third category that has an assignment, but without a guarantee it won’t change. I don’t know what properties can still change.

A fully versioned set of Unicode tables is getting rather expensive and complicated. What we could consider is to add another entry holding the stable Pattern Syntax symbols. Now we could use that to always quote the unstable ones (notably for write_canonical/1). We could also add a flag that tells the system whether to only allow for stable or any symbol/punctuation character?

Note there is a Prolog flag unicode_syntax_version that tells you the Unicode version used to generate the tables that drive the parser.

I guess the trade of is between portability guarantees and the ability to do fancy stuff using the latest Unicode standard …

jan · May 13, 2026, 6:55am

I think private use and “unstable” are different things. And yes, also for SWI-Prolog private use ends up as unassigned. I guess one could argue for an API that allows (dynamic) assignment …

SURROGATE values should be flagged as invalid characters, regardless of where they appear and/or whether it is a pair or the lone half of a pair. That is not yet implemented.

?- code_type(0xF813, X).
X = print ;
X = graph ;
X = punct ;
X = to_lower(63_507) ;
X = to_upper(63_507) ;
X = width(1).

jan · May 13, 2026, 8:47am

SWI-Prolog deals with encoding at I/O. After the I/O, everything is a sequence of Unicode code points and these do (should) not involve surrogates pairs, just as it does not include UTF-8 high bytes.

This should not happen. Which Prolog system? SWI-Prolog (also incorrectly) produces some unprintable atom

atom_concat('abc\xD83D\', '\xDE02\def', X), atom_length(X, C).
ERROR: Syntax error: Illegal character code
ERROR: atom_concat('abc\
ERROR: ** here **
ERROR: xD83D\', '\xDE02\def', X), atom_length(X, C) .

I.e. it is not allowed to specify surrogate pairs using \x or \u syntax.

All these things work on code points. This does imply that the length may differ from the number of visible characters (graphemes) due to the use of combining characters. This seems common practice.

jan · May 13, 2026, 11:37am

Java is older than UTF-16 (not by much) and probably imagined the now obsolete UCS-2 encoding as what the world would look like. Javascript and Windows fell into the same trap.

You can do so in Prolog as well, but then you need to use get_byte/2 and put_byte/2 and deal with the encoding yourself.

P.s. Tightening the system to avoid surrogates surfacing at the Prolog level completely.

jan · May 13, 2026, 2:33pm

This mostly confirms my view on UTF-16, which we use for compatibility with the C API on Windows. This is of course a bug as the result must be > as it is on non-Windows.

Unicode started with UCS-2, which is just fine except that 2^{16} code points is not enough, so we got UTF-16. That is an unfortunate choice as it is mostly one-character is one code-point, except … So, it leads to a lot of hard to track bugs. Most modern languages use UTF-8 internally. Unfortunately I went for a dual representation char* or wchar_t* This was fine on non-Windows as it keeps the content of atoms as arrays of code points, but requires a lot of hacks on Windows

Note that 10.1.7 switches the xpce graphics system from char*/wchar_t* to char*/uchar_t*, where uchar_t is typedeffed to uint32_t.

In that sense, there is no and there should be no byte view. There is a code-point view, a one-code-point atom view (chars) and a grapheme view (atom with base char and combining characters). Bytes only enter the picture when doing I/O, including the C/C++ API.

alanbur · May 18, 2026, 1:37pm

See BUG: internal error in GUI debugger - #2 by alanbur

Topic		Replies	Views
Ann: SWI-Prolog 10.1.8 Releases	4	216	June 15, 2026
Unicode symbols, and a possible numerical paradox! Discussion	27	2428	June 27, 2022
Trouble with Unicode Help!	10	470	September 21, 2021
Ann: SWI-Prolog 9.0.4 (stable) Releases	2	664	May 3, 2023
Ann: SWI-Prolog 10.1.6 Releases	7	226	April 27, 2026

Ann: SWI-Prolog 10.1.7

Related topics