Dear SWI-Prolog user,
SWI-Prolog 10.1.7 is ready for download. This release lands the
long-running overhaul of Unicode handling — both in the
source-language syntax and in the editor/terminal rendering — together
with a handful of behaviour changes to align with the Unicode
recommendations for programming languages. A few of these change
observable defaults; users who rely on the old behaviour should read
the highlights below before upgrading.
Status
Unicode handling is under discussion in the PIP working group. The
current implementation realises where I think the consensus is growing
towards. SWI-Prolog will follow the decisions of the PIP group, which
implies that additional incompatible changes are not unlikely. Note that
these merely touch edge cases. Notably
- Classification of solo characters.
- Handling of Bracket pairs (
Ps/Pe) and quote pairs (Pi/Pf).
Notably the term-representation of quote pairs is new and likely to
be debated. These features may also be restricted using Prolog flags
and/or options. - Handling of non-ASCII digits is not supported by number_codes/2, etc.
This may change.
Highlights
- Unicode source syntax now follows UAX #31 (XID_Start / XID_Continue),
with super-/subscript-digit profile additions for variables and atoms. - Default source encoding on Windows is now UTF-8 (was the legacy ANSI
codepage). - Pluggable Unicode normalisation using the Prolog flag
unicode_atoms, and
Unicode bidi-override / isolate code points are now unconditionally
rejected in source as a Trojan-source defence (CVE-2021-42574). - xpce editor and terminal now render and edit NFD combining marks,
CJK wide glyphs, and supplementary-plane code points (emoji, CJK
Extensions B–G) correctly on every platform.
Unicode source syntax
The classifier follows the Unicode release reported by the new read-only
Prolog flag unicode_syntax_version (currently '17.0.0'). See section
Unicode Prolog source in the manual for a worked example. Changes:
- Identifiers are
XID_Startfollowed byXID_Continueper UAX #31,
with the super- and subscript digits (², ³, ¹, ⁰..⁹, ₀..₉) added as
XID_Continueso variables likeX²andX₁work. - Variable vs. atom: a token is a variable iff it starts with
_or
a code point in general categoryLu. MODIFIED: titlecase
letters (Lt, e.g.Dž) now start an atom, not a variable; this
differs from earlier releases that used the broader derived
Uppercaseproperty. - MODIFIED: All Unicode symbol classes (
Sm,Sc,Sk,So)
and the connector/dash/other-punctuation classes (Pc,Pd,Po)
are now treated as solo: each forms an atom on its own and does
not glue with adjacent symbols. This is a deliberate
break from earlier releases, in which Unicode symbols glued into
compound atoms in the same way as ASCII symbols. - MODIFIED: NBSP (U+00A0) is no longer treated as whitespace.
Outside quoted material it raises a stray-character syntax error.
The layout set is now exactly UnicodePattern_White_Space. - Bracket pairs (
Ps/Pe) and quote pairs (Pi/Pf) are recognised
as reader-level operators. An opening character followed by a
Prolog term and the matching closing character reads as a unary
compound'<open><close>'(Term), generalising the existing
{Term} ⇒ '{}'(Term)form to the full UnicodePs/Peset.
Quote pairs read literal text in the form selected by
double_quotes; e.g.«hello, world»reads as
'«»'("hello, world"). - Seven Pattern_White_Space code points now act as line terminators
everywhere (LF, VT, FF, CR, NEL, LS, PS). FIXED: %-comments,
line-counter, and\<newline>continuation were silently ignoring
NEL (U+0085), LS (U+2028), PS (U+2029) as well as CR/VT/FF. - In source code (
read_term/2), numeric literals still use ASCII
digits 0–9 only. Conversion via atom_number/2, number_codes/2 and
number_chars/2 additionally accepts any UnicodeNdblock; in a
single number all digits must come from the same block. - New code_type/2 / char_type/2 categories
prolog_layout(the eleven
Pattern_White_Space code points) andprolog_end_of_line(the seven
line-terminator-like ones).paren(Close)and the new
quote(Close)cover the full Unicode Ps/Pe and Pi/Pf sets. unicode_block/3updated to Unicode 17.0.
Unicode atoms / normalisation
- ADDED: Unicode handling is guided by the 4-valued option
unicode_atoms. The values areaccept(default),
nfc,error, andreject. It follows the same three-tier
hierarchy asencoding: Prolog flag → stream property → per-call
option of read_term/2,3, read_clause/2,3, set_stream/2 and open/4. - Modes
nfcanderrorauto-load library(unicode) at every entry
point (previously only the flag’s setter did so). - Mode
errorfalls back to a wcwidth-based check when no
normalisation hook is registered, which can over-reject scripts
(e.g. Thai) that use combining marks in NFC. A new read-only
atom_normalize_hookflag tells you whether the precise check is
available. - Unicode bidi-override and isolate code points (U+202A..U+202E and
U+2066..U+2069) are unconditionally rejected in source tokens,
quoted strings, and comments as a defence against the Trojan-source
attack (CVE-2021-42574). - Loading library(unicode) now installs an NFC normaliser into the
kernel viaPL_atom_normalize_hook(), which makesunicode_atoms(nfc)
work without further setup.
Windows
- MODIFIED: Default source encoding on Windows is now UTF-8. The
system locale on Windows usually reports a legacy codepage
(Windows-1252 or similar), which used to make the default Prolog
encodingflag ANSI/Latin-1 and caused UTF-8 source files to be
read as their byte-wise Latin-1 interpretation. - FIXED:
init_localealso setsLC_CTYPEto UTF-8. Without
this, mbrtowc/wcrtomb stayed ASCII-only and any UTF-8 byte above
0x7F flowing throughPL_canonicalise_textreturned EILSEQ —
surfacing asSyntax error: illegal_multibyte_sequencethe moment
a user typed an emoji or other non-ASCII char in the libedit prompt.
C API
- ADDED:
PL_is_id_start,PL_is_id_continue,PL_is_uppercase,
PL_is_decimal,PL_is_layout. Thin shims over the Unicode flag
table so foreign extensions and embedded toolkits classify code
points exactly as SWI-Prolog does, without needing the
locale-dependent POSIXiswX()family. - ADDED:
PL_wcwidth(int). Locale-portable display-column width
for a Unicode code point — same answer as the rest of SWI-Prolog,
regardless of the processLC_CTYPEor platform. On Unix/macOS it
forwards to system wcwidth(3); on Windows it goes through the
bundled mk_wcwidth() table. - ADDED:
foreign_library_property/2to query foreign library
properties.
Editor and terminal (xpce)
- FIXED: Editor column functions now account for CJK wide chars
(each counts as 2 columns) and combining marks (each counts as 0).
getColumnEditor()andgetColumnLocationEditor()previously
counted every non-tab character as one visual column, breaking caret
positioning after CJK double-width and NFD content. - FIXED: Forward/backward char and delete now move by grapheme
cluster, not code point. For NFD text a visible character may span
multiple code points; the four basic editing operations previously
stepped one code point at a time, leaving the caret stranded inside
a cluster or deleting only part of a grapheme. - FIXED: Selection drag no longer corrupts text at grapheme
cluster boundaries (Thai above-base vowel signs etc.). - FIXED: Selection endpoints snap to grapheme cluster boundaries.
- FIXED: Terminal
paint_chunkssplits per cluster for non-ASCII
text. When a system fixed-width font has no glyphs for a script
(e.g. Thai), Pango falls back to a proportional font and shapes the
run at its natural advance — so a selection within such a line no
longer drifts horizontally. - FIXED:
paint_linerenders Unicode highlights without visual
drift. - FIXED: Selection in
display->copysets both primary and
secondary selection. - ADDED:
nfd_styleinstance variable for visual NFD cluster
highlighting on Editor and TerminalImage. Default@nil(disabled). - MODIFIED:
auto_copyclass variable ontext_item,editor
andterminal_imagedefaults to@offon Windows and macOS and
@onelsewhere. - FIXED (Windows): supplementary-plane code points (emoji,
CJK Extensions B–G) no longer render as their PUA-area low 16 bits
or .notdef glyphs. Severalwint_t→uchar_tfixes;charW
now falls back touint32_twhenwchar_tis too narrow; the
text-buffer file format now round-trips SMP correctly;c_width,
uchar_display_width,paint_line, regexCHRwidth and
ENC_WCHARstream paths all honour full code points. - ENHANCED: SDL backend Windows font specs now cover CJK,
symbol/emoji, Thai/Lao (Leelawadee UI) and Yi syllables (Microsoft
Yi Baiti) in the mono/sans/serif fallback chains. Korean, Japanese,
Chinese and many symbols rendered as tofu before. - ENHANCED:
font ->memberwalks the Pango fallback chain by
default, so it returns@onfor code points the system can render
via fallback (e.g. U+1F600). An optionalfamily=[bool]argument
forces the old “primary font only” check. - ENHANCED:
font <-domainlikewise unions Pango fontset coverage. - ENHANCED:
display-level methods to control and observe SDL’s
on-screen-keyboard policy from Prolog. - FIXED: PceEmacs class menu (raised a type error), XPCE native
finder icons, thread monitor icons. - FIXED: Icons for the thread monitor (use new SVG icons).
libedit
- ADDED:
el_get/2to query editline properties. Initially
supportseditor(?Editor)(emacs/vi). Unknown properties
raisedomain_error(editline_property, _). - ADDED:
bracketed_paste(?Boolean)property onel_set/2and
el_get/2. Tracked perel_contextand toggled at runtime; skipped
(and unbound) in vi mode. - ADDED:
^Zis bound to send EOF on Windows (matches the
platform convention; Unix uses^D). - FIXED:
el_cursor()takes wchar_t units whileelectric()hands
code points — caret motion past wide chars now lands on the right
cell. - FIXED (Windows):
pl_lineandel_history_encodeduse REP_EL
not REP_MB so libedit’s UTF-8 bytes are decoded correctly under the
legacy ANSI codepage. - ENHANCED: libedit uses
PL_wcwidthas its wcwidth
implementation, so swipl, swipl-win, pl-write/pl-fmt and xpce all
share one column-width table.
Other
- ADDED:
library(macros)accepts dicts as valid macros. - ADDED:
library(unicode)is now included in the WASM build. - FIXED:
unload_file/1clearsisfileso a subsequent
use_module/1actually reloads — notably re-running
:- use_foreign_library/1. - FIXED:
tnodebug/0no longer raises a domain error. - FIXED:
code_type/2no longer reportscsymftwice. - C++ binding:
PlTerm::unify_blob()now takes
std::unique_ptr<T>for anyTderived fromPlBlobdirectly,
without an upcast.
Documentation pipeline
-
The PDF manual is now built with lualatex (was pdflatex). Source
files can carry literal Unicode without the\urldefASCII-routing
ceremony. -
SWI manual sources renamed
*.doc→*.plx(Prolog LaTeX) so file
managers, editors and contributors no longer mistake them for MS
Word documents..gitattributesmaps*.plxto TeX for editor
syntax highlighting andgit diffhunks.Enjoy — Jan